Works from RUCBM

AgentProcessBench Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan1,*, Xuyan Ye1,*, Yupeng Huo1, Zhi-Yuan Chen1, Yiju Guo1, Shenzhi Yang1, Wenkai Yang1, Shuqi Ye1, Jingwen Chen3, Haotian Chen4, Xin Cong2, Yankai Lin1,†
1 Renmin University of China, Beijing, China 2 Tsinghua University, Beijing, China 3 Beijing Jiaotong University, Beijing, China 4 Shanghai Jiao Tong University, Shanghai, China * Indicates Equal Contribution   Corresponding Author
AgentProcessBench main figure

An overview of AgentProcessBench. First, we sample trajectories from four representative agent benchmarks generated by five source models. Subsequently, human experts annotate the data via a specialized platform, achieving an inter-annotator agreement of 89.1%. Finally, we utilize the constructed benchmark to evaluate 20 distinct models across various families and parameter scales using the StepAcc and FirstErrAcc metrics.

Statistics of AgentProcessBench

Statistics of AgentProcessBench

Overall Performance on AgentProcessBench

Overall Performance on AgentProcessBench

Case Study

Case study example 0
Case study example 1
Case study example 2
Case study example 3

BibTeX

@article{fan2026agentprocessbench,
  title={Agentprocessbench: Diagnosing step-level process quality in tool-using agents},
  author={Fan, Shengda and Ye, Xuyan and Huo, Yupeng and Chen, Zhi-Yuan and Guo, Yiju and Yang, Shenzhi and Yang, Wenkai and Ye, Shuqi and Chen, Jingwen and Chen, Haotian and others},
  journal={arXiv preprint arXiv:2603.14465},
  year={2026}
}