Works from RUCBM

AgentProcessBench Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan1,*, Xuyan Ye1,*, Yupeng Huo1, Zhi-Yuan Chen1, Yiju Guo1, Shenzhi Yang1, Wenkai Yang1, Shuqi Ye1, Jingwen Chen3, Haotian Chen2, Xin Cong2, Yankai Lin1,†
1 Renmin University of China, Beijing, China 2 Tsinghua University, Beijing, China 3 Beijing Jiaotong University, Beijing, China * Indicates Equal Contribution   Corresponding Author
AgentProcessBench main figure

An overview of AgentProcessBench. First, we sample trajectories from four representative agent benchmarks generated by five source models. Subsequently, human experts annotate the data via a specialized platform, achieving an inter-annotator agreement of 89.1%. Finally, we utilize the constructed benchmark to evaluate 20 distinct models across various families and parameter scales using the StepAcc and FirstErrAcc metrics.

Statistics of AgentProcessBench

Statistics of AgentProcessBench

Overall Performance on AgentProcessBench

Overall Performance on AgentProcessBench

Case Study

Case study example 0
Case study example 1
Case study example 2
Case study example 3

BibTeX

@misc{fan2026agentprocessbenchdiagnosingsteplevelprocess,
      title={AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents}, 
      author={Shengda Fan and Xuyan Ye and Yupeng Huo and Zhi-Yuan Chen and Yiju Guo and Shenzhi Yang and Wenkai Yang and Shuqi Ye and Jingwen Chen and Haotian Chen and Xin Cong and Yankai Lin},
      year={2026},
      eprint={2603.14465},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.14465}
}