AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

AgentProcessBench Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan^1,*, Xuyan Ye^1,*, Yupeng Huo¹, Zhi-Yuan Chen¹, Yiju Guo¹, Shenzhi Yang¹, Wenkai Yang¹, Shuqi Ye¹, Jingwen Chen³, Haotian Chen⁴, Xin Cong², Yankai Lin^1,†

¹ Renmin University of China, Beijing, China ² Tsinghua University, Beijing, China ³ Beijing Jiaotong University, Beijing, China ⁴ Shanghai Jiao Tong University, Shanghai, China ^* Indicates Equal Contribution ^† Corresponding Author

Paper Dataset Code arXiv Document

An overview of AgentProcessBench. First, we sample trajectories from four representative agent benchmarks generated by five source models. Subsequently, human experts annotate the data via a specialized platform, achieving an inter-annotator agreement of 89.1%. Finally, we utilize the constructed benchmark to evaluate 20 distinct models across various families and parameter scales using the StepAcc and FirstErrAcc metrics.

Statistics of AgentProcessBench

Overall Performance on AgentProcessBench

Case Study

BibTeX

@article{fan2026agentprocessbench,
  title={Agentprocessbench: Diagnosing step-level process quality in tool-using agents},
  author={Fan, Shengda and Ye, Xuyan and Huo, Yupeng and Chen, Zhi-Yuan and Guo, Yiju and Yang, Shenzhi and Yang, Wenkai and Ye, Shuqi and Chen, Jingwen and Chen, Haotian and others},
  journal={arXiv preprint arXiv:2603.14465},
  year={2026}
}