Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents

Abstract

Code large language models (CodeLLMs) and agents are increasingly being integrated into complex software engineering tasks spanning the entire Software Development Life Cycle (SDLC). Benchmarking is critical for rigorously evaluating these capabilities. However, despite their growing significance, there remains a lack of comprehensive reviews that examine these benchmarks from an SDLC perspective. To bridge this gap, we propose a tiered analysis framework to systematically review 178 benchmarks from 461 papers, comprehensively characterizing them from the perspective of the SDLC. Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 61\% focused on the software implementation phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5\% and 3\%, respectively. % Additionally, anti-contamination strategies are largely absent from current benchmarks, leading to an increased risk of data leakage. Furthermore, current benchmarks lack effective anti-contamination strategies, posing significant risks of data leakage and potentially inflated performance assessments. Finally, we identify key open challenges in current research and outline future directions to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their practical effectiveness in real-world scenarios.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…