Benchmarks evaluating LLM agents for software development

The most useful benchmarks for evaluating LLM agents for software development

This post provides an overview of the most useful benchmarks for evaluating LLM code generation agents and agentic software development workflows.

Table of contents:

We already wrote about the most handy benchmarks to compare code generation in LLMs. Evaluating agentic behavior, however, is very different. To learn more about the basics of LLM agents for software development, check out our post.

LLMs vs LLM agents

LLM agentic use cases differ from “simple” LLM workflows: instead of a single-prompt input leading to an output, agentic workflows are iterative. Agents use a multi-step reasoning process and rely on planning, tool use, and self-correction capabilities to deliver solutions. An agent should be able to develop a wide contextual understanding of the system and problem at hand; adapt to the domain and task; look up and synthesize information from a range of sources; use tools to fetch necessary data or to execute actions; define a plan to solve the problem introduced in the prompt; and provide solutions via an iterative process. And, if the solution turns out to be erroneous, start over until the problem is resolved.

To keep things simple for this post, we’ll introduce the following rudimentary distinction:

  • if the LLM is only queried once (i.e., it gets one prompt instructing it to solve a problem and then provides an output) it is not considered agentic
  • if more than one query is passed to the agent or LLM solving a problem (i.e. there is back-and-forth interaction between the human and the agent), we consider the tool to have an agentic character. For a more nuanced definition of “agentic” behavior, check out the excellent explanation by Harrison Chase (CEO of LangChain)

Challenges of evaluating LLM software development agents

For a benchmark to be truly useful in evaluating the performance of LLM agents in software development, all the above considerations should be measured. With the recent shift towards agents and agentic workflows in the LLM world, we expect to see benchmarks that better assess the capabilities of agents in terms of:

  • Multi-step reasoning: Instead of single-turn interactions, benchmarks should support the evaluation of agents based on complex reasoning chains and multiple decision points.
  • Tool integration assessment: Tool selection, tool calling, usage patterns, and using the results of tool calls should all be evaluated.
  • Context window management: Agents require a larger context window than LLMs based on single-turn interactions. Benchmarks should also measure how well information is retained and used over time.
  • Planning and decomposition: It is necessary to evaluate planning quality, task decomposition, routing, and execution sequencing, as these are core capabilities of an LLM agent.
  • Self-correction mechanisms: Benchmarks are needed to evaluate error detection, solution refinement, and iterative improvement capabilities.
  • Long-term memory: In the case of development tasks taking hours, days, or more for the agent to solve, benchmarks should evaluate performance and memory management capabilities over extended periods.
  • Environmental interaction: Agents performing real-life problem solving work in a complex environment. Benchmarks need to be able to simulate realistic environmental constraints and the interactions of agents with external tools (IDEs, repositories, databases, etc).
  • Adaptation to feedback: Since agents are expected to work interactively and iteratively, benchmarks need to assess how well the agent responds to and implements suggestions.

Benchmarks for LLM coding agents

AgentBench

GitHub | Paper

AgentBench evaluates general agent capabilities across multiple domains, including coding. This multi-dimensional benchmark consists of 8 different environments (operating system, database, knowledge graph, digital card game, lateral thinking puzzles, house-holding, web shopping, and web browsing) and uses a multi-turn, open-ended generation setting. The focus of this benchmark is to measure planning, reasoning, tool use, and decision-making in agents. While the detailed approach of AgentBench is very promising and it claims to be the first systematic benchmark to evaluate LLMs as agents on a wide range of tasks (with the first edition evaluating 25 models), it’s important to note that GPT-4 is the latest evaluated model, and the latest code change on GitHub is from January 2025, meaning that AgentBench may be a little outdated if you’re looking to assess the performance of state-of-the-art models as agents.

DA-Code

GitHub | Paper

DA-Code is a code generation benchmark that was designed to evaluate LLMs on agent-based data science tasks. Note that the benchmark focuses on plain LLMs (rather than agent applications), but it evaluates models based on agentic workflows specifically in a data science context. Rather than general coding abilities, DA-Code focuses on data science tasks that require advanced coding skills.

The reason DA-Code can be considered a great tool to assess the capabilities of LLM agents is that the benchmark:

  • challenges models with complex problems that require planning
  • is based on real and diverse data
  • LLMs are required to use data science programming languages and to perform complex data processing (including classification, regression, clustering, statistical analysis, visualization, data manipulation & transformation, and more) to deliver solutions.

LiveSWEBench

GitHub | HuggingFace

LiveSWEBench is a benchmark for AI coding assistants that focuses on end-user coding agent applications (the likes of e.g. GitHub Copilot, Cursor, Aider, etc) specifically. The leaderboard shows a periodically updated list of the best-performing AI agent tools based on three task types (categorized based on the level of developer involvement):

  • Fully agentic programming tasks: LLM assistants are provided with just a GitHub issue and are instructed to solve it fully autonomously
  • Targeted editing tasks: The name of a file to be modified is given to the assistants, along with a high-level prompt that describes the change to make
  • Autocomplete tasks: Assistants are prompted to generate code at a specific file location.

The tasks in LiveSWEBench are sourced from real-world GitHub repositories (issue + merged pull request pairs), and the benchmark’s creators aim to prevent contamination by updating the benchmark to stay in line with the developing capabilities of agents. Languages and frameworks covered include C++, Java, TypeScript, and Python.

MLE-Bench

GitHub | Paper

MLE-Bench is, again, not specific to software code but focuses on agentic workflows: it’s a benchmark by OpenAI that aims to measure how well AI agents perform at machine learning engineering tasks. The creators of this benchmark have curated engineering-related competitions from Kaggle to form a set of challenging real-world tasks that challenge the ML engineering capabilities of models (including training models, preparing datasets, and running experiments).

At the time of writing, MLE-Bench contains 75 manually selected competitions, each of which include a description of the task, training and testing datasets, the code to grade models' submissions, a leaderboard to compare LLM performance with that of human contestants, and a complexity rating ranging from low (the task takes an experienced human <2 hours to code a solution to), through medium (2-10 hours), to high (>10 hours).

Polyglot benchmark by Aider

GitHub | Leaderboard

The Polyglot benchmark by Aider is a curated collection of programming exercises (at the time of writing, that means 225 coding problems). The benchmark is based on Exercism coding problems in a variety of popular languages (C++, Go, Java, JavaScript, Python, Rust) and focuses on the most difficult exercises to challenge agents. The Polyglot benchmark was introduced in December 2024 and continues to evaluate state-of-the-art models (with the most recent being, at the time of writing, Claude 3.7 Sonnet and o3 reaching ~82%).

As a provider of an LLM coding agent, Aider also developed a refactoring benchmark and leaderboard, along with the Polyglot benchmark, to help evaluate the performance of coding agents.

☝️ Using Aider AI for code generation

Aider provides an AI pair programming agent to support software development tasks. Read our introduction to Aider.

StackEval

GitHub | Paper

StackEval evaluates an LLM’s capability to function as a coding assistant. The coding assistance tasks include code writing, debugging, code review, and conceptual understanding. That focus on agentic capabilities makes it useful for evaluating agents (however, note that the benchmark evaluates only plain LLMs rather than agent tools).

The benchmark is based on two curated datasets derived from StackOverflow questions:

  • StackEval: contains 925 questions sampled between Jan 2018 - Sep 2023
  • StackUnseen: a continuously updated dynamic dataset with two versions at the time of writing (Sep 2023 - Mar 2024, Mar 2024 - May 2024).

The developer of this benchmark also provides a function-calling benchmark that evaluates the ability of LLMs to use tools to perform tasks (including web searches, code execution, and planning multiple function calls). While not coding-specific, that is a useful addition when assessing agentic performance.

SWE-bench (+ SWE-agent and other tools)

GitHub (SWE-bench) | Paper (SWE-bench)

SWE-bench is a benchmark for evaluating LLMs on real-world software issues collected from GitHub. Models are provided a codebase and an issue, and are tasked with creating a patch that resolves the problem. At the time of writing, the benchmark contains 2,294 issue-commit pairs across 12 Python repositories. It provides a leaderboard and is available in several additional variations:

  • SWE-bench Full which contains all ~2.3K tasks
  • SWE-bench Lite is a curated subset of SWE-bench with 300 instances that aims to make evaluation less expensive and is focused on evaluating the bug fixing ability of models
  • SWE-bench Verified by OpenAI: is a human-validated subset that provides a more fair and reliable subset of tasks to evaluate LLMs ability to solve coding issues
  • SWE-bench Multimodal: contains visual elements (images, videos) from JavaScript repositories.

The team behind SWE-bench develops other useful tools, including their own SWE-agent, an agent scaffold (built on the concept of Agent Computer Interfaces) that turns LLMs into software engineering agents. It lets any LLM use tools autonomously, enabling it to fix issues in real GitHub repositories, perform tasks on the web, and carry out other tasks.

Another useful tool is SWE-smith which lets you generate SWE-bench-compatible task instances for any GitHub repository, supporting the training of software engineering agents. Together, these tools help the development and evaluation of LLM agents for software engineering use cases.

The future of agent evaluation

As agents evolve and their role shifts in the LLM landscape, there is a need for evaluations to be better geared towards the use cases, workflows, and challenges specific to LLM agents in software development use cases.

As we see with traditional LLM benchmarks, new agent-centered evaluations will likely become more fine-grained for different use cases. Cost-efficiency must also become a central metric in these benchmarks, considering the higher token costs associated with leveraging the reasoning capabilities of agents. At the time of writing, the Aider Polyglot benchmark is the only one that prominently compares the score to the cost of the agents. To support the usefulness of agents in practical use cases, software engineering evaluations should focus on real-world programming challenges. A comprehensive set of metrics needs to be tracked to facilitate identifying the right agents (or LLMs) for specific use cases, and evaluations also need to tackle the problems presented by newer multi-agent systems. Metrics for function calling will need to be measured to make sure agents efficiently and effectively use tools to deliver solutions.

A new evaluation challenge to tackle will be to measure the effectiveness of collaboration between the human user and the agent, as this will be the most prominent real-world scenario. Measuring how agents learn and improve over time would also be important, as this ability can greatly impact the usefulness of agents in specific use cases.

A very important challenge is preventing benchmarks from becoming contaminated, i.e. newly released models simply learning and internalizing the evaluation cases, giving them an unfair advantage. The creators of LiveCodeBench compared the performance of models on coding challenges released before and after the models' cutoff date. The results are worrying because some models performed noticeably better on challenges that surfaced on the web before their cutoff date, raising the question of how meaningful benchmarks with public datasets truly are.

📈 DevQualityEval

DevQualityEval (GitHub) is an evaluation benchmark and framework to compare and evolve the quality of code generation of LLMs.

While the benchmark started out with simple LLM-only coding tasks, it is developing beyond LLM evaluation, with a task set that spans diverse software engineering scenarios and tasks that require planning and a degree of agentic behavior (migration, transpilation). The next versions will raise the ceiling by implementing a combination of tasks (scenarios) that better represent the real-world work of software developers and extending the evaluation to new programming languages.

Check out the latest results of the DevQualityEval evaluation benchmark.

Support the DevQualityEval project and access the latest results of the benchmark.
| 2025-05-12