Comparing LLM benchmarks for software development

A detailed comparison of benchmarks for using LLMs in software development

In this post, we’re comparing the various benchmarks that help rank large language models for software development tasks.

Large language models are getting advanced enough to be useful for software development tasks. While models are now capable of writing commit messages, searching through your code base, and generating and fixing code, finding the one that best suits your needs can be difficult.

Luckily, there exist benchmarks that help you gauge and compare the performance of different models at typical software development tasks. This post introduces these key benchmarks that help you assess the performance of LLMs and the feasibility of using a model to support you in your everyday work.

💡 A blog post series on LLM benchmarking

Read all the other posts in Symflower’s series on LLM evaluation, and check out our latest deep dive on the best LLMs for code generation.

Benchmarks for LLM code generation

For a list of general benchmarks (e.g. not specific to software development), read the previous post in this series. Read on for coding-specific benchmarks:

HumanEval

Paper | GitHub | Dataset

  • Released: July 2021
  • Total number of tasks: 164
  • Programming languages: Python
  • Result verification: Unit tests (average 7.7 test cases per example)

HumanEval was developed by OpenAI when they introduced their first “Codex” model (which supposedly was the first model behind GitHub Copilot). Instead of simple text similarity (which would be typical for natural-language tasks), it focuses on whether the code generated by the model actually works as intended.

It includes 164 programming challenges and unit tests for each of them. Examples were manually written to make sure they are not part of the training dataset for any code generation models. The challenges consist of Python method signatures and natural text (comments, docstrings) in English. The LLM then has to fill in the method body. The tasks in this benchmark cover a range of domains ranging from algorithms to understanding key Python language features.

This benchmark evaluates the LLM’s response implementation by checking whether it passes the corresponding test cases. It uses the Pass@k metric to measure model performance (generating k code samples and measuring the rate of successfully passing the provided unit tests). With models becoming more and more powerful, it has become common practice to evaluate with Pass@1, meaning that a LLM has only a single chance to solve each challenge.

HumanEval helps identify models that can really solve problems rather than just regurgitating code they have been trained on. It was initially introduced for Python, but the community has since released versions for other programming languages.

Available forks include:

  • HumanEval-X which has 820 human-written data samples with corresponding test cases in Python, C++, Java, JavaScript, and Go
  • HumanEval-XL comes with 80 problems in 23 different natural languages and 12 programming languages: Python, Java, Go, Kotlin, PHP, Ruby, Scala, JavaScript, C#, Perl, Swift, and TypeScript.
  • EvalPlus offers 80 times the number of tests than the original HumanEval

DevQualityEval

  • Released: First release on April 2024, new release every month
  • Total number of tasks: Varies
  • Programming languages: Go, Java, Ruby (increased by every release)
  • Result verification: Quality metrics of static and dynamic analyses

DevQualityEval splits software development into distinct subtasks:

  • The “write test” tasks ask the model to write tests for a piece of code.
  • “Code repair” tasks require the model to repair a piece of code that has compile-time errors.
  • “Transpile” tasks ask the model to convert source code from one language to another.

New tasks are added with every version.

Results are validated automatically using static and dynamic analyses , e.g. by dynamically generating tests and executing them. Models are graded in a point-based system based on quality metrics (e.g. providing compiling code, reaching sufficient coverage with the generated test code).

The benchmark also measures the processing time it takes to generate responses and counts the number of characters in model responses to highlight models that provide short and efficient responses.

DevQualityEval supports any inference endpoint implementing the OpenAI inference API and works on Linux, Mac, and Windows.

ClassEval

Paper | GitHub | Dataset

  • Released: August 2023
  • Total number of tasks: 100
  • Programming languages: Python
  • Result verification: Unit tests (average 33.1 test cases per example)

ClassEval is a manually constructed dataset that contains 100 Python classes with coding tasks (100 classes and 410 methods). It challenges models with code generation that spans the logic of a whole class, not just a single function. ClassEval also provides tests (to be precise, an average of 33.1 test cases per class).

Tasks are chosen across a wide range of topics (management systems, data formatting, mathematical operations, game development, file handling, database operations and Natural Language Processing). The dataset accounts for library, field, and method dependencies as well as standalone methods.

SWE-bench

Paper | GitHub | Dataset

  • Released: October 2023
  • Total number of tasks: 2294
  • Programming languages: Python
  • Result verification: Unit tests (depending on the PR)

SWE-bench contains 2000+ real-world GitHub issues and PRs from 12 popular Python repositories. Solving these issues requires the LLM to understand the issue description and coordinate changes across several functions, classes, and files. Models are required to interact with execution environments, process broad contexts, and perform reasoning that surpass the level of code generation tasks found in most benchmarks. The changes made by the model are then verified with the unit tests that were introduced when the actual issues were resolved.

Aider

GitHub | Leaderboards

  • Released: June 2023
  • Total number of tasks: 202
  • Programming languages: Python
  • Result verification: Unit tests

Aider is an open-source solution. The Aider core application basically lets you use models as a pair programmer in your terminal. The creator of Aider, Paul Gauthier, started conducting benchmarks to help users choose between the vast majority of models becoming available. The assessment focuses on how effectively LLMs can be used to translate a natural language coding request into code that executes and passes unit tests. The two types of benchmarks available are code editing and code refactoring.

The 133 code editing tasks included in the benchmark are small coding problems based on Exercism Python exercises. Models' responses are validated with the complementing unit tests, and the model even has one chance to fix potential errors, should the tests fail the first time.

The 89 code refactoring tasks are selected to represent non-trivial refactorings from 12 popular Python repositories. The task is always to refactor a method of a class into a standalone method, and correctness is roughly checked using one unit test per refactoring.

Aider offers its own leaderboard so you can track the progress of various models at https://aider.chat/docs/leaderboards/.

BigCodeBench

Paper | GitHub | Dataset

  • Released: June 2024
  • Total number of tasks: 1140
  • Programming languages: Python
  • Result verification: Unit tests (average 5.6 test cases per example)

BigCodeBench is dubbed the next generation of HumanEval. It’s created for function-level code generation with practical and challenging coding tasks. When compared to HumanEval, it contains more complex instructions and a variety of function calls. It comes with 1140 function-level tasks and has LLMs do function calls to 139 libraries. Each task comes with 5.6 test cases on average, with an average branch coverage of 99%. BigCodeBench provides open-source LLM-generated samples.

DS-1000

Paper | GitHub | Dataset

  • Released: November 2022
  • Total number of tasks: 1000
  • Programming languages: Python
  • Result verification: Unit tests (average 1.6 test cases per example)

DS-1000 is a result of a collaboration of researchers from a number of high-profile universities (including the University of Hong Kong, Stanford, and the University of California, Berkeley) and Meta AI. It’s a Python benchmark of data science use cases based on real StackOverflow questions (451 of them to be precise).

Problems are slightly modified compared to the originals on StackOverflow so modules can’t memorize correct solutions through pre-training. The benchmark aims to reflect a diverse set of realistic and practical use cases: it all adds up to 1000 problems that span 7 popular (Python) data science libraries.

MBPP

Paper | GitHub | Dataset

  • Released: August 2021
  • Total number of tasks: 974
  • Programming languages: Python
  • Result verification: Unit tests (3 test cases per task)

MBPP (Mostly Basic Python Problems Dataset) was developed by Google. It contains about 1000 crowd-sourced Python programming problems that are simple enough to be solved by beginner Python programmers (covering programming fundamentals, standard library functionality, etc.). The dataset contains a description, a code solution, and 3 automated test cases for each task.

MultiPL-E

Paper | GitHub | Dataset

  • Released: August 2022
  • Total number of tasks: ~ 530 (per language)
  • Programming languages: Clojure, C++, C#, D, Elixir, Go, Haskell, Java, Julia, JavaScript, Lua, ML, PHP, Perl, R, Ruby, Racket, Rust, Scala, Bash, Swift, TypeScript
  • Result verification: Unit tests (average 6.3 test cases per example)

The abbreviation stands for Multi-Programming Language Evaluation of Large Language Models of Code. It’s a system for translating unit test-driven neural code generation benchmarks to new languages. It supports 22 programming languages and also has models themselves translate tests. Based on OpenAI’s HumanEval and MBPP, MultiPL-E uses compilers to translate them to other languages. This benchmark is part of BigCodeBench and that’s the easiest way to use it, but you can also use it directly.

APPS

Paper | GitHub | Dataset

  • Released: May 2021
  • Total number of tasks: 10000
  • Programming languages: Python
  • Result verification: Unit tests (average 13.2 test cases per example)

APPS contains problems sourced from a variety of open-access coding websites (Codeforces, Kattis, etc). The goal of APPS is to mirror the evaluation of human programmers: coding problems are presented in natural language.

The dataset contains 10,000 problems (with 131,836 corresponding test cases and 232,444 human-written ground-truth solutions). The problems in APPS range from beginner-level tasks to advanced competition-level problems. The average length of a problem in APPS is 293.2 words, so problems can get fairly complex. APPS aims to measure both coding ability and problem-solving.

CodeXGLUE

Paper | GitHub | Dataset

  • Released: February 2021
  • Total number of tasks: 10 (across 14 datasets)
  • Programming languages: C/C++, C#, Java, Python, PHP, JavaScript, Ruby, Go
  • Result verification: Included tests

CodeXGLUE is a General Language Understanding Evaluation benchmark for CODE. It contains 14 datasets for 10 programming-related tasks. The benchmarks covers many scenarios:

  • code-code (clone detection, code search, defect detection, cloze test, code completion, code repair, and code-to-code translation)
  • text-code (natural language code search, text-to-code generation)
  • code-text (code summarization)
  • text-text (documentation translation)

To get your model evaluated on this test set, submit your model to the developers of the benchmark.

Comparing LLM benchmarks for code generation

For this post, we looked at lots of benchmarks that focus exclusively on LLM performance for software development-related tasks. While it is amazing to see all the progress that has been made over the last few years, there are some distinct shortcomings in this field due to challenges that are unique to software development:

Programming language diversity

Python is the machine learning language so it is not too surprising that lots of code generation benchmarks focus on Python. However, there is a huge selection of programming languages that are used in the industry.

Luckily, there are already some benchmarks that focus on languages other than Python. Since language performance can vary, the eventual goal has to be for one standard benchmark to exist which can tell us developers what the best model is for our language of choice.

Number of examples

In theory, the more evaluation examples there are, the better. However, in practice, evaluating LLMs is costly, and lots of promising models need to be evaluated. There needs to be a balance between having enough examples to be able to evaluate the capabilities of a model, and not having too many examples to the point that they don’t add any value (for example, “HumanEval” contains an example that just requires adding two numbers).

Verification and scoring

In almost any benchmark, the verification of model results is based on unit tests that need to pass. There have been instances where unit test suites of popular benchmarks have been shown to be incomplete or incorrect, so the correctness of the tests needs to be ensured.

While unit tests are a great way to automate verification, they do not assess the quality of the produced code. Every programmer who does code reviews regularly knows that a lot can go wrong despite passing tests (i.e. bad code design, misleading naming or comments, lack of documentation, poor optimization, etc.).

There needs to be additional assessments that take into account the quality of the code generated by LLMs. Furthermore, some meta-metrics might play a big role in how you would rank a benchmarking candidate. For example, the cost of a model, or the fact that the model weights are open-source are not factored into the assessment of any of the presented benchmarks.

Task diversity

At the core of software development, there is programming. So naturally, most of the benchmarks we introduced in this post focus on code generation. But the actual skills a software developer needs are far more diverse.

Writing or reviewing new code, refactoring old code, planning and integrating new behavior on different scales, writing tests on different levels, reading documentation and utilizing frameworks, hunting and fixing bugs, maintaining documentation, maintaining services and deployments are just a few examples. Benchmarking LLMs only on code generation is looking at just a fraction of the bigger picture that is software development.

Stale state

Almost all benchmarks we presented were curated one time at creation and then called “done”. However, software development is a very active profession, and new tools and techniques emerge almost every day. A benchmark should not be a “snapshot” of how well a model solved some task a few years ago, it should be an active leaderboard that displays the best model to use right now.

Make sure you never miss any of our upcoming content by signing up for our newsletter and by following us on X, LinkedIn, or Facebook!

| 2024-07-11