Anthropic's Claude 3.7 Sonnet is the new king 👑 of code generation (but only with help), and DeepSeek R1 disappoints (Deep dives from the DevQualityEval v1.0)

Cost-effectiveness scatter plot that shows the best LLMs aligned with their capability on the y-axis to the average costs per API request with a logarithmic scale on the x-axis solving 760 benchmark cases. — SVG and higher resolution PNG

This deep dive takes a look at the results of the DevQualityEval v1.0 which analyzed 107 different LLMs for generating quality code (for Java, Go and Ruby). Anthropic’s Claude 3.5 Sonnet (2024-10-22) and OpenAI’s o1-mini (2024-09-12) have an advantage over Google’s Gemini 2.0 Flash Lite and Anthropic’s Claude 3.7 Sonnet (2025-02-19) in functional score. However, Anthropic’s Claude 3.7 Sonnet (2025-02-19) is with better context the best functional scoring model, while DeepSeek’s V3 is the best open-weight model 🔓. Mistral’s Codestral (2501) is the best European model 🦸 while DeepSeek R1 doesn’t hold up to the hype at all and Llama is not even in the TOP 20.

The results in this post are based on 5 full runs using DevQualityEval v1.0. Multiple bugfix releases where necessary to make the comparisons fair. Metrics and logs have been extracted and detailed leaderboard data can be requested. The full logs are packed with information to improve LLMs and can be requested) as well. The full evaluation setup and reasoning behind the tasks are similar to the previous dive but considerably extended as detailed in this dive.

This deep dive is not yet done. We are adding new models, sections and details daily. Make sure to stay up to date on the latest changes or.

The following sections are a deep dive into the results, learnings and insights of all evaluation runs towards the DevQualityEval v1.0 release. Each section can be read on its own and comes with a multitude of learnings that we will integrate into the next release.

Each graph links to an interactive HTML for easier digestion and perusing of the results you are interested in.

DevQualityEval deep dives build on each other. Check out all the deep dives we have published so far:

Access the results of the DevQualityEval LLM benchmark.

Table of contents:

Terminology
Comparing the capabilities and costs of top models
Comparing model capabilities by total scores
- Comparing scores and model size
- Comparing scores and performance
Programming languages
- Go
- Java
- Ruby
Tasks
Chattiness
Model reliability
API reliability
Degraded API availability of Anthropic: Claude 3.7 Sonnet (2025-02-19)
Benchmark reliability
Noteworthy changes and learnings in DevQualityEval v1.0
DevQualityEval Prö
What comes next? DevQualityEval v1.1

The purpose of the evaluation benchmark and the examination of its results is to give LLM creators a tool to improve the results of software development tasks towards quality and to provide LLM users with a comparison to choose the right model for their needs. Your feedback is highly appreciated and guides the next steps of the eval.

With that said, let’s dive in!

Terminology

This DevQualityEval benchmark evaluates LLMs and LRMs (Large Reasoning Models) on software development use cases. The benchmark results span numerous metrics that are introduced and explained with every section. Additional metrics and details are defined in the full leaderboard.

Since LLMs and LRMs are strongly related we are using just “LLM” or “model” to refer to them.

Since the benchmark is touching multiple programming languages and frameworks, as well as various technical topics that have terminologies on their own, we need to further define some common terminology for these deep dives:

Each “task type”, or just “task”, is a distinct problem category, for example: writing tests.
A “task variant” is a specific implementation of a task, for example: writing tests but with an additional test template as context.
A “task case”, or just “case”, is a concrete instantiation of such a problem, for example: a Binary Search implementation written in Go that needs unit tests.
An API request attempt is a query to an LLM API provider. There are up to 10 request attempts for each task case if an API is unreliable.
An API request retry is a query to an LLM API provider after the initial request failed. There are up to 9 request retries for each task case if an API is unreliable.

If you are missing an entry in this terminology section, please let us know.

💰 We need your help

We built DevQualityEval for the community and we believe benchmarks should stay open and accessible to anyone. But, benchmarking is not free.

Every time we run the benchmark on the latest models, it costs us serious 💸. And that’s just the smallest portion of the costs. Maintaining the benchmark, refining results, adding new tasks, cases, languages, assessments… All that is time well spent, but we still need to make a living and find the time to further extend DevQualityEval.

We truly appreciate any help, whether it comes in the form of direct financial support or LLM consulting projects. You can already help us a lot by just sharing this post and spreading the word!

We are eternally grateful for your help 👽.

Comparing the capabilities and costs of top models

The graph shows the best models in relation to their overall scores (y-axis, linear scale) and costs (x-axis, logarithmic scale, $ spent for average costs per API request over the whole benchmark for a model). The sweet spot is the top-left corner: cheap with good results.

Looking at cost-efficiency (i.e. performance vs. the average costs per API request of solving all general 760 benchmark cases), there is a cluster of high-scoring and affordable models at the top left. Using a 45° angled line going from that left-top corner towards the bottom-right corner gives us the following TOP 3 of cost-effectiveness:

#1 the 👑 of cost-effectiveness Google’s Gemini 2.0 Flash Lite (88.26% at $0.162406 for the whole benchmark, and #3 overall),
#2 💎 DeepSeek’s DeepSeek V3 (84.96% at $0.244723 for the whole benchmark, and #9 overall)
#3 💍 Qwen’s Qwen2.5 Coder 32B (81.32% at $0.085345 for the whole benchmark, and #16 overall)

This TOP 3 clearly shows that a model does not need to score high to be cost-effective. #2 in terms of cost-effectiveness is already on #9 overall. Comparing the score of our #1 in cost-effectiveness Google’s Gemini 2.0 Flash Lite (88.26%) with the #1 over all categories Anthropic: Claude 3.7 Sonnet (2025-02-19) (95.03%) makes it even clearer that as a user, one needs to chose wisely, and balance costs and functional results for the current task.

Comparing the TOP 3 models in terms of overall score without help, we see significant differences in costs:

#1 🥇 Anthropic: Claude 3.5 Sonnet (2024-10-22) (89.18% at $14.01118 for the whole benchmark)
#2 🥈 OpenAI: o1-mini (2024-09-12) (88.88% at $20.136773 for the whole benchmark)
#3 🥉 Google: Gemini 2.0 Flash Lite (88.25% at $0.162406 for the whole benchmark)

The most expensive models overall did not result in the best scoring:

#1 OpenAI’s o1-preview (2024-09-12) (85.16% at $182.504534 for the whole benchmark)
#2 Anthropic’s Claude 3.7 Sonnet (Thinking) (84.65% at $118.437782 for the whole benchmark) (Worth noting: While Anthropic’s Claude 3.7 Sonnet (2025-02-19) did comparably well (87.59% at $0.007422 per API request), switching its reasoning mode to “Thinking” did not help it solve our benchmark cases any better (84.65%) but increased costs by almost 14x.)
#3 Anthropic: Claude 3 Opus (85.33% at $37.416199 for the whole benchmark)

It is absolute worth noting that the TOP 3 most expensive models all score lower than our #1 of cost-effectiveness: Google’s Gemini 2.0 Flash Lite. OpenAI’s o1-preview (2024-09-12) alone is about 1124x more expensive with -3.10%.

Reducing the full list of over 300 LLMs to a manageable size was done by sorting based on scores and then costs. We then removed all models that were worse than the best models of the same vendor/family/version/variant (e.g. gpt-4* is better than gpt-3* so the latter can be removed). For clarity, the remaining models were renamed to represent their variants. Even then, the list was immense. In the end, only the most important new models, fundamental models and top-scorers were kept for the above graph. For a complete picture, request access to the DevQualityEval leaderboard with all the results from the latest run.

Comparing model capabilities by total scores

Bar chart that shows the total score of 107 LLMs along the x-axis, including the potential score improvement through static code repair and static analytics. — Interactive HTML, SVG and higher resolution PNG

This graph displays the total score of all benchmarked models of DevQualityEval v1.0. The higher the score, the better the model performed. This release uses a percentage-based score instead of a numerical scoring system, as outlined in the v0.6 “percentage-based-score” section. The graph also includes the improved score through static code repair of common mistakes and for the first time improved score through better context with static analytics.

The top three best-scoring models without static code repair and static analytics are:

#1 Anthropic’s Claude 3.5 Sonnet (2024-10-22) (89.19%)
#2 OpenAI’s o1-mini (2024-09-12) (88.88%)
#3 Google’s Gemini 2.0 Flash Lite (88.26%)

We see an average score of 54.51% (previously 49.30%: +5.21) which indicates that the average model is getting better at coding. However, we see 0% (previously 13%: -13.00) of all models scoring 90% or higher, and 15.89% (previously 21.74%: -5.85) of all models scoring 80% or higher. This does not indicate that models are getting worse, but that our efforts to increase the ceiling of the DevQualityEval are working.

Comparing scores and model size

Scatter chart that shows the total score of LLMs along the y-axis against their number of model parameters on the x-axis — Interactive HTML, SVG and higher resolution PNG

This graph plots the total score of models (reaching 50% or higher) against their number of parameters, in all the cases where the number of parameters was known and confirmed.

These are the TOP models according to their sizes:

Tiny: Mistral’s Ministral with 3B and 57.66%
Small: Mistral’s Ministral with 8B and 64.48%
Mid-small: Mistral’s Pixtral (v2409) with 12B and 58.56%
Medium: Qwen’s Qwen 2.5 Coder with 32B and 81.32%
Big: Qwen’s Qwen 2.5 with 72B and 76.57%
Bigger: DeepSeek’s DeepSeek V3 with 236B and 84.39%
Huge: DeepSeek’s DeepSeek R1 with 685B and 68.23%

In the previous iteration, we pointed out that while more model parameters seem to lead to a better score, size is not everything. In DevQualityEval v1.0, 56.72% of all <32B models scored over 50% vs 65% of all ≥32B models. But parameter size isn’t the only factor:

DeepSeek R1 only scored 68.23% despite having 685B parameters
Meta’s Llama 3.1 405B (Instruct) at 65.38% scored lower than many smaller models
Qwen 2 72B (Instruct) achieved 58.49% while Qwen 2.5 72B (Instruct) scored 76.58% (+18.09) with the same parameter size.

Comparing scores and performance

Scatter chart that shows the total score of LLMs along the y-axis against their average response time per case on the x-axis — Interactive HTML, SVG and higher resolution PNG

This graph plots the total score of models (reaching 50% or higher) against their performance (logarithmically). We measured the response time for each model and normalized over the solved cases, resulting in the average time it took the model to solve a single example.

There was a huge gap between the fastest model: Mistral’s Codestral (2501) (83.78%) at 2.25 seconds per response, and the slowest: DeepSeek R1 (68.23%) at 269.05 seconds per response. That means DeepSeek R1 took over 4 minutes to solve a single task, which made this model an outlier: it was over 1.7 times slower than the second-slowest model: Qwen: QwQ 32B (157.39 seconds per response).

Mistral won in terms of speed, with Codestral (2501) being the fastest and Mistral Small 3 (2.28 seconds per task) being the second-fastest.

Programming languages

Bar chart that shows the score of 107 LLMs along the x-axis, per programming language — Interactive HTML, SVG and higher resolution PNG

One goal of DevQualityEval is to look at the language-specific performance of models to allow everyone to choose the best model for their projects. The evaluation currently supports three programming languages (Go, Java, Ruby), with more languages to be added in upcoming iterations.

Average scores across all models were higher for Go (68.71%) (previously 46.64%: +22.07) than for Java (49.31%) (previously 53.82%: -4.51) and Ruby (61.49%) (previously 46.13%: +15.36). The collapse in Java’s average is not surprising since a focus of the DevQualityEval v1.0 release was to introduce challenging tasks and cases for Java. Read on for details on these languages and their averages.

Go

Bar chart that shows the Go score of 107 LLMs along the x-axis — Interactive HTML, SVG and higher resolution PNG

The TOP 3 models for Go are OpenAI’s o3-mini (2025-01-31, reasoning effort: high) (99.54%) followed by Anthropic’s Claude 3.5 Sonnet (2024-10-22) (98.89%) and OpenAI’s o1-mini (2024-09-12) (98.07%).

The best small (<10B) model for Go is Ministral 3B (80.40%) which surpasses the score of Ministral 8B (77.16%) at Go (but not at Java or Ruby).

The average Go score across all models is 68.71% (previously 46.64%: +22.07), which indicates that models are getting better at generating Go code.

Java

The TOP 3 models for Java are Anthropic’s Claude 3.5 Sonnet (2024-10-22) (87.57%), Google’s Gemini 2.0 Flash Lite (85.92%), and OpenAI: o1-mini (2024-09-12) (85.77%).

Finding a small, good-performing model for Java is more challenging, with Mistral’s Ministral 8B being the best-performing model (58.93%) among sub-10B parameter models. The next viable option is Qwen2.5 7B (Instruct) (52.20%).

While the previous iteration of this benchmark saw the highest average per-language scores for Java, this time, models were the least successful at Java when compared to other languages. The average Java score across all models was 49.31% (previously 53.82%: -4.51) The simple reason is that this version of DevQualityEval added particular complex cases to the benchmark: writing integration tests for Java’s Spring Boot and migrating JUnit 4 tests to JUnit 5.

Ruby

The TOP 3 models for Ruby are from OpenAI: o1-preview (2024-09-12) (95.55%), GPT-4o (2024-11-20) (95.47%) and o3-mini (2025-01-31, reasoning effort: medium) (95.11%). The first non-OpenAI model is on 6th place: Google’s Gemini 2.0 Flash Lite (93.44%).

Smaller (<10B) models are getting better for Ruby: while the best-performing model in the previous iteration of this benchmark only reached 61.81% (Codestral Mamba), the top score among smaller models now is 80.25% (Gemini Flash 1.5 8B), with Mistral’s Ministral 8B coming in second among smaller models at 73.36%. A bit surprising is Codestral Mamba’s degraded performance at Ruby: while it was the winner among small models in the previous benchmark with 61.81%, now, it only scored 54.43%.

The average Ruby score across all models is 61.49% (previously 46.13%: +15.36, which indicates that models are getting better at generating Ruby code.

Tasks

Bar chart that shows the score of 107 LLMs along the x-axis, per task type — Interactive HTML, SVG and higher resolution PNG

DevQualityEval’s tasks are based on common software engineering scenarios. Previous versions added cases for writing tests, transpilation and fixing generic mistakes in code (e.g. syntax and type errors). With v1.0 we introduced the base for migrating code (specifically migrating Java JUnit 4 tests to JUnit 5), generating unit (and partly integration) tests for application frameworks (specifically Java’s Spring Boot), as well as employing test templates to have a syntactical correct base for writing tests.

Using test templates has been proven to be very effective. Also, with the introduction of more complex cases, DevQualityEval v1.0 demonstrates that it is easy to raise the ceiling of the benchmark.

The following sections go into the details of the results for each task.

Code repair

Bar chart that shows the score of 107 LLMs for repairing code along the x-axis — Interactive HTML, SVG and higher resolution PNG

The code repair task consists of generic mistakes in code (e.g. syntax and type errors) with corresponding error messages. Models that perform well overall have reached consistently high scores at this task. However, in this version, we hit the ceiling with multiple models.

31 models reached 100% at the code repair task in DevQualityEval v1.0:

Anthropic: Claude 3 Sonnet
Anthropic: Claude 3.5 Haiku (2024-10-22)
Anthropic: Claude 3.5 Sonnet (2024-06-20)
Anthropic: Claude 3.5 Sonnet (2024-10-22)
Anthropic: Claude 3.7 Sonnet (2025-02-19)
Anthropic: Claude 3.7 Sonnet (Thinking)
DeepSeek: DeepSeek V2.5
Google: Gemini 2.0 Flash Lite
Google: Gemini Flash 1.5
Google: Gemini Pro 1.5
Google: Gemma 2 27B
Meta: Llama 3 70B (Instruct)
Meta: Llama 3.1 405B (Instruct)
Mistral: Codestral (2501)
Mistral: Ministral 8B
OpenAI: GPT-4o (2024-11-20)
OpenAI: GPT-4o-mini (2024-07-18)
OpenAI: o1-mini (2024-09-12)
OpenAI: o1-preview (2024-09-12)
OpenAI: o3-mini (2025-01-31) (reasoning_effort=high)
OpenAI: o3-mini (2025-01-31) (reasoning_effort=low)
OpenAI: o3-mini (2025-01-31) (reasoning_effort=medium)
Perplexity: Llama 3 Sonar 70B (Online)
Perplexity: Llama 3.1 Sonar 70B
Qwen: Qwen-Max
Qwen: Qwen-Plus
Qwen: Qwen2.5 32B Instruct
Qwen: Qwen2.5 72B (Instruct)
Qwen: Qwen2.5 Coder 32B (Instruct)
Qwen: QwQ 32B
xAI: Grok-2 (1212)

Small models can be effectively used for syntax-related tasks. Mistral’s Ministral 8B reached 100% score at code repair. Another representative of the lower parameter size category, Qwen2.5 Coder 32B (Instruct), also achieved 100%. Other notable mentions include Cognitive Computations: Dolphin 2.9.2 Mixtral 8x22B, which didn’t do very well in the overall comparison (42.0%), but performed comparably well at code repair, achieving 98.11%.

36 models scored >99% at this task, with the overall average score for this task being 90.56% (previously 73.52%: +17.04), suggesting that models are getting better at code repair. This underlines the need to again raise the ceiling by adding more complex tasks in upcoming iterations of DevQualityEval.

Migrate

Bar chart that shows the score of 107 LLMs for code migration along the x-axis — Interactive HTML, SVG and higher resolution PNG

Only one model has been able to achieve a score of 100% at this task: Anthropic: Claude 3.7 Sonnet (2025-02-19). The other two models in the TOP 3 are Mistral: Codestral (2501) (98.92%) and MiniMax: MiniMax-01 (98.23%). Worth noting is that MiniMax-01 hasn’t shown outstanding performance across all the other tasks, but did well at migrating JUnit 4 tests to JUnit 5, suggesting it may be a good candidate for migration use cases.

Smaller models tended to lose points at the migration task. Qwen2.5 Coder 32B (Instruct) only reached 48.29%, while Ministral 8B scored 55.74%.

One surprising result is that of Mistral’s Pixtral Large (2411) which, despite being a vision model, did incredibly well at migrating code (5th place with 95.66%), beating strong contenders like o1-preview (2024-09-12) (93.19%) and DeepSeek V3 (91.59%).

The average score for the migration task was 45.83%.

Transpile

Bar chart that shows the score of 107 LLMs for transpiling code along the x-axis — Interactive HTML, SVG and higher resolution PNG

OpenAI’s models performed well at the transpilation task. The TOP 3 were o3-mini (2025-01-31)(reasoning_effort=low) with 93.45%, a tie between o3-mini (2025-01-31)(reasoning_effort=medium) (93.33%) and GPT-4o-mini (2024-07-18) (93.33%) at second place, and o3-mini (2025-01-31) (reasoning_effort=high) (92.84%) at third place.

Only 4 models from other providers made it into the TOP 10 of transpilation scores. One model to point out is Qwen2.5 Coder 32B (Instruct) (91.23%), which performed consistently well not only at transpiling code, but also writing tests and code repair.

Many LLMs that performed well at writing tests weren’t as great at transpiling code. For instance, the overall leader Claude 3.5 Sonnet (2024-10-22) was best at writing tests, but only scored 83.13% at transpilation.

The average score in this category across all models was 74.03%, a big leap up from the average in DevQualityEval v0.6 (59.00%, +15.03), suggesting that models are getting better at transpiling code.

Write test

Bar chart that shows the score of 107 LLMs for generating tests along the x-axis — Interactive HTML, SVG and higher resolution PNG

The top models for generating tests are unsurprisingly also the ones that score high in the overall comparison. The TOP 3 were the overall winner Claude 3.5 Sonnet (2024-10-22) with 88.94%, o1-mini (2024-09-12) with 86.94%, and Claude 3.5 Sonnet (2024-06-20) with 86.55%.

Among smaller models, Qwen2.5 Coder 32B (Instruct) scored 83.43%, reaching 6th place and beating strong contenders like, for example, o3-mini (2025-01-31) (reasoning_effort=medium) at 82.61%, Mistral: Codestral (2501) at 81.26%, or o1-preview (2024-09-12) at 82.18%. Mistral’s Ministral 8B only scored 60.22%.

It is worth noting that no model has scored over 90% in this category, and only 18 models (16.82%) scored between 80% and 90%. This indicates that LLM-generated tests can still miss logic cases in the implementation code, resulting in incomplete coverage.

The average score in this category across all models was 50.85%.

Chattiness

DevQualityEval tracks the character count in LLM responses to compute metrics about “chattiness”. Learn more about how we measure chattiness in the previous deep dive.

Mistral’s models exhibited both extremes in terms of content chattiness ratios: Mixtral 8x7B (Base) (v0.1) scored highest at 72.62%, meaning almost three-quarters of its output was unnecessary content, while another Mistral model, Ministral 8B scored the lowest at 0.34% (meaning only 0.34% of the response was classified as unwanted excess content). In the TOP 3 of the least chatty models are Ministral 3B with 0.57% and MiniMax: MiniMax-01 with 0.84%.

Models that perform well on the DevQualityEval benchmark usually have a lower excess content chattiness ratio. All models in the top 20 of overall scores had relative chattiness scores of <2% except for Claude 3 Opus with 3.7%. The average relative chattiness of the portion of models scoring above 80% on the benchmark was 1.29%, while for the models scoring below 80%, the average chattiness was 9.61%.

The covariance coefficient between model score and this chattiness ratio is -0.4875865466, indicating a negative correlation, i.e. the higher the score, the lower the excess content. One outlier to note was Liquid’s LFM 3B, a small LLM that was among the lowest-performing models in the benchmark, yet had an excess content chattiness ratio of just 0.94%, the 5th best among all evaluated models.

On average, evaluated models exhibited a chattiness of 9.61% (or 38.86 characters per point scored), up from 9.40% in DevQualityEval v0.6.

Scatter chart that shows the score of LLMs scoring 60% and above along the y-axis, with the characters required per scored point across the y-axis — Interactive HTML, SVG and higher resolution PNG

The second chattiness metric is “overall chattiness”: how many response characters a model needs to reach its score. This is interesting as there can be models with similar scores, but the responses of one model might be concise, while the other one’s response may contain lots of redundant code.

Looking at absolute chattiness (i.e. how many characters a model produces for one point scored in DevQualityEval) for the models scoring over 60%, there is a cluster of very efficient and high-scoring models at the top left. There are some subtle differences, for example, Claude 3.5 Sonnet (2024-10-22) (overall score: 89.19%) is the model with the briefest responses, while o1-mini (2024-09-12) (overall score: 88.88%) is less efficient. Models with a lower score have less concise responses, represented by a lower cluster closer to the center.

This graph shows absolute chattiness for all benchmarked LLMs. The TOP 3 least chatty models were Anthropic’s Claude 3.5 Sonnet (2024-10-22) with just 10.33 characters per point, followed by GPT-4o (2024-11-20) at 11.51 and Claude 3 Opus at 12.03 characters per point scored.

An interesting finding here is the extreme chattiness of DeepSeek: R1 Distill Qwen 1.5B with 598.10 characters per point scored in the benchmark, which is over 60% higher than the second most chatty model Mistral: Mixtral 8x7B (Base) (v0.1) with 363.89.

Interestingly, fine-tuned models using Mixtral 8x7B as a base, such as Cognitive Computations: Dolphin 2.6 Mixtral 8x7B, NousResearch: Hermes 2 Mixtral 8x7B (DPO), or Mistral: Mixtral 8x7B (Instruct) (v0.1) were way less chatty (with an average of 41.29 characters per point). The score of the third most chatty model, Llama 3.2 1B (Instruct) at 126.77 characters per point scored, is yet 43.02% lower, signaling great differences even among these extremely chatty models.

Model reliability

LLMs produce their output through probabilistic sampling. To compensate for this nondeterministic behavior, DevQualityEval is evaluated over 5 runs by default. This also allows us to measure fluctuations in model performance, which translates to how reliably a model can solve tasks.

To conduct this analysis, we calculate the “mean deviation” over the 5 evaluation runs. To be precise, all the scores that a model gets for the runs r are r1, …, r5. The mean of these scores is then calculated AVG(r1, …, r5). And the mean deviation is the averaged absolute difference of the mean and the separate scores: (|AVG(r1, …, r5) - r1| + … + | AVG(r1, …, r5) - r5 |) / 5. A nice property of this metric is that it can be interpreted directly as the average fluctuation of the score around the mean, i.e. how consistent a model’s score is.

Bar chart that shows the "relative mean deviation" of 107 LLMs along the x-axis — Interactive HTML, SVG and higher resolution PNG

Across all models, we have seen values as low as 1.31% for Grok-2 (1212), indicating that the model score fluctuated by an average +/- 1.3% around the mean. That makes Grok-2 (1212) the most reliable model of this iteration. On the other hand, there are extremely high values such as 20.98% for OpenChat: OpenChat 3.5 7B, meaning that its actual score fluctuated on average +/- 20.98% around the mean.

While uncertainties resulting from varying model reliability cannot be entirely eliminated due to the nondeterministic nature of LLMs, additional runs for high-fluctuation models might be necessary in the future to pinpoint their capabilities with higher confidence. On the other hand, it appears unjustified to invest additional resources to evaluate such a model’s performance more accurately if the model is unreliable in practice anyways.

API reliability

For this deep dive, we investigated how reliable the model API endpoints are. We recorded how many API requests to a model fail, and how many retries are necessary to finally fulfill the request. For the retries, we currently have an upper limit of 10 attempts total, meaning that there are no points for cases that are not solved within 9x retries (i.e. ten requests).

We briefly mentioned the problem of API availability already in our previous v0.6 release. It should be noted that OpenRouter bundles multiple API providers under a single API with automatic failover. So actual API failures resemble serious issues where multiple providers are unable to handle a request. Proprietary models usually have only a few, or even just a single provider. For example, Anthropic models are offered via the official Anthropic API, Google Vertex and Amazon Bedrock, while OpenAI models usually only run via the official OpenAI endpoint.

Bar chart that shows successful API requests of 107 LLMs along the x-axis — Interactive HTML, SVG and higher resolution PNG

The graph shows the requests per model, split into requests that immediately succeeded and requests that only succeeded after at most 9x retries. Most models worked without the need for retries, or answered all requests within the 9x retries we allow. However, 16 models were unable to complete some benchmark cases entirely, because 10x total requests was not enough to get a valid response.

Bar chart that shows average retries needed per request for 107 LLMs along the x-axis — Interactive HTML, SVG and higher resolution PNG

The graph above shows how many additional retries were necessary on average to complete a request (or hit the 9x retry limit). The worst model in this regard was Qwen: Qwen-Turbo (2024-11-01), which required an average of 0.75 retries per request. So most of the time, at least one retry was performed. In second place is DeepSeek: DeepSeek R1 with 0.60, meaning it also required a retry most of the time.

Degraded API availability of Anthropic: Claude 3.7 Sonnet (2025-02-19) after release

From our experience, API reliability can vary over time and is often particularly bad when a model is just released. We experienced mayor problems with Anthropic: Claude 3.7 Sonnet (2025-02-19) shortly after it’s release. The following graphs are taken from our 2025-02-25 evaluation.

Bar chart that shows successful API requests for 107 LLMs along the x-axis — SVG and higher resolution PNG

The graph shows that Anthropic: Claude 3.7 Sonnet (2025-02-19) was the third-worst model when benchmarking with at most 3x retries. Since then, we have increased the number of permitted retries to 9x.

Anthropic: Claude 3.7 Sonnet (2025-02-19) needed 0.87 retries per request. Almost a guaranteed retry every time.

Benchmark reliability

DevQualityEval is evaluated over 5 runs to compensate for varying model performance. We analyzed the impact of LLM nondeterminism on the stability of our benchmark results in the deep dive for v0.6.

Noteworthy changes and learnings in DevQualityEval v1.0

LLM naming convention

With DevQualityEval v1.0 we are proposing a unified model naming convention to ensure consistency and the visibility of important information at one glance.

LLM naming is WILDLY inconsistent. Each model developer follows their own practices without an industry-standard naming convention. And, in some cases, even these individual practices change with almost every model. The DevQualityEval naming convention contains all the necessary information about models and is consistent across vendors. Please help spread the word by sharing the following image and link to this section. We would very much love to see model developers and other benchmarks adopt either this naming convention, or another one that is comparably clear. We are not picky about the convention, just picky about consistency.

Illustration that shows the elements that make up DevQualityEval's new LLM naming convention.

DevQualityEval’s model naming convention:

${company} ": " ${family name} [" " ${version}] [" " ${model name}] [" (" ${build version or better an ISO date} ")"] [" " ${parameter size with a numerical abbreviation M, B, or T}] [" distill (" ${distilled model name} ")" ] [" (" ("base"/"chat"/"instruct") ")"] [" (" ("experimental"/"extended") ")"] [" (" ${context size with SI unit suffix "k"} ")"] [" (" ("free"/"nitro"/"self-moderated") ")"] [" (" ${reasoning attributing e.g. "reasoning-effort=high"} ")"]

Let’s look at some examples:

Illustration that shows the elements in DevQualityEval's new LLM naming convention on the example of Qwen: Qwen v2.5 Coder 32B (instruct).

Qwen2.5 Coder 32B Instruct becomes Qwen: Qwen 2.5 Coder 32B (instruct)
- Company: “Qwen”
- Family name: “Qwen” (even though this word is used twice in the model name, the convention considers it mandatory)
- Version: “2.5” (note that a space is added, i.e. not “Qwen2.5”)
- Model name: “Coder”
- Parameter size: “32B”
- Fine-tuning attribute: “(instruct)”

Illustration that shows the elements in DevQualityEval's new LLM naming convention on the example of Anthropic: Claude v3.5 Sonnet (2024-06-20).

Claude 3.5 becomes Anthropic: Claude v3.5 Sonnet (2024-06-20)
- Company: “Anthropic” (mandatory)
- Family name: “Claude”
- Version: “3.5”
- Model name: “Sonnet”
- ISO date: “(2024-06-20)” (note it is not “2406” nor “202406” and it is added, because there are multiple versions)

Illustration that shows the elements in DevQualityEval's new LLM naming convention on the example of Microsoft: Phi v3 Mini (instruct) (128k).

Phi-3-Mini-128K-Instruct becomes Microsoft: Phi v3 Mini (instruct) (128k)
- Company: “Microsoft” (mandatory)
- Family name: “Phi”
- Version: “3”
- Model name: “Mini”
- Fine-tuning attribute: “(instruct)”
- Context size: “128k”

A key problem with how some models are named today is that version information and/or model attributes are often missing or are inconsistently added to model names. Some model developers have adopted naming conventions that convey minimal information regarding the model’s version or attributes. In some cases, information was simply removed from model names, while in other cases, a brand name was established that hides the details. The model underlying such brand model names may change over time, and versioning information isn’t always readily available, making it difficult to compare a model’s performance at different points in time.

One example is Mistral Small, a model evaluated in DevQualityEval v0.6. While versioning information is available in Mistral’s documentation, even for legacy models, it can have errors: for the model Mistral Small 24.02 (deprecated on 30 Nov 2024), model version is erroneously displayed as 24.09. That creates some confusion as both versions 24.02 and 24.09 were available to users at some point. Note that while some LLM users have used the name online, there has been no official announcement of a Mistral Small 2 model prior to the release of Mistral Small 3 on 2025-01-30. At the time of writing, Mistral Small 3, a 24B model, is at version of 25.01.

Another thought we have about the naming is that more parts could be mandatory, e.g. for all open-weight models the parameter size could be mandatory. Or even for all models the context size could be mandatory, as this is nowadays a key part of every provider. Let us know what you think! And again, please help spread the word by sharing the image of the convention and link to this section.

Static analytics improvement

In DevQualityEval v1.0, we showcase a RAG scenario via a preprocessing step that adds additional information to the input of the task before it is passed to the LLM. We targeted the “write-test” task type where LLMs are prompted to write unit tests for code. We added Symflower’s Smart Test Templates to the input which provide initialization for tests including imports, test setup, and object initialization.

We have already shown that static analysis can improve model scores by repairing common mistakes. In the previous version of this benchmark, we applied static analysis as a post-processing step on the LLM output, and showed that smaller models with static code repair could provide better results than larger models without static analysis.

Since generating the test template is done via static code analysis, the performance overhead of computing this input is negligible (milliseconds). Yet the results show that final scores are generally higher with templates than without them and in some cases, the difference is significant.

New task: migrating JUnit 4 tests to JUnit 5

In DevQualityEval v1.0, we introduced a new migration task to be able to identify the models best suited for automating code migration, an area where AI agents for software engineering have great potential. For now, we use a modified version of the “write-test” task: models are asked to convert JUnit 4 tests to JUnit 5. They only have access to the tests, but not the original source code. We’re using a Symflower-curated repository that is part of DevQualityEval Prö which contains scenarios for converting real-world projects from JUnit 4 to 5. JUnit 4 dependencies are not present for the source code of the new migration task. This means that if there is any JUnit 4 code left over after a model processed a case, it won’t compile, resulting in 0 coverage score for the model.

Spring Boot support

DevQualityEval v1.0 introduces Spring support to benchmark application framework development. The benchmark contains scenarios for Spring components, controllers, repositories, and services for different configurations. We use a plain repository as an example, and ensure that Spring packages are available during execution. Boilerplate files (e.g. Application.java) can now be excluded so the evaluation does not confuse them for actual tasks. The LLM prompt can be adjusted to reflect test framework changes (e.g. “JUnit 5 for Spring”), and execution output can be validated (e.g. to ensure that “Spring Boot v2.7.9” is in the execution output to make sure Spring is in fact initialized).

Documentation & GitHub

We’ve improved the documentation and how issues are managed on GitHub for DevQualityEval. To streamline issue creation, we now make use of forms to ensure we obtain enough information for e.g. bug reports. We’ve also extended our roadmap planning and release process guidelines, and went through an exhaustive issue purging session, re-evaluating every open issue and closing ones that were out of scope or outdated.

Development

We added the following linters to our GitHub actions:

At Symflower, we are spoiled by a very picky internal CI. We even have our own linter to enforce certain practices and maintain high code quality. Our enthusiasm for DevQualityEval had us focus more on features than on development tools, and so we already encountered some issues with code being merged on GitHub that we are used to being flagged in our internal CI (i.e. missing Go error checks). With the new linters added, we already found some cases of unreachable code leftover from earlier refactorings.

We have also added a VS Code extension configuration and custom VS Code debugging configuration to the repository to streamline development.

Scoring and reporting

With this release, we removed all the scoring logic from within the core benchmark and will from now on use a proprietary tool to conduct scoring.

In previous versions, DevQualityEval recorded all the metrics during an evaluation run internally and already performed some rudimentary transformations on the data. This included, for example, scaling the coverage information by a factor of 10x for the “write-test” case, or collecting the metrics from multiple runs and tasks into a total score per model. However, this tight coupling of evaluation and scoring made maintenance unnecessarily complex. In an attempt to decouple the logic, moving some parts of the reporting into a separate tool introduced parallelism that made maintenance even more difficult.

Metrics are now written directly to CSV files on a task case basis i.e. the lowest possible level. This means a simple and clean implementation on the benchmark side, and having all the information required for further analysis in a single CSV file. We currently obtain the metrics, analysis, and graphs for this deep dive report using proprietary tooling that analyzes the raw data from the evaluation in the CSV.

Finally, we should note that (as announced in the last deep dive), we’re switching from absolute value-based scoring to percentage-based scores in the evaluation. Since this requires us to know the total number of scores achievable on certain tasks, we added these values as meta-information to the respective repository.json.

Logging

Structural logging (i.e. splitting up the logs of specific models and task types into separate files) was introduced in the previous version of DevQualityEval. With this version, we completed the transition by de-duplicating logs and by enabling easier parsing and post-processing. Logs now use JSON and enable parametrization (i.e. having the currently processed model as a dedicated JSON field), and we’ve also removed duplicates (the overall log file no longer contains the data belonging to model-specific log files).

Fixes

The following sections highlight fixes for problems that are universal for other benchmarks and evaluations and should therefore be of interested to anyone who has an internal test suite for LLM usage.

Disqualification with static analysis

We loosened the qualification criterion so that a model is qualified even if it is only able to solve the qualification task with the help of RAG or code repair.

We mentioned in a previous deep dive that DevQualityEval employs a qualification system for programming languages by default. Models have to solve a trivial coding task for each language to qualify for the full catalog of tasks and cases for that language. The qualification system ensures that one does not waste time and money on weak models that can’t solve even this simple task.

Until now, a model had to solve the qualification task of a language autonomously to be qualified for all of that language’s tasks. Starting with DevQualityEval v1.0, a model is qualified for the benchmark if it can solve the qualification task with RAG or code repair.

The official evaluation of DevQualityEval v1.0 is deliberately run with the aforementioned disqualification system disabled to give all models a fair chance of solving all tasks, even if it costs us more money.

IO errors during `git clean`

We introduced a retry to ensure the execution environment is always cleaned up properly.

The evaluation internally uses git to manage the files of tasks. This ensures that all files are reset to their default state before an LLM gets to modify them. We noticed some rare cases where git clean errored in case a temporary file that git tried to remove was already removed by some other mechanism. With this release, a retry was introduced to make sure the cleanup process is executed in any case.

Remove timeout optimization regarding `symflower fix`

We removed a timeout optimization which may lead to longer benchmarking times due to multiple executions.

The last release showcased how static code repair can improve model performance. Internally, the LLM’s coding solution is executed once on its own, and once with symflower fix applied. Initially, we had an optimization in place to skip the code repair phase if the original solution resulted in a timeout. The reasoning behind this was that a timeout only happens if the original code is inefficient or contains an infinite loop. As symflower fix currently only repairs, but doesn’t optimize code, applying the code repair and executing the solution again is bound to trigger the same timeout again. We plan to extend symflower fix in the future to also perform optimization. In preparation, the timeout optimization was now removed.

Collecting query data even if responses are empty

We often encounter cases where models just reply with empty responses. Until now, we ranked this as an error. So if a model produced an empty response, the corresponding entry in the evaluation.csv would look something like this (abbreviated for simplicity):

model-id,language,repository,case,task,run,costs-total-actual,native-token-input,native-token-output,response-no-error
dummy-model,golang,golang/plain,plain.go,write-tests,0,0,0,0

A problematic observation is that there are no costs and tokens tracked. Since the model response is classified as an error, no metadata is extracted from the response. But in reality, the request still produced costs, which should be tracked. With the fix, an empty response now produces the correct entry in the evaluation.csv (again abbreviated):

model-id,language,repository,case,task,run,costs-total-actual,native-token-input,native-token-output,response-no-error
dummy-model,golang,golang/plain,plain.go,write-tests,1,111,222,0

We are tracking all costs meticulously throughout the benchmark because newer reasoning models commonly produce costs for internal reasoning that is hidden from the user. Tracking the costs on request basis allows us to include these hidden costs in our benchmark results, and that should include requests that cost money but did not produce a response.

Connecting requests and responses in logs

Logging of requests and responses now adds a unique query-id to them. This makes it easier to find a response for a certain request and vice versa. The query-id is also applied for retries in case of API errors.

DevQualityEval is evaluated with 10 API request attempts by default. This means that a model can perform 9 additional retries after the first request failed.

DevQualityEval Prö

With DevQualityEval v1.0, we’ve decided to require a small fee to access detailed results and from now on, we’ll be developing parts of the benchmark closed-source. Here’s why that decision was made.

At the current state the industry is in, and especially with the likely proliferation of AI agents in the near future, we feel there is a huge need for a new SOTA software development benchmark that covers many languages, scenarios, tasks and assessments. Existing software engineering benchmarks are lacking in that regard. We genuinely believe that this benchmark should be open and accessible to everyone.

Yet, developing such a benchmark and running evaluations requires significant investment in the form of paying our developers and continuously evaluating models. We believe that parties that benefit the most from such a benchmark should bear a fair proportion of those costs. That’s why detailed results are moving behind a paywall (meet DevQualityEval Prö 🍾), while deep dive blog posts like this one with general findings remain freely available for everyone. To get access to the leaderboard, please sponsor the project by buying us two beverages.

Here’s a detailed list of the steps this change entails:

The core of DevQualityEval stays on GitHub as a framework for everyone to run evaluations with.
DevQualityEval only outputs raw evaluation results in CSV format. The reporting tooling used to analyze and aggregate the raw results into the leaderboard and charts is not open-sourced.
The open-source repository only contains basic task data, while the bulk of task examples used to create the DevQualityEval leaderboard is held-out (this also helps prevent over-fitting i.e. model developers tweaking models to perform better on the DevQualityEval benchmark).
Access to the leaderboard with the latest results is granted to anyone sponsoring the project.

What comes next? DevQualityEval v1.1

In the last deep dive (v0.6) we identified 4 main problems that need to be solved for a true software development benchmark. With the v1.0 release of the DevQualityEval we made big strides and are moving towards a truly ever-evolving benchmark that stays ahead of the daily announcements of “the next” LLMs and software development agents.

Let’s check what has been done since v0.6. and what needs to change for v1.1.

Moving the ceiling further up with complex scenarios

In the previous evaluation, the top-scoring model reached a score of 98%, clearly indicating that the ceiling of the benchmarks was reached. With the new Java Spring Boot repositories and JUnit 4 to JUnit 5 migration tasks of v1.0, the top-scoring model reaches a score of 88%. Not only is this a major headroom to the new ceiling but the additional complexity brings new challenges that are difficult to solve without considerable reasoning. Since these changes mainly focused on Java, we are looking forward to adding equivalent repositories to the other languages.

We did not just increase the amount of distinct cases of the benchmark (from 105 to 152: +45%) but also added more assessments and challenges to make every percentage more difficult to collect. For the next version we will broaden our set of assessments and rules to stay true to the mission of DevQualityEval: benchmark towards real-world usage and quality.

However, the main focus for DevQualityEval v1.1 is to further raise the ceiling by implementing a combination of tasks (scenarios) that better represent real world work of software developers. From planning and implementing changes, over reviewing and maintaining, as well as keeping to conventions and policies. DevQualityEval now dives into the world of true software development agents.

Adding more languages

For DevQualityEval v1.1 we have something special planned. This benchmark is about helping everyone create and choose their perfect model. We gathered feedback over the last releases and made our decision. The following languages will be added with the next version:

C#
C++
JavaScript
PHP
Python
Rust
Swift

This involves adding the plain, light, mistakes and transpile repositories for these languages. Repositories for application frameworks and migrations are not planned for v1.1 but we would greatly appreciate your help! Especially, if your favorite technology is not represented, let us know (as your vote counts towards our roadmap).

More RAG and static code repair

Previously, we teased that there lies great potential in the combination of LLMs and static/dynamic code analysis. The results showcase how the right Retrieval Augmented Generation can boost model performance, and we will continue to improve symflower and especially symflower fix to include more languages and rules. These analyses not just showcase the importance to combine static/dynamic analyses with LLM usage, but should give LLM creators and users a way to further improve ever LLM response.

In addition, for DevQualityEval v1.1, we are planning to introduce new tools to showcase which models are truly capable of being used in software development agents.

Automatic reporting tooling

Recent releases involved a considerable amount of manual work to analyze results and create plots and leaderboards, even with parts being automated. With this release, we finally moved to a fully automatic, proprietary reporting tool. This allows us to react to new model releases more quickly and start to implement a truly dynamic leaderboard.

With DevQualityEval v1.1 we will further implement quality-of-life changes as well as user-based filtering tools that allows everyone to find their perfect model.

We hope you enjoyed reading this deep-dive and we would love to hear your thoughts and feedback on how you liked the details, how we can improve both our deep-dives and the DevQualityEval benchmark overall.

If you are interested in joining our development efforts for the DevQualityEval benchmark: GREAT, let’s do it! You can use our issue list and discussion forum to interact or write us directly at markus.zimmermann@symflower.com or on Twitter.

| 2025-02-16