This deep dive takes a look at the results of the DevQualityEval v1.0 which analyzed 100 different LLMs for generating quality code (for Java, Go and Ruby). Anthropic’s Claude 3.5 Sonnet (2024-10-22) and OpenAI’s o1-mini (2024-09-12) have a big advance before Anthropic’s Claude 3 Opus and OpenAI’s o1-preview in functional score. But all four are hugely more expensive than our king of cost-effectiveness DeepSeek V3 (previously: V2), shortly followed by the MUCH faster Mistral’s Codestral (2501) ๐ฆธ. DeepSeek R1 doesn’t hold up to the hype at all and Llama is not even in the TOP 20.
The results in this post are based on 5 full runs using DevQualityEval v1.0. Metrics and logs have been extracted and detailed leaderboard data can be requested. The full evaluation setup and reasoning behind the tasks are similar to the previous dive but considerably extended as detailed in this dive.
The following sections are a deep dive into the results, learnings and insights of all evaluation runs towards the DevQualityEval v1.0 release. Each section can be read on its own and comes with a multitude of learnings that we will integrate into the next release.
Table of contents:
- Terminology
- Comparing model capabilities by total scores
- Programming languages
- DevQualityEval Prรถ
- What comes next? DevQualityEval v1.1
The purpose of the evaluation benchmark and the examination of its results is to give LLM creators a tool to improve the results of software development tasks towards quality and to provide LLM users with a comparison to choose the right model for their needs. Your feedback is highly appreciated and guides the next steps of the eval.
With that said, let’s dive in!
Terminology
This benchmark, DevQualityEval, evaluates Large Language Models and Large Reasoning Models on software development use cases. Each “task type”, or just “task”, is a distinct problem category, for example: writing unit tests. A “task case”, or just “case”, is a concrete instantiation of such a problem, i.e. a Binary Search implementation written in Go that needs unit tests. The benchmark results span numerous metrics that are explained with every section. Additional metrics and details are defined in the leaderboard.
We built DevQualityEval for the community and we believe benchmarks should stay open and accessible to anyone. But, benchmarking is not free.
Every time we run the benchmark on the latest models, it costs us serious ๐ธ. And that’s just the smallest portion of the costs. Maintaining the benchmark, refining results, adding new tasks, cases, languages, assessments… All that is time well spent, but we still need to make a living and find the time to further extend DevQualityEval.
We truly appreciate any help, whether it comes in the form of direct financial support or LLM consulting projects. You can already help us a lot by just sharing this post and spreading the word!
We are eternally grateful for your help ๐ฝ.
Comparing model capabilities by total scores
This graph displays the total score of all benchmarked models of DevQualityEval v1.0. The higher the score, the better the model performed. This release uses a percentage-based score instead of a numerical scoring system, as outlined in the v0.6 “percentage-based-score” section. The graph also includes the improved score through static code repair of common mistakes and RAG through static analysis.
The top three best-scoring models are Anthropic’s Claude 3.5 Sonnet (2024-10-22) (89.10%), closely followed by OpenAI’s o1-mini (88.88%), and some paces behind on third place Anthropic’s Claude 3 Opus (85.34%). We see an average score of 52.58% (previously 49.30%: +3.28) which indicates that the average model is getting better at coding. However, we see 0% (previously 13%: -13.0) of all models scoring 90% or higher, and 13.68% (previously 21.74%: -8.06) of all models scoring 80% or higher. This does not indicate that models are getting worse, but that our efforts to increase the ceiling of the DevQualityEval are working.
Programming languages
One goal of DevQualityEval is to look at the language-specific performance of models to allow everyone to choose the best model for their projects. The evaluation currently supports three programming languages (Go, Java, Ruby), with more languages to be added in upcoming iterations.
Average scores across all models were higher for Go 66.16% (previously 46.64%: +19.52) than for Java 46.83% (previously 53.82%: -6.99) and Ruby 58.54% (previously 46.13%: +12.41). The collapse in Java’s average is not surprising since a focus of the DevQualityEval v1.0 release was to introduce challenging tasks and cases for Java. Read on for details on these languages and their averages.
Go
The TOP 3 models for Go are OpenAI’s o3-mini (2025-01-31, reasoning effort: high) (99.54%) and again with (reasoning effort: low) (99.20%), followed by Anthropic’s Claude 3.5 Sonnet (2024-10-22) (98.89%).
The best small (<10B) model for Go is Ministral 3B (80.40%) which surpasses the score of Ministral 8B (77.16%) at Go (but not at Java or Ruby).
The average Go score across all models is 66.16% (previously 46.64%: +19.52), which indicates that models are getting better at generating Go code.
Java
The TOP 3 models for Java are Anthropic’s Claude 3.5 Sonnet (2024-10-22) (87.57%), OpenAI’s o1-mini (2024-09-12) (85.77%) and Anthropic’s Claude 3 Opus (85.03%).
Finding a small, good-performing model for Java is more challenging, with Mistral’s Ministral 8B being the best-performing model (58.92%) among sub-10B parameter models. The next viable option is Qwen2.5 7B (Instruct) (52.20%).
While the previous iteration of this benchmark saw the highest average per-language scores for Java, this time, models were the least successful at Java when compared to other languages. The average Java score across all models is 46.83% (previously 53.82%: -6.99) The simple reason is that this version of DevQualityEval added particular complex cases to the benchmark: writing integration tests for Java’s Spring Boot and migrating JUnit 4 tests to JUnit 5.
Ruby
The TOP 3 models for Ruby are from OpenAI: o1-preview (2024-09-12) (95.55%), GPT-4o (2024-11-20) (95.47%) and o3-mini (2025-01-31, reasoning effort: high) (94.37%). The first non-OpenAI model is on 8th place: Qwen’s 2.5 Coder 32B (Instruct) (92.57%).
Smaller (<10B) models are getting better for Ruby: while the best-performing model in the previous iteration of this benchmark only reached 61.81% (Codestral Mamba), the top score among smaller models now is 80.25% (Gemini Flash 1.5 8B), with Mistral’s Ministral 8B coming in second among smaller models at 73.36%. A bit surprising is Codestral Mamba’s degraded performance at Ruby: while it was the winner among small models in the previous benchmark with 61.81%, now, it only scored 54.43%.
The average Ruby score across all models is 58.54% (previously 46.13%: +12.41), which indicates that models are getting better at generating Ruby code.
DevQualityEval Prรถ
With DevQualityEval v1.0, we’ve decided to require a small fee to access detailed results and from now on, we’ll be developing parts of the benchmark closed-source. Here’s why that decision was made.
At the current state the industry is in, and especially with the likely proliferation of AI agents in the near future, we feel there is a huge need for a new SOTA software development benchmark that covers many languages, scenarios, tasks and assessments. Existing software engineering benchmarks are lacking in that regard. We genuinely believe that this benchmark should be open and accessible to everyone.
Yet, developing such a benchmark and running evaluations requires significant investment in the form of paying our developers and continuously evaluating models. We believe that parties that benefit the most from such a benchmark should bear a fair proportion of those costs. That’s why detailed results are moving behind a paywall (meet DevQualityEval Prรถ ๐พ), while deep dive blog posts like this one with general findings remain freely available for everyone. To get access to the leaderboard, please sponsor the project by buying us two beverages.
Here’s a detailed list of the steps this change entails:
- The core of DevQualityEval stays on GitHub as a framework for everyone to run evaluations with.
- The proprietary Symflower tool used throughout various parts of the binary now requires a paid license.
- DevQualityEval only outputs raw evaluation results in CSV format. The reporting tooling used to analyze and aggregate the raw results into the leaderboard and charts is not open-sourced.
- The open-source repository only contains basic task data, while the bulk of task examples used to create the DevQualityEval leaderboard is held-out (this also helps prevent over-fitting i.e. model developers tweaking models to perform better on the DevQualityEval benchmark).
- Access to the leaderboard with the latest results is granted to anyone sponsoring the project.
What comes next? DevQualityEval v1.1
In the last deep dive (v0.6) we identified 4 main problems that need to be solved for a true software development benchmark. With the v1.0 release of the DevQualityEval we made big strides and are moving towards a truly ever-evolving benchmark that stays ahead of the daily announcements of “the next” LLMs and software development agents.
Let’s check what has been done since v0.6. and what needs to change for v1.1.
Moving the ceiling further up with complex scenarios
In the previous evaluation, the top-scoring model reached a score of 98%, clearly indicating that the ceiling of the benchmarks was reached. With the new Java Spring Boot repositories and JUnit 4 to JUnit 5 migration tasks of v1.0, the top-scoring model reaches a score of 88%. Not only is this a major headroom to the new ceiling but the additional complexity brings new challenges that are difficult to solve without considerable reasoning. Since these changes mainly focused on Java, we are looking forward to adding equivalent repositories to the other languages.
We did not just increase the amount of distinct cases of the benchmark (from 105 to 152: +45%) but also added more assessments and challenges to make every percentage more difficult to collect. For the next version we will broaden our set of assessments and rules to stay true to the mission of DevQualityEval: benchmark towards real-world usage and quality.
However, the main focus for DevQualityEval v1.1 is to further raise the ceiling by implementing combination of tasks (scenarios) that better represent real world work of software developers. From planning and implementing changes, over reviewing and maintaining, as well as keeping to conventions and policies. DevQualityEval now dives into the world of true software development agents.
Adding more languages
For DevQualityEval v1.1 we have something special planned. This benchmark is about helping everyone create and choose their perfect model. We gathered feedback over the last releases and made our decision. The following languages will be added with the next version:
- C#
- C++
- JavaScript
- PHP
- Python
- Rust
- Swift
This involves adding the plain
, light
, mistakes
and transpile
repositories for these languages. Repositories for application frameworks and migrations are not planned for v1.1 but we would greatly appreciate your help! Especially, if your favorite technology is not represented, let us know (as your vote counts towards our roadmap).
More RAG and static code repair
Previously, we teased that there lies great potential in the combination of LLMs and static/dynamic code analysis. The results greatly showcase how the right Retrieval Augmented Generation can boost model performance, and we will continue to improve symflower
and especially symflower fix
to include more languages and rules. These analyses not just showcase the importance to combine static/dynamic analyses with LLM usage, but should give LLM creators and users a way to further improve ever LLM response.
Additionally for DevQualityEval v1.1 we are planning on introducing tool usage to showcase which models are truly capable to be used in software development agents.
Automatic reporting tooling
Recent releases involved a considerable amount of manual work to analyze results and create plots and leaderboards, even with parts being automated. With this release, we finally moved to a fully automatic, proprietary reporting tool. This allows us to react to new model releases more quickly and start to implement a truly dynamic leaderboard.
With DevQualityEval v1.1 we will further implement quality-of-life changes as well as user-based filtering tools that allows everyone to find their perfect model.
We hope you enjoyed reading this deep-dive and we would love to hear your thoughts and feedback on how you liked the details, how we can improve such deep-dives and the DevQualityEval.
If you are interested in joining our development efforts for the DevQualityEval benchmark: GREAT, let’s do it! You can use our issue list and discussion forum to interact or write us directly at markus.zimmermann@symflower.com or on Twitter.