OpenAI's o1-preview is the king πŸ‘‘ of code generation but is super slow and expensive (Deep dives from the DevQualityEval v0.6)

This deep dive takes a look at the results of the DevQualityEval v0.6 which analyzed over 80 different LLMs for generating quality code (for Java, Go and Ruby). OpenAI’s o1-preview and o1-mini are slightly ahead of Anthropic’s Claude 3.5 Sonnet in functional score, but are MUCH slower and chattier. DeepSeek’s v2 is still the king of cost-effectiveness, but GPT-4o-mini and Meta’s Llama 3.1 405B are catching up.

The results in this post are based on 5 full runs using DevQualityEval v0.6. Metrics and logs have been extracted and detailed leaderboard data can be requested. The full evaluation setup and reasoning behind the tasks are similar to the previous dive but considerably extended as detailed in this dive.

Access the results of the DevQualityEval LLM benchmark.

The following sections are a deep dive into the results, learnings and insights of all evaluation runs towards the DevQualityEval v0.6 release. Each section can be read on its own and comes with a multitude of learnings that we will integrate into the next release.

πŸ’° We need your help

We built DevQualityEval for the community and we believe benchmarks should stay open and accessible to anyone. But, benchmarking is not free. You can already help us a lot by just sharing this post and spreading the word!

Every time we run the benchmark on the latest models, it costs us serious πŸ’Έ. And that’s just the smallest portion of the costs. Maintaining the benchmark, refining results, adding new tasks, cases, languages, assessments… All that is time well spent, but we still need to make a living and find the time to further extend DevQualityEval.

We truly appreciate any help, whether it comes in the form of LLM consulting projects, direct financial support or simply letting others know by sharing this blog post and our social posts.

Can you chip in? Contact us. Know someone that might have a project for us? Please let us know! We are eternally grateful for your help πŸ‘½.

Table of contents:

The purpose of the evaluation benchmark and the examination of its results is to give LLM creators a tool to improve the results of software development tasks towards quality and to provide LLM users with a comparison to choose the right model for their needs. Your feedback is highly appreciated and guides the next steps of the eval.

With that said, let’s dive in!

Terminology

This benchmark, DevQualityEval, evaluates Large Language Models and Large Reasoning Models on software development use cases. Each “task type”, or just “task”, is a distinct problem category, for example: writing unit tests. A “task case”, or just “case”, is a concrete instantiation of such a problem, i.e. a Binary Search implementation written in Go that needs unit tests. The benchmark results span many different metrics, which are all explained in the respective sections as they come up.

Comparing the capabilities and costs of top models

Cost-effectiveness scatter plot that shows the best LLMs aligned with their capability on the y-axis to total costs with a logarithmic scale on the x-axis solving 525 benchmark cases.
Higher resolution PNG in case SVG is not working.

The graph shows the best models in relation to their overall scores (y-axis, linear scale) and costs (x-axis, logarithmic scale, $ spent for one benchmark). The sweet spot is the top-left corner: cheap with good results.

Reducing the full list of over 80 LLMs to a manageable size was done by sorting based on scores and then costs. We then removed all models that were worse than the best models of the same vendor/family/version/variant (e.g. gpt-4* is better than gpt-3* so the latter can be removed). For clarity, the remaining models were renamed to represent their variant. Even then, the list was immense. In the end, only the most important new models, fundamental models and top-scorers were kept for the above graph. For a complete picture, request access to the DevQualityEval leaderboard with all the results from the latest run.

Comparing model capabilities by total scores

Bar chart that shows the total score of 69 LLMs along the x-axis, including the potential score improvement through static code repair.
Higher resolution PNG in case SVG is not working.

This graph displays the total score of all benchmarked models of DevQualityEval v0.6. The higher the score, the better the model performed. For this release, we switched from a numerical scoring system to percentage-based scores, as outlined in the “percentage-based-score” section. The graph also includes the improved score through static code repair of common mistakes. Dedicated sections go both into the concept behind automatic post-processing of LLM output and how this affects the Go language score of the benchmark.

The top three best-scoring models are OpenAI’s o1-preview (98.6%) and o1-mini (96.9%), with Claude 3.5 Sonnet (95.5%) in third place. We see an average score of 49.3%, with 13% of all models scoring 90% or higher. The separate scores for programming languages and task types are presented in dedicated sections.

Scatter chart that shows the total score of LLMs along the y-axis against their average response time per case on the x-axis.
Higher resolution PNG in case SVG is not working.

This graph plots the total score of models (reaching 50% or better) against their performance (logarithmically). We measured the response time for each model and normalized over the solved cases, resulting in the average time it took the model to solve a single example.

There is a huge gap between the fastest model: Sonnet 3 Haiku (82.2%) at 2.8 seconds per response, and the slowest: o1-preview (98.6%) at 23.3 seconds per response.

Scatter chart that shows the total score of LLMs along the y-axis against their number of model parameters on the x-axis.
Higher resolution PNG in case SVG is not working.

This graph plots the total score of models (reaching 50% or better) against their number of parameters, wherever the number of parameters is known and confirmed.

While more model parameters seem to lead to a better score, it is interesting to see that Llama 3.1 405B is not too much better than the DeepSeek V2 models, despite having almost double the parameter size.

Findings of DevQualityEval v0.6

Large Language Models vs. Large Reasoning Models

The new o1-preview and o1-mini offerings from OpenAI are conceptually still LLMs. However, their internal reasoning implements the “Chain of Thought” (CoT) approach initially proposed by Google Deepmind in 2022 and later refined by Amazon Research. Every LLM can be improved by adding a CoT mechanism, and OpenAI has evidently done a great job in doing so. But unfortunately, that makes it hard to have an apples-to-apples comparison between LLMs with CoT and ones without. Hence, we will refer to these new OpenAI models as “Large Reasoning Models” (LRM) to clearly separate them from “stock” LLMs.

Costs for OpenAI’s Large Reasoning Models also include the tokens generated throughout the internal thinking process, which are hidden from the user. Therefore we compare LRM cost not according to token pricing, but by the real cost that the model accumulated throughout our benchmark.

OpenAI o1-preview and o1-mini

The new o1-preview and o1-mini by OpenAI are the first so-called “Large Reasoning Models” (LRMs). Before producing an output, these models perform some internal thinking to plan and iterate towards a solution. While this approach improves reasoning capabilities (also reflected by the scores in DevQualityEval), it comes at a high cost. The intermediate reasoning is billed in addition to any input and output tokens that usually make up the price for LLM usage. Since the internal reasoning is currently omitted and cannot be controlled, this results in a somewhat intransparent pricing model. Costs for LRMs can add up fast. In our case, we spent an approximate $1.79 on average when evaluating each model for the DevQualityEval v0.6 benchmark. Until now, there was only one extreme outlier: Anthropic’s Claude 3 Opus at $12.90 because of its high price. Benchmarking OpenAI’s new models cost us a whopping $76.91 for o1-preview and $8.68 for o1-mini. Also, inference time is slower than the average of 6.9s per response, especially for the o1-preview model at 23.3s per response. The previous slowest model was Xwin 70B with 19.9s per response, followed by DeepSeek V2 Coder at 17.9s per response.

Both o1-preview and o1-mini score exceptionally well at 98.6% and 96.9% respectively, beating Claude 3.5 Sonnet (95.5%), GPT-4o (94.3%) and DeepSeek V2 Chat (93.6%). As already mentioned, comparing pricing is hard due to the internal reasoning tokens of LRMs. In that regard, o1-mini beat Claude 3 Opus both in score (96.9% vs. 93.0%) but also from a price-point of view, costing us a total of $8.68 vs. $12.9. While o1-preview is the overall best model, it also stands out by its high Ruby score (98.2%), beating the second-best Ruby model, which surprisingly is GPT-4o (96.5%) and not o1-mini.

OpenAI GPT-4o-mini

The new GPT-4o-mini is cheaper and even better than GPT-4o (+1.1%). This score increase stems from the improved Go performance (+6.5%), while Java and Ruby stay the same. GPT-4o-mini basically matches the score of Sonnet 3.5 (within margin of error), but at a fraction of the cost ($0.75 vs. $18.00 per 1M token). It scores a bit better than DeepSeek V2 Chat (+1.7%) at almost double the price ($0.75 vs. $0.42 per 1M token) but is also 3x times faster (4.2s vs. 15.6s per request). In comparison to the new LRMs, it scored -3.3% worse than o1-preview, but only cost us $0.08 during the whole evaluation (solving 525 cases), while o1-preview cost $76.61. It also was over 4x times faster (4.9s per response vs. 23.3s).

Mistral Large V2 (July 2024)

We were happy to see more open-source LLMs in the upper field. Mistral Large V2 scored lower (-8.4%) than the current LLM leader, Sonnet 3.5, being just a bit cheaper ($12.00 vs. $18.00 per 1M token). But also worse (-6.4%) than the current open-weight leader: DeepSeek V2 Chat. It is two times faster than DeepSeek’s model (7.7s vs. 15.6s per request) but at 28x times the price ($12.00 vs. $0.42 per 1M token). Compared to the recently released LRMs, Mistral Large V2 scored worse (-11.5%) than o1-preview, but cost us only $1.09 throughout the benchmark (vs. $76.61) and was roughly 3x times faster (7.7s per response vs, 23.3s).

As the old Mistral Large V1 was proprietary and is now deprecated, we cannot compare it to the new version in the benchmark across the new tasks. Looking back at our v0.5 results of the Go and Java “write-test” task, we see that Mistral Large V1 obtained an absolute score of 13352 while Mistral Large V2 now scored 18798 in these categories. This yields a relative 40.8% score improvement over the old version. Compared to other Mistral models, Large V2 performs better than Medium V1 (+13.5%) at a slightly higher price ($12.00 vs. $10.80 per 1M token) and also better than Mixtral 8x22B (+16.3%), but the 8x22B model is a lot cheaper ($12.00 vs. $1.30 per 1M token).

It should be noted that Mistral Large V2 is however way less chatty than Mixtral 8x22B: 1.8% vs. 14.7%, meaning that only 1.8% of the response content of Large V2 was measured as unwanted excess explanations alongside the actual source code that the model is supposed to produce. Chattiness is introduced in greater depth in a dedicated section below.

Meta Llama 3.1 vs. Llama 3.0

Meta refreshed their Llama models, so we wanted to compare them against the previous versions. We also benchmarked the performance of the new Llama 3.1 405B release. First up: are the new Llama 3.1 models better than Llama 3.0?

  • Llama 3.1 70B: +3.5% over Llama 3.0 70B
  • Llama 3.1 8B: +34.4% over Llama 3.0 8B

The improvement for the 70B variants is due to the +7.41% improved Ruby score, while Go and Java stayed the same. For the 8B variants, scores increased across all languages, but the most for Go (+53.0%).

When comparing the Llama 3.1 models against each other, we realized that the new big 405B performs better (+7.7%) than the new smaller 3.1 70B (and +11.3% better than the previous 3.0 70B model). Interestingly, Llama 3.1 405B performs worse than the 3.1 70B for Ruby (-1.9%) but better overall due to the better Go and Java scores. And naturally, Llama 3.1 70B performs better (+24.7%) than its smaller sibling Llama 3.1 8B.

With Llama 3.1 405B we have a second new open-weight LLM in the upper field. It beats the new Mistral Large V2 by +3.6% at a quarter of the price ($3.58 vs. $12.00 per 1M token) but is slower (10.4s vs 7.7s per request). It falls behind DeepSeek’s V2 Chat (-2.8%) at a higher price ($3.58 vs. $0.42 per 1M token) but also faster speed (10.4s vs 15.6s per request). Compared to the top LLM, Sonnet 3.5, it scored -4.84% worse but for 1/5 of the price ($3.58 vs. $18.00), being twice as slow (10.4s vs. 4.9s per request). Comparing it to the new LRMs, it performed worse than o1-preview (-7.9%) but is cheaper (costing $0.35 vs. $76.61 through the evaluation) and 2x times faster (10.4s per response vs. 23.3s).

Mistral Nemo and Codestral Mamba

On the lower end, the new Mistral Nemo model, a new small general purpose LLM from Mistral, beats Mistral 7B v0.3 (+12.8%) at a slightly higher price ($0.26 vs. $0.11 per 1M token). Mistral Codestral Mamba, switching the de-facto LLM standard transformer model for the recently introduced Mamba state space architecture, is able to surpass Mistral Nemo (+12.9%) despite having less parameters (7B vs. 12B). The new Mamba architecture is optimized for speed, and indeed Codestral Nemo is one of the faster LLMs we have seen, taking an average 2.9 seconds to solve a case. The only superior model in this regard is Claude 3 Haiku, with the same runtime but higher (+23.0%) score, at 3x times the cost ($0.50 vs. $1.50 per 1M token).

Comparing Codestral Mamba to the also recently released Llama 3.1 8B, they have roughly the same parameter size (7B vs. 8B) and similar score (59.22% vs. 58.27%). Curiously, inference pricing for the Llama model is cheaper regardless ($0.50 vs. $0.11 per 1M token), possibly due to inference not yet being optimized for the new Mamba architecture. It is worth noting that language performance does differ. Codestral exhibits better Go (+24.0%) and Ruby (+7.2%) performance than Llama, while that shines for Java (+14.4%).

While Codestral Mamba delivers a good score, it is also a very chatty model, with 18.7% of the response content being measured as unwanted excess explanations alongside the actual source code. Llama 3.1 8B is less chatty with only 10.2% of excess response content. And while Mistral Nemo is even more concise (4.5%), we have seen values as low as 2% throughout the evaluation, so there is room for improvement.

Google Gemini Java score degradation

One rather surprising finding is that Google Gemini’s Java score for test generation (“write-test” task) dropped significantly since our last v0.5 evaluation in July:

  • Gemini Flash 1.5
    • “write-test” Java score July: 13504
    • “write-test” Java score August: 4686 (relative -65.3%)
  • Gemini Pro 1.5
    • “write-test” Java score July: 10976
    • “write-test” Java score August: 4601 (relative -58.1%)

We first thought that we were dealing with network problems, but the Go language score increased slightly at the same time, with Gemini Pro 1.5 being one of the best models for Go (93.26% of total reachable score):

  • Gemini Flash 1.5
    • “write-test” Go score July: 4175
    • “write-test” Go score August: 4390 (relative +5.1%)
  • Gemini Pro 1.5
    • “write-test” Go score July: 4299
    • “write-test” Go score August: 4481 (relative +4.2%)

Upon closer inspection, this degradation in Java score happens because the Gemini models don’t add package declarations anymore to generated Java test code, resulting in non-compiling code:

  • Gemini Flash 1.5
    • “write-test” Java light repository in August: only 23/115 generated tests have a package com.eval; statement. The others start straight always with the imports.
    • “write-test” Java light repository in July: 103/115 tests had package statements.
  • Gemini Pro 1.5
    • “write-test” Java light repository in August: only 29/115 generated tests have a package com.eval; statement. The others start straight away with the imports.
    • “write-test” Java light repository in July: 93/115 tests had package statements.

One thing that changed compared to DevQualityEval v0.5 is that the prompt now explicitly states:

  • v0.5: “the response must contain only the test code and nothing else”
  • v0.6: “the response must contain only the test code in a fenced code block and nothing else”

Given that the Go performance did not suffer at all, we are inclined to interpret that this change cannot be the reason for the drop in Java score.

With the addition of Ruby, it should be mentioned that Gemini Flash 1.5 has a vastly better Ruby score than Gemini Pro 1.5 (+47.8%) while performing worse for Go (-4.3%) and Java (-12.6%). This Ruby score difference stems again from imports. Pro 1.5 only managed to produce correct relative imports in 36/115 cases, while Flash 1.5 did it correctly in all 115/115 cases, making it one of the better-scoring models for Ruby (89.25% of total reachable score).

Cohere Command (August 2024)

Bar chart that shows the scores of the four Cohere LLMs along the y-axis, separately for the Go, Java and Ruby programming languages.
Higher resolution PNG in case SVG is not working.

The Cohere models received a refresh in August, so we wanted to take a closer look at them.

The smaller Command R performs +15.8% better than the previous revision. This improvement is thanks to the better Go (+52.0%) and Java (+16.3%) scores, while Ruby (-4.0%) slightly degraded. While the price dropped ($0.75 vs. $2.00 per 1M token), we cannot comment on the reported performance improvements. Our evaluation tasks require short responses, so measuring latency and throughput reliably is not possible with the current setup.

Command R Plus does not exhibit the same score improvement, with the new August revision having a similar score (within margin of error) than the previous version. Looking at the different tasks in detail, the newer “08-2024” model performed -4.3% worse for writing unit tests, but +12.1% better for code transpilation. The price dropped here as well ($12.50 vs. $18.00 per 1M token).

Programming languages

Bar chart that shows the score of 69 LLMs along the x-axis, per programming language.
Higher resolution PNG in case SVG is not working.

As our evaluation now supports three programming languages, Go, Java and Ruby, we want to look at the language-specific performance of models.

Go

Bar chart that shows the Go score of 69 LLMs along the x-axis.
Higher resolution PNG in case SVG is not working.

The best LRM for Go is o1-mini (99.0%) and the best LLMs are Llama 3.1 405B (94.7%), alongside Gemini Pro 1.5 (93.3%), Claude 3.5 Sonnet (93.2%) and Claude 3 Opus (91.3%). Interestingly, the original OpenAI LLMs are not present immediately in the upper field, with GPT-4 Turbo and GPT-4o-mini at 88.6% / 88.23%.

The cheapest ($0.42 per 1M token), but also slowest (15.6s per request) well-performing LLM for Go is DeepSeek V2 Chat, while Llama 3.1 405B is still relatively expensive ($3.58 per 1M token). Another cheap option is GPT-4o-mini ($0.75 per 1M token), though its Go score is not as high.

The best small model for Go is Codestral Mamba (77.2%) with just 7B parameters.

The average Go score across all models is 46.6%.

Java

Bar chart that shows the Java score of 69 LLMs along the x-axis.
Higher resolution PNG in case SVG is not working.

The best Java LRM is o1-preview (99.1%) and the best LLMs are GPT-4 Turbo (99.3%), Claude 3.5 Sonnet (98.1%), Claude 3 Opus (98.1%) and DeepSeek V2 Chat (98.0%). For Java, Llama 3.1 405B is also a very good choice at 93.4%, as is Mistral Large V2 with 95.6%. The Google models are not present in the upper field because of the package statement problems discussed previously.

The cheapest ($0.42 per 1M token), but also slowest (15.6s per request) well-performing model for Java is again DeepSeek V2 Chat, with GPT-4o-mini being the next cheapest option ($0.75 per 1M token).

Finding a small, good-performing model for Java is more challenging, with Codestral Mamba being the only model below 10B parameters (7B) scraping the 50% mark (49.0%). The next viable option is Mixtral 8x7B at 82.4%, where 13B parameters are usually active during inference.

The average Java score across all models is 53.8%.

Ruby

Bar chart that shows the Ruby score of 69 LLMs along the x-axis.
Higher resolution PNG in case SVG is not working.

The best LRM for Ruby is o1-preview (98.18%), with the best LLMs being GPT-4o (96.5%), GPT-4o-mini (96.3%), Claude 3.5 Sonnet (93.7%) and surprisingly: DeepSeek V2 Coder (93.2%), which usually scores lower than the “Chat” variant. Interestingly, Claude 3 Opus is not as capable for Ruby (87.7%) and especially the other open-source models: Llama 3.1 405B (85.3%) and Mistral Large V2 (78.1%) fall behind as well.

The cheapest ($0.42 per 1M token), but extremely slow (17.9s per request) well-performing model for Ruby is DeepSeek V2 Coder, with GPT-4o-mini being the next cheapest option ($0.75 per 1M token).

Smaller models are not yet very good yet for Ruby, with Codestral Mamba in the lead at 61.8%.

The average Ruby score across all models is 46.1%, which is the lowest out of all three benchmarked languages (Go, Java, Ruby), but very close to the average performance for Go (46.6%). This is interesting because Ruby is not as widely used as Go or Java, being 16th place of the most used languages on GitHub.

Tasks

Bar chart that shows the score of 69 LLMs along the x-axis, per task type.
Higher resolution PNG in case SVG is not working.

With now three different software engineering scenarios, there are some differences to how models score for writing unit tests, transpiling code and fixing syntactical errors.

Unit tests

Bar chart that shows the score of 69 LLMs for generating unit tests along the x-axis.
Higher resolution PNG in case SVG is not working.

As writing unit tests is a non-trivial task that involves understanding the implementation logic, the top LLMs for generating unit tests are unsurprisingly also the ones that score high in general, such as Claude 3.5 Sonnet. Similarly, the top LRM is o1-preview.

It is worth noting that only 14.5% of all models managed to score over 90% in this category, none reaching 100%. OpenAI’s o1-preview LRM got very close though (98.5%), requiring us to add more challenging tasks in upcoming versions. The score is derived from the code coverage that the generated unit tests produce. This indicates that LLM-generated unit tests can still miss logic cases in the implementation code, resulting in incomplete coverage.

The average score in this category across all models is 47.1%.

Transpilation

Bar chart that shows the score of 69 LLMs for transpiling code along the x-axis.
Higher resolution PNG in case SVG is not working.

Code transpilation is simpler than writing unit tests but requires to correctly translate logic to different syntax. Only one LLM, GPT-4o-mini, managed to reach 100% on this task, with both new OpenAI LRMs, o1-preview and o1-mini, surprisingly scoring below 100% (99.0% and 96.9% respectively). The score is derived from a unit test corpus, which ensures the logic is correctly transferred to the new language. Interestingly, many “good” LLMs that performed well for writing (unit test) code did not solve the transpilation as well. I.e. GPT-4 Turbo scored very high on the “write-test” category with 94.6% but “only” solved 83.2% in the “transpile” category. On the other hand, many “weak” models on the “write-test” category managed to score well for transpilation. I.e. Google’s PaLM 2 only scored 12.8% for unit test generation, but 77.2% for transpilation.

While it makes sense that “weak” LLMs perform better for pure syntax translation, it is surprising to see models that perform well for test generation (i.e. semantic and logic comprehension) lacking when it comes to a simpler transpilation scenario.

The average score in this category across all models is 59.0%.

Code repair

Bar chart that shows the score of 69 LLMs for repairing code along the x-axis.
Higher resolution PNG in case SVG is not working.

The code repair task consists of syntax errors with corresponding error messages. Over 29.0% of the evaluated models scored > 99% on this task, with an average score of 73.5%. This indicates that there is the need to add more challenging examples in the future, also including semantic logic errors in addition to the already present syntax problems. However, it also shows the big potential of using small models for such syntax-related tasks, i.e. Codestral Mamba (7B parameters) scored 93.1% for code repair.

Score improvement through static code repair

Bar chart that shows the Go score of LLMs along the x-axis, including the potential score improvement through static code repair.
Higher resolution PNG in case SVG is not working.

As we’ve already discussed, LLMs often score lower than they should because of common mistakes that are easy to fix. Recent research mathematically proves that hallucinations and mishaps are an inherent design “feature” of LLMs and it is impossible to reduce them to 0.0%. Therefore, this release of DevQuality eval includes a prototype for static Go code repair. The goal is to automatically fix common problems and increase the model score accordingly. We introduce how exactly the code repair works later on.

The benchmark results show that small fixes to the code can improve the score of many LLMs. The results compare both the bare model score and the fixed score with code repair (in case a compilation error was detected with the original output).

For the Go language, applying symflower fix increased performance by a factor of +25.3% across all models compared to the base model score. Models with code repair managed to score a 9.8% higher benchmark score on average. This corresponds to an additional +29.3% of model responses compiling thanks to code repair.

Some extreme cases presented themselves in the upper field, where performance is already very high. For instance, using symflower fix boosted the performance of GPT-4o mini by a factor of +10.5%. In the overall Go scoring, that boost would have enabled GPT-4o-mini to surpass the best Go model: Llama 3.1 405B (97.2% vs. 94.5% score). For the new OpenAI LRMs in particular, code repair did not improve the score, as the models are able to catch and eliminate simple errors during the thought process.

Static repair can especially help improve weaker LLMs to surpass stronger ones. With static repairs, Gemma 2 27B (+20.1%) would also surpass Llama 3.1 405B (95.4% vs 94.5% score).

We found the most extreme cases in the lower and mid fields, where static analysis boosted the performance of Mistral Tiny by +127.0%, leading to a 35.2% absolute score improvement (27.7% -> 62.9%). That makes mistral-tiny (with static repairs) better than mistral-small (62.9% vs. 54.2%), bringing it to the same level as mistral-medium without static repairs (62.9% vs. 64.04%).

Pairing an automatic repair logic with LLM code generation tools has great potential to positively impact the usefulness of LLMs in software development. The performance of smaller and cheaper LLM models is improved to match (or surpass) that of larger and more expensive ones.

Interestingly, looking at the overall score for all languages, static code repair influences the ranking of LLMs even though it is currently only applied for Go. While Sonnet 3.5 stands out as the highest scoring LLM without code repair (at 95.5% total score), GPT-4o-mini is the best scoring LLM with code repair enabled (at 97.1%), better than Sonnet 3.5 with code repair (96.0%).

Bar chart that shows the compilable code responses of LLMs along the x-axis, including the potential score improvement through static Go code repair.
Higher resolution PNG in case SVG is not working.

This graph shows the total number of compilable code responses for all languages, including the code that compiled additionally thanks to static Go code repair. Through automatic fixup, the number of compiling responses increased by +14.7 on average per model.

Percentage-based score

Our scoring until this version consisted of absolute numerical values because they are easier to handle during score computation. One big drawback of this approach is that we could not easily compute a percentage-based score. So while it was possible to compare models against each other, it was unclear how good the models performed in relation to the total score that is theoretically obtainable in DevQualityEval.

It would be possible to solve all benchmark task cases by hand and add up the score to arrive at the perfect solution. However, this approach does not scale well, especially given the fact that we want to add more languages and tasks regularly. So for now, we implemented a rudimentary approximation by scraping the logs for the highest scores for each task case and adding them up to arrive at the total score. This approximation only works under the assumption that each task is solved by at least one model perfectly. We verified this assumption by re-calculating portions of the score by hand as a reference.

Solving the ceiling problem

Thanks to the introduction of a percentage-based DevQualityEval score, we can introduce the “ceiling” problem of benchmarking. It consists of the following two problems. As benchmarking candidates approach 100% on a benchmark (the “ceiling”)…

  1. … the benchmark is getting too easy and further progress cannot be measured anymore.
  2. … it becomes impossible to compare candidates against each other as they all perform “similarly well”.

As the current best model for DevQualityEval (o1-preview) scores 98.6%, it is evident that v0.6 of the benchmark already scrapes the ceiling, leading to the mentioned problems. We see a similar trend with other established coding benchmarks such as HumanEval, where o1-preview also reached 92.4%.

We want to note that the first version of DevQualityEval just tasked models to write tests for an empty Go function, a scenario where solving a single example essentially meant reaching 100% right away. By adding multiple languages, task cases and task scenarios, we shifted the ceiling upwards already. Our goal is to a.) add more languages, cases and scenarios with every new release to keep DevQualityEval challenging as LLM capabilities increase and b.) to make all scenarios, tasks and cases dynamic to make the benchmark impossible to fine-tune for. After all, models evolve and get better, so why shouldn’t benchmarks become increasingly harder as well?

Chattiness

Bar chart that shows the "excess content chattiness" of 69 LLMs along the x-axis.
Higher resolution PNG in case SVG is not working.

With the recent addition of LLM response character counts and the fixed code fence instructions in our prompt, we are able to compute metrics about “chattiness”. Our prompt deliberately states that nothing but the source code should be returned. Any extra output is just added costs. Therefore, one new metric is “excess content chattiness”: the portion of the model response that contains excess information (i.e. additional explanations). It should be noted that reaching 0.0% on this metric is impossible, as the code fences used to mark source code are currently counted towards excess content as they are technically not part of the source code itself. So, there will always be a small portion of excess content detected by the evaluation pipeline.

We have seen excess content chattiness ratios as low as 1.3% (Llama 3 70B) meaning only 1.3% of the response was classified as unwanted excess content, but also extreme values such as 39.9% (NousResearch: Hermes 2 Mixtral 8x7B DPO) where almost half of the output was unnecessary content. On average, the 67 evaluated models exhibited a chattiness of 9.6%.

Models that perform extremely well on the DevQualityEval benchmark usually have a lower excess content chattiness ratio. The average chattiness of the portion of models scoring above 90% on the benchmark is 2.1%, while for the models scoring below 90% the average chattiness is 10.7%. The covariance coefficient between model score and this chattiness ratio is -0.36, indicating that the two are inversely related (i.e. the higher the score, the lower the excess content chattiness). There are some exceptions to this observation. Claude 3 Opus, a very high scorer (93.0%), exhibited a relatively high chattiness of 5.3%. So a portion of the high costs ($90.00 per 1M token) could be reduced by improving instruction-following. On the other hand, Google: PaLM 2 Code Chat scored quite low (22.9%), but with only 1.7% chattiness.

Bigger models seem to exhibit less excess content chattiness as well. The covariance coefficient between model parameter size and chattiness ratio is -0.27, indicating a slight inverse relationship (i.e. the bigger the model, the lower the excess content chattiness). However, as smaller models are usually more affordable, this drawback is not too critical.

Scatter chart that shows the score of LLMs scoring 60% and above along the y-axis, with the characters required per scored point across the y-axis.
Higher resolution PNG in case SVG is not working.

The second new metric we introduce is “overall chattiness”: how many response characters a model needs to reach its score. This is interesting as there can be models with similar scores, but the responses of one model might be concise, while the other one’s contain lots of redundant code.

Looking at the absolute chattiness (i.e. how many characters a model produces for one point scored in DevQualityEval), there is a cluster of very efficient and high-scoring models at the top left. There are some subtle differences, i.e. GPT-4o is the model with the briefest responses, while DeepSeek V2 Coder is less efficient. Models with a lower score have less concise responses, represented by the cluster in the lower center. One interesting observation is that the outputs of Llama 3.1 70B contain almost twice as many characters than Llama 3 70B’s, despite only a marginally better score.

Scatter chart that shows the score of LLMs scoring 60% and above along the y-axis, with the characters required per scored point across the y-axis.
Higher resolution PNG in case SVG is not working.

This graph shows again absolute chattiness but for all benchmarked LLMs. An interesting find here is that the newest Llama 3.1 8B model is 4.5x times more chatty than its big sibling: Llama 3.1 405B 405B.

Model and result reliability

Large Language Models produce their output through probabilistic sampling. To compensate for this nondeterministic behavior, DevQualityEval is evaluated over 5 runs by default. This also allows us to measure fluctuations of model performance, which translates to how reliable a model can solve tasks.

To conduct this analysis, we calculate the “mean deviation” over the 5 evaluation runs. To be precise, all the scores that a model gets for the runs r are r1, …, r5. The mean of these scores is then calculated AVG(r1, …, r5). And the mean deviation is the averaged absolute difference of the mean and the separate scores: (|AVG(r1, …, r5) - r1| + … + | AVG(r1, …, r5) - r5 |) / 5. A nice property of this metric is that it can be interpreted directly as the average fluctuation of the score around the mean, i.e. how consistent a model’s score is.

Model reliability

Bar chart that shows the "relative mean deviation" of 69 LLMs along the x-axis.
Higher resolution PNG in case SVG is not working.

Across all models, we have seen values as low as 1.3% for Gemini Flash 1.5, indicating that the model score fluctuated by an average +/- 1.3% around the mean. This makes Gemini Flash 1.5 the most reliable model of this iteration. On the other hand, there are extremely high values such as 50.6% for Yi 34B Chat, meaning that its actual score fluctuated on average +/- 50.6% around the mean. This is not surprising since Yi 34B Chat only achieved a total score of 0.6%, so its responses were just hit-or-miss. This not only means that Yi 34B Chat Instruct is a highly unreliable model, but also that the score value and therefore the ranking against other models might not be accurate enough.

While these uncertainties cannot be entirely eliminated due to the nondeterministic nature of LLMs, additional runs for high-fluctuation models might be necessary in the future to pinpoint their capabilities with higher confidence. On the other hand, it appears unjustified to invest additional resources to evaluate such a model’s performance more accurately if the model is unreliable in practice anyways.

Bar chart that shows the score of 69 LLMs along the x-axis, including error bars for the "standard deviation".
Higher resolution PNG in case SVG is not working.

A general observation is that higher scoring models were usually more reliable. The correlation between score and mean deviation is -0.66, indicating an inverse relationship (the higher the score, the lower the mean deviation i.e. fluctuation). We can see that only a handful of model families manage to achieve a mean deviation of 4% or lower: OpenAI GPT, Anthropic Claude (excluding Haiku), DeepSeek and Google Gemini. Other models with similar scores, such as Llama 3.1 405B and Mistral Large V2 struggle with a mean deviation of 5% and above. Finally, it is interesting to point out that Gemini Flash 1.5, even though it is the most reliable model, is unfortunately not within the top field because of the Java problems already discussed.

Benchmark reliability

For the previous version v0.5 we internally analyzed how the number of benchmark runs influences the stability of our results. We compared the mean deviation of three models (Claude 3 Sonnet, Claude 3.5 Sonnet, DeepSeek V2 Coder) for 5x singular runs against three separate benchmark runs that consisted of 5x runs each (i.e. 15 runs in total). In this setup, we deliberately chose reliable models to mitigate the nondeterministic nature of LLMs.

By the “Central Limit Theorem”, increasing the sample size of a measurement is guaranteed to decrease the variance of the sample means. So intuitively, conducting more runs in a benchmark will always make the results more accurate. We still wanted to investigate how much the mean deviation (i.e. fluctuations) of the score changes when comparing 5x individual runs to 3x multi-run evaluations. While the individual runs contained an average mean deviation of 2.33%, the evaluations with multiple runs had an average mean deviation of 0.93% for the mentioned models. With this, we are convinced that DevQualityEval (with 5x runs per default) can pinpoint the score of reliable models up to an error margin of +/- 1%.

Model selection

We benchmark all models via OpenRouter.ai, which unifies all popular LLMs, LRMs and providers via one API, making it easy to switch between them as new powerful models are released frequently.

Some model types are excluded by default for DevQualityEval:

  • Non-text-generation models (such as Meta’s Llama Guard classification model)
  • Roleplay and creative-writing models
  • Extended context window models
  • Base models (if a chat/instruct fine-tune is available)
  • Models with internet access (as our tasks don’t require research)
  • High-throughput models (they usually run on better hardware and therefore just cost more)
  • Models optimized for tool usage / function calling
  • Multi-modal models (such as Google’s Gemini Pro Vision)
  • Beta or preview releases (we only evaluate stable models that are production-ready)
  • Outdated models that have been superseded (such as OpenAI’s GPT-3.5)

Furthermore, we regret to report that the following models are no longer available on OpenRouter and therefore also not benchmarked anymore:

  • 01-ai/yi-6b
  • 01-ai/yi-large
  • allenai/olmo-7b-instruct
  • bigcode/starcoder2-15b-instruct
  • fireworks/firellava-13b
  • google/gemma-7b-it
  • huggingfaceh4/zephyr-7b-beta
  • intel/neural-chat-7b
  • liuhaotian/llava-13b
  • liuhaotian/llava-yi-34b
  • meta-llama/codellama-34b-instruct
  • meta-llama/codellama-70b-instruct
  • nousresearch/nous-capybara-34b
  • nousresearch/nous-capybara-7b
  • nousresearch/nous-hermes-2-mistral-7b-dpo
  • nousresearch/nous-hermes-2-mixtral-8x7b-sft
  • open-orca/mistral-7b-openorca
  • phind/phind-codellama-34b
  • qwen/qwen-14b-chat
  • qwen/qwen-32b-chat
  • qwen/qwen-4b-chat
  • qwen/qwen-7b-chat
  • recursal/eagle-7b
  • recursal/rwkv-5-3b-ai-town
  • rwkv/rwkv-5-world-3b
  • snowflake/snowflake-arctic-instruct
  • teknium/openhermes-2-mistral-7b
  • togethercomputer/stripedhyena-hessian-7b

Some exciting new models that are now part of the benchmark:

  • meta-llama/llama-3.1-405b-instruct
  • meta-llama/llama-3.1-70b-instruct
  • meta-llama/llama-3.1-8b-instruct
  • mistralai/codestral-mamba
  • mistralai/mistral-large (version 2)
  • mistralai/mistral-nemo
  • openai/gpt-4o-mini
  • openai/o1-mini
  • openai/o1-preview
  • cohere/command-r-plus-08-2024
  • cohere/command-r-08-2024

Uptime

We noticed that not all OpenRouter models were available all the time. Some models (like Qwen 110B during our testing) were down for hours, others only for a few minutes. Do they deserve a low score because they are not reliably available? That’s something we need to decide for the next version of the evaluation. For now, we just gave them a second chance and reran the benchmark on these models.

Deprecated models

Another problem was that some models just disappear, making it very difficult to compare older versions with newer ones. For instance, Mistral’s newly released “Large” model is now in V2. Mistral decided to just bury V1, and since that one was a proprietary model, it’s no longer possible to run the evaluation using V1 of the same model.

As for running models (e.g. over Ollama) again that are not in our list: please help us!

Noteworthy changes in DevQualityEval 0.6

Static code repair for common mistakes

In our last report, we outlined how common compile errors prevented many models from receiving good scores. We have selected four representatives from this list of common mistakes and implemented a static code repair tool for Go to automatically fix them. This showcases the advantages of pairing LLMs with code analysis to improve the overall results. The impact of this on the benchmarking score is presented in the results section of this report. In a future version, we want to also introduce similar fixes for Java.

Code repair mistakes

The selected code repair candidates are:

  1. Missing imports: referencing a declaration of a package without importing it.
  2. Unused imports: import statements that are unused (compilation error in Go).
  3. Undefined variable: variables in Go are defined using the := operator, but many models use an assignment instead =.
  4. Black-box testing with incorrect package declaration: it is common Go practice to place implementation and testing code into the same package. For “black-box” testing, it is possible to put the testing code into a separate package called <implementation-package-name>_test. Then, the test code can only access public/exported declarations of the implementation. We observed many cases where LLMs decide to do just that, but still try to access implementation package internals, leading to compilation errors.

To illustrate candidate 4. on a real world example, here is phind-codellama-34b attempting to test an unexported function in a black-box scenario:

package plain_test

import (
	"testing"
	"plain"
)

func TestPlain(t *testing.T) {
	plain.plain()
}

The function-under-test plain is private within the plain package so it can only be used if the tests are in the same package: package plain.

There are some special cases that we didn’t anticipate and that are hence not handled in this version. These are outlined below.

Namespace imports

Go allows to import complete packages into the current namespace (similar to Java’s “on-demand” / wildcard imports).

// "Normal" package import:
import "fmt" // Reference a declaration within "fmt" using: "fmt.Println".

// Namespace package import:
import . "fmt" // Reference a declaration within "fmt" using just: "Println".

While it is usually not encouraged to import directly into the current namespace to avoid naming conflicts, we have seen multiple examples of this in LLM-generated code (i.e. for qwen-14b-chat and command-r-plus). Detecting if these forms of imports are unused is more challenging than for “normal” imports. This is because an import of the form: import "pkg" is unused if no expression pkg.SomeDeclaration is ever encountered within the current file. However, with import . "pkg", the selector of the expression vanishes, leaving just SomeDeclaration. But now, SomeDeclaration could come from anywhere within the current scope, package or even another namespace import. For all expressions, one has to check if the referenced object is not already defined somewhere else. Only then can we assume that it comes from pkg and the import is indeed used.

Removing such more complex unused imports correctly is something we have planned for the future, further improving LLM code generation quality.

Black-box testing without an implementation import

The black-box testing package mistake 4. was already illustrated with an example above. However, in that case, it was at least clear what the intention of the test was. The import was done correctly, the only problem was that the referenced declaration was private/unexported. The following example by xwin-lm-70b is more ambiguous:

package plain_test

import (
	"testing"
)

func TestPlain(t *testing.T) {
	plain()
}

// func TestPlainEmpty(t *testing.T) {}

In the previous example, plain.plain() was unexported but it was clear that it must be part of the plain package. But here plain() is simply undefined and could come from anywhere. That it could come from the implementation package is in this case just an assumption.

Soon, we want to extend the static analysis to also handle these more challenging cases.

Integration in the evaluation

The static code repair is currently available for the write-test and transpile tasks for the Go language. We detect whenever an LLM-generated solution does not compile, apply the code repair techniques outlined above, and re-compile to see if the repairs were successful. To clearly separate LLM scores enhanced through static repairs from the bare LLM results, we introduce new “virtual” tasks to collect the improved scores. I.e. write-test results are the out-of-the-box LLM results and write-test-symflower-fix are the ones with additional code repair.

Ruby support

We added Ruby as a third language to the benchmark as it is not a very common language (ranked as the 19th popular language in the 2024 Stack Overflow survey). This makes it an interesting candidate as LLM performance is influenced by the availability and quality of training examples. The journey of supporting Ruby as a new language is documented in a separate blog post. The results section contains a dedicated Ruby portion.

New & updated tasks

Previously, we only challenged LLMs with a “write test” task for both Java and Go. This task has models write unit tests for implementation functions. In our recent release, we increased the number of cases to 24 per language. Preparing to support other software engineering tasks in the future, we now added more internal infrastructure that allows us to add new task types more easily.

The way repositories are validated is also something we have improved. Previously, repository validation (checking if the folders containing the cases have the valid structure) was performed for every run and every model. Naturally, verifying the repository structure again and again is inefficient and unnecessarily extends the evaluation time. Instead, we now only validate task repositories once, right before running the evaluation.

With the two new tasks we added for this new version, we have a total of three types of tasks in the evaluation for each language (Go, Java):

Unit tests

This task already existed, but we did some tweaking.

  • Asks a model to write unit tests for a function
  • 24 different “write-test” cases (per language)

Last time, we reported that the Java Knapsack problem with the non-static outer class was problematic for some models to solve. Many models could not figure out that they need to instantiate the outer Knapsack class to be able to instantiate the inner Item class. The example has been fixed to make the outer class static as well, so instantiating an Item can be done with new Knapsack.Item().

Code repair

  • Asks a model to repair source code with compilation errors
  • 5 different “code-repair” cases (per language) with a corresponding test suite (avg. 3.6 tests per response)

Task repositories are located in golang/mistakes and java/mistakes. A mistakes repository itself contains several packages, each representing one case for an LLM to solve. Each case is composed of a source file (which contains the erroneous code that the LLM should fix) and the corresponding test file with a valid test suite.

Multiple scenarios are covered in the code repair task:

  • Missing import
  • Function missing the opening bracket of its body
  • Function missing the parameter type
  • Function with a misspelled parameter type
  • Function using an undeclared variable

When running the evaluation, we first compile the package to obtain the list of errors. We then hand the errors to the LLM, along with the source code. Finally, we check the LLM’s response by running the predefined tests with the source code returned by the model.

When test-driving the code repair task for the first time, we encountered an interesting ambiguity. For the variableUnknown test case, we originally defined a variable y with y := x and then removed the declaration to provoke the compile error, as shown below (of course without the comment):

package variableUnknown

func variableUnknown(x int) int {
	// deleted y := x
	if x > 0 {
		return y
	}
	if x < 0 {
		return -y
	}
	return 0
}

We then defined some test cases around this example to check if LLMs would properly repair the broken code. Keeping in mind that we assumed y == x and that the tests are never shown to the LLM as part of the challenge. This setup proved to be ambiguous, since without context there is no way to figure out that y must be equal to x. Some LLMs defined y == 10, while others just returned x. By making y = x an assignment instead of a definition (which is an error in both Go and Java), we gave the LLM just enough of a hint to correct the problem:

package variableUnknown

func variableUnknown(x int) int {
	y = x
	if x > 0 {
		return y
	}
	if x < 0 {
		return -y
	}
	return 0
}

The score of the write-test task is based on the coverage that the LLM-generated tests produce for the implementation. For code-repair it is the other way around. I.e. the score is only based on the number of passing unit tests that ensure the repaired code is behaving correctly.

Transpilation

  • Asks a model to transpile source code to another language
  • 5 different “code-repair” task cases (per language) with a corresponding test suite (avg. 4.4 tests per response)

With this new task, the evaluation supports transpilation between Java and Go (e.g. Java to Go and vice versa). The task repositories golang/transpile and java/transpile hold these task’s cases. Each case has an implementation folder containing the source code to be transpiled (e.g. golang/transpile/binarySearch/implementation/BinarySearch.java is the implementation file to be transpiled to Go). The remaining two files are a stub for the target language so the LLMs know the function signature, plus a test file with a predefined test suite to ensure that the transpiled code has the correct behavior. This design allows us to easily support more languages in the future by adding more source implementations (i.e. golang/transpile/binarySearch/implementation/binarySearch.rb).

golang/
└── transpile
    └── binarySearch
        β”œβ”€β”€ binarySearch.go       // Transpilation target stub
        β”œβ”€β”€ binarySearch_test.go  // Predefined test suite
        β”œβ”€β”€ go.mod
        └── implementation
            └── BinarySearch.java // Transpilation source

Scenarios are taken from the light repository:

  • balancedBrackets
  • binarySearch
  • cascadingIfElse
  • pascalsTriangle
  • sort

We initially had a small inconsistency in the sort Go test case. In the stub file, the package declaration was isSorted. But in the prompt, we instructed LLMs that the package should be sort (derived from the folder structure). Some LLMs blindly followed the prompt and called the transpiled package sort, some thought it would be smarter to use the stub package isSorted. We eventually changed the naming to match and resolve this confusion.

Given the following Java code file, transpile it into a Go code file.
The response must contain only the transpiled Go source code and nothing else.
package com.eval;

class Sort {
    static boolean isSorted(int[] a) {
        int i = 0;
        while (i < a.length - 1 && a[i] <= a[i + 1]) {
            i++;
        }

        return i == a.length - 1;
    }
}

The transpiled Go code file must have the package "sort" and the following signature:

func isSorted(a []int) bool {
}

Note that we are again not using coverage scoring for this task. An LLM could just add random statements to the transpiled code and receive more coverage, which would be unfair. So instead, we once more check how many unit tests of the test suite pass (as in the new transpile task).

Docker image

We now provide a Docker image for the evaluation. It was a long-time goal of ours to sandbox unsafe LLM-generated code. Furthermore, the eval-dev-quality command now has a --runtime=docker argument which causes it to switch from the (unsafe) local evaluation to an automatically spun-up Docker image for the evaluation to run in. This also enables benchmarking several models in parallel spawning multiple containers (using the new --parallel flag).

Image tagging within GitHub actions

For convenience, we wanted to tag the images with the current commit hash with which the image was built. However, when tagging the images in GitHub actions, we realized that GitHub uses different commit hashes when actions are run for PRs. This is because the PR HEAD simulates the PR branch being already merged on top of the base branch including the merge commit. As we are not the only ones confused by this, there is a discussion about this behavior. For us, the solution was just to build the Docker image separately for the actual feature branch and the PR. This allows us to keep separate images for the branch + the “simulated behavior” of the branch being merged to main. We also added the new --runtime-image <sha> flag which then runs the evaluation with a specific image by the commit hash of a PR branch.

Testing Docker support within GitHub actions

We were initially skeptical if it would be possible to get a Docker runtime working within a GitHub action to test the new sandboxing. After all, a GitHub action runner is in itself a container as well. There exists a “docker in docker” image but using it is not generally recommended. We didn’t encounter any problems, though, until now.

However, this additional abstraction layer is a bit more challenging to work with conceptually. One problem was that, currently, only task data located within the Docker image is supported. I.e. when running the evaluation with --runtime=docker, one can only evaluate on the stock cases that were present when the image was built. Mounting custom tasks into the container is on our list for a future update. This was problematic during testing. Our tests load the task data from a temporary directory so that it can optionally be modified for the test. But of course that temporary directory does not exist in the image. This was resolved by adapting the testing logic.

Non-root containers

Also, we ran into the problem of rootless containers. The result directory was mounted into the container as a “bind mount”. However, this can lead to permission/ownership problems and a lot of head-scratches. So instead of using “bind mount” volumes, we ended up using actual volumes, which helped alleviate all of those problems. The only downside of this approach is that we need an additional container connected to the volume, with which we then run docker cp to move the data back to the host.

Parallel access to result directories

Containerization came with another general problem. As mentioned above, we can now execute parallel evaluations using multiple containers at once (either local or in a Kubernetes cluster). We already have a check in place that avoids overwriting existing results by suffixing the result directory with a number in case it already exists. But when spawning multiple containers, there were still race conditions where several containers decided to use the same suffix, resulting in I/O sync problems. So we had to make the results directories for each container unique during the initialization instead of relying on our previous mechanism.

Kubernetes support

With Docker support, we have also added support for running the containers in Kubernetes clusters. This allows us to benchmark many models in parallel on our infrastructure.

Running distributed evaluations requires using ReadWriteMany persistent volumes. To copy result data back from the volume to the host, kubectl cp is used. In the case of multiple models being benchmarked, we applied pod anti-affinity to spread workload on the whole cluster during execution. For added security, we pass secrets (API keys) to containers via environment variables instead of CLI arguments.

Test execution timeouts

In our previous report, we mentioned that weaker models often generate very inefficient or even blocking code. To prevent such generated code from slowing down or halting the complete evaluation, we added a timeout for test execution. It is set to 5 minutes by default, which should be sufficient to finish generated tests that are reasonable but abort time-consuming or blocking ones.

Prompt refinement

We unified our task prompts across all operating systems to use UNIX-style forward slashes in file paths. This is a subtle change but it ensures that results are comparable and consistent across operating systems.

Additionally, we refined our prompts with respect to code fences. As reported in our very first deep dive, lots of models return unwanted information in their responses. We have to parse the code portion from the response, and rely on code fences to help us find the relevant part of the output. Furthermore, we award extra points (called no-excess score) during scoring if a model response contains just code in code fences and nothing else. We first realized something was off when looking at the results of Claude Sonnet 3.5. It repeatedly missed a few no-excess points in this category, affecting its overall rank despite good overall performance. Upon investigating the generated responses, it turned out that Sonnet 3.5 prefers not to use code fences at all. To avoid such confusion in the future, we adapted our prompt to strictly require code fences. This should also improve our metric for excess code as that now penalizes models that don’t obey the requirement to use a code fence.

Improved Ollama initialization

In our previous release we added Ollama support for benchmarking arbitrary (local) models. As part of the initialization, the evaluation spins up an Ollama server instance to verify that all selected models for evaluation are actually pulled already. We have made two improvements to this process. First, selected models are now pulled automatically. And second, we don’t start the Ollama server if no Ollama model is selected for evaluation, speeding up the initialization process in case Ollama is not needed.

We did some CPU-inference tests for bigger models, and found that CPU-only inference is starting to reach its limits quickly. We are looking forward to switching to GPU-inference for bigger models. And we plan to add a reasonable query timeout for Ollama models to prevent inefficient CPU inference to accidentally take hours on slow CPUs.

Reporting, logging and metrics

Structural logging

Until now, all logs were written into one file (per task and repository). Our policy is to log as much information as possible, but that also makes it difficult to quickly find information. With this iteration, we internally introduced structured logging to prepare for better reporting. Our goal is to have a “JUnit HTML report”-like log structure where it is easy to inspect overall information (i.e. which tasks were solved by a model) plus the ability to go deeper (to e.g. inspect the model responses and compilation output directly). As a first step, model responses are now written in separate files for easier access.

Result post-processing

In this release, we’ve also decoupled benchmarking and results processing. So far, scores were aggregated and written to disk directly after the evaluation was completed. That also meant that when the evaluation failed, its results were gone. Now, we write results right away so nothing can get lost. This implies that post-processing (combining results from different sources, aggregating model scores across multiple tasks, …) has to be done in a separate step.

Hence, there is a new sub-command for post-processing results. Currently, it can be used to combine multiple evaluation results (e.g. from multiple containerized evaluations) into one report:

eval-dev-quality report --evaluation-path=<path-to-evaluation.csv> --evaluation-path=<path-to-other-evaluation.csv> --result-path=<result-path>

Arguments for this sub-command are:

  • evaluation-path: A list of paths to evaluation.csv files. This argument also supports glob functionality e.g. docs/v5/*/evaluation.csv.
  • result-path: A path to write the results to.

The next step for us is to extend result post-processing to enable more sophisticated insights (aggregating model scores, interactive plots, a leaderboard, …).

Configuration reproducibility

When running sample benchmarks, we always found it challenging to later track back how exactly they were configured (e.g. which models and task repositories were used for the evaluation). We would have to search through the logs to find that information, and then reconstruct the exact command again to re-run a specific benchmark. Initially, we just started collecting dozens of scripts with different benchmark command line invocations, but that solution doesn’t scale.

With this release, we prototyped a configuration file that is written out with each evaluation result and contains the selected models and task repositories for documentation purposes. This same file can also be passed to the evaluation again to start a new benchmark with the same setting. This is currently limited to native OpenRouter and Ollama models, but we want to extend it towards arbitrary OpenAI API providers as well to allow saving and restoring their custom API urls with the configuration.

What comes next? DevQualityEval v0.7

There are 4 problems that we need to solve for the next DevQualityEval version v0.7. Let’s again take a look at the overall score chart:

Bar chart that shows the total score of 69 LLMs along the x-axis, including the potential score improvement through static code repair.
Higher resolution PNG in case SVG is not working.

From this chart we can directly identify two problems:

  1. Models are hitting the score ceiling: the section solving the ceiling problem discusses solutions.
  2. Static code repair is not fully utilized: the section score improvement through static code repair discusses the current code repair solution. However, we identified an immense list of features that should be implemented to make even <10B models perform similar to >400B models.

However, there are two problems that you as a reader of our deep dives have surely identified:

  • Missing languages, frameworks and use-cases: this benchmark is about helping everyone create and choose their perfect model. If your favorite technology is not represented, let us know (as your vote counts towards our roadmap) and make sure to help us!
  • More automatic reporting and dynamic leaderboard: working on new DevQualityEval versions which involves creating these deep dives to reflect how we move forward, take a huge amount of time. We will therefore focus in the next version on creating more automatic tooling to make sure we are not wasting time on creating graphs or filtering logs. The next version will “automate all the things”!

Hope you enjoyed reading this deep-dive and we would love to hear your thoughts and feedback on how you liked the details, how we can improve such deep-dives and the DevQualityEval.

If you are interested in joining our development efforts for the DevQualityEval benchmark: GREAT, let’s do it! You can use our issue list and discussion forum to interact or write us directly at markus.zimmermann@symflower.com or on Twitter.

| 2024-09-26