The most capable model for Rust is OpenAI: ChatGPT-4o, with GPT-4.1 Nano being the most cost-efficient choice. DeepSeek: DeepSeek V3 is the best open-weight model and Google: Gemma 3 (both 12B and 27B) shine as strong models for local deployment.
We evaluate how useful LLMs are for Rust on code generation and software engineering tasks. Which complete all their assigned tasks without issues, which models struggle already with syntax and which models generate usable code at a cheap price point? We added Rust to DevQualityEval since it continues to be the most anticipated programming language, with over 80% of developers hoping to use Rust in the future.
This article does a deep dive of the evaluation in respect to findings and cases that need to be improved either via better LLMs or dedicated coding tools. Some highlights that we have seen include high-scoring models such as OpenAI: ChatGPT-4o that struggle with Rust’s explicit overflow semantics. Also, generic type information and import syntax lead to issues. Due to an oversight on our end, models were even able to cheat during transpilation to obtain a higher score than should be possible.
The terminology and setup of the DevQualityEval benchmark are explained in a dedicated section. For the full benchmark including additional languages Java, Go and Ruby visit devqualityeval.com.
Table of contents:
- Comparing the capabilities and costs of top models
- Comparing model capabilities by total score
- Tasks
- Setup
Comparing the capabilities and costs of top models
The graph shows the best models in relation to their overall scores (y-axis, linear scale) and costs (x-axis, logarithmic scale, $ spent for average costs per API request over the whole benchmark for a model). The sweet spot is the top-left corner: cheap with good results.
Looking at cost-efficiency (i.e. performance vs. the average costs per API request of solving all 120 Rust benchmark cases), there is a cluster of high-scoring and affordable models at the top left. Using a 45° angled line going from that left-top corner towards the bottom-right corner gives us the following TOP 3 of cost-effectiveness:
- #1 the 👑 of cost-effectiveness OpenAI: GPT-4.1 Nano (95.34% at $0.03 for the whole benchmark, and #12 overall),
- #2 💎 OpenAI: GPT-4o (99.77% at $0.92 for the whole benchmark, and #1 overall)
- #3 💍 OpenAI: GPT-4.1 Mini (97.93% at $0.10 for the whole benchmark, and #5 overall)
Comparing the most cost-effective model OpenAI: GPT-4.1 Nano (95.34%) with the best performing model OpenAI: ChatGPT-4o (99.77%, at $0.92 for the whole benchmark), it is possible to reach almost top performance at ~3% of the cost. This makes it clear that it is essential to compare costs alongside performance when choosing a model for Rust programming.
Comparing the TOP 3 models in terms of functional score for Rust, we see increased costs for reasoning models:
- #1 🥇 OpenAI: ChatGPT-4o, 99.77% at $0.92 for the whole benchmark
- #2 🥈 OpenAI: o3-mini (2025-01-31) (reasoning-effort=low), 99.44% at $7.44 for the whole benchmark
- #3 🥉 OpenAI: o3-mini (2025-01-31) (reasoning-effort=high), 99.15% at $7.39 for the whole benchmark
The three most expensive models overall made up for 68% of our costs of running the benchmark, but only one of them reached an almost perfect score:
- OpenAI: o1-preview (2024-09-12), 97.15% at $33.99 for the whole benchmark
- Anthropic: Claude 3.7 Sonnet (thinking), 95.13% at $17.69 for the whole benchmark
- Google: Gemini 2.5 Pro Preview, 86.40% at $15.39 for the whole benchmark
Reducing the full list of over 300 LLMs to a manageable size was done by sorting based on scores and then costs. We then removed all models that were worse than the best models of the same vendor/family/version/variant (e.g. gpt-4*
is better than gpt-3*
so the latter can be removed). For clarity, the remaining models were renamed to represent their variants. Even then, the list was immense. In the end, only the most important new models, fundamental models and top-scorers were kept for the above graph. For a complete picture, take a look at the DevQualityEval leaderboard or even request access to all metrics and graphs.
Comparing model capabilities by total score
This graph displays the total Rust score of all benchmarked models. The higher the score, the better the model performed on a functional level. This uses a percentage-based score instead of a numerical scoring system, as outlined in the v0.6 “percentage-based-score” section.
The TOP 3 best-scoring models are:
- #1 🥇 OpenAI: ChatGPT-4o, 99.77%
- #2 🥈 OpenAI: o3-mini (2025-01-31) (reasoning-effort=low), 99.44%
- #3 🥉 OpenAI: o3-mini (2025-01-31) (reasoning-effort=high), 99.15%
The average score is 69.07%. We have 20.0% of all models scoring 90% or higher, and 41.5% of all models scoring 80% or higher. So, while we have lots of good models, there is still a portion of LLMs (58.5%) struggling to generate immediately usable Rust code.
Comparing scores and model size
This graph plots the total score of models (reaching 50% or higher) against their number of parameters, in all the cases where the number of parameters was known and confirmed.
These are the TOP models according to their sizes:
- Tiny: Mistral: Codestral Mamba with 7B and 73.51%
- Small: Mistral: Ministral with 8B and 71.68%
- Mid-small: Google: Gemma 3 with 12B and 78.51%
- Medium: Google: Gemma 3 with 27B and 86.00%
- Big: Qwen2.5 Instruct with 72B and 85.49%
- Bigger: Mistral: Mistral 2 Large (2407) with 123B and 90.61%
- Huge: DeepSeek: DeepSeek V3 (0324) with 685B and 93.95%
In a previous iteration, we pointed out that while more model parameters seem to lead to a better score, size is not everything. In this Rust evaluation, 55.3% of all <32B models scored over 50% vs 76.9% of all ≥32B models. But parameter size isn’t the only factor:
- Mistral: Mixtral 8x22B Instruct only scored 60.75% despite having 176B parameters
- Meta: Llama 3.1 405B Instruct at 71.44% scored lower than many smaller models (there are 17x smaller models scoring as good or better, i.e. Google: Gemini 1.5 Flash 8B at 71.44%)
- Nous: Hermes 2 Mixtral 8x7B DPO achieved only 23.85% despite having 56 parameters. The only three models worse are all significantly smaller: Mistral: Mistral 7B Instruct v0.3 (18.68%), Mistral: Mistral 7B Instruct (16.73%) and Meta: Llama 3.2 1B Instruct (16.55%).
Comparing scores and performance
This graph plots the total score of models (reaching 50% or higher) against their performance (logarithmically). We measured the response time for each model and normalized over the solved cases, resulting in the average time it took the model to solve a single example. On average, models took 19.58 seconds to solve a task case.
There was a gap between the fastest model: Amazon: Nova Micro 1.0 (53.05%) at 6.67 seconds per response, and the slowest: Microsoft: Phi 4 Reasoning Plus (8.79%) at 213.89 seconds per response. That means Phi 4 Reasoning Plus took 3.5 minutes to solve a single task, which made this model an outlier: it was over 1.7 times slower than the second-slowest model: Qwen: Qwen3 235B A22B (126.08 seconds per response).
Amazon won in terms of speed, with Nova Micro 1.0 (53.05%) being the fastest and Nova Lite 1.0 (6.75 seconds per task, 71.60%) being the second-fastest LLM. The fastest model scoring >80% is Inception: Mercury Coder Small Beta with 7.11 seconds per task. It should be noted that the best scoring model, OpenAI: ChatGPT-4o (99.77%), only took 7.61 to solve a task on average.
Comparing model reliability
LLMs produce their output through probabilistic sampling. To compensate for this nondeterministic behavior, DevQualityEval is evaluated over 5 runs by default. This also allows us to measure fluctuations in model performance, which translates to how reliably a model can solve tasks.
To conduct this analysis, we calculate the mean deviation over the 5 evaluation runs. To be precise, all the scores that a model gets for the runs r are r1, …, r5
. The mean of these scores is then calculated AVG(r1, …, r5)
. And the mean deviation is the averaged absolute difference of the mean and the separate scores: (|AVG(r1, …, r5) - r1| + … + | AVG(r1, …, r5) - r5 |) / 5
. A nice property of this metric is that it can be interpreted directly as the average fluctuation of the score around the mean, i.e. how consistent a model’s score is.
Across all models, we have seen values as low as 0.17% for OpenAI: o3-mini (2025-01-31) (reasoning-effort=low), indicating that the model score fluctuated by an average +/-0.17% around the mean. That makes OpenAI: o3-mini (2025-01-31) (reasoning-effort=low) the most reliable model of this iteration. On the other hand, there are extremely high values such as 65.20% for Microsoft: Phi 4 Reasoning Plus 15B, meaning that its actual score fluctuated on average +/-65.20% around the mean.
While the three most reliable models are all from OpenAI, it is worth mentioning that in fourth place we have Inception: Mercury Coder Small (beta) with 0.64%. Also an extremely reliable model, built with a new architecture inspired by diffusion models called “Large Language Diffusion Model”. We are eager to see what comes next.
Tasks
We evaluate models on three scenarios:
- Write tests: the model receives a piece of code and has to provide a unit test suite reaching 100% coverage
- Code repair: the model receives an erroneous code section with an error message and has to provide a fixed version of the code where the error is resolved
- Transpile: the model receives a function stub with an implementation in a different language and has to provide a transpiled version of the implementation
Write tests
We discuss results, setup of the task, and noteworthy insights.
Results
No LLM was able to reach 100%, but 3 OpenAI models were close with a score of over 99%.
The only reason why OpenAI: o3-mini (2025-01-31) (reasoning-effort=high) and OpenAI: o3-mini (2025-01-31) (reasoning-effort=low) did not reach the full score was that both had a single run (out of 5) where the generated tests did not cover all lines of the following source code: the second occurrence of jacobi = -jacobi;
was not covered.
use std::error::Error;
pub fn jacobi_symbol(k: u64, mut n: u64) -> Result<i32, Box<dyn Error>> {
if k < 0 || n % 2 == 0 {
return Err("Invalid value".into());
}
let mut k = k % n;
let mut jacobi = 1;
while k > 0 {
while k % 2 == 0 {
k /= 2;
let r = n % 8;
if r == 3 || r == 5 {
jacobi = -jacobi;
}
}
let temp = n;
n = k;
k = temp;
if k % 4 == 3 && n % 4 == 3 {
jacobi = -jacobi; // <- Missed coverage.
}
k %= n;
}
if n == 1 {
Ok(jacobi)
} else {
Ok(0)
}
}
OpenAI: ChatGPT-4o is slightly behind the o3 Mini variants from above, and had two runs where it missed to cover one line, and it had one run where it missed two lines of the following example:
pub fn valid_date(day: i32, month: i32, year: i32) -> bool {
let month_days = vec![31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
if year < 1583 {
return false;
}
if month < 1 || month > 12 {
return false;
}
if day < 1 {
return false;
}
if month == 2 {
if (year % 400) != 0 && (year % 4) == 0 {
if day > 29 {
return false;
}
} else {
if day > 28 {
return false;
}
}
} else {
if day > month_days[month as usize - 1] {
return false;
}
}
true
}
Models scored on average 67.20%, with 19.51% of the models scoring >90%.
Are the models solving the problems in the same way?
Most of the models solved the examples in the same way, by using the testing functions and macros from the Rust standard library. Some models, like Mistral: Mistral 7B Instruct v0.3 and Mistral: Mistral 7B Instruct, tried to import different testing libraries (e.g. proptest), but without success since we don’t allow the use of third-party libraries.
What are the problems of models not reaching 100 percent?
Many models had problems dealing with the Rust idiomatic type Result
for error handling. While they know the type and how it is used, they lack a critical piece of semantical information which is relevant for generating tests.
Consider the following example with the previously mentioned return type.
pub fn jacobi_symbol(k: u64, mut n: u64) -> Result<i32, Box<dyn Error>> {
if k < 0 || n % 2 == 0 {
return Err("Invalid value".into());
}
let mut k = k % n;
let mut jacobi = 1;
while k > 0 {
while k % 2 == 0 {
k /= 2;
let r = n % 8;
if r == 3 || r == 5 {
jacobi = -jacobi;
}
}
let temp = n;
n = k;
k = temp;
if k % 4 == 3 && n % 4 == 3 {
jacobi = -jacobi;
}
k %= n;
}
if n == 1 {
Ok(jacobi)
} else {
Ok(0)
}
}
These models then provided tests containing assertions like assert_eq!(jacobi_symbol(2, 7), Ok(1))
which results in the following compilation error:
error[E0369]: binary operation `==` cannot be applied to type `Result<i32, Box<dyn std::error::Error>>`
A working assertion would be assert_eq!(jacobi_symbol(2, 7).unwrap(), 1)
.
Models causing this issue include
- 01.AI: Yi Large
- Cohere: Command A
- Qwen: Qwen3 235B A22B
- Anthropic: Claude 3.7 Sonnet (thinking)
- Google: Gemma 3 27B
Another common type of error also concerns the jacobi_symbol
example from above. Rust does not allow to assign a negative value to unsigned integers. Hence, an assertion assert!(jacobi_symbol(-1, 3).is_err())
yields a compilation error
error[E0600]: cannot apply unary operator `-` to type `u64`
Models having this issue include
- DeepSeek: DeepSeek V3 0324
- Google: Gemma 3 27B
- Command R7B (12-2024)
Initialization of multidimensional arrays is also difficult for LLMs. We asked the models to provide tests for the following function:
pub fn type_array_multidimensional_array_length(x: &[&[i32]]) -> i32 {
if x.len() == 2 {
if x[0].len() == 2 {
return 2;
}
return 1;
}
0
}
Cohere: Command A provided tests of the following form:
#[test]
fn test_2x2_array() {
let arr = [&[1, 2], &[3, 4]];
assert_eq!(type_array_multidimensional_array_length(&arr), 2);
}
The initialization of arr
is fine, but it has the wrong type for passing it to the function. Correct would have been this:
#[test]
fn test_2x2_array() {
let arr: &[&[i32]] = &[&[1, 2], &[3, 4]];
assert_eq!(type_array_multidimensional_array_length(arr), 2);
}
A similar mistake, done by DeepSeek: DeepSeek V3 0324 and Qwen: Qwen3 235B A22B, was missing the type upon assignment. That is, instead of doing let arr: &[&[i32]] = …
it did let arr = …
.
There a lot of prominent models having issues with multidimensional arrays, including
- Cohere: Command A
- Google: Gemma 3 27B
- Qwen: Qwen3 8B
- Anthropic: Claude 3.7 Sonnet (thinking)
- OpenAI: o4 Mini
A very low score was achieved by Mistral: Mistral 7B Instruct v0.3 and Mistral: Mistral 7B Instruct. These models had a lot of syntactical issues, like forgetting or misplacing parenthesis or quotes, or using wrong function names (balanced_brackets
instead of has_balanced_brackets
, or cascading_if_Else
instead of cascading_if_else
). But they also tried to write the tests in an unconventional way, e.g. using testing frameworks different from the one from the standard library, but without success.
Distribution of solved cases
Case | Line coverage |
---|---|
src/balanced_brackets.rs | 73.55% |
src/binary_search.rs | 81.13% |
src/call_loop_conditions_often_enough.rs | 79.25% |
src/cascading_if_else.rs | 87.64% |
src/collatz_calculator.rs | 55.68% |
src/conditions_and.rs | 85.38% |
src/equilibrium_indices.rs | 80.45% |
src/for_loop.rs | 82.06% |
src/jacobi_symbol.rs | 27.72% |
src/klarner_rado_sequence.rs | 80.29% |
src/knapsack.rs | 83.72% |
src/largest_proper_divisor.rs | 53.46% |
src/magic_square_odd.rs | 64.74% |
src/match_bytes.rs | 75.24% |
src/pascals_triangle.rs | 43.72% |
src/phone_number.rs | 58.13% |
src/simple_if_else.rs | 86.71% |
src/simple_switch_with_return.rs | 85.71% |
src/sort.rs | 83.13% |
src/type_array_access.rs | 79.76% |
src/type_array_consecutive_access.rs | 86.00% |
src/type_array_multidimensional_array_length.rs | 39.17% |
src/validate_date.rs | 74.80% |
src/plain.rs | 80.33% |
This table shows all write test cases and their average line coverage over all models. For example, the plain.rs
case only has an average coverage of 80%, which is quite surprising given that it is just a single empty function. The most challenging case is the jacobi_symbol.rs
with only 28% coverage reached on average.
Setup of the write tests task
For this task, models receive a Rust function and have to provide a complete test suite with 100% coverage.
An example prompt would be:
Given the following Rust code file "src/cascading_if_else.rs", provide tests for this code that can be appended to the source file.
Add everything required for testing, but do not repeat the original file and do not try to import the code file.
The tests should produce 100 percent code coverage and must compile.
The response must contain only the test code in a fenced code block and nothing else.
```rust
pub fn cascading_if_else(i: i32) -> i32 {
if i == 1 {
2
} else if i == 3 {
4
} else {
5
}
}
```
As the prompt states, the response must contain the tests only, but are meant to be appended to the source code. Once a response is received, the source file including the tests are written to a file, and cargo llvm-cov
is executed, and the determined coverage is used to calculate the score of the response.
Noteworthy insights
Tests in Rust can reside either in the corresponding implementation file for unit tests, or within a separate package for integration tests. Unit tests are more common, so we chose to implement that workflow. We ask a model to generate just the tests. We then append them to the implementation file and run the tests. With our initial prompt, it was not clear enough that the models only had to return the tests and nothing else.
Given the following Rust code file “src/plain.go”, provide tests for this code. The tests should produce 100 percent code coverage and must compile. The response must contain only the test code in a fenced code block and nothing else.
We observed that models would repeat already existing imports from the implementation to the tests. Then, appending the tests would result in duplicated imports, and therefore non-compiling code. The following answer is from Meta: Llama 3.3 8B:
#[cfg(test)]
mod tests {
use super::*;
use std::error::Error;
#[test]
fn test_largest_proper_divisor_invalid_input() {
assert!(largest_proper_divisor(-1).is_err());
assert!(largest_proper_divisor(0).is_err());
}
// ... abbreviated ...
#[test]
fn test_largest_proper_divisor_error() {
assert_eq!(largest_proper_divisor(-1).unwrap_err().downcast_ref::<&str>().unwrap(), "Argument must be >= 1");
assert_eq!(largest_proper_divisor(0).unwrap_err().downcast_ref::<&str>().unwrap(), "Argument must be >= 1");
}
}
The implementation already imports and uses std::error::Error
, so the import by the LLM triggers a compilation error.
It took several iterations to find a reliable prompt that ensures the model responses only contain the test portion and nothing else:
Given the following Rust code file “src/plain.go”, provide tests for this code that can be appended to the source file. Add everything required for testing, but do not repeat the original file and do not try to import the code file. The tests should produce 100 percent code coverage and must compile. The response must contain only the test code in a fenced code block and nothing else.
Code repair
We discuss results, setup of the task, and noteworthy insights.
Results
71 out of 123 (57.72%) models were able to solve 100% of the cases. We are answering the question if these models solved the cases in the same way, and provide an insight into the challenges of the remaining models.
Are the models solving the problems in the same way?
Almost all cases were solved in the same way. The single exception where several top models provided different solutions was the following:
pub fn import_missing(s: &str) -> bool {
let t = String::from_str(s).unwrap();
t.contains("foobar")
}
We gave the above implementation to the LLM and asked it to repair the following compilation error:
src/lib.rs:2: no function or associated item named `from_str` found for struct `String` in the current scope
Check out the setup for code repair task for more information on the prompt handed to the LLMs.
The function is called import_missing
since we expected to see that the models just add the missing import to make the implementation compile. But instead, the LLMs came up with different fixes.
Solution 1 was to replace String::from_str(s).unwrap()
with String::from(s)
. Models providing this solution include
- OpenAI: o4 Mini
- Amazon: Nova Pro 1.0
- Qwen: Qwen3 32B
pub fn import_missing(s: &str) -> bool {
let t = String::from(s);
t.contains("foobar")
}
Solution 2 was to replace String::from_str(s).unwrap()
with s.to_string()
. Models providing this solution include
- Google: Gemini 1.5 Flash
- AI21: Jamba 1.6 Large
- OpenAI: o3-mini (2025-01-31) (reasoning-effort=high)
pub fn import_missing(s: &str) -> bool {
let t = s.to_string();
t.contains("foobar")
}
Solution 3 was our expected solution, to add the missing import use std::str::FromStr
. Models providing this solution include
- Anthropic: Claude 3.7 Sonnet (thinking)
- Mistral: Ministral 8B
- Google: Gemma 3 4B
use std::str::FromStr;
pub fn import_missing(s: &str) -> bool {
let t = String::from_str(s).unwrap();
t.contains("foobar")
}
What are the problems of models not reaching 100 percent?
Fixing compile-time errors seems like the easiest of our tasks. So we expected many models to be high up the ranking, or actually achieve the full score. Here are some insights about why some LLMs did not perform as well as expected.
There are models which are pretty close to 100%, but provide excessive answers. That is, they provide the correct source code in a code fence but also add some kind of additional text even if they are asked explicitly to not do that. This additional text can be an explanation why this is the correct solution, or just a simple Here's the corrected code
. Among those LLMs are:
- 01.AI: Yi Large
- Cohere: Command A
- DeepSeek: DeepSeek V3 0324
A similar type of misbehavior was shown by Cohere: Command R7B (12-2024): In one response the replied source code was not in a code fence even though we specifically ask for that in the prompt.
A task case causing frequent failure was the import_missing
example already discussed here.
For instance, Qwen: Qwen3 235B A22B and Qwen: Qwen3 8B responded with this piece of code:
pub fn import_missing(s: &str) -> bool {
let t = String::from(s).unwrap();
t.contains("foobar")
}
So essentially, the model replaced String::from_str(s).unwrap()
with String::from(s).unwrap()
which does not compile either. Note that without the trailing .unwrap()
it would have been correct.
Google: Gemma 3 27B used the wrong use
directive. Instead of use std::str::FromStr
it added use std::string::FromStr
.
Cohere: Command R+ (08-2024), Cohere: Command R+ and Cohere: Command R fixed the given compilation error in all instances, but altered other parts of the code such that our tests could not be executed. This happened to a specific case only:
pub fn variable_unknown(x: i32) -> i32 {
y = x;
if x > 0 {
return y;
}
if x < 0 {
return -y;
}
return 0;
}
The LLMs are asked to fix the undeclared variable y
. Cohere: Command R+ and Cohere: Command R responded with the following:
fn variable_unknown(x: i32) -> i32 {
let mut y = 0;
y = x;
if x > 0 {
return y;
}
if x < 0 {
return -y;
}
0
}
Note that the pub
modifier of the function is missing which means that our tests cannot be executed since those are in a separate module. While for Command R that happened in three runs, it only happened once for Command R+.
Command R+ (08-2024) modified the function name in two runs calling them variable_fixed
and variable_known
. This, of course, also resulted in non-executable tests.
Mistral: Mistral 7B Instruct and Mistral: Mistral 7B Instruct v0.3 are extremely chatty. Every single response contained some explanation although we deliberately asked for just the code. Apart from the chattiness, both models had issues fixing the missing import problem. While Mistral 7B Instruct was not able at all to fix the compilation error, Mistral 7B Instruct v0.3 was able to provide a correct solution in one run out of five.
Most of the wrong answers were different:
It tried to implement the FromStr
trait for the argument type, but importing a private trait use std::string:FromStr
which does not work. Also, only traits from the current crate can be implemented on non-local types.
use std::string::FromStr;
impl FromStr for &str {
type Err = ();
fn from_str(s: &str) -> Result<Self, Self::Err> {
s.parse().map_err(|_| ())
}
}
pub fn import_missing(s: &str) -> bool {
let t: String = s.parse().unwrap();
t.contains("foobar")
}
Same approach but a bit different: It does not compile because of the trait implementation. Just the function alone would be the correct solution.
use std::string::FromStr;
impl FromStr for &str {
type Err = std::num::ParseIntError;
}
pub fn import_missing(s: &str) -> bool {
let t = s.to_string();
t.contains("foobar")
}
Again similar: The function alone would be correct, but the trait implementation makes the solution not compile.
use std::str::FromStr;
impl FromStr for &str {
type Err = <str as std::convert::TryFrom<&str>)->Result<String, <str as std::convert::TryFrom<&str>>::Error>;
fn from_str(s: &str) -> Result<String, <str as std::convert::TryFrom<&str>>::Error> {
s.to_string()
}
}
pub fn import_missing(s: &str) -> bool {
let t = s.to_string();
t.contains("foobar")
}
Using a wrong use directive (use std::str::FromStr
would have been correct):
use std::string::FromStr;
pub fn import_missing(s: &str) -> bool {
let t = String::from_str(s).unwrap();
t.contains("foobar")
}
Mistral: Mistral 7B Instruct v0.3 also claimed to have fixed the missing type of the function parameter in the following example:
pub fn syntax_error_type_missing(x) -> i32 {
if x > 0 {
return 1;
}
if x < 0 {
return -1;
}
return 0;
}
But in reality it just added a semicolon at the end of the function:
pub fn syntax_error_type_missing(x) -> i32 {
if x > 0 {
return 1;
}
if x < 0 {
return -1;
}
; // Added a semicolon here
}
Besides the semantical issues above, the responses of Mistral: Mistral 7B Instruct also had some syntactic issues like forgetting parenthesis or semicolons, or adding generic parameters to non-generic methods, e.g. let t = String::from_str(s).unwrap::<String>;
.
Distribution of solved cases
Case | Tests passing |
---|---|
importMissing | 89.36% |
openingBracketMissing | 96.15% |
syntaxErrorTypeMissing | 94.36% |
typeUnknown | 96.48% |
variableUnknown | 96.04% |
This table shows all code repair test cases and their average portion of passing tests over all models. For example, the typeUnknown
case has an average of 96.5% passing tests, making it the least challenging case. The most challenging case is the importMissing
with “only” 89.4% average passing tests.
Setup of the code repair task
The LLM is given a piece of source code and a list of compilation errors, and is asked to provide a fixed version of the code. The response is then validated by compiling the response to make sure that the compilation errors are fixed, and executing a predefined set of tests to make sure that the model did not return a completely different program.
For instance, the prompt for the example discussed in the result section is as follows:
Given the following Rust code file "src/lib.rs" and a list of compilation errors, modify the code such that the errors are resolved.
The response must contain only the source code in a fenced code block and nothing else.
```rust
pub fn import_missing(s: &str) -> bool {
let t = String::from_str(s).unwrap();
t.contains("foobar")
}
```
The list of compilation errors is the following:
- src/lib.rs:2: no function or associated item named `from_str` found for struct `String` in the current scope
The response is then written to a file called src/lib.rs
. The following predefined tests are written to tests/import_missing_test.rs
. The whole project is then compiled and tests are executed, as described here.
use import_missing::import_missing;
#[cfg(test)]
mod tests {
use super::*;
use std::assert_eq;
#[test]
fn success() {
assert_eq!(true, import_missing("foobar"));
}
#[test]
fn failure() {
assert_eq!(false, import_missing("bar"));
}
}
Noteworthy insights
More tools are beginning to support interoperability by adding structured output to operations. For example, the latest Go 1.24 release includes structured output for build and tests. Rust also supports JSON output for the rustc
compiler, which is great for parsing the program errors before handing them to the LLMs.
Transpile
We discuss results, setup of the task, and noteworthy insights.
Results
Only a single model managed to score 100% on code transpilation: OpenAI: ChatGPT-4.1. On average, models reached 72.70%, with 23.58% models scoring >90%.
Are the models solving the problems in the same way?
A noteworthy example is the following isSorted
function with the unusual behavior that it returns false
for empty arrays.
class Sort {
static boolean isSorted(int[] a) {
int i = 0;
while (i < a.length - 1 && a[i] <= a[i + 1]) {
i++;
}
return i == a.length - 1;
}
}
Although empty arrays are commonly considered to be sorted, we define them not to be sorted in the reference implementation and test suite to make the case a bit more challenging for the LLMs. In order to successfully transpile the algorithm, they actually need to understand the semantics of the program, and cannot be misled on what is usually expected of a sorting algorithm. For instance, the above Java implementation is given to the models and they are asked to transpile it to Rust. A one-to-one translation of this algorithm would be of the following form:
pub fn is_sorted(a: &[i32]) -> bool {
let mut i = 0;
while i + 1 < a.len() && a[i] <= a[i + 1] {
i += 1;
}
i == a.len() - 1
}
However, a.len()
returns an unsigned integer, and a.len() - 1
results therefore in a panic attempt to subtract with overflow
if a.len()
is zero. Models providing this solution in some run include:
- OpenAI: ChatGPT-4o
- OpenAI: o1-mini (2024-09-12)
- Qwen: Qwen-Plus
- Google: Gemini 2.5 Pro Preview
- xAI: Grok 3 Beta
Some models managed to avoid that pitfall by using the function usize::saturating_sub(rhs: usize)
. Models providing this solution in some run include:
- OpenAI: o1-mini (2024-09-12)
- Google: Gemini 2.5 Pro Preview
pub fn is_sorted(a: &[i32]) -> bool {
let mut i = 0;
while i + 1 < a.len() && a[i] <= a[i + 1] {
i += 1;
}
i == a.len().saturating_sub(1)
}
But that changes the behavior of the function since a.len().saturating_sub(1)
returns zero if a.len()
is zero. That is, the function is_sorted
now returns true
if an empty array is passed, which is again not a correct transpilation.
That means, a correct solution introduces a guard in the beginning of the function to check for the empty array. Models providing this solution in some run include:
- OpenAI: GPT-4.1
- OpenAI: ChatGPT-4o
- Google: Gemini 2.5 Pro Preview
pub fn is_sorted(a: &[i32]) -> bool {
if a.is_empty() {
return false;
}
let mut i = 0;
while i + 1 < a.len() && a[i] <= a[i + 1] {
i += 1;
}
i == a.len() - 1
}
What are the problems of models not reaching 100 percent?
As outlined above, lots of models had problems with the is_sorted
task, especially the case of an empty array as input. We have seen these issues across the board:
- OpenAI: GPT-4.1-mini
- Qwen: Qwen 2.5 Plus
- DeepSeek: DeepSeek V3
- Google: Gemini 2.5 Pro (preview)
- xAI: Grok 3 Beta
We have also seen several issues when initializing Rust vectors. Many models would skip proper type information, which then leads to non-compiling code. The following example is from Cohere: Command A 111B:
pub fn pascals_triangle(rows: i32) -> Result<Vec<Vec<i32>>, String> {
if rows < 0 {
return Err("Rows can't be negative!".to_string());
}
let rows = rows as usize;
let mut triangle = Vec::with_capacity(rows); // Needs type information.
for i in 0..rows {
let mut row = Vec::with_capacity(i + 1);
row.push(1);
if i > 0 {
for j in 1..i {
row.push(triangle[i - 1][j - 1] + triangle[i - 1][j]);
}
}
row.push(1);
triangle.push(row);
}
Ok(triangle)
}
When initializing the let mut triangle = Vec::with_capacity(rows);
, there needs to be type information. Hence the above case does not compile. We have also seen similar issues for the following models:
- 01.AI: Yi Large
- DeepSeek: DeepSeek V3 (0324) 685B
- Qwen: Qwen 3 235B A22B
- Qwen: Qwen 3 8B
- Cohere: Command R+ (04-2024) 104B
Distribution of solved cases
Case | Tests passing |
---|---|
balancedBrackets | 88.32% |
binarySearch | 72.94% |
cascadingIfElse | 92.35% |
isSorted | 73.86% |
pascalsTriangle | 40.40% |
This table shows all transpile test cases and their average portion of passing tests over all models. For example, the pascalsTriangle
case only has on average 40.4% tests passing, making it the most challenging case. The least challenging case is the cascadingIfElse
with 92.4% tests passing on average.
Setup of the transpile task
For this task, a model receives a Rust function stub that it needs to complete and an implementation in a different programming language (Go, Java, Ruby). We then run a test suite to check if the transpilation was successful.
An example prompt would be:
Given the following Java code file, transpile it into a Rust code file.
The response must contain only the transpiled Rust source code in a fenced code block and nothing else.
```java
package com.eval;
public class BalancedBrackets {
static boolean hasBalancedBrackets(char[] charArray) {
int brackets = 0;
for (char ch : charArray) {
if (ch == '[') {
brackets++;
} else if (ch == ']') {
brackets--;
} else {
return false; // Non-bracket characters.
}
if (brackets < 0) { // Closing bracket before opening bracket.
return false;
}
}
return brackets == 0;
}
}
```
The transpiled Rust code file must have the package "" and the following signature:
```rust
pub fn has_balanced_brackets(char_array: &str) -> bool {
// Implement this function to check if brackets are balanced
todo!()
}
```
The test suite looks like this:
use balanced_brackets::has_balanced_brackets;
#[test]
fn test_empty_string() {
// Empty string
assert_eq!(has_balanced_brackets(""), true);
}
#[test]
fn test_single_opening_bracket() {
// Single opening bracket (unbalanced)
assert_eq!(has_balanced_brackets("["), false);
}
#[test]
fn test_unclosed_brackets() {
// Unclosed brackets (unbalanced)
assert_eq!(has_balanced_brackets("[[["), false);
}
#[test]
fn test_simple_balanced_brackets() {
// Simple balanced brackets
assert_eq!(has_balanced_brackets("[[]]"), true);
}
#[test]
fn test_nested_balanced_brackets() {
// Nested balanced brackets
assert_eq!(has_balanced_brackets("[[[[]]]]"), true);
}
Noteworthy insights
Models adding additional unit tests
While it is common to place tests within implementation files for Rust, we don’t want to place them into the initial stub file to ensure that the models cannot know or modify the test suite. Therefore, we make use of Rust’s integration testing feature and put the tests into a separate testing package. Models are graded according to how many tests of the test suite pass. Because of an oversight on our end, both unit tests and integration tests are used to compute the score, which means that models can just add unit tests alongside their implementation to get a better score. The following example shows a response from OpenAI: o4-mini:
pub fn is_sorted(a: &[i32]) -> bool {
if a.len() == 0 {
return false;
}
let mut i = 0;
while i < a.len() - 1 && a[i] <= a[i + 1] {
i += 1;
}
i == a.len() - 1
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_empty() {
assert_eq!(is_sorted(&[]), false);
}
#[test]
fn test_single() {
assert_eq!(is_sorted(&[1]), true);
}
#[test]
fn test_sorted() {
assert_eq!(is_sorted(&[1, 2, 2, 3, 5]), true);
}
#[test]
fn test_unsorted() {
assert_eq!(is_sorted(&[1, 3, 2, 4]), false);
}
}
By adding 4x more tests, the model was able to reach a total of 8 out of 5 reachable passing tests, which was very suspicious and luckily led us to investigate.
Here is another example by Meta: Llama 3.3 70B:
pub fn binary_search(a: &[i32], x: i32) -> i32 {
let mut index: i32 = -1;
let mut min: usize = 0;
let mut max: usize = a.len() - 1;
while index == -1 && min <= max {
let m = (min + max) / 2;
if x == a[m] {
index = m as i32;
} else if x < a[m] {
max = m - 1;
} else {
min = m + 1;
}
}
return index;
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_binary_search() {
let a = [1, 2, 3, 4, 5, 6, 7, 8, 9];
assert_eq!(binary_search(&a, 5), 4);
assert_eq!(binary_search(&a, 0), -1);
assert_eq!(binary_search(&a, 3), 2);
assert_eq!(binary_search(&a, 9), 8);
}
}
Note that in this case, the model does not even have a scoring advantage as the newly added test fails. The new assertions reside within a single test case (#[test]
), and the implementation cannot handle the second assertion without panicking (searching for element 0
leads to an underflow eventually), so the additional unit test fails anyways.
Test suite specifications
As described here, the transpile task consists of an origin implementation in some language X, and the LLMs are asked to translate that into Rust. The result is then verified by running a test suite which is not known to the LLM.
We ourselves apply AI daily onto tedious tasks, so we also used LLMs to transpile the verification test suite into the target language. Even though we have a thorough reviewing process in place, an LLM-generated mistake slipped through. Here’s what happened:
We asked Claude Code to transpile the following Java test suite. Note that an empty array is considered to be not sorted, as explained here.
package com.eval;
import org.junit.jupiter.api.*;
import static org.junit.jupiter.api.Assertions.*;
public class IsSortedTest {
@Test
public void isSorted1() {
int[] a = {};
boolean actual = IsSorted.isSorted(a);
assertFalse(actual);
}
@Test
public void isSorted2() {
int[] a = { 5, 4, 3, 2, 1 };
boolean actual = IsSorted.isSorted(a);
assertFalse(actual);
}
@Test
public void isSorted3() {
int[] a = { 0 };
boolean actual = IsSorted.isSorted(a);
assertTrue(actual);
}
@Test
public void isSorted4() {
int[] a = { 1, 2, 3, 4, 5 };
boolean actual = IsSorted.isSorted(a);
assertTrue(actual);
}
@Test
public void isSorted5() {
int[] a = { 1, 2, 10, 11, 20, 21 };
boolean actual = IsSorted.isSorted(a);
assertTrue(actual);
}
}
As a result we got the Rust tests below. Everything seems correct at a first glance, but note that in the test case for empty arrays we expect false
in Java and true
in Rust.
use is_sorted::is_sorted;
#[test]
fn test_empty_array() {
// Empty array should be considered sorted
assert_eq!(is_sorted(&[]), true);
}
#[test]
fn test_descending_array() {
// Descending array
assert_eq!(is_sorted(&[5, 4, 3, 2, 1]), false);
}
#[test]
fn test_single_element_array() {
// Single element array
assert_eq!(is_sorted(&[0]), true);
}
#[test]
fn test_ascending_array() {
// Ascending array
assert_eq!(is_sorted(&[1, 2, 3, 4, 5]), true);
}
#[test]
fn test_ascending_array_with_jumps() {
// Ascending array with larger jumps
assert_eq!(is_sorted(&[1, 2, 10, 11, 20, 21]), true);
}
So Claude Code did not perform a proper 1-to-1 transpilation, but instead overruled that an empty array should always be considered to be sorted. A reminder to always review all new code - be it written by a human or an LLM.
Setup
Terminology
The DevQualityEval benchmark evaluates LLMs and LRMs (Large Reasoning Models) on software development use cases. This report introduces the newly added Rust support. The benchmark results span metrics that are introduced and explained with every section. Additional metrics and details are defined in the full leaderboard.
Since LLMs and LRMs are strongly related we are using just “LLM” or “model” to refer to both of them.
Since the benchmark is touching various technical topics that have terminologies on their own, we need to further define some common terminology for these deep dives:
- Each “task type”, or just “task”, is a distinct problem category, for example: writing tests.
- A “task case”, or just “case”, is a concrete instantiation of such a problem, for example: a binary search implementation written in Go that needs unit tests.
- An API request attempt is a query to an LLM API provider. There are up to 10 request attempts for each task case if an API is unreliable.
- An API request retry is a query to an LLM API provider after the initial request failed. There are up to 9 request retries for each task case if an API is unreliable.
Rust
We installed Rust via rustup. Obtaining code coverage data from Rust tests is sadly not trivial by default, so we opted to use cargo-llm-cov
, a third-party tool that takes over all the heavy lifting for us.
Evaluation
DevQualityEval was configured to perform 5 evaluation runs per model, and permit 10x attempts for every performed LLM query. Also, every LLM query needs to be handled by the models within 20 minutes. While this might sound unreasonably high, it is necessary for some insanely slow models. We looked closer at the performance data from DevQualityEval v1.0 and extracted the distribution of query response times for DeepSeek: R1:

We hope you enjoyed reading this deep-dive and we would love to hear your thoughts and feedback on how you liked the details, how we can improve both our deep-dives and the DevQualityEval benchmark overall.
If you are interested in joining our development efforts for the DevQualityEval benchmark: GREAT, let’s do it! You can use our issue list and discussion forum to interact or write to us directly at markus.zimmermann@symflower.com or on Twitter.