The project was to extend our DevQualityEval LLM code generation benchmark with a new language: Ruby. We successfully used LLMs to transpile existing Java and Go code (tasks and test cases) to Ruby. The workflow of this project and our experiences are described below.
Table of contents:
- Extending the DevQualityEval with Ruby
- Project definition
- Challenges of transpiling existing code to Ruby
- Process:
- Experiences: Using LLMs to transpile Go/Java code to Ruby
Extending the DevQualityEval with Ruby
Learning a new language is hard. To successfully extend DevQualityEval’s existing Go and Java repositories with Ruby with manual transpilation, we would have needed to know the basics, including:
- How does the language work?
- How is a Ruby project structured?
- What testing tools are available?
- How to run the tests?
- What package management tools are available?
We prompted LLMs to transpile the test repositories for each task in the DevQualityEval, allowing us to skip the learning process and providing an automated way for the transpilation.
We used a combination of 2 models for transpiling Go and Java code to Ruby: Claude 3.5 Sonnet and GPT-4o. We decided to use these two LLMs because our results show that they are among the most powerful in the tested languages (Java and Go). So we figured they had the potential to work well with Ruby too.
Project definition
We prompted our chosen models to transpile the existing Java and Go test repositories in DevQualityEval to Ruby. DevQualityEval contains the following repositories for the evaluation tasks:
Plain & Light repositories (used for the write test task)
plain
repository: contains a function that does nothing.light
repository: contains more complex examples.
In the evaluation, LLMs are prompted to write tests for the code provided in these repositories.
👉 The goal with these repositories was to transpile the source code from Java or Go to Ruby.
Mistakes (used for the code repair task)
The mistakes repository contains code with errors and predefined tests. In the evaluation, LLMs are prompted to fix the errors. We then run the pre-defined tests to check if the generated code is correct.
👉 Our goal here was to get examples for Ruby that are as close as possible to the original ones we defined for Java and Go. The predefined tests also needed to be transpiled.
Transpile (used for the transpile task)
Uses 5 examples from the light
repository. In the evaluation, LLMs are prompted to transpile code between languages.
👉 The goal was to transpile the test cases that will enable us to verify the Ruby output of benchmarked LLMs with the transpilation task. This enables us to benchmark how well LLMs can transpile code from one language to another.
Challenges of transpiling existing code to Ruby
In our case, since all three (Java, Go, and Ruby) are imperative languages that share similar concepts (if, loop, functions, etc.), the transition is somewhat easier. However, the challenge is that Go and Java have types, while Ruby is not a typed language, meaning there is no 1:1 equivalent for some code examples in the repositories. For instance, DevQualityEval has an example (for Java and Go) with a function where the parameter type is missing, which results in a compilation error. As function parameters in Ruby are not typed, we needed the LLMs to give us an example similar to this one:
func syntaxErrorTypeMissing(x) int { // The parameter type is missing.
...
}
Since Ruby is an interpreted language, it does not have a compiler. For Java and Go examples in the mistakes
repository, we can just compile the code and check the list of errors. With Ruby, we cannot do that, so we had to run the Ruby tests and get the list of errors.
There’s also a difference in testing frameworks and tools:
- Go has a standard testing framework (
go test
), a standard library (testing
), and a de-facto standard assertion framework (testify
). - Java has a de-facto standard testing framework (JUnit 5) and well-established build tooling with Maven and Gradle.
- Ruby has two commonly used testing frameworks (
minitest
andRSpec
) that are used differently, and lots of tools that help you manage your Ruby installation.
Finally, conventions are also different, i.e. both Go and Java use camelCase
, while Ruby uses snake_case
.
Process
Understanding project structure
We asked Sonnet about the project structure of a Ruby project using the minitest
framework. It responded with the following project structure, explaining what the folders/files are used for:
project_root/
├── lib/
│ └── your_code_files.rb
├── test/
│ ├── test_helper.rb
│ └── test_your_code_files.rb
├── Gemfile
└── Rakefile
We also asked for alternative project structures, but after looking at some Ruby projects on GitHub, we decided to go with this structure.
Selecting a testing framework
To be able to transpile the test repositories, we first had to clear up the question of testing frameworks. We asked Claude 3.5 Sonnet about the most often used Ruby test frameworks. The response listed:
RSpec
minitest
Cucumber
Test::Unit
Capybara
We then asked for the pros and cons of the top two frameworks on that list. Sonnet provided us with the following information:
Ruby testing frameworks | Pros | Cons |
---|---|---|
RSpec |
|
|
Minitest |
|
RSpec |
We decided to go with minitest
for the following reasons:
- Fast test execution: We want to keep the evaluation benchmark’s execution time as low as possible.
- No dependencies needed: It is already built into the Ruby standard library.
Transpiling code: Light repository
We had to transpile Go examples from the light
repository to Ruby. Both GPT-4o and Claude 3.5 Sonnet did a great job at transpiling the repository. Here’s an example of code generated by GPT-4o and Claude 3.5 Sonnet for has_balanced_brackets
.
The original Go code:
package light
func hasBalancedBrackets(charArray string) bool {
brackets := 0
for _, ch := range charArray {
if ch == '[' {
brackets++
} else if ch == ']' {
brackets--
} else {
return false // Non-bracket characters.
}
}
if brackets < 0 { // Closing bracket before opening bracket.
return false
}
return brackets == 0
}
GPT-4o’s output:
def has_balanced_brackets?(char_array)
brackets = 0
char_array.each_char do |ch|
if ch == '['
brackets += 1
elsif ch == ']'
brackets -= 1
else
return false # Non-bracket characters.
end
return false if brackets < 0 # Closing bracket before opening bracket.
end
brackets == 0
end
Claude 3.5 Sonnet’s output:
class BalancedBrackets
def self.has_balanced_brackets?(char_array)
brackets = 0
char_array.each_char do |ch|
if ch == '['
brackets += 1
elsif ch == ']'
brackets -= 1
else
return false # Non-bracket characters.
end
return false if brackets < 0 # Closing bracket before opening bracket.
end
brackets == 0
end
end
Transpiling code: Mistakes repository
We knew this would be more challenging because of Ruby’s syntax and the fact that the language is not typed. Since Ruby does not have a compiler, we needed to actually run the tests to get the list of errors. Even then, we ran into problems because syntax and runtime errors are reported differently, requiring different parsing approaches.
One of the examples in the mistakes
repository has a missing import. The LLM responded with a good example, a function that uses a JSON module function but does not import it. This example is very similar to the ones our benchmark had for Go and Java.
def parse_json(json_string)
return JSON.parse(json_string)
end
When test-driving the benchmark in a Docker container, we noticed the tests for this function were passing right away. The expected behavior was to get a list of errors so we could hand them to the LLMs. This was the case on our local machines, but not in the Docker container. We then used GPT-4o to investigate this behavior. It speculated that there may have been some Ruby gem already using the JSON module, making the import for the module not required for our example. Besides pointing out what could be the reason for the problem, it also gave us a hint on how to debug it:
“You can inspect the loaded features (i.e., files that have been required) by adding
puts $LOADED_FEATURES
in your code.”
After following the suggestion and re-running the benchmark, we checked the modules that were being used, and the JSON module was in fact already loaded. To save the time we would have spent investigating the problem, we decided to go with another example instead. We checked Ruby’s module list and found one that was not being loaded: the CSV module.
We also had to tackle the problems resulting from Ruby not being a typed language. The following Go example cannot be directly transpiled to Ruby since Ruby is not typed:
package typeUnknown
func typeUnknown(x intt) int {
if x > 0 {
return 1
}
if x < 0 {
return -1
}
return 0
}
In this example, we are forcing an error by misspelling the type int
. We asked GPT-4o what the equivalent would be in Ruby. The suggestion was:
def type_unknown(x)
# Attempting to use `Intt` as if it were defined
if x.is_a?(Intt)
if x > 0
return 1
elsif x < 0
return -1
end
end
return 0
end
In this example, we are checking the type of x
in runtime which results in an error: instead of x.is_a?(Intt)
, it should be x.is_a?(Integer)
.
Missing brackets represented another challenge in the transpilation. Java and Go use curly braces to define code blocks. One of the examples of the mistakes
repository defines a function that’s missing the function body curly brace:
package openingBracketMissing
func openingBracketMissing(x int) int
if x > 0 {
return 1
}
if x < 0 {
return -1
}
return 0
}
This example cannot be transpiled 1:1 to Ruby since the language does not use curly braces for code blocks, but rather uses the end
keyword to define when a code block ends. We asked GPT-4o what could be the closest approximation in Ruby and the suggestion was using a function with the def
keyword missing. Despite the suggestion being valid, we decided to go with a different approach by defining a function without the end
keyword, since it is more related to the example we already have.
Transpiling code: Transpile repository
The transpile task uses 5 examples from the light
repository. We took the Java test cases from the java/transpile
repository and asked GPT-4o to transpile them to Ruby. Overall, the transpilation went well.
For each example in the transpile
repository for Ruby, we have a file that contains only the Ruby function definition. This is so LLMs know what the generated function signature should look like, which enables us to run the predefined tests. Since Ruby is not typed, we decided to give a hint to the LLMs about what the function parameter/return types should be. We were hoping this would increase the chance of LLMs generating code correctly. We asked Claude 3.5 Sonnet what would be a good hint: adding a comment specifying what the parameters and return types were, or using something like Sorbet (a static type-checker for Ruby). The response, quoted below, was very interesting:
Using comments: “This approach is simple and can work well for many LLMs. It’s clear and doesn’t require any > additional setup or dependencies. However, it relies on the LLM correctly interpreting and following the comment.”
Using Sorbet: “This approach has several advantages:
- It’s more explicit and formal.
- It provides type information for both the parameter and the return value.
- It can be used for static type checking in your Ruby codebase. However, it requires setting up Sorbet in your project, which might be an overkill if you’re only using it for LLM code generation.”
YARD-style (Claude 3.5 Sonnet suggestion): “For LLM code generation, I’d recommend using the YARD-style comment approach. It’s more explicit than a simple comment, widely recognized, and doesn’t require additional setup. It clearly specifies both the parameter type and return type.”
We decided to go with YARD-style because it was simpler and did not require any additional setup.
Experiences: Using LLMs to transpile Go/Java code to Ruby
The challenge was supporting Ruby as a new language in the DevQualityEval benchmark without knowing the language and its environment. We used LLMs to gain insights into the most often-used tools of the Ruby language, and to transpile our existing test repositories to Ruby. This helped us save time and integrate Ruby faster in the evaluation.
The tested models (Claude 3.5 Sonnet, GPT-4o) did a great job at transpiling the existing code. We deemed this project a success and will use the same strategy for supporting new languages in DevQualityEval in the future.