Transpiling Go & Java to Ruby using GPT-4o & Claude 3.5 Sonnet

Using LLMs to transpile code from Go & Java to Ruby

The project was to extend our DevQualityEval LLM code generation benchmark with a new language: Ruby. We successfully used LLMs to transpile existing Java and Go code (tasks and test cases) to Ruby. The workflow of this project and our experiences are described below.

Table of contents:

Extending the DevQualityEval with Ruby

Learning a new language is hard. To successfully extend DevQualityEval’s existing Go and Java repositories with Ruby with manual transpilation, we would have needed to know the basics, including:

  • How does the language work?
  • How is a Ruby project structured?
  • What testing tools are available?
  • How to run the tests?
  • What package management tools are available?

We prompted LLMs to transpile the test repositories for each task in the DevQualityEval, allowing us to skip the learning process and providing an automated way for the transpilation.

Support the DevQualityEval project and access the latest results of the benchmark.

We used a combination of 2 models for transpiling Go and Java code to Ruby: Claude 3.5 Sonnet and GPT-4o. We decided to use these two LLMs because our results show that they are among the most powerful in the tested languages (Java and Go). So we figured they had the potential to work well with Ruby too.

Project definition

We prompted our chosen models to transpile the existing Java and Go test repositories in DevQualityEval to Ruby. DevQualityEval contains the following repositories for the evaluation tasks:

Plain & Light repositories (used for the write test task)

  • plain repository: contains a function that does nothing.
  • light repository: contains more complex examples.

In the evaluation, LLMs are prompted to write tests for the code provided in these repositories.

👉 The goal with these repositories was to transpile the source code from Java or Go to Ruby.

Mistakes (used for the code repair task)

The mistakes repository contains code with errors and predefined tests. In the evaluation, LLMs are prompted to fix the errors. We then run the pre-defined tests to check if the generated code is correct.

👉 Our goal here was to get examples for Ruby that are as close as possible to the original ones we defined for Java and Go. The predefined tests also needed to be transpiled.

Transpile (used for the transpile task)

Uses 5 examples from the light repository. In the evaluation, LLMs are prompted to transpile code between languages.

👉 The goal was to transpile the test cases that will enable us to verify the Ruby output of benchmarked LLMs with the transpilation task. This enables us to benchmark how well LLMs can transpile code from one language to another.

Challenges of transpiling existing code to Ruby

In our case, since all three (Java, Go, and Ruby) are imperative languages that share similar concepts (if, loop, functions, etc.), the transition is somewhat easier. However, the challenge is that Go and Java have types, while Ruby is not a typed language, meaning there is no 1:1 equivalent for some code examples in the repositories. For instance, DevQualityEval has an example (for Java and Go) with a function where the parameter type is missing, which results in a compilation error. As function parameters in Ruby are not typed, we needed the LLMs to give us an example similar to this one:

func syntaxErrorTypeMissing(x) int { // The parameter type is missing.
   ...
}

Since Ruby is an interpreted language, it does not have a compiler. For Java and Go examples in the mistakes repository, we can just compile the code and check the list of errors. With Ruby, we cannot do that, so we had to run the Ruby tests and get the list of errors.

There’s also a difference in testing frameworks and tools:

  • Go has a standard testing framework (go test), a standard library (testing), and a de-facto standard assertion framework (testify).
  • Java has a de-facto standard testing framework (JUnit 5) and well-established build tooling with Maven and Gradle.
  • Ruby has two commonly used testing frameworks (minitest and RSpec) that are used differently, and lots of tools that help you manage your Ruby installation.

Finally, conventions are also different, i.e. both Go and Java use camelCase, while Ruby uses snake_case.

Process

Understanding project structure

We asked Sonnet about the project structure of a Ruby project using the minitest framework. It responded with the following project structure, explaining what the folders/files are used for:

project_root/
├── lib/
│   └── your_code_files.rb
├── test/
│   ├── test_helper.rb
│   └── test_your_code_files.rb
├── Gemfile
└── Rakefile

We also asked for alternative project structures, but after looking at some Ruby projects on GitHub, we decided to go with this structure.

Selecting a testing framework

To be able to transpile the test repositories, we first had to clear up the question of testing frameworks. We asked Claude 3.5 Sonnet about the most often used Ruby test frameworks. The response listed:

  • RSpec
  • minitest
  • Cucumber
  • Test::Unit
  • Capybara

We then asked for the pros and cons of the top two frameworks on that list. Sonnet provided us with the following information:

Ruby testing frameworks Pros Cons
RSpec
  • Expressive, readable syntax
  • Rich set of matchers and expectations
  • Extensive mocking and stubbing capabilities
  • Strong community support and extensive documentation
  • Steeper learning curve for beginners
  • Slower test execution compared to minitest
  • More complex setup and configuration
  • Can lead to overly verbose tests if not carefully managed
Minitest
  • Simple and lightweight
  • Fast test execution
  • Built into the Ruby standard library
  • Easy to learn
  • Less expressive syntax compared to RSpec
  • Fewer built-in matchers and expectations
  • Less robust mocking and stubbing capabilities out of the box
  • Smaller ecosystem of extensions and plugins
  • We decided to go with minitest for the following reasons:

    • Fast test execution: We want to keep the evaluation benchmark’s execution time as low as possible.
    • No dependencies needed: It is already built into the Ruby standard library.

    Transpiling code: Light repository

    We had to transpile Go examples from the light repository to Ruby. Both GPT-4o and Claude 3.5 Sonnet did a great job at transpiling the repository. Here’s an example of code generated by GPT-4o and Claude 3.5 Sonnet for has_balanced_brackets.

    The original Go code:

    package light
    
    func hasBalancedBrackets(charArray string) bool {
    	brackets := 0
    	for _, ch := range charArray {
    		if ch == '[' {
    			brackets++
    		} else if ch == ']' {
    			brackets--
    		} else {
    			return false // Non-bracket characters.
    		}
    	}
    
    	if brackets < 0 { // Closing bracket before opening bracket.
    		return false
    	}
    
    	return brackets == 0
    }
    

    GPT-4o’s output:

    def has_balanced_brackets?(char_array)
      brackets = 0
      char_array.each_char do |ch|
        if ch == '['
          brackets += 1
        elsif ch == ']'
          brackets -= 1
        else
          return false # Non-bracket characters.
        end
    
        return false if brackets < 0 # Closing bracket before opening bracket.
      end
    
      brackets == 0
    end
    

    Claude 3.5 Sonnet’s output:

    class BalancedBrackets
      def self.has_balanced_brackets?(char_array)
        brackets = 0
        char_array.each_char do |ch|
          if ch == '['
            brackets += 1
          elsif ch == ']'
            brackets -= 1
          else
            return false # Non-bracket characters.
          end
          return false if brackets < 0 # Closing bracket before opening bracket.
        end
        brackets == 0
      end
    end
    

    Transpiling code: Mistakes repository

    We knew this would be more challenging because of Ruby’s syntax and the fact that the language is not typed. Since Ruby does not have a compiler, we needed to actually run the tests to get the list of errors. Even then, we ran into problems because syntax and runtime errors are reported differently, requiring different parsing approaches.

    One of the examples in the mistakes repository has a missing import. The LLM responded with a good example, a function that uses a JSON module function but does not import it. This example is very similar to the ones our benchmark had for Go and Java.

    def parse_json(json_string)
    	return JSON.parse(json_string)
    end
    

    When test-driving the benchmark in a Docker container, we noticed the tests for this function were passing right away. The expected behavior was to get a list of errors so we could hand them to the LLMs. This was the case on our local machines, but not in the Docker container. We then used GPT-4o to investigate this behavior. It speculated that there may have been some Ruby gem already using the JSON module, making the import for the module not required for our example. Besides pointing out what could be the reason for the problem, it also gave us a hint on how to debug it:

    “You can inspect the loaded features (i.e., files that have been required) by adding puts $LOADED_FEATURES in your code.”

    After following the suggestion and re-running the benchmark, we checked the modules that were being used, and the JSON module was in fact already loaded. To save the time we would have spent investigating the problem, we decided to go with another example instead. We checked Ruby’s module list and found one that was not being loaded: the CSV module.

    We also had to tackle the problems resulting from Ruby not being a typed language. The following Go example cannot be directly transpiled to Ruby since Ruby is not typed:

    package typeUnknown
    
    func typeUnknown(x intt) int {
    	if x > 0 {
    		return 1
    	}
    	if x < 0 {
    		return -1
    	}
    
    	return 0
    }
    

    In this example, we are forcing an error by misspelling the type int. We asked GPT-4o what the equivalent would be in Ruby. The suggestion was:

    def type_unknown(x)
        # Attempting to use `Intt` as if it were defined
        if x.is_a?(Intt)
            if x > 0
                return 1
            elsif x < 0
                return -1
            end
        end
    
        return 0
    end
    

    In this example, we are checking the type of x in runtime which results in an error: instead of x.is_a?(Intt), it should be x.is_a?(Integer).

    Missing brackets represented another challenge in the transpilation. Java and Go use curly braces to define code blocks. One of the examples of the mistakes repository defines a function that’s missing the function body curly brace:

    package openingBracketMissing
    
    func openingBracketMissing(x int) int
    	if x > 0 {
    		return 1
    	}
    	if x < 0 {
    		return -1
    	}
    
    	return 0
    }
    

    This example cannot be transpiled 1:1 to Ruby since the language does not use curly braces for code blocks, but rather uses the end keyword to define when a code block ends. We asked GPT-4o what could be the closest approximation in Ruby and the suggestion was using a function with the def keyword missing. Despite the suggestion being valid, we decided to go with a different approach by defining a function without the end keyword, since it is more related to the example we already have.

    Transpiling code: Transpile repository

    The transpile task uses 5 examples from the light repository. We took the Java test cases from the java/transpile repository and asked GPT-4o to transpile them to Ruby. Overall, the transpilation went well.

    For each example in the transpile repository for Ruby, we have a file that contains only the Ruby function definition. This is so LLMs know what the generated function signature should look like, which enables us to run the predefined tests. Since Ruby is not typed, we decided to give a hint to the LLMs about what the function parameter/return types should be. We were hoping this would increase the chance of LLMs generating code correctly. We asked Claude 3.5 Sonnet what would be a good hint: adding a comment specifying what the parameters and return types were, or using something like Sorbet (a static type-checker for Ruby). The response, quoted below, was very interesting:

    • Using comments: “This approach is simple and can work well for many LLMs. It’s clear and doesn’t require any > additional setup or dependencies. However, it relies on the LLM correctly interpreting and following the comment.”

    • Using Sorbet: “This approach has several advantages:

      • It’s more explicit and formal.
      • It provides type information for both the parameter and the return value.
      • It can be used for static type checking in your Ruby codebase. However, it requires setting up Sorbet in your project, which might be an overkill if you’re only using it for LLM code generation.”
    • YARD-style (Claude 3.5 Sonnet suggestion): “For LLM code generation, I’d recommend using the YARD-style comment approach. It’s more explicit than a simple comment, widely recognized, and doesn’t require additional setup. It clearly specifies both the parameter type and return type.”

    We decided to go with YARD-style because it was simpler and did not require any additional setup.

    Experiences: Using LLMs to transpile Go/Java code to Ruby

    The challenge was supporting Ruby as a new language in the DevQualityEval benchmark without knowing the language and its environment. We used LLMs to gain insights into the most often-used tools of the Ruby language, and to transpile our existing test repositories to Ruby. This helped us save time and integrate Ruby faster in the evaluation.

    The tested models (Claude 3.5 Sonnet, GPT-4o) did a great job at transpiling the existing code. We deemed this project a success and will use the same strategy for supporting new languages in DevQualityEval in the future.

    Support the DevQualityEval project and access the latest results of the benchmark.
    | 2024-08-29