Icon for gfsd IntelliJ IDEA

TreeSitter - the holy grail of parsing source code

We at Symflower fell in love ❤️ with TreeSitter right away.

As a software analysis company we have an inherent need for parsing source code. Until recently we used the maybe most prominent tool for parsing Java code which is JavaParser. JavaParser works exceptionally well and certainly paid its dues for us. However, JavaParser became more and more a bottleneck in terms of performance. Not necessarily because the parser itself is slow, but because we had to perform expensive external Java calls from Go to execute JavaParser. To be able to keep delivering a snappy UX with more features to come, we had to say our farewells to JavaParser and welcome a new parsing library.

As already indicated above, the main requirement for a replacement was the ability to call the new parser natively within our Go code base. Our final decision fell on TreeSitter, written in C and available in Go via the package smacker/go-tree-sitter which provides bindings using cgo. With the migration from JavaParser to TreeSitter we gained a speedup by a factor of 36x for parsing source code in our parser benchmarks 🤯.

The speedup is a perfect start for the journey of optimizing our analysis pipeline towards real time compatibility. However, it is not only the performance which steered us towards TreeSitter, but also various other properties and features which give us an advantage in the mid and long-term.

Since you are reading an article on parsing technology, you are definitely interested in analyzing and generating source code as well. Perfect! We are hiring!

What is TreeSitter?

To rephrase the website, TreeSitter is a generic parser generator and incremental parsing library. It can build a concrete syntax tree (CST) and efficiently update syntax trees as the source file is edited. So it aims to be general enough to parse any programming language, fast enough to parse on every keystroke, and robust enough to provide results even in the presence of syntax errors.

Our experience with TreeSitter so far just underlines these properties. The active community makes the project even more valuable as there already exists a huge amount of language grammars. Grammars, basically the definition of the syntax of a language, are written in JavaScript in a simple domain-specific language, and creating them is fairly intuitive, a few details are tricky though. TreeSitter itself is written in pure C and was therefore already made available to other languages via bindings. There is a playground where you can give TreeSitter a try.

Our additions and fixes to the TreeSitter-related projects got reviewed and accepted quickly, and we are looking forward to being a part of that community.

TreeSitter at Symflower

Symflower analyzes source code and generates tests for the analyzed code. We created our own language/syntax tree, the Symflower AST, which is optimized towards analysis. If, for instance, a Java file is passed to Symflower, then the Java syntax tree is translated into a Symflower AST. The analysis is performed on the Symflower AST, and generates a test represented as Symflower AST. The generated test is then translated back into the Java syntax tree. That is, for every language we want to support, we “only” have to write a so-called language frontend responsible for translating the language syntax tree to Symflower AST and back.

The description of our architecture might already give away why TreeSitter is so valuable to us in addition to pure performance. Due to generality and the huge amount of existing language grammars, TreeSitter provides a single interface for integrating and supporting new languages in Symflower. And it turns out that a huge chunk of the before-mentioned language frontend can be generated from the language grammar itself. Stay tuned!

Always generate as much of your code as you can.

Concrete vs. abstract syntax tree

Before we go into details, we want to make the differences between concrete (CST) and abstract syntax trees (AST) clear. TreeSitter produces CSTs whereas, at Symflower, we use ASTs to perform our analysis.

A CST, also called parse tree, is an exact and unambiguous representation of the source code in tree form. E.g. the CST of (a + b) + c is different to the one of a + (b + c). From a semantic point of view, however, there is no difference between those two, that’s where ASTs come in. ASTs are abstractions of CSTs removing irrelevant syntactic detail, like the parentheses in this case. Removing these irrelevant details makes the syntax tree easier to handle since there is only one case to treat instead of two. The definition of irrelevant syntactic detail highly depends on the application and use case. ASTs can therefore be geared towards their application, e.g. the Symflower AST is optimized towards source code analysis.

A CST gives you every detail of the parsed text.

A generic TreeSitter AST generator

Since our analysis is performed on the Symflower AST we have to bridge the gap between the TreeSitter CST and the Symflower AST. Unfortunately, not the whole step can be filled with generated code, but a huge chunk nonetheless. What we can do is to generate data structures for a TreeSitter AST, and a translator from TreeSitter CST to TreeSitter AST. The code for translating a TreeSitter AST into a Symflower AST has then to be a manual effort.

The reason why we do not translate a TreeSitter CST into a Symflower AST directly is that an AST is easier to work with, and since the translation into Symflower AST has to be written manually, a translation from AST to AST is easier to grasp and maintain. However, if we need to gain more performance, we can apply an automatic refactoring which removes the TreeSitter AST in a preprocessing step such that we have a direct TreeSitter CST to Symflower AST translation in our production code.

So how did we write a TreeSitter AST generator and translator which works for any TreeSitter grammar? The first step is to generate data structures representing the node types of a given language. For that we use the node-types.json file which is automatically generated for every grammar by TreeSitter. It is extracted from the grammar specification and provides structured data about every possible syntax node which we use to generate type declarations. Below is the JSON entry for a binary expression in the node-types.json file of the Java grammar, and after that is the corresponding generated type declaration in Go.

Java binary expression in node-types.json:

{
  "type": "binary_expression",
  "named": true,
  "fields": {
    "left": {
      "multiple": false,
      "required": true,
      "types": [
        {
          "type": "expression",
          "named": true
        }
      ]
    },
    "operator": {
      "multiple": false,
      "required": true,
      "types": [
        {
          "type": "!=",
          "named": false
        }
        // Other possible operators.
      ]
    },
    "right": {
      "multiple": false,
      "required": true,
      "types": [
        {
          "type": "expression",
          "named": true
        }
      ]
    }
  }
}

Generated Go type for the Java binary expression:

// BinaryExpression represents the grammar rule "binary_expression".
type BinaryExpression struct {
   // Left holds the field "left".
   Left Expression
   // Operator holds the field "operator".
   // BangEqual | Percent | ...
   Operator *Token
   // Right holds the field "right".
   Right Expression

   // Start holds the start location of the node.
   Start sitter.Point
   // End holds the end location of the node.
   End sitter.Point
}

For generating these type declarations we already created a one-to-one mapping between the JSON data and the type declarations. We use that mapping to perform the second necessary step, namely generating code which translates the TreeSitter CST into our new TreeSitter AST data structures. Below is the generated code which translates the CST of a binary expression into the corresponding AST.

// UnmarshalBinaryExpression unmarshalls the grammar rule "binary_expression", or returns nil if the unmarshalling failed.
func (u *Unmarshaller) UnmarshalBinaryExpression(in *sitter.Node) (out *BinaryExpression) {
   out = &BinaryExpression{
       Start: in.StartPoint(),
       End:   in.EndPoint(),
   }

   fieldNodes, additionalChildren := u.childNodes(in)
   for fieldName, nodes := range fieldNodes {
       switch fieldName {
       case "left":
           out.Left = u.UnmarshalExpression(nodes[0])
       case "operator":
           out.Operator = u.UnmarshalToken(nodes[0])
       case "right":
           out.Right = u.UnmarshalExpression(nodes[0])
       default:
           u.addError(ErrUnknownFieldName, in)

           return nil
       }
   }

   return out
}

As already mentioned, the code for translating the resulting TreeSitter AST into Symflower AST has then to be written manually. The whole approach is heavily inspired by GitHub’s semantic project, which allows to analyze programming languages with a common AST similar to how the Symflower AST works.

A generic TreeSitter AST printer generator

We already briefly sketched how Symflower’s workflow looks like. After performing source code analysis Symflower generates tests represented as a Symflower AST. The Symflower AST is then translated back into a language-specific AST, e.g. Java AST, and this language-specific AST is then printed to a source file.

While using JavaParser, we wrote our own Java AST formatter which is a tedious work to do. TreeSitter again provides a possibility to automatically generate code for printing and formatting ASTs. TreeSitter not only creates a file called node-types.json, but also a file named grammar.json. This file describes the grammar of a language and can therefore be leveraged to generate code for printing TreeSitter ASTs. We will update the article when this part has been implemented with how this can be done.

Still reading? Clearly you are interested in parsing technologies, as well as analyzing and generating source code. This blog post just highlights one aspect at what we do at Symflower. Join us!

Conclusion

TreeSitter is an amazing project with an active community. Its properties, generality, performance and robustness, together with the already existing amount of language grammars makes it super valuable. By migrating to TreeSitter we not only gained speed, but also a huge simplification when it comes to integrating and supporting new languages. The files nodes-types.json and grammar.json are extracted from the grammar specification alone, and provide type and language information which can be used to automatically generate code for handling the syntax trees of any given language.

To stay updated on the next steps of our TreeSitter journey and future content about software development and software testing, sign up for our newsletter, and follow us on Twitter, LinkedIn or Facebook

| 2023-03-17