Symflower finds small reproducers, the smart way
A software component that gracefully handles every possible input is an asset. Such strong guarantees greatly benefit a system’s security and maintainability, but the implementation of a robust system can be a daunting task. Often, programmers think of the “golden path” that corresponds to a typical use case, but forget edge cases.
It is tedious for programmers and testers to come up with test cases for problematic edge cases, let alone all of them. Tests for the specification of an application’s logic are specific to that application. However, many desirable properties are not really specific to a system’s application logic, and thus require no domain-specific knowledge. For example, a property could be “the system shall never crash on any input”. Finding edge cases that violate this property is a generic task that can be automated.
Fuzzing is a popular approach to find such edge cases by repeatedly feeding a program random inputs. Its reliance on randomness comes with fundamental limitations: in general, fuzzers cannot feasibly explore all possible program paths. The good news is that Symflower is a source code analysis tool that will not miss any edge case in your code. Symflower analyzes your source code and uses mathematical methods to find minimal test cases for every relevant path.
This article is a tutorial on how to use Symflower, which you can install for free, to find bugs in your source code and how to generate unit tests to reproduce them. The tutorial is written in Java but you can tag along with any language that is supported.
The Story of A Bug
Imagine you are a freelance Java programmer. Today, your client is a scientist who is analyzing variations in the human genome. The genome consists of DNA molecules, which are sequences of base pairs. There are four different bases: adenine (A), cytosine (C), guanine (G) and thymine (T). These DNA sequences represent the genetic code.
Sequence alignment is a well-studied problem in bioinformatics that is concerned with comparing such DNA sequences. You are tasked to solve a related problem: deliver a program that measures the similarity between two linear sequences of base pairs.
You decide to implement the Levenshtein algorithm to measure the minimal edit distance between two sequences. It computes the minimal number of characters to insert, delete or change to transform a given string into another. You can encode a DNA sequence as a string, with one character per base pair, like “ACGT” for a DNA sequence of four base pairs.
For example, to transform the string “ACG” into “ATTG”, we need to change one character, and insert one more.
The algorithm uses a divide-and-conquer strategy. Dividing the input problem into smaller subproblems allows us to combine their solutions to solve the original problem. To decompose the problem, we create an alignment matrix where every cell represents the minimal edit distance between two substrings of the input strings. This matrix shows how that would work for the “ACG -> ATTG” example input.
Observe that the cell values are the edit distances for corresponding substrings:
- Row A column A is 0, which is the edit distance between “A” and “A”
- Row A column C is 1, which is the edit distance between “AC” and “A”
- Row A column T₁ is 1, which is the edit distance between “A” and “AT”
Based on the three values above, we can compute the edit distance between “AC” and “AT”. There are three options that correspond to the values above.
- The distance between “A” and “A” (0) + the distance between “C” and “T” (1 because they mismatch) = 1
- The distance between “AC” and “A” (1) + the distance added by inserting “T” (1) = 2
- The distance between “A” and “AT” (1) + the distance added by deleting “C” (1) = 2
Since we are computing the minimal distance, we are using the minimum out of the three for the cell in row C column T₁.
The same process is applied to the remaining cells. Finally, the rightmost column in the last row is 2, which is the minimal edit distance between “ACG” and “ATTG”.
This is your implementation:
|
|
You add some tests to demonstrate that your implementation is correct for the three cases: insertion, deletion and modification of characters.
static void testEditDistance() {
assert editDistance("same", "same") == 0;
assert editDistance("insert", "inssert") == 1;
assert editDistance("delete", "deete") == 1;
assert editDistance("modify", "mödify") == 1;
assert editDistance("cake", "cookie") == 3;
}
Satisfied with your work, you deliver it to your client.
The next day, you get an angry call from a customer: your program crashes! This is the stack trace:
Exception in thread "crack_genetic_code" java.lang.AssertionError
at Levenshtein.editDistance(Levenshtein.java:27)
Our DNA is encoded in 23 individual chromosomes, which are very similar across all humans and other mammals. Now the problem is that Chromosome 1 alone comprises around 250 million base pairs. You could ask your client to send the input that triggers the problem, but to compare two of those chromosomes, the matrix used in our algorithm will need to hold 6.25 x 1016 integers, or around 250 Petabytes. Unless you have that much spare memory on your sleek development laptop, you need to figure out a smarter way to debug this issue.
Let’s think about the problem and do some systematic debugging. You know that this assertion is violated:
assert (distance == 0) == x.equals(y);
Observe that there are two possibilities how that could happen:
- The computed distance is zero but the two input strings are different.
- The computed distance is greater than zero, but the two input strings are the same.
Oh great, now you have two problems! The good thing is that both are simpler, because they add assumptions which help further debugging. However, tracking down the problem is still a boring, mechanical process. What if you had a tool to quickly solve this problem for you? Then you are free in the afternoon, so you can go about spending all that coder money? 😎
Enter Symflower, a source code analysis tool that automagically finds a reproducer for this bug. Here’s how you can try it yourself.
- Install Symflower by following the steps on https://get.symflower.com.
- Create a new directory that contains
Levenshtein.java
with the above implementation. - Run Symflower either in your console or editor to generate tests. This will create a file
LevenshteinSymflowerTest.java
with test cases to cover the entire function. Since this is the “local” version of Symflower, whose computing resources are limited by your machine, this might take half a minute. But, can you do it faster? 🤔
Click here for the solution!
Here is the failing test that reveals the problem.
public class LevenshteinSymflowerTest {
@Test // (expected = AssertionError.class)
public void edit_distance0() throws AssertionError {
String x = "A";
String y = "T";
int actual = Levenshtein.edit_distance(x, y);
}
}
As usual, the problem is quite simple to explain with an example, but it’s hard to find by merely reading the source code. The two input strings are obviously different, but the nested loop that computes the alignment costs is never executed. This is a simple off-by-one error, that can be corrected by adding 1 to the dimensions of the alignment matrix.
You will notice that the generated tests cover every path in your function. If you are only interested in a single assertion or exception, you can add a temporary wrapper function to Levenshtein.java
to guide test generation. This is done by calling Symflower.assume()
which tells Symflower to ignore all paths that do not satisfy the given condition.
static void only_verify_assertion(String x, String y) {
boolean assertion_failed = false;
try {
editDistance(x, y);
} catch (AssertionError e) {
assertion_failed = true;
} catch (IllegalArgumentException e) {
}
symflower.Symflower.assume(assertion_failed);
}
After adding this function, run symflower Levenshtein::only_verify_assertion
in your console. This will instruct Symflower to generate tests only for that wrapper function. Observe that Symflower finds the bug instantly and exits, because we have asked it to not look for other tests. If you are using one of the editor extensions of Symflower, you might need to use an extra command to specifically generate tests for only one function.
Conclusion
We have shown how to use Symflower to quickly find bugs. By design, Symflower gives you minimal reproducers, that make it easy to understand the bug. This can reduce time needed to find and reduce test cases.
The user interface and other features of Symflower are a work in progress. Send us feedback for this blog post and on Symflower at our public issue tracker on GitHub or via email at hello@symflower.com.