Technical | 2021-08-15

Symflower finds small reproducers, the smart way

Sit back and relax as Symflower finds bugs in your code.

A software component that gracefully handles every possible input is an asset. Such strong guarantees greatly benefit a system’s security and maintainability, but the implementation of a robust system can be a daunting task. Often, programmers think of the “golden path” that corresponds to a typical use case, but forget edge cases.

It is tedious for programmers to come up with tests for some problematic edge cases, let alone all of them. Of course, tests that are a specification of the application’s logic must be written by programmers. However, many desirable properties are not really specific to a system’s application logic, and thus require no domain-specific knowledge. For example, a property could be “the system shall never crash on any input”. Finding edge cases that violate this property is a generic task that can be automated.

Fuzzing is a popular approach to find such edge cases by repeatedly feeding a program random inputs. Its reliance on randomness comes with fundamental limitations: in general, fuzzers cannot feasibly explore all possible program paths. Symflower is an analysis tool for Java that will not miss any edge case in your code. Symflower analyzes your source code and uses mathematical methods to find minimal test cases for every relevant path.

This article is a tutorial on how to use Symflower’s free command-line tool to find bugs in your Java code and how to generate unit tests to reproduce them.

The Story of A Bug

Imagine you are a freelance Java programmer. Today, your client is a scientist who is analyzing variations in the human genome. The genome consists of DNA molecules, which are sequences of base pairs. There are four different bases: adenine (A), cytosine (C), guanine (G) and thymine (T). These DNA sequences represent the genetic code.

Sequence alignment is a well-studied problem in bioinformatics that is concerned with comparing such DNA sequences. You are tasked to solve a related problem: deliver a program that measures the similarity between two linear sequences of base pairs.

You decide to implement the Levenshtein algorithm to measure the minimal edit distance between two sequences. It computes the minimal number of characters to insert, delete or change to transform a given string into another. You can encode a DNA sequence as a string, with one character per base pair, like “ACGT” for a DNA sequence of four base pairs.

For example, to transform the string “ACG” into “ATTG”, we need to change one character, and insert one more.

Edit distance between ACG and ATTG.

The algorithm uses a divide-and-conquer strategy. Dividing the input problem into smaller subproblems allows us to combine their solutions to solve the original problem. To decompose the problem, we create an alignment matrix where every cell represents the minimal edit distance between two substrings of the input strings. This matrix shows how that would work for the “ACG -> ATTG” example input.

Edit distance between ACG and ATTG.

Observe that the cell values are the edit distances for corresponding substrings:

  • Row A column A is 0, which is the edit distance between “A” and “A”
  • Row A column C is 1, which is the edit distance between “AC” and “A”
  • Row A column T₁ is 1, which is the edit distance between “A” and “AT”

Based on the three values above, we can compute the edit distance between “AC” and “AT”. There are three options that correspond to the values above.

  • The distance between “A” and “A” (0) + the distance between “C” and “T” (1 because they mismatch) = 1
  • The distance between “AC” and “A” (1) + the distance added by inserting “T” (1) = 2
  • The distance between “A” and “AT” (1) + the distance added by deleting “C” (1) = 2

Since we are computing the minimal distance, we are using the minimum out of the three for the cell in row C column T₁.

The same process is applied to the remaining cells. Finally, the rightmost column in the last row is 2, which is the minimal edit distance between “ACG” and “ATTG”.

This is your implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class Levenshtein {
	static int editDistance(String x, String y) {
		if (x == null || y == null)
			throw new IllegalArgumentException();
		// Easy cases - only insertions or deletions.
		if (x.isEmpty()) return y.length();
		if (y.isEmpty()) return x.length();
		// The alignment matrix.
		int[][] M = new int[y.length()][x.length()];
		for (int i = 0; i < x.length(); i++) M[0][i] = i;
		for (int j = 0; j < y.length(); j++) M[j][0] = j;

		int j = 1;
		int i = 1;
		for (; j < y.length(); j++) {
			for (i = 1; i < x.length(); i++) {
				int match = M[j-1][i-1];
				if (x.charAt(i-1) != y.charAt(j-1)) match++;
				int left = M[j][i-1] + 1;
				int above = M[j-1][i] + 1;
				M[j][i] = Math.min(match, Math.min(left, above));
			}
		}

		int distance = M[j-1][i-1];

		assert (distance == 0) == x.equals(y);

		return distance;
	}
}

You add some tests to demonstrate that your implementation is correct for the three cases: insertion, deletion and modification of characters.

static void testEditDistance() {
	assert editDistance("same", "same") == 0;
	assert editDistance("insert", "inssert") == 1;
	assert editDistance("delete", "deete") == 1;
	assert editDistance("modify", "mödify") == 1;
	assert editDistance("cake", "cookie") == 3;
}

Satisfied with your work, you deliver it to your client.

The next day, you get an angry call from a customer: your program crashes! This is the stack trace:

Exception in thread "crack_genetic_code" java.lang.AssertionError
   at Levenshtein.editDistance(Levenshtein.java:27)

Our DNA is encoded in 23 individual chromosomes, which are very similar across all humans and other mammals. Now the problem is that Chromosome 1 alone comprises around 250 million base pairs. You could ask your client to send the input that triggers the problem, but to compare two of those chromosomes, the matrix used in our algorithm will need to hold 6.25 x 1016 integers, or around 250 Petabytes. Unless you have that much spare memory on your sleek development laptop, you need to figure out a smarter way to debug this issue.

Let’s think about the problem and do some systematic debugging. You know that this assertion is violated:

assert (distance == 0) == x.equals(y);

Observe that there are two possibilities how that could happen:

  1. The computed distance is zero but the two input strings are different.
  2. The computed distance is greater than zero, but the two input strings are the same.

Oh great, now you have two problems! The good thing is that both are simpler, because they add assumptions which help further debugging. However, tracking down the problem is still a boring, mechanical process. What if you had a tool to quickly solve this problem for you? Then you are free in the afternoon, so you can go about spending all that coder money? 😎

Enter Symflower, a source code analysis tool that automagically finds a reproducer for this bug. Here’s how you can try it yourself.

  1. Install Symflower by following the steps on https://get.symflower.com.
  2. Create a new directory that contains Levenshtein.java with the above implementation.
  3. Simply run symflower inside that directory. This will create a file LevenshteinSymflowerTest.java with test cases to cover the entire function. Since this is the “local” version of Symflower, whose computing resources are limited by your machine, this might take a minute. But, can you do it faster? 🤔

Click here for the solution!

Here is the failing test that reveals the problem.

public class LevenshteinSymflowerTest {
	@Test // (expected = java.lang.AssertionError.class)
	public void edit_distance0() throws java.lang.AssertionError {
		java.lang.String x = "A";
		java.lang.String y = "T";
		int actual = Levenshtein.edit_distance(x, y);
	}
}

As usual, the problem is quite simple to explain with an example, but it’s hard to find by merely reading the source code. The two input strings are obviously different, but the nested loop that computes the alignment costs is never executed. This is a simple off-by-one error, that can be corrected by adding 1 to the dimensions of the alignment matrix.


You will notice that the generated tests cover every path in your function. If you are only interested in a single assertion or exception, you can add a temporary wrapper function to Levenshtein.java to guide test generation. This is done by calling Symflower.assume() which tells Symflower to ignore all paths that do not satisfy the given condition.

static void only_verify_assertion(String x, String y) {
	boolean assertion_failed = false;
	try {
		editDistance(x, y);
	} catch (AssertionError e) {
		assertion_failed = true;
	} catch (IllegalArgumentException e) {
	}
	symflower.Symflower.assume(assertion_failed);
}

After adding this function, run symflower Levenshtein::only_verify_assertion. This will instruct Symflower to generate tests only for that wrapper function. Observe that Symflower finds the bug instantly and exits, because we have asked it to not look for other tests.

Conclusion

We have shown how to use Symflower to quickly find bugs. By design, Symflower gives you minimal reproducers, that make it easy to understand the bug. This can reduce time needed to find and reduce test cases.

The user interface and other features of the free command-line tool version of Symflower are work in progress. Send us feedback for this blog post and on Symflower at our public issue tracker on GitHub or via email at hello@symflower.com.