Debugging part 4: Finding simple reproducers during debugging with Symflower

Connecting the dots when debugging is easier with Symflower!

In this final article of our 4-part series on debugging, we’ll walk you through a practical example of using Symflower to find simple reproducers for bugs in your code.

This blog series on debugging aims to equip you with all the knowledge and tools you need to start to efficiently debug software applications. In part 1, we gave a general overview of the debugging workflow using the TRAFFIC principle. Part 2 introduced the most useful tools and workflows to streamline your debugging efforts. In part 3, we covered all the ways unit tests can help the debugging process.

Artificial strawberries need your help

It’s a sleepy Thursday afternoon and you’re just about to grab one more cup of coffee when your phone rings. It’s a long-lost school friend of yours who studied biology and is now working on an exciting project that explores the possibility of growing artificial strawberries. They need your help computationally analyzing variations in the strawberry’s genome. The genome consists of DNA molecules, which are sequences of base pairs. There are four different bases: adenine (A), cytosine (C), guanine (G) and thymine (T). These DNA sequences represent the genetic code.

Sequence alignment is a well-studied problem in bioinformatics that is concerned with comparing such DNA sequences. You are tasked to solve a related problem: deliver a program that measures the similarity between two linear sequences of base pairs.

You decide to implement the Levenshtein algorithm to measure the minimal edit distance between two sequences. It computes the minimal number of characters to insert, delete, or change to transform a given string into another. You can encode a DNA sequence as a string, with one character per base pair, like “ACGT” for a DNA sequence of four base pairs.

For example, to transform the string “ACG” into “ATTG”, we need to change one character, and insert one more.

The algorithm uses a divide-and-conquer strategy. Dividing the input problem into smaller subproblems allows us to combine their solutions to solve the original problem. To decompose the problem, we create an alignment matrix where every cell represents the minimal edit distance between two substrings of the input strings. This matrix shows how that would work for the “ACG -> ATTG” example input.

Observe that the cell values are the edit distances for corresponding substrings:

Row A column A is 0, which is the edit distance between “A” and “A”
Row A column C is 1, which is the edit distance between “AC” and “A”
Row A column T₁ is 1, which is the edit distance between “A” and “AT”

Based on the three values above, we can compute the edit distance between “AC” and “AT”. There are three options that correspond to the values above.

The distance between “A” and “A” (0) + the distance between “C” and “T” (1 because they mismatch) = 1
The distance between “AC” and “A” (1) + the distance added by inserting “T” (1) = 2
The distance between “A” and “AT” (1) + the distance added by deleting “C” (1) = 2

Since we are computing the minimal distance, we are using the minimum out of the three for the cell in row C column T₁.

The same process is applied to the remaining cells. Finally, the rightmost column in the last row is 2, which is the minimal edit distance between “ACG” and “ATTG”.

This is your implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


class Levenshtein {
   static int editDistance(String x, String y) {
       if (x == null || y == null)
           throw new IllegalArgumentException();


       // Handle the easy cases: only insertions or deletions.
       if (x.isEmpty()) return y.length();
       if (y.isEmpty()) return x.length();


       // Define the alignment matrix.
       int[][] M = new int[y.length()][x.length()];
       for (int i = 0; i < x.length(); i++) M[0][i] = i;
       for (int j = 0; j < y.length(); j++) M[j][0] = j;


       int j = 1;
       int i = 1;
       for (; j < y.length(); j++) {
           for (i = 1; i < x.length(); i++) {
               int match = M[j-1][i-1];
               if (x.charAt(i-1) != y.charAt(j-1)) match++;
               int left = M[j][i-1] + 1;
               int above = M[j-1][i] + 1;
               M[j][i] = Math.min(match, Math.min(left, above));
           }
       }


       int distance = M[j-1][i-1];


       assert (distance == 0) == x.equals(y);


       return distance;
   }
}

You add some tests to demonstrate that your implementation is correct for the three cases: insertion, deletion and modification of characters.

static void testEditDistance() {
   assert editDistance("same", "same") == 0;
   assert editDistance("insert", "inssert") == 1;
   assert editDistance("delete", "deete") == 1;
   assert editDistance("modify", "mödify") == 1;
   assert editDistance("cake", "cookie") == 3;
}

Satisfied with your work, you deliver it to your school friend.

The next day, you get an angry call: your program crashes! Here’s the stack trace:

Exception in thread "crack_strawberry_genome" java.lang.AssertionError
  at Levenshtein.editDistance(Levenshtein.java:27)

A strawberry’s DNA is present in 56 chromosomes, with 60% of this DNA information also being present in the human DNA. Now the problem is that a single chromosome alone can hold millions of base pairs. You could ask your client to send the input that triggers the problem, but to compare two of those chromosomes, the matrix used in our algorithm will probably require a few Petabytes of space. Unless you have that much spare memory on your sleek development laptop, you need to figure out a smarter way to debug this issue.

Debugging the smart way

Let’s think about the problem and do some systematic debugging. You know that this assertion is violated:

assert (distance == 0) == x.equals(y);

Observe that there are two possibilities for how that could happen:

The computed distance is zero but the two input strings are different.
The computed distance is greater than zero, but the two input strings are the same.

Oh great, now you have two problems! The good thing is that both are simpler because they add assumptions that help further debugging. However, tracking down the problem is still a boring, mechanical process. What if you had a tool to quickly solve this problem for you?

Enter Symflower, a source code analysis tool that automagically finds a reproducer for this bug. Here’s how you can try it yourself.

Install Symflower by following the steps on https://get.symflower.com.
Create a new directory that contains Levenshtein.java with the above implementation.
Run Symflower either in your console or editor to generate tests. This will create a file LevenshteinSymflowerTest.java with test cases to cover the entire function. Since Symflower’s computing resources are limited by your machine, this might take as long as half a minute. But could you do it faster by hand? 🤔

Click here for the solution!

Here is the failing test that reveals the problem:

@Test
public void editDistance6() throws AssertionError {
    String x = "A";
    String y = "B";
    // assertThrows(java.lang.AssertionError.class, () -> {
    Plain.editDistance(x, y);
    // });
}

As usual, the problem is quite simple to explain with an example, but it’s hard to find by merely reading the source code. The two input strings are obviously different, but the nested loop that computes the alignment costs is never executed. This is a simple off-by-one error that can be corrected by adding 1 to the dimensions of the alignment matrix.

You will notice that the generated tests cover every path in your function. If you are only interested in a single assertion or exception, you can add a temporary wrapper function to Levenshtein.java to guide test generation. This is done by calling Symflower.assume() which tells Symflower to ignore all paths that do not satisfy the given condition.

static void only_verify_assertion(String x, String y) {
	boolean assertion_failed = false;
	try {
		editDistance(x, y);
	} catch (AssertionError e) {
		assertion_failed = true;
	} catch (IllegalArgumentException e) {
	}
	symflower.Symflower.assume(assertion_failed);
}

After adding this function, run symflower ::only_verify_assertion in your console. This will instruct Symflower to generate tests only for that wrapper function. Observe that Symflower finds the bug instantly and exits, because we have asked it to not look for other tests. If you are using one of the editor extensions of Symflower, you might need to use an extra command to specifically generate tests for only one function.

Conclusion

That went pretty fast, didn’t it? We just used Symflower to quickly find bugs where going the manual way could have taken quite a bit of time. By design, Symflower gives you minimal reproducers that make it easy to understand the bug. This can reduce the time needed to find and reduce test cases.

Have your own artificial strawberry project where you could put Symflower to good use? Install it in your IDE of choice or CLI, test it with your code, and let us know what you think!

| 2023-06-20

Debugging part 4: Finding simple reproducers during debugging with Symflower

Artificial strawberries need your help

Debugging the smart way

Conclusion

Series: Debugging

Product

Company