The Toilet Paper

An industrial evaluation of unit test generation: Finding real faults in a financial application

Why write tests if you can let a computer write them for you?

Some old Dutch banknotes with a ladybug on one of the notes
It’s not exactly a Monet, but… money doesn’t smell

Writing tests isn’t something that many developers enjoy, and clients generally don’t like spending money on testing either. Could we try to automate it? Almasi, Hemmati, Fraser, Arcuri, and Benefelds compared two unit test generation tools for Java, and conclude that while they do work, you’ll still have to write tests manually for now.

Why it matters


Testing is an important part of software development – unfortunately it’s also something that not all developers are good at.

Automated test generation could solve that problem. Researchers have introduced and studied the effectiveness of such tools for open source projects, but it’s not clear for industrial systems.

This study therefore aims to provide evidence of the effectiveness of automated test generation tools on commercially developed software.

How the study was conducted


The authors used LifeCalc, a medium-sized life insurance and pensions software application, as the subject of study. More specifically, the authors studied 25 faults that have been identified and remedied in the past by LifeCalc’s developers.

For each of the faults a special LifeCalc version was created that exhibited only that particular fault.

Test generation


When it comes to automated unit test generation, there are basically three main approaches:

  • Random testing involves generating random-ish inputs in the form of method calls that verify that the application doesn’t crash;

  • With search-based testing, the generator iteratively tries to find optimal inputs and assertions using a ;

  • For symbolic testing the program code is analysed to determine combinations of possible input values that will activate all possible execution paths.

The authors ran on the non-faulty version of LifeCalc:

  • Randoop, which uses random testing, and

  • EvoSuite, which uses search-based testing.

Each tool was ran ten times, with different durations: 3 and 15 minutes.

Finally, the authors conducted a survey about generated and manually written tests with five of LifeCalc’s developers.

What discoveries were made


Let’s start with the good news: together, the tools managed to detect 19 of the 25 faults. This suggests that the tools can definitely be useful.

During single runs however, EvoSuite on average only finds about half of all faults, while Randoop doesn’t even manage to find two fifths. Allowing tools to run longer results in slightly higher detection rates, but the difference is pretty negligible.

The faults can be grouped into three categories:

  • Easy faults don’t require specific input or conditions and typically result in things like NullPointerExceptions. These were detected by both tools in at least 8 out of 10 runs.

  • Hard faults can only be found using specific inputs. These were detected by at least once by only one tool (usually the search-based one).

  • Challenging faults are faults that can only be detected using specific input that usually consists of complex objects. Neither tool managed to detect these faults in any of the runs.

The developers found the tools easy to use, but also noted that the generated tests weren’t very readable.

This is largely due to the input values and assertions; both are a bit random and not particularly meaningful within the context of the tested application logic.


  1. Automatically generated tests are usable as regression tests

  2. Search-based testing works better than random testing

  3. There are faults that both random testing and search-based testing cannot find

  4. Generated tests often aren’t as readable as manually written tests