An industrial evaluation of unit test generation: Finding real faults in a financial application (2017)

Some old Dutch banknotes with a ladybug on one of the notes

Writing tests isn’t something that many developers enjoy, and clients generally don’t like spending money on testing either. Could we try to automate it? Almasi, Hemmati, Fraser, Arcuri, and Benefelds compared two unit test generation tools for Java, and conclude that while they do work, you’ll still have to write tests manually for now.

Why it matters

Testing is an important part of software development – unfortunately it’s also something that not all developers are good at.

Automated test generation could solve that problem. Researchers have introduced and studied the effectiveness of such tools for open source projects, but it’s not clear how usable these tools areIt’s heavily implied that many of the tools are research prototypes that might not always be easy to set up or use for industrial systems.

This study therefore aims to provide evidence of the effectiveness of automated test generation tools on commercially developed software.

How the study was conducted

The authors used LifeCalc, a medium-sized life insurance and pensions software application, as the subject of study. More specifically, the authors studied 25 faults that have been identified and remedied in the past by LifeCalc’s developers.

For each of the faults a special LifeCalc version was created that exhibited only that particular fault.

Test generation

When it comes to automated unit test generation, there are basically three main approaches:

The authors ran two actively maintained, mature test generation toolsNone of the symbolic testing tools in existence were mature enough for the study on the non-faulty version of LifeCalc:

Each tool was ran ten times, with different durations: 3 and 15 minutes.

Finally, the authors conducted a survey about generated and manually written tests with five of LifeCalc’s developers.

What discoveries were made

Let’s start with the good news: together, the tools managed to detect 19 of the 25 faults. This suggests that the tools can definitely be useful.

During single runs however, EvoSuite on average only finds about half of all faults, while Randoop doesn’t even manage to find two fifths. Allowing tools to run longer results in slightly higher detection rates, but the difference is pretty negligible.

The faults can be grouped into three categories:

The developers found the tools easy to use, but also noted that the generated tests weren’t very readable. This is largely due to the input values and assertions; both are a bit random and not particularly meaningful within the context of the tested application logic.

The important bits

  1. Automatically generated tests are usable as regression tests
  2. Search-based testing works better than random testing
  3. There are faults that both random testing and search-based testing cannot find
  4. Generated tests often aren’t as readable as manually written tests