An industrial evaluation of unit test generation: Finding real faults in a financial application (2017)
Writing tests isn’t something that many developers enjoy, and clients generally don’t like spending money on testing either. Could we try to automate it? Almasi, Hemmati, Fraser, Arcuri, and Benefelds compared two unit test generation tools for Java, and conclude that while they do work, you’ll still have to write tests manually for now.
Why it matters
Testing is an important part of software development – unfortunately it’s also something that not all developers are good at.
Automated test generation could solve that problem. Researchers have introduced and studied the effectiveness of such tools for open source projects, but it’s not clear how usable these tools areIt’s heavily implied that many of the tools are research prototypes that might not always be easy to set up or use for industrial systems.
This study therefore aims to provide evidence of the effectiveness of automated test generation tools on commercially developed software.
How the study was conducted
The authors used LifeCalc, a medium-sized life insurance and pensions software application, as the subject of study. More specifically, the authors studied 25 faults that have been identified and remedied in the past by LifeCalc’s developers.
For each of the faults a special LifeCalc version was created that exhibited only that particular fault.
When it comes to automated unit test generation, there are basically three main approaches:
- Random testing involves generating random-ish inputs in the form of method calls that verify that the application doesn’t crash;
- With search-based testing, the generator iteratively tries to find optimal inputs and assertions using a fitness functionA fitness function is a function that looks at a particular algorithm or set of parameters, and then “grades” its performance to guide test generation;
- For symbolic testing the program code is analysed to determine combinations of possible input values that will activate all possible execution paths.
The authors ran two actively maintained, mature test generation toolsNone of the symbolic testing tools in existence were mature enough for the study on the non-faulty version of LifeCalc:
Each tool was ran ten times, with different durations: 3 and 15 minutes.
Finally, the authors conducted a survey about generated and manually written tests with five of LifeCalc’s developers.
What discoveries were made
Let’s start with the good news: together, the tools managed to detect 19 of the 25 faults. This suggests that the tools can definitely be useful.
During single runs however, EvoSuite on average only finds about half of all faults, while Randoop doesn’t even manage to find two fifths. Allowing tools to run longer results in slightly higher detection rates, but the difference is pretty negligible.
The faults can be grouped into three categories:
- Easy faults don’t require specific input or conditions and typically result in things like
NullPointerExceptions. These were detected by both tools in at least 8 out of 10 runs.
- Hard faults can only be found using specific inputs. These were detected by at least once by only one tool (usually the search-based one).
- Challenging faults are faults that can only be detected using specific input that usually consists of complex objects. Neither tool managed to detect these faults in any of the runs.
The developers found the tools easy to use, but also noted that the generated tests weren’t very readable. This is largely due to the input values and assertions; both are a bit random and not particularly meaningful within the context of the tested application logic.