An industrial evaluation of unit test generation: Finding real faults in a financial application

Published: 2 Sept 2018
Written by: Chun Fei Lung

Why write tests if you can let a computer write them for you?

It’s not exactly a Monet, but… money doesn’t smell

Writing tests isn’t something that many developers enjoy, and clients generally don’t like spending money on testing either. Could we try to automate it? Almasi, Hemmati, Fraser, Arcuri, and Benefelds compared two unit test generation tools for Java, and conclude that while they do work, you’ll still have to write tests manually for now.

About the article

Title	An industrial evaluation of unit test generation: Finding real faults in a financial application
Year	2017
Author(s)	M. Moein Almasi (University of Manitoba) Hadi Hemmati (University of Calgary) Gordon Fraser (University of Sheffield) Andrea Arcuri (Westerdals ACT and University of Luxembourg) Jānis Benefelds (SEB Life & Pension Holding AB Riga Branch)
Venue	Proceedings of the 19th International Conference on Software Engineering

Why it matters

Testing is an important part of software development – unfortunately it’s also something that not all developers are good at.

Automated test generation could solve that problem. Researchers have introduced and studied the effectiveness of such tools for open source projects, but it’s not clear how usable these tools are (side note: It’s heavily implied that many of the tools are research prototypes that might not always be easy to set up or use) for industrial systems.

This study therefore aims to provide evidence of the effectiveness of automated test generation tools on commercially developed software.

How the study was conducted

The authors used LifeCalc, a medium-sized life insurance and pensions software application, as the subject of study. More specifically, the authors studied 25 faults that have been identified and remedied in the past by LifeCalc’s developers.

For each of the faults a special LifeCalc version was created that exhibited only that particular fault.

Test generation

When it comes to automated unit test generation, there are basically three main approaches:

Random testing involves generating random-ish inputs in the form of method calls that verify that the application doesn’t crash;
With search-based testing, the generator iteratively tries to find optimal inputs and assertions using a fitness function (side note: A fitness function is a function that looks at a particular algorithm or set of parameters, and then “grades” its performance to guide test generation);
For symbolic testing the program code is analysed to determine combinations of possible input values that will activate all possible execution paths.

The authors ran two actively maintained, mature test generation tools (side note: None of the symbolic testing tools in existence were mature enough for the study) on the non-faulty version of LifeCalc:

Randoop, which uses random testing, and
EvoSuite, which uses search-based testing.

Each tool was ran ten times, with different durations: 3 and 15 minutes.

Finally, the authors conducted a survey about generated and manually written tests with five of LifeCalc’s developers.

What discoveries were made

Let’s start with the good news: together, the tools managed to detect 19 of the 25 faults. This suggests that the tools can definitely be useful.

During single runs however, EvoSuite on average only finds about half of all faults, while Randoop doesn’t even manage to find two fifths. Allowing tools to run longer results in slightly higher detection rates, but the difference is pretty negligible.

The faults can be grouped into three categories:

Easy faults don’t require specific input or conditions and typically result in things like NullPointerExceptions. These were detected by both tools in at least 8 out of 10 runs.
Hard faults can only be found using specific inputs. These were detected by at least once by only one tool (usually the search-based one).
Challenging faults are faults that can only be detected using specific input that usually consists of complex objects. Neither tool managed to detect these faults in any of the runs.

The developers found the tools easy to use, but also noted that the generated tests weren’t very readable.

This is largely due to the input values and assertions; both are a bit random and not particularly meaningful within the context of the tested application logic.