Does it really matter to test-first or to test-last?

Published: 17 Mar 2019
Written by: Chun Fei Lung

Debunking some myths about the effectiveness of test-driven development.

Adding features during refactoring is counterproductive! It’s a fallacy that may blow up in your face.

Test-driven development is a development practice that involves short, iterative cycles in which the programmer writes tests before adding new functionality or refactoring existing code. It’s commonly believed that writing tests first leads to higher-quality code and improved productivity. Fucci et al. put that belief to the test.

About the article

Title	A dissection of the test-driven development process: Does it really matter to test-first or to test-last?
Year	2017
Author(s)	Davide Fucci (University of Oulu) Hakan Erdogmus (Carnegie Mellon University) Burak Turhan (University of Oulu) Markku Oivo (University of Oulu) Natalia Juristo (University of Oulu)
Venue	IEEE Transactions on Software Engineering

Why it matters

Test-driven development (TDD) has multiple characteristics that set it apart from “traditional” programming, but the “tests first, code later” aspect tends to be the thing that most people talk about (and remember).

There’s more to it than that however, so let’s talk definitions first.

TDD is an programming technique which involves cyclic, iterative implementation of new features.

In each cycle a programmer carries out the following tasks:

Writing unit tests for the desired behaviour;
Writing code to make those tests pass;
Strictly refactoring code to improve its design, i.e. without modifying its behaviour (side note: Doing so could nullify or even reverse the benefits of refactoring).

A cycle is finished when all new and existing unit tests pass, and the programmer is content with the program’s design. Ideally, all cycles are short and roughly the same length (side note: Cycles should be around 5 minutes long, and never be longer than 10 minutes.).

TDD advocates claim that adherence to these practices will lead to improved quality and productivity.

In a nutshell, TDD has four characteristics:

The sequence in which tests are written; before or after coding
The granularity of cycles (side note: Length of cycles)
The uniformity of cycle lengths
The amount of effort spent on refactoring

How do these four characteristics affect the external quality (side note: “Does the software do what it’s supposed to do?”) of the produced software and the developer’s productivity?

How the study was conducted

The authors held several five-day workshops about unit testing and TDD at two Nordic companies.

During the workshop, participants were asked to individually implement three tasks, of which two were greenfield (side note: Implementing a solution from scratch) and one was brownfield (side note: Extending an existing system). Some participants made use of a test-first sequence, while others used a test-last sequence.

TDD dictates that development is done iteratively using many short cycles. To help participants work on their tasks in small steps, the researchers refined each task into clearly delineated stories and sub-stories. Tasks were then “graded” using acceptance test suites for each user story in order to determine the quality of submitted solutions.

All participants made use of a special Eclipse IDE that collected information about actions that are performed in it, like:

Code modification
Test modification
Code compilation
Test execution

This information is used to determine how participants applied TDD.

Combining timestamps from the IDE logs with the pass rate of the acceptance test suite allows one to calculate the productivity of each developer.

What discoveries were made

You probably already guessed by now that Betteridge’s law of headlines (side note: “Any headline that ends in a question mark can be answered by the word no.”) strikes again, but in what way?

Correlation

Granularity and uniformity are positively correlated, i.e. developers who use shorter cycles are able to keep them consistently short, while those who use larger cycles tend to have cycles of varying lengths. Both factors also appear to affect external quality: smaller cycles and cycles that have consistent lengths are associated with better external quality.

A small, but statistically significant correlation exists between granularity and refactoring effort: developers who use coarser cycles spend less time on refactoring.

Regression

To better understand the relation between TDD’s four characteristic factors and the two outcome variables (quality and productivity), the authors constructed two models.

The basic idea here is that each model should predict one of the outcome variables using information about the code-test sequence, cycle granularity and uniformity, and refactoring effort.

A good model is also simple, and should not include superfluous input variables. The process of trimming these variables, feature selection, is described in the original article.

I’ll simply list the most noteworthy discoveries here:

Code-test sequence is not part of either model, which suggests that – at least for external quality and developer productivity – it does not matter whether you write your tests before or after your “real” code (side note: This study did not look at the effects on internal quality (i.e. maintainability), which is also pretty important.);
Cycle granularity and uniformity, and refactoring effort are all negatively correlated with both quality and productivity.
The negative correlation between refactoring effort and the two outcome variables is likely due to floss refactoring (side note: This is a form of refactoring that also includes other activities, like implementation of new features. These new features might not be covered by tests and are therefore more likely to introduce regression bugs.).

Summary

No, it doesn’t really matter whether you write tests-first or tests-last (if you only care about external quality and productivity)
Test-driven development works best if you keep your cycle lengths short and consistent
Improper refactoring lowers (short-term) productivity, but also increases the likelihood of bugs