Does it really matter to test-first or to test-last? (2017)

A professor draws two variants of the test-driven development process on a blackboard
Adding features during refactoring is counterproductive! It’s a fallacy that may blow up in your face.

Test-driven development is a development practice that involves short, iterative cycles in which the programmer writes tests before adding new functionality or refactoring existing code. It’s commonly believed that writing tests first leads to higher-quality code and improved productivity. Fucci et al. put that belief to the test.

Why it matters

Test-driven development (TDD) has multiple characteristics that set it apart from “traditional” programming, but the “tests first, code later” aspect tends to be the thing that most people talk about (and remember).

There’s more to it than that however, so let’s talk definitions first.

TDD is an programming technique which involves cyclic, iterative implementation of new features.

In each cycle a programmer carries out the following tasks:

  1. Writing unit tests for the desired behaviour;
  2. Writing code to make those tests pass;
  3. Strictly refactoring code to improve its design, i.e. without modifying its behaviourDoing so could nullify or even reverse the benefits of refactoring.

A cycle is finished when all new and existing unit tests pass, and the programmer is content with the program’s design. Ideally, all cycles are short and roughly the same lengthCycles should be around 5 minutes long, and never be longer than 10 minutes..

TDD advocates claim that adherence to these practices will lead to improved quality and productivity.

In a nutshell, TDD has four characteristics:

How do these four characteristics affect the external quality“Does the software do what it’s supposed to do?” of the produced software and the developer’s productivity?

How the study was conducted

The authors held several five-day workshops about unit testing and TDD at two Nordic companies.

During the workshop, participants were asked to individually implement three tasks, of which two were greenfieldImplementing a solution from scratch and one was brownfieldExtending an existing system. Some participants made use of a test-first sequence, while others used a test-last sequence.

TDD dictates that development is done iteratively using many short cycles. To help participants work on their tasks in small steps, the researchers refined each task into clearly delineated stories and sub-stories. Tasks were then “graded” using acceptance test suites for each user story in order to determine the quality of submitted solutions.

All participants made use of a special Eclipse IDE that collected information about actions that are performed in it, like:

This information is used to determine how participants applied TDD.

Combining timestamps from the IDE logs with the pass rate of the acceptance test suite allows one to calculate the productivity of each developer.

What discoveries were made

You’ve probably already guessed that Betteridge’s law of headlines“Any headline that ends in a question mark can be answered by the word no.” strikes again, but how exactly?

Correlation

Granularity and uniformity are positively correlated, i.e. developers who use shorter cycles are able to keep them consistently short, while those who use larger cycles tend to have cycles of varying lengths. Both factors also appear to affect external quality: smaller cycles and cycles that have consistent lengths are associated with better external quality.

A small, but statistically significant correlation exists between granularity and refactoring effort: developers who use coarser cycles spend less time on refactoring.

Regression

To better understand the relation between TDD’s four characteristic factors and the two outcome variables (quality and productivity), the authors constructed two models.

The basic idea here is that each model should predict one of the outcome variables using information about the code-test sequence, cycle granularity and uniformity, and refactoring effort.

A good model is also simple, and should not include superfluous input variables. The process of trimming these variables, feature selection, is described in the original article.

I’ll simply list the most noteworthy discoveries here:

The important bits

  1. No, it doesn’t really matter whether you write tests-first or tests-last (if you only care about external quality and productivity)
  2. Test-driven development works best if you keep your cycle lengths short and consistent
  3. Improper refactoring lowers (short-term) productivity, but also increases the likelihood of bugs