The Toilet Paper

Does it really matter to test-first or to test-last?

Debunking some myths about the effectiveness of test-driven development.

A professor draws two variants of the test-driven development process on a blackboard
Adding features during refactoring is counterproductive! It’s a fallacy that may blow up in your face.

Test-driven development is a development practice that involves short, iterative cycles in which the programmer writes tests before adding new functionality or refactoring existing code. It’s commonly believed that writing tests first leads to higher-quality code and improved productivity. Fucci et al. put that belief to the test.

Why it matters


Test-driven development (TDD) has multiple characteristics that set it apart from “traditional” programming, but the “tests first, code later” aspect tends to be the thing that most people talk about (and remember).

There’s more to it than that however, so let’s talk definitions first.

TDD is an programming technique which involves cyclic, iterative implementation of new features.

In each cycle a programmer carries out the following tasks:

  1. Writing unit tests for the desired behaviour;

  2. Writing code to make those tests pass;

  3. Strictly refactoring code to improve its design, i.e. .

A cycle is finished when all new and existing unit tests pass, and the programmer is content with the program’s design. Ideally, all .

TDD advocates claim that adherence to these practices will lead to improved quality and productivity.

In a nutshell, TDD has four characteristics:

  • The sequence in which tests are written; before or after coding

  • The

  • The uniformity of cycle lengths

  • The amount of effort spent on refactoring

How do these four characteristics affect the of the produced software and the developer’s productivity?

How the study was conducted


The authors held several five-day workshops about unit testing and TDD at two Nordic companies.

During the workshop, participants were asked to individually implement three tasks, of which two were and one was . Some participants made use of a test-first sequence, while others used a test-last sequence.

TDD dictates that development is done iteratively using many short cycles. To help participants work on their tasks in small steps, the researchers refined each task into clearly delineated stories and sub-stories. Tasks were then “graded” using acceptance test suites for each user story in order to determine the quality of submitted solutions.

All participants made use of a special Eclipse IDE that collected information about actions that are performed in it, like:

  • Code modification

  • Test modification

  • Code compilation

  • Test execution

This information is used to determine how participants applied TDD.

Combining timestamps from the IDE logs with the pass rate of the acceptance test suite allows one to calculate the productivity of each developer.

What discoveries were made


You probably already guessed by now that strikes again, but in what way?



Granularity and uniformity are positively correlated, i.e. developers who use shorter cycles are able to keep them consistently short, while those who use larger cycles tend to have cycles of varying lengths. Both factors also appear to affect external quality: smaller cycles and cycles that have consistent lengths are associated with better external quality.

A small, but statistically significant correlation exists between granularity and refactoring effort: developers who use coarser cycles spend less time on refactoring.



To better understand the relation between TDD’s four characteristic factors and the two outcome variables (quality and productivity), the authors constructed two models.

The basic idea here is that each model should predict one of the outcome variables using information about the code-test sequence, cycle granularity and uniformity, and refactoring effort.

A good model is also simple, and should not include superfluous input variables. The process of trimming these variables, feature selection, is described in the original article.

I’ll simply list the most noteworthy discoveries here:

  • Code-test sequence is not part of either model, which suggests that – ;

  • Cycle granularity and uniformity, and refactoring effort are all negatively correlated with both quality and productivity.

  • The negative correlation between refactoring effort and the two outcome variables is likely due to .


  1. No, it doesn’t really matter whether you write tests-first or tests-last (if you only care about external quality and productivity)

  2. Test-driven development works best if you keep your cycle lengths short and consistent

  3. Improper refactoring lowers (short-term) productivity, but also increases the likelihood of bugs