The Toilet Paper

A checklist for controlled program comprehension experiments

If you want to study the effect of something on the understability of code, you need a controlled experiment. What are things to do or avoid?

Admiral Piett
It’s an older code, sir, but it checks out

Developers spend more time reading code than writing it. Code can be hard to understand, especially when written by someone else. Researchers are therefore interested in the factors that make it hard to understand code and what can be done about it.

Controlled experiments, in which subjects are asked to complete a programming task on given code and researchers perform all kinds of measurements, allow us to learn more about what makes code difficult to understand.

There are many things that need to be taken into account: a good experiment uses the right code, task, metrics, and human subjects. This week’s paper presents a checklist for controlled experiments on program comprehension.

I’ll try to give you an idea of what you need to consider, but if you intend to conduct an experiment of your own, you should just go ahead and read the original!



Any experiment on program comprehension involves code, but it’s not easy to find code that is suitable for an experiment.

First, you need to consider how much code you should use. This can be anything from a few lines of code to the complete source code of a software project. The right amount depends on what you’re studying.

For example, if your study is about control structures, then control structures are all you need – including code from a higher scope (e.g. class-level) may result in confounding effects. Limiting the scope of an experiment also makes it more manageable. Few subjects will be able to participate in experiments that take more than an hour!

On the other hand, if the goal of the study is to understand how things work in the real world, you need a real-world amount of code: understanding entire systems is different from understanding a limited amount of code. And because comprehension takes time, experiments that need to be really realistic may have to take weeks (or even months)! This is rarely feasible in practice of course, so researchers often take shortcuts, limit the scope of their study, or rely on observations.

Similar arguments can be made for the difficulty of the code, which should be not too easy, but also not too difficult for the task and its experimental subjects. For instance, a task could be too hard if it requires specific technical knowledge or domain knowledge. Any subject who lacks such knowledge would not be able to meaningfully participate in the experiment. Pilot studies and careful recruitment of experiment subjects can be used to ensure that the difficulty of the code is at an appropriate level.

Finally, you need to think about where the code comes from: should you use real or synthetic code? Using real code is often easier, due to the widespread availability of . However, writing code for experiments gives you full control over what the code will look like.

A downside of using existing code is that it can be harder to understand, e.g. because a reader needs specific domain knowledge or be aware of any assumptions and constraints in its design. These issues can be partially mitigated by using code from utility libraries.



There are 7 pitfalls that you should be aware of:

  1. Code can be misleading, e.g. due to the presence of linguistic anti-patterns. This may lead to unexpected outcomes and incorrect conclusions.

  2. Code that is based on well-known algorithms may be recognised, allowing subjects to complete tasks faster than normal. Moreover, modifying such algorithms may lead to misleading code when subjects expect them to work in a very specific way.

  3. Code used in experiments should be realistic, and that also means that its structure and style are realistic, and do not contain things that would not appear in real code.

  4. If code is presented badly, subjects may focus on the wrong things.

  5. When multiple snippets of code are presented in a sequence, performance may differ for the snippets that appear later in the sequence due to learning or fatigue effects. This can be mitigated by randomising the order.

  6. Badly-named variables can have unintended side-effects, even (or maybe especially) when they are replaced by distracting arbitrary strings (asdf, superman) or obfuscated into long strings that are hard to parse (ecoamKayiEoaikAmKayiEckqmqca).

  7. The code must be appropriate for the task, e.g. if you are studying the effect of indentation, the code should have a structure that actually allows for indentation.



Program comprehension studies typically involve code. A subject can prove that they understand that code by performing a task. The task therefore “defines” what comprehension means.

There are many types of tasks:

  1. Reading tasks are about the readability of the code. This is not about understanding the code, but merely about its tokens and its structure.

  2. Parsing tasks are used to show that a subject understands the syntax of the code.

  3. Interpretation tasks also requires understanding the semantics of the code on a machine level, i.e. what would happen if you would execute the code.

  4. Comprehension tasks require a more thorough understanding of the code, such that a subject can explain what it does in their own terms.

  5. Use tasks are about being able to use an API based on its interface and documentation, without having access to the implementation.

  6. Correction tasks require subjects to fix a bug in the code. Note that one must distinguish between technical bug fixing (which only requires interpretation of the code) and semantic bug fixing (which requires actual understanding).

  7. Extension or modification tasks require subjects to make a change to the code. This is typically done using larger units of code.

  8. Design-related tasks are no longer about code, but about the structure or architecture of a system. This requires a deeper level of understanding, e.g. why the system is structured in a certain way.

  9. Recall tasks ask subjects to read code, understand it, and then recall it from memory. The idea is that humans are good at memorising things that are meaningful, so if a subject can recall the code from memory, they have probably understood it. While recall tasks are reliable, they’re not used very often because they’re a bit weird.

Of course, as a researcher you can also use multiple tasks to measure different aspects of understanding. Such tasks may involve time limits and you may want to measure how many tasks subjects can achieve in a given amount of time or how much time they need to complete a task.



As we have seen above, there are different levels of understanding that can be measured. A task should therefore measure the correct level of understanding. One also needs to make sure that a subject really does understand the code: perceived understanding might not be the same as actual understanding!

Even if a task explicitly checks for actual understanding, there may still be shortcuts that allow subjects to complete it without fully understanding what the code does. Care should be taken to ensure that such shortcuts do not exist.

Of course, tasks should also be designed such that they only test what the experiment is supposed to test. There should be no confounding explanations.

Finally, the working environment may affect how subjects perform their tasks. This is a hard one: tools like IDEs and syntax highlighting may make tasks easier to complete that they should, but leaving them out may be problematic for subjects who are used to them.



An experiment isn’t complete without measurements. Most experiments measure the accuracy of answers for tasks, the response time for correct answers, or a combination thereof. But measuring these things is not entirely trivial.

Some tasks require answers in the form of a simple answer, e.g. when a subject is asked what the output of a function will be for a given input. It’s easy to judge the accuracy of such answers. But other tasks, like “Name this function” may have many different possible answers. In such cases, the answers will need to be judged and when multiple people serve as judges there needs to be a protocol for settling disputes among judges.

Speaking of wrong answers, what should you do when a subject provides a wrong answer? A common approach is to just continue with the experiment. Informing the subject might introduce learning effects, discourage subjects from continuing, or .

When there are multiple dimensions of performance (e.g. time and correctness), they can be reported separately or combined in some way. to report them separately, but there are different ways to do this. For instance, one can distinguish between three levels of accomplishment: incorrect answers, correct answers that are given in relatively little time, and correct answers that took relatively long.

There are also completely different ways to perform measurements. Eye tracking can be used to identify what subjects focus on or are interested in, while biophysical indicators like fMRI can be used to (sort of) see how a subject processes code.



Beware of confounding effects:

  • Subjects need time to get used to the experimental setting, which may lead to longer task times in the first task or two.

  • Sometimes time and correctness measure two very different things. In one study researchers found that time reflects difficulty, while the error rate reflects a “surprise factor” (i.e. misleading code).

It’s also important to think about measurement technicalities, e.g. when measuring task time, are you also measuring the time it takes for the subject to provide their answer?

Once you’ve made sure that all the measurements you’ve done were correct, there’s one more thing to keep in mind: don’t jump straight to premature theorising, but collect more data first. Not every measurement has to directly lead to a cognitive theory.



Each subject is different. There are three major factors that may affect task performance: knowledge, skill, and motivation. These factors are the hardest to control.

It certainly helps if you use large enough samples that are representative of the overall population.

An important question is whether students are appropriate subjects. Not just from an ethical point of view, but also because most students are clearly different from professional software developers. The author argues that this doesn’t really matter. The dividing line between students and professionals is somewhat ambiguous anyway, as some students have also worked or currently work as a developer. One could therefore instead look at how much work experience a subject has, or attempt to assess their proficiency and skill in programming.

There are more demographic factors, like age and gender, that may affect performance. However, if such an effect exists, it’s probably small.



There are different dimensions of knowledge. Someone may be good at one thing, but bad at another. This makes classifications into experts and novices less useful. And what exactly is an “expert”? Some studies classify graduate students as experts, even though this is only true when compared to freshmen.

Finally, some subjects are unsuitable for a study and should simply be excluded. A subject might lack the required knowledge to meaningfully participate in the experiment, have done the (or a very similar) experiment before, or may be affected by a lack of motivation.


  1. Controlled experiments for program comprehension are affected by the choice of code, tasks, metrics, and experimental subjects