Does code structure affect comprehension? On using and naming intermediate variables

Published: 22 Aug 2021
Written by: Chun Fei Lung

The answer is not “yes” or “no”, but somewhere in between. Who would’ve guessed?

There are two easy solutions in computer science: caching and naming things.

Conventional wisdom says that lengthy chunks of code are hard to read, and that you should therefore split them into smaller pieces. Does this really make your code easier to read?

About the article

Title	Does code structure affect comprehension? On using and naming intermediate variables
Year	2021
Author(s)	Roee Cates (The Hebrew University of Jerusalem) Nadav Yunik (The Hebrew University of Jerusalem) Dror G. Feitelson (The Hebrew University of Jerusalem)
Venue	International Conference on Program Comprehension (ICPC)

Why it matters

Algorithms and functionality can often be expressed in different ways. For example, the distance between two points can be calculated as follows using a single expression:

d = sqrt( (A.x-B.x)**2 + (A.y-B.y)**2 )

or using three separate expressions:

dx = A.x - B.x
dy = A.y - B.y
d = sqrt( dx**2 + dy**2 )

Each of the lines in the second version is easier to understand than the compound expression in the first version. However, the reader now also has to mentally “connect” the lines if they want to understand what is going on.

This is a very simple example of course, but similar issues exist when decomposing large functions into several smaller functions or in codebases with macaroni-style dependency injection (side note: These look good on paper, but are hard to understand in practice…).

How the study was conducted

When you split a single compound expression into multiple smaller expressions, you inevitably also create intermediate variables, which you have to name. Good variable names serve as a form of inline documentation and thus may also make your code easier to understand. So decomposition actually affects understandability in two different ways.

The researchers studied these two ways using a controlled experiment. Participants were given 6 Python functions that implemented relatively well-known, non-trivial mathematical functions in one of three ways:

As a single compound expression without any intermediate variables

def foo(arr):
    return sum((x - (sum(arr) / len(arr)))**2 for x in arr) / len(arr)

Decomposed into separate expressions, where intermediate variables are given meaningless names, like tmp1 and tmp2

def foo(arr):
    tmp1 = len(arr)
    tmp2 = sum(arr) / tmp1
    return sum((x - tmp2)**2 for x in arr) / tmp1

Again, decomposed into separate expressions, but now intermediate variables are given meaningful names

def foo(arr):
    n = len(arr)
    mean = sum(arr) / n
    return sum((x - mean)**2 for x in arr) / n

As you can see, all functions are named foo(). That is because participants were asked to read the code and come up with a name. If the name accurately describes the algorithm, one can assume that they understand what the code does. The researchers manually verified the correctness of the answers.

What discoveries were made

The first and third version above are extremes, while the second version is somewhere in between. One would therefore expect that the ratio of correct answers for the second version is between those for versions 1 and 3.

They’re not. For some functions version 2 was the hardest to identify correctly, for one it was very similar to version 3, while for others the results were similar to those for version 1. In other words, there is no such thing as a “best” way to write such code – it all depends on the function.

The results also clearly show that version 3 was better understood for 5 of the 6 Python functions. This strongly suggests that intermediate variables are not necessarily “better” on their own, but can be very useful when they are given meaningful names.