Understanding large-scale software – A hierarchical view (2019)
Large software systems are more expensive to maintain – not because changes require more code, but because it’s harder to understand such systems. Many developers and studies focus on things like cyclomatic complexity and API documentation, but they aren’t exactly helpful if you need to understand entire systems.
Why it matters
Most research in program comprehension focusses on understanding of code. That certainly offers valuable insights, but it’s not necessarily representative of how developers actually work with software.
In practice, developers who maintain software systems need to understand more than just the code: they have to know what the rest of the system looks like, and how their changes affect the rest of the system.
How do developers manage to understand entire systems? What methods do developers use to gain a better understanding of the systems they’re working on?
How the study was conducted
The authors conducted semi-structured interviews with 11 experienced developers, managers, architects, and entrepreneurs from different organisations.
What discoveries were made
The study tells us something about how developers learn about the inner workings of larger software systems.
Depth of comprehension
There are two major levels of comprehension:
- Black-box comprehension is the simplest level of comprehension: the developer knows what a component does from a user’s perspective, but has no understanding of its internals. This is more than enough if you only need to use a component “as is”.
- White-box comprehension goes a bit further: at this level, a developer would also understand how a component is implemented. This is level of understanding is necessary for most maintenance and refactoring activities.
Some interviewees point out that there are other levels as well:
- Unboxable comprehension consists of assumptions that cannot be derived from the code. You may find these in the documentation or by having a chat with its original developerThis assumes of course that they’re still part of the project..
- Out-of-the-box comprehension requires deep knowledge about the way the code will actually be executed and can be useful for implementing optimisations. This is rarely needed.
Full, white-box comprehension of a system often isn’t just unattainable – it’s also undesirable. Developers prefer to avoid actual comprehension whenever possible, by:
- Guarding design and code quality (information hiding, modularity) so that others only need to understand small parts at a time;
- Relying on other indicators that may serve as proxies for quality when selecting external software packages, e.g. the number of stars, downloads, or the reputation of its author(s);
- Making use of unit tests, which “automate” understanding.
There are two approaches that one can take when attempting to understand a system: top-down or bottom-up. Top-down approaches typically involve design documents and API documentation, while bottom-up approaches are more likely to involve the actual source code and possibly some inline documentation.
Both approaches have their up- and downsides: for example, a top-down approach makes it easy to understand which and how components are used, but might also quickly overwhelm newcomers. Some therefore suggest using a combination of the two approaches.
However, it seems that developers who gravitate towards top-down approaches are better able to comprehend large volumes of code, as it allows them to defer understanding of details that don’t really matter that much yet.
Aside from top-down and bottom-up approaches, the interviewees also mention other methods to become acquainted with unfamiliar systems:
- Meaningful names for functions, classes, and packages;
- Adherence to coding and naming conventions;
- Assigning minor maintenance tasks (fixing simple bugs, adding simple features, creating missing documentation) to newcomers to help them focus on important parts of the system;
- Asking questions or discussing code with other team members;
- Using a debugger to step through the code;
- Drawing analogies with things you already know;
- Documentation in the form of inline comments, design documents, and test suites.
Comprehension means different things at different levels.
Interviewees define understanding of a function as understanding of its contract: what parameters does it accept, what does it do, what does it return? White-box comprehension generally isn’t necessary, unless:
- The function has side-effects, as it means that the function cannot be understood in isolation;
- When one needs to consider reentrancy or thread synchronisation: how and when can the function be used;
- You need to optimise for performance and understand how it’s implemented.
At the class level designer intent starts to become more important than interfaces. In order to really understand a class, one needs to know why it exists and when it should (not) be used.
This information cannot be found in the code itself, but can sometimes be found in design documentation or by learning more about the parts of the system where the class is used or located.
Packages (e.g. third-party libraries) are a collection of classes that can easily be reused “as is”.
One can understand how a package works without the need to consider the rest of the system. This makes packages the only level at which black-box comprehension is often sufficient.
It’s a bit harder to define what it means to understand an entire system. Many interviewees talk about:
- understanding of the structure of a system: what modules does it consist of, how do they communicate with each other, how does data flow through the system?
- intent of the system as a whole: what is its grand objective and why was it designed this way?
The actual code is completely irrelevant at this level. But it’s not enough to understand the system in its current state: its history and evolution also need to be taken into account. The structure – or even the entire project – may reflect intentions, constraints, and choices that are no longer relevant. This makes it challenging to understand systems.