Casting doubt on the ability of tools to measure technical debt
As developers, most of us do our best to avoid the creation of technical debt, which is often caused by bad code that is associated with future excessive maintenance costs. To avoid – or at least keep track of – these costs, some teams use tools like SonarQube or ndepend to identify bad code that may cause technical debt.
Curiously enough, each tool uses its own definitions of “bad” code to detect the presence of technical debt. What does this mean?
The researchers compared five well-known tools that can detect technical debt at the file level: SonarQube, Designite Java, DV8, Structure101 and Software Archinaut. The Succinct Code Counter was also included in the comparison for normalisation (or control) purposes.
These tools were used to analyse the from 10 Java projects in the 20-MAD dataset that had at least 1,000 commits that were linked to an issue tracked by an issue tracker. The researchers compared the output of these tools with each other, and four maintainability measures that can be seen as indicators of “interest” that results from technical debt:
- the number of times a file appears in commits
- the number of lines added or removed in these commits
- the number of file revisions linked to bug commits
- the number of lines added or removed in bug commits for a file
Some measures, like file size (the number of lines of code), cyclomatic complexity, and the number of import cycles between files or packages are fairly easy to compute. Consequently, one would expect that each tool would give you the same result.
That’s not actually the case. Even for the simplest measure, file size, the pair-wise correlation between tools is often “only” between 0.95 and 1.00. As a result, the tools also don’t entirely agree on which files are the largest (and thus most problematic).
Size is an important measure, because it’s well known that it has an impact on almost every other measure. The researchers demonstrate this by computing the correlations for cyclomatic complexity twice. When computing it directly, correlations between the six tools seem relatively high. However, correlations drop significantly once file size is taken into account.
Correlations are even lower when the researchers looked at “fancy” measures, like those that focus on code smells, design smells, or possible bugs. Even measures that are not related to size, like cycles, are not computed consistently between different tools.
The researchers found the following explanations for the disagreement between different debt detection tools:
Different tools compute file size in different ways. For instance, some tools count all lines, while others only count lines with code. Moreover, it is possible that some count logical rather than physical lines.
There are different ways to calculate the cyclomatic complexity of functions. Additionally, there are also different ways to aggregate these values for a file. Some tools sum the individual cyclomatic complexities of each function, while others use a weighted approach.
Cycles come in different shapes and sizes. Some tools only look at pair-wise cycles, while others see them as strongly connected graphs. Moreover, some files can be part of multiple cycles, but certain tools see cycles as mutually exclusive.
Each tool defines their own smells, so it is not very surprising that tools disagree about which smells are present and where.
Given that large differences exist between tools, not every tool is likely good at detecting technical debt.
The results initially seem somewhat promising as there is some correlation between measures implemented by the tools and the four indicators of technical debt. However, after normalising them we once again see that the results are pretty abysmal: virtually no correlation exists between the tools’ measures and actual technical debt.
Only DV8, which incorporates history information from a version control system, manages to come close, especially with its Modularity Violation and Unstable Interface anti-patterns.
The overall conclusion is that most debt detection tools probably aren’t good at detecting actual technical debt. Any advice given by a debt detection tool should be taken with a huge grain of salt and is probably best ignored, unless it can make good use of historical information to understand the changes that are being made to a file.
Most debt detection tools are not able to detect technical debt
Technical debt can only be detected properly when the change history of code is taken into account