Micro-clones in evolving software (2018)

A group of gnomes use a Surface Studio to edit a PHP script that outputs lyrics for Duck Sauce’s Barbra Streisand

Codebases often contain code clones: code fragments that are very similar or even completely identical to each other. Until now, only larger clones have been studied thoroughly – not much is known about micro-clones, which are only 1–4 lines of code. Mondai, Roy, and Schneider show that these micro-clones are quite widespread.

Why it matters

The characteristics and impact of code clones on software development and maintenance have been studied extensively by researchers.

While some have found that code cloning has positive effects, there’s also plenty of strong evidence that code cloning make programs more prone to bugs due to unintentional inconsistencies.

Tools that keep track of clones within a codebase can help mitigate these issues. Most tools – and researchers for that matter – only look at larger clones, as it’s generally thought that smaller code clones don’t really matter that much.

Those smaller clones are called “micro-clones” and may be as small as a single line of codeTypical examples of cloned one-liners are invocations or declarations with hard-coded values, e.g. CSS declarations like “color: #c90016”.

The authors argue that micro-clones can also have a strong negative effect on software quality and should therefore also be covered by tracking tools.

How the study was conducted

The authors mine commits from six open-source Java and C application repositories for micro-clones.

Intuitively, code clones can be recognised by looking for all lines that look the sameStrictly speaking, this only applies to Type 1 clones. Code fragments that have different types or identifiers, but are syntactically the same are called Type 2 clones, while Type 3 clones consist of fragments that are almost syntactically identical.

This study mostly focusses on Type 1 and 2 clones, as Type 3 clones are close to impossible to detect in micro-clones.
and are changed in the same way within a single commit: the same line might have been added, updated, or removed in multiple places.

All these lines might be micro-clones, but it’s also possible that they’re simply part of regular code clones, i.e. clones that are at least 5 lines of code.

Therefore the NiCad clone detector is executed on the same commits to detect regular code clones. Any change that is not included in the set of regular code clones, is likely a micro-clone.

This yields sufficient information for statistical and qualitative analyses of micro-clones.

What discoveries were made

It turns out that micro-clones are very common.

The majority of consistent changes (about 80%) that were made throughout the history of the six projects occur in micro-clones. Only 16% occurs in regular code clonesThe remaining 4% is “uncategorised” and consists of changes to or around single-line characters, like { and }. These actually don’t matter..

Manual analysis of 300 micro-clones suggests that most of these changes are non-trivial: the changes aren’t merely changes in spacing or variable naming, but actually affect what the program does or shows.

Most changes that are consistently made in micro-clones are updates (80%). Additions (12%) and deletions (8%) are comparatively rare.

The distribution of micro-clone sizes tends to vary a bit among the six applications, but single-line micro-clones appear to be the most common in many of the studied repositories.

Finally, the authors note that micro-clone pairs usually reside in the same file, although this does not necessarily have to be the case.

The important bits

  1. Micro-clones, which are code clones that measure less than 5 lines of code, may constitute about 80% of all code clones
  2. Most changes to micro-clones consist of modifications to existing lines. Only few changes are additions or deletions.
  3. Many micro-clones are single-line clones
  4. Clone trackers should keep track of micro-clones: failure to update them consistently may result in bugs or unexpected behaviour