Deep code comment generation (2018)

An angry mobster interrogates a programmer who didn’t comment their code

Machine learning models can be used to find relevant code snippets for a natural language description. Does that mean we can also do the opposite and predict natural language descriptions for code snippets that lack comments? Hu, Li, Xia, Lo, and Jin designed a model that does just that.

Why it matters

Before a developer can make changes to an existing codebase, they first need to understand it. It’s estimated that developers spend roughly 60% of their time on reading code. This is one of the primary reasons why software maintenance is expensive.

Good comments can help speed up such program comprehension activities. Unfortunately, good comments are rare – they’re often mismatched, missing or outdated.

Since the developers who originally wrote the code are probably gone (or simply bad at writing or maintaining comments), many have proposed automated comment generation.

Existing approaches typically make use of manually crafted rules, information retrieval techniques, and heuristics. These approaches work with varying degrees of effectiveness, but generally have two main limitations: they don’t work well with poorly named identifiers and methods, and often need access to similar code snippets.

How the study was conducted

The authors propose a new approach, named DeepCom.

Conception

The whole idea for DeepCom is based on three simple observations:

Comment generation then can be treated as a neural machine translation of text in one language (e.g. Java) to another (e.g. English).

Implementation

Two major challenges needed to be solved: source code representation and dealing with a heterogeneous set of tokens.

Structure

Unlike natural language texts, source code is strongly structured and unambiguous. Models that are intended for use with natural language texts need to take this into account.

The authors solved this by applying structure-based traversal on the abstract syntax tree of methods to create sequences that can be processed by DeepCom. This makes it possible to unambiguously reconstruct the abstract syntax tree from the sequences.

Vocabulary size

Natural language texts usually don’t contain that many different words, which makes it possible to train a model by using only the most common words and treating the rest as “unknown words”. This approach doesn’t work for source code: there aren’t that many different keywords and operators, but identifiers tend to be very unique. Including all of them isn’t feasible.

These rare identifiers are therefore represented by their (inferred) type, which at least offers some information.

Evaluation

The authors evaluate the performance of DeepCom by feeding it Java code from almost 10,000 repositories and comparing the results with those generated by CODE-NNNot to be confused with CODEnn, which was discussed last week., which was the state-of-the-art approach at the time of writing.

BLEU scores, which are widely used to assess the accuracy of neural machine translationOr more simply put, the quality of generated comments, were calculated for both approaches.

What discoveries were made

DeepCom performs a lot better than CODE-NN. The difference is especially pronounced when SBT is used: its BLEU-4 score of 38% is almost 15 percentage points higher than that of CODE-NN.

Because long AST sequences are truncated, the BLEU-4 scores gradually decrease as the length of methods increases. The target comment length doesn’t have a strong effect on DeepCom’s accuracy. CODE-NN’s accuracy on the other hand quickly decreases to 0% if it’s asked to generate longer comments of about 30 words.

Examples

Because most people won’t be very familiar with BLEU, I’ll conclude this summary with some examples that should give you an idea of the things that DeepCom is and isn’t capable of.

Comments that are just as good

public static byte[] bitmapToByte(Bitmap b) {
    ByteArrayOutputStream o = new ByteArrayOutputStream();
    b.compress(Bitmap.CompressFormat.PNG, 100, o);
    return o.toByteArray();
}

The comment that DeepCom predicts

Convert Bitmap to byte array

is exactly the same as the comment that a human has written

Convert Bitmap to byte array

Comments that are better

public boolean isEmpty() {
    return root == null;
}

The comment that DeepCom predicts

Returns true if the symbol is empty

is better than the comment that a human has written

Is this symbol table empty?

Comments that don’t make sense

protected void createItemsLayout() {
    if (mItemsLayout == null) {
        mItemsLayout = new LinearLayout(getContext());
        mItemsLayout.setOrientation(LinearLayout.VERTICAL);
    }
}

The comment that DeepCom predicts

Creates item layouts if any parameters

is not as good as the one written by a human

Creates item layouts if necessary

The important bits

  1. Code comment generation can be viewed as a form of translation between text in the form of code to text in a natural language
  2. Source code can be unambiguously represented using sequences by applying structure-based traversal on its abstract syntax tree
  3. Identifiers with few occurrences can be represented using their type information to limit the size of the training vocabulary