Deep code comment generation

Published: 6 Jan 2019
Written by: Chun Fei Lung

Why write comments in your code when you can also generate them?

It’s got some rough edges here and there

Machine learning models can be used to find relevant code snippets for a natural language description. Does that mean we can also do the opposite and predict natural language descriptions for code snippets that lack comments? Hu, Li, Xia, Lo, and Jin designed a model that does just that.

About the article

Title	Deep code comment generation
Year	2018
Author(s)	Xing Hu (Peking University) Ge Li (Peking University) Xin Xia (Monash University) David Lo (Singapore Management University) Zhi Jin (Peking University)
Venue	Proceedings of the 26th Conference on Program Comprehension

Why it matters

Before a developer can make changes to an existing codebase, they first need to understand it. It’s estimated that developers spend roughly 60% of their time on reading code. This is one of the primary reasons why software maintenance is expensive.

Good comments can help speed up such program comprehension activities. Unfortunately, good comments are rare – they’re often mismatched, missing or outdated.

Since the developers who originally wrote the code are probably gone (or simply bad at writing or maintaining comments), many have proposed automated comment generation.

Existing approaches typically make use of manually crafted rules, information retrieval techniques, and heuristics. These approaches work with varying degrees of effectiveness, but generally have two main limitations: they don’t work well with poorly named identifiers and methods, and often need access to similar code snippets.

How the study was conducted

The authors propose a new approach, named DeepCom.

Conception

The whole idea for DeepCom is based on three simple observations:

At the very least, most method documentation will say something about what the method does (side note: Ideally the why should also be described, but this is pretty much impossible to infer if the source code is your only source (no pun intended));
Neural networks can be used to translate texts from one language to another, e.g. from Chinese to English;
While source code is clearly very different from “normal” texts, they do have some things in common: both are written using a language and consist of words that, when put together in a particular order, can be used to convey an idea.

Comment generation then can be treated as a neural machine translation of text in one language (e.g. Java) to another (e.g. English).

Implementation

Two major challenges needed to be solved: source code representation and dealing with a heterogeneous set of tokens.

Structure

Unlike natural language texts, source code is strongly structured and unambiguous. Models that are intended for use with natural language texts need to take this into account.

The authors solved this by applying structure-based traversal on the abstract syntax tree of methods to create sequences that can be processed by DeepCom. This makes it possible to unambiguously reconstruct the abstract syntax tree from the sequences.

Vocabulary size

Natural language texts usually don’t contain that many different words, which makes it possible to train a model by using only the most common words and treating the rest as “unknown words”. This approach doesn’t work for source code: there aren’t that many different keywords and operators, but identifiers tend to be very unique. Including all of them isn’t feasible.

These rare identifiers are therefore represented by their (inferred) type, which at least offers some information.

Evaluation

The authors evaluate the performance of DeepCom by feeding it Java code from almost 10,000 repositories and comparing the results with those generated by CODE-NN (side note: Not to be confused with CODEnn, which was discussed last week.), which was the state-of-the-art approach at the time of writing.

BLEU scores, which are widely used to assess the accuracy of neural machine translation (side note: Or more simply put, the quality of generated comments), were calculated for both approaches.

What discoveries were made

DeepCom performs a lot better than CODE-NN. The difference is especially pronounced when SBT is used: its BLEU-4 score of 38% is almost 15 percentage points higher than that of CODE-NN.

Because long AST sequences are truncated, the BLEU-4 scores gradually decrease as the length of methods increases. The target comment length doesn’t have a strong effect on DeepCom’s accuracy. CODE-NN’s accuracy on the other hand quickly decreases to 0% if it’s asked to generate longer comments of about 30 words.

Examples

Because most people won’t be very familiar with BLEU, I’ll conclude this summary with some examples that should give you an idea of the things that DeepCom is and isn’t capable of.

Comments that are just as good

public static byte[] bitmapToByte(Bitmap b) {
    ByteArrayOutputStream o = new ByteArrayOutputStream();
    b.compress(Bitmap.CompressFormat.PNG, 100, o);
    return o.toByteArray();
}

The comment that DeepCom predicts

Convert Bitmap to byte array

is exactly the same as the comment that a human has written

Convert Bitmap to byte array

Comments that are better

public boolean isEmpty() {
    return root == null;
}

The comment that DeepCom predicts

Returns true if the symbol is empty

is better than the comment that a human has written

Is this symbol table empty?

Comments that don’t make sense

protected void createItemsLayout() {
    if (mItemsLayout == null) {
        mItemsLayout = new LinearLayout(getContext());
        mItemsLayout.setOrientation(LinearLayout.VERTICAL);
    }
}

The comment that DeepCom predicts

Creates item layouts if any parameters

is not as good as the one written by a human

Creates item layouts if necessary

Summary

Code comment generation can be viewed as a form of translation between text in the form of code to text in a natural language
Source code can be unambiguously represented using sequences by applying structure-based traversal on its abstract syntax tree
Identifiers with few occurrences can be represented using their type information to limit the size of the training vocabulary

Deep code comment generation

Why it matters

How the study was conducted

Conception

Implementation

Structure

Vocabulary size

Evaluation

What discoveries were made

Examples

Comments that are just as good

Comments that are better

Comments that don’t make sense

Summary

More about machine learning

More about code quality