The Toilet Paper

Deep code comment generation

Why write comments in your code when you can also generate them?

An angry mobster interrogates a programmer who didn’t comment their code
It’s got some rough edges here and there

Machine learning models can be used to find relevant code snippets for a natural language description. Does that mean we can also do the opposite and predict natural language descriptions for code snippets that lack comments? Hu, Li, Xia, Lo, and Jin designed a model that does just that.

Why it matters


Before a developer can make changes to an existing codebase, they first need to understand it. It’s estimated that developers spend roughly 60% of their time on reading code. This is one of the primary reasons why software maintenance is expensive.

Good comments can help speed up such program comprehension activities. Unfortunately, good comments are rare – they’re often mismatched, missing or outdated.

Since the developers who originally wrote the code are probably gone (or simply bad at writing or maintaining comments), many have proposed automated comment generation.

Existing approaches typically make use of manually crafted rules, information retrieval techniques, and heuristics. These approaches work with varying degrees of effectiveness, but generally have two main limitations: they don’t work well with poorly named identifiers and methods, and often need access to similar code snippets.

How the study was conducted


The authors propose a new approach, named DeepCom.



The whole idea for DeepCom is based on three simple observations:

  • At the very least, most method documentation will say something about ;

  • Neural networks can be used to translate texts from one language to another, e.g. from Chinese to English;

  • While source code is clearly very different from “normal” texts, they do have some things in common: both are written using a language and consist of words that, when put together in a particular order, can be used to convey an idea.

Comment generation then can be treated as a neural machine translation of text in one language (e.g. Java) to another (e.g. English).



Two major challenges needed to be solved: source code representation and dealing with a heterogeneous set of tokens.


Unlike natural language texts, source code is strongly structured and unambiguous. Models that are intended for use with natural language texts need to take this into account.

The authors solved this by applying structure-based traversal on the abstract syntax tree of methods to create sequences that can be processed by DeepCom. This makes it possible to unambiguously reconstruct the abstract syntax tree from the sequences.

Vocabulary size

Natural language texts usually don’t contain that many different words, which makes it possible to train a model by using only the most common words and treating the rest as “unknown words”. This approach doesn’t work for source code: there aren’t that many different keywords and operators, but identifiers tend to be very unique. Including all of them isn’t feasible.

These rare identifiers are therefore represented by their (inferred) type, which at least offers some information.



The authors evaluate the performance of DeepCom by feeding it Java code from almost 10,000 repositories and comparing the results with those generated by , which was the state-of-the-art approach at the time of writing.

BLEU scores, which are widely used to , were calculated for both approaches.

What discoveries were made


DeepCom performs a lot better than CODE-NN. The difference is especially pronounced when SBT is used: its BLEU-4 score of 38% is almost 15 percentage points higher than that of CODE-NN.

Because long AST sequences are truncated, the BLEU-4 scores gradually decrease as the length of methods increases. The target comment length doesn’t have a strong effect on DeepCom’s accuracy. CODE-NN’s accuracy on the other hand quickly decreases to 0% if it’s asked to generate longer comments of about 30 words.



Because most people won’t be very familiar with BLEU, I’ll conclude this summary with some examples that should give you an idea of the things that DeepCom is and isn’t capable of.

Comments that are just as good

The comment that DeepCom predicts

Convert Bitmap to byte array

is exactly the same as the comment that a human has written

Convert Bitmap to byte array

Comments that are better

The comment that DeepCom predicts

Returns true if the symbol is empty

is better than the comment that a human has written

Is this symbol table empty?

Comments that don’t make sense

The comment that DeepCom predicts

Creates item layouts if any parameters

is not as good as the one written by a human

Creates item layouts if necessary


  1. Code comment generation can be viewed as a form of translation between text in the form of code to text in a natural language

  2. Source code can be unambiguously represented using sequences by applying structure-based traversal on its abstract syntax tree

  3. Identifiers with few occurrences can be represented using their type information to limit the size of the training vocabulary