Deep code search (2018)
Text search is something that (mostly) “just works”, but the same can’t be said of code search. Gu, Zhang, and Kim present a deep neural network that can be used to retrieve code snippets based on natural language queries and a proof-of-concept application that demonstrates the feasibility of this approach.
Why it matters
Most code search engines that are currently in use rely on information retrieval techniques like usage of structural information, keyword matching, and PageRank(-like) algorithms.
These approaches generally don’t work very wellExhibit A: GitHub’s awful search functionality. This is because of a mismatch between the high-level intent of a searcher and the low-level implementation of that intent in the source code:
- the terminology used by the searcher might differ completely from that used in the code;
- natural language queries issued by a searcher may contain irrelevant keywords that add noise to search results.
A higher-level semantic mapping between the user’s world and the “source code world” might yield better results.
How the study was conducted
The authors propose a deep neural network named CODEnn (Code-Description Embedding Neural Network) that makes it possible to search for code in a more natural way.
If all functions or methodsI’ll just call these “functions” in the remainder of this article so I don’t have to repeat the “or methods” part all the time in a codebase are thoroughly commented, one can simply find specific pieces of code by simply searching for keywords that are included in the function documentation.
Unfortunately, most codebases aren’t actually that thoroughly commented, so this wouldn’t work very well in practice – or would it?
Code search can be seen as a prediction problem, where a model tries to guess relevant results for a particular search query. In this case CODEnn is that model.
CODEnn does this in two steps.
It first uses a technique called embedding, which is typically used to learn vector representationsMachine learning technically only works on numbers. All information must therefore be converted to a list of numbers before a machine learning algorithm can do anything useful with it. of natural language sentences, such that sentences that are similar to each other have similar representations.
Here it’s applied to function documentation, which consists of one or more sentences, and functions, which are represented by a name, a sequence of API invocations and a list of tokens. This yields two different sets of vectors: one for the functions and one for their documentation.
Joint embedding is the last piece of the puzzle: this technique allows the model to learn the mappings between code and documentation by embedding both of them within one unified vector space based on their similarity.
The authors also implemented a proof-of-concept search tool named DeepCS, that allows users to search for code snippets using queries in plain English. The tool makes use of a CODEnn model that’s trained on 18 million commented Java methods from public GitHub repositories.
After training the CODEnn model, about 16 million commented and uncommented code snippets from roughly 10,000 other Java repositories (i.e. not included in the training set) were fed into DeepCS.
The authors then created a list of queries that were derived from the top 50 Java programming questions on Stack Overflow and evaluated DeepCS’s search results for each of the queries.
The evaluation was performed using several common metrics. The most important of these is probably FRank, which is essentially the position (rank) of the first search result that resembles the snippet in the accepted answer.
What discoveries were made
DeepCS’s search results tend to be more relevant than those returned by similar approaches that are used by CodeHow and a conventional Lucene-based code search tool.
This appears to be because DeepCS is better at understanding the relative importance of each keyword within a query and inferring the correct meaning of words within their context.
To illustrate what that last bit means, the authors provide two examples of queries that contain very similar words, but mean very different things and should therefore return very different results:
queue an event to be run on the thread
run an event on a thread queue
Another interesting observation is that DeepCS is able to find relevant code snippets even if those snippets do not contain any of the keywords used in the query. For example, searching for
play a song will also include results that are associated with words like
Having said that, the method does have its limitations. One major limitation is that DeepCS only ranks results based on their semantic vectors: it does not take other features into account. This may cause partially relevant results to be ranked above exact results.