Improving search results using query expansion

Published: 20 Nov 2023
Written by: Chun Fei Lung

Make your custom-built search engine return more relevant results by automatically adding similar or related words to search queries.

Mr. Querysote just keeps expanding

General-purpose search engines like Google, Yahoo, and Bing often try to understand what you’re looking for in an attempt to provide you with more relevant results. For example, if you conduct a search for “ps5 games”, a search engine will likely deduce that you may also want to see search results that would match “PlayStation 5 game”.

Custom-built search engines for enterprise software are often a lot simpler. Some search engines may offer the ability to formulate queries using boolean operators or conduct fuzzy searches (side note: A fuzzy search helps you find results that include terms that are very similar to those in the search query. This can be helpful when words are spelled in different ways or when a search term is misspelled.), but only because the underlying search engine (typically Elasticsearch) provides that functionality out of the box.

It’s quite a shame that developers often don’t spend a little bit more effort to create a vastly better search experience for end users, especially because it’s not that hard to do.

In this article, I’ll discuss two techniques to rewrite a user’s search query such that it will likely match more results. Some of these techniques are a form of query expansion, and work by finding related words for each term in the original search query and appending them to an “expanded” version that’s sent to the search engine.

Stemming

The first technique is called stemming and is technically not query expansion. However, it achieves roughly the same purpose (increasing the number of matches) and is easy to implement, especially if your search engine uses Elasticsearch.

In Western languages like English and French, words are written as sequences of alphabetical characters. Different character sequences represent different words and thus tend to have different meanings. Interestingly, if two words start with the exact same character sequence they are likely to be semantically related, i.e. they have roughly the same meaning. For instance, the words “game”, “games” and “gaming” all start with “gam”. This “gam” is called a stem and can be seen as the root form of all three words.

Elasticsearch provides support for two built-in ways to automatically determine the stem of a word: algorithmic stemming and dictionary stemming.

Algorithmic stemming (side note: In this article I describe how this can be done in Python with the Natural Language Toolkit (NLTK).) works by algorithmically applying a series of rules to a word to reduce it to a root form. This method is simple, cheap, and fast, but suffers from a number of problems:

Words that are highly irregular, like “be” and “are”, will obviously be reduced to completely different stems and thus be treated as if they are unrelated.
Understemming can also happen when words do look similar, e.g. when words are compounds with identical endings, like “moonlight” and “starlight”, or in the case of “alumnus”, “alumni”, and “alumna”, which are reduced to “alumnu”, “alumni”, and “alumna” respectively.
Overstemming is the opposite of understemming, and refers to errors that happen when words that look similar but have different meanings, like “universe” and “university” (side note: These words are etymologically related, which means that they evolved from the same word. However, their meanings have drifted apart over time.), are reduced to the same stem (in this case “univers”).

Dictionary stemming works by looking up stems of words in a dictionary. A dictionary can be as simple as a CSV file. This method solves many of the problems with under- and overstemming, but this comes at a cost: you’ll have to provide your own dictionaries, which potentially consume a lot of memory and need to be kept up-to-date as new words are introduced into languages.

WordNet

We can take things one step further by including words that have roughly similar meanings. This saves users the trouble of having to formulate elaborate queries that include any keyword that might be relevant for the thing they are looking for.

We can distinguish between several types of words that have similar meanings:

Synonyms are words that have roughly or even precisely the same meaning as another word. For example, “big” is a synonym for “large”.
Hypernyms are words that have a more general meaning than another word. For example, “game” is a hypernym of “video game” and “board game”.
Hyponyms are words that have a more specific meaning than another word. For example, “PlayStation 5” and “Xbox One S” can be considered to be hyponyms of “gaming console”.

WordNet is an open-source lexical database of the English language that describes these (and other) relationships between words. The English version of WordNet is part of the Natural Language Toolkit (NLTK) for Python. Some people have created their own publicly available versions of WordNet for other languages, like Dutch and German.

Although WordNet can be a useful solution in some cases, it also has quite a few major drawbacks. It’s not maintained very well (side note: Many versions have not been updated in years, so this is a bit of a massive understatement.), and it’s also incapable of taking context into account. For example, if a user searches for “banking left plane”, they probably do not want their results to be about “finance”.

Language models like BERT and GPT are capable of taking context into account when predicting possibly related keywords (to some extent), but are also resource-hungry and slow, so I’m not entirely convinced that they’re worth using for keyword suggestions right now.

Improving search results using query expansion

Stemming

WordNet

More about information science

More about natural language processing