Finding the most common words in a set of texts for a word cloud

A police officer applies some pepper spray against a protester
We need to talk about cloud control techniques

Suppose that someone walks over to your desk and asks you to create a word cloud from some news articles. The sensible thing to do here would be to tell them that “word clouds are very 2006 and people who still use them should be shot”. You don’t want to come off as rude however, so you agree to help them.

The word cloud should be based on the 10 most common words in this small, totally-not-politically-motivated selection of BBC news articles:

The naïve solution

There’s really simple *nix one-linerSplit over several lines for readability solution from the 80s that gives you the 10 most common words (any sequence of alphabetic characters a-z) in some texts:

1
2
3
4
5
6
7
cat *.txt \
  | tr -cs A-Za-z '\n' \
  | tr A-Z a-z \
  | sort \
  | uniq -c \
  | sort -rn \
  | sed 10q

It was originally published by Doug McIlroy in response to Donald Knuth’s 8-page-long solutionTo be fair to Knuth, most of those 8 pages consisted of documentation, and does the following:

  1. Take all txt files in a directory and combine their contents into a single text by putting them after each other;
  2. Collapse all whitespace and convert the text so that each word appears on its own line;
  3. Convert everything to lower case, so we don’t need to deal with upper and lower case versions separately;
  4. Sort the words so that duplicates are all in consecutive lines;
  5. Deduplicate the lines and count how often they occur;
  6. Sort the word counts in reverse, numeric order;
  7. Quit when 10 results have been printed.

Stop words

McIlroy’s script technically does exactly what we’ve asked for – finding the 10 most common alphabetic character sequences – but it’s clearly not what we actually wanted:

 135 the
  68 to
  65 of
  63 in
  60 a
  44 and
  43 hong
  39 kong
  37 s
  34 that

The list is topped by so-called stop words that occur often in any text and therefore don’t tell you very much about these specific articles.

We can of course choose to simply omit these words from our results. There are some ready-made stop lists on the Internet that you can use, but it’s also possible to roll your own.

Here’s a revised version for BSD-based systemsThose who have installed GNU’s version may need to use -r instead of -E that uses sed to remove a few common stop words from the output:

cat *.txt \
  | tr -cs A-Za-z '\n' \
  | tr A-Z a-z \
  | sed -E '/^(a|and|have|in|is|it|of|the|to|was)$/d' \
  | sort \
  | uniq -c \
  | sort -rn \
  | sed 10q

Note that this won’t work if you want to create word clouds for something like the lyrics for The Beatles’s Let it be, which largely consists of stop words. It’ll work well enough for most other texts however.

We can see that our revised script already yields a slightly more useful list, even though we’ve excluded only 10 stop words:

  43 hong
  39 kong
  37 s
  34 that
  32 for
  23 police
  21 are
  20 by
  18 but
  17 with

We can improve the quality of our results by using a larger stop list. Unfortunately, stop words aren’t our only problem.

Tokenisation

Our script output shows that three most common “words” in our text are Hong, Kong, and s. But wait, none of these are actual words!

So what went wrong here? In the second line we assumed that words are non-empty sequences of alphabetical characters (a-z). But in practice it’s a bit more complicated than that.

What we need for our word cloud aren’t necessarily words, but tokens: non-empty sequences of characters that together form a single linguistic building block. Most of the times these tokens will simply be words, although there are also many, many edge cases that make proper tokenisation a non-trivial task:

Normalisation

Proper tokenisation is a hard problem that’s still being researched, but let’s assume that you’ve mostly fixed the tokenisation issues to some degree using heuristics, machine learning, or a combination thereof.

You’ll quickly run into another issue: there are many similar tokens! These tokens usually fall into one of the following categories:

Sometimes you want to treat different verb or noun forms differently, but most of the time you don’t really care and just want to merge all occurrences into a single form.

This process of merging tokens is called normalisation. There are two popular normalisation techniques that you can use: stemming and lemmatisation.

Stemming

Stemming is a crude – but fairly effective – way to normalise words. It works by simply chopping off the ends of words based the idea that many words have similar word endings that don’t necessarily add a lot of information to their meaning: for a word cloud it’s totally fine to substitute protesting for protest for example.

Most people use one of two stemming algorithms. The Porter stemming algorithm has traditionally been the most popular algorithm for English, although nowadays you might want to use the Snowball stemming algorithm instead, as it supports more languages and gives slightly better results.

Here’s a snippet that demonstrates how you can stem words using the Natural Language Toolkit for Python:

from nltk.stem.snowball import *

tokens = ['stemming', 'is', 'a', 'crude', 'but', 'fairly', 'effective', 'way', 'to', 'normalise', 'words']
stemmer = SnowballStemmer("english")

for token in tokens:
    print(stemmer.stem(token))
stem
is
a
crude
but
fair
effect
way
to
normalis
word

It clearly isn’t perfect: blunt removal of characters from a word will sometimes leave you with “words” that don’t make a lot of senseOr two words with different meanings may be reduced to the same word, e.g. authority and author, so we need to find a way to map stemmed words back to real words.

Lemmatisation

The other method, lemmatisation, is more sophisticated and is based on the idea that you always look up words in dictionaries by their lemma. This means that we can normalise words by converting them to lemmas.

Obviously this requires that you have a complete dictionary in the first place and a way to figure out which lemmas belong to which words, which makes it much harder to implement.

Fortunately there’s no need to do that, as the Natural Language Toolkit also comes with a lemmatizer. It clearly produces better results than the stemmer:

from nltk.stem import WordNetLemmatizer

tokens = ['stemming', 'is', 'a', 'crude', 'but', 'fairly', 'effective', 'way', 'to', 'normalise', 'words']
lemmatizer = WordNetLemmatizer()

for token in tokens:
    print(lemmatizer.lemmatize(token))
stemming
is
a
crude
but
fairly
effective
way
to
normalise
word

If we now apply the lemmatiser on the original list of most common words and remove all stop words that are common in the English language, we might end up with a list like this:

  43 hong
  39 kong
  23 police
  16 protest
  14 said
  14 government
  10 protester
   9 would
   9 people
   9 officer

It’s far from perfect: Hong and Kong are still listed as separate words (with different numbers of appearances!) and we may want to treat said as a stop word for news articles, since they frequently include quotes and paraphrases.

It’s good enough for us however. Because let’s be serious: no one’s going to look at a stupid word cloud in 2019.

Summary

  1. Stop words are words that occur very often in natural languages, but don’t tell you much about what a text is about
  2. Tokenisation is the process of splitting a single text into a sequence of tokens (≈words). It’s kind of hard to do well
  3. Stemming and lemmatisation are normalisation techniques that can help you group words with similar meanings