How to deal with Dutch compound words when processing text
A few months ago I published an article about Stonecoal, my translation side-project which tries to convert Dutch to English in the worst way possible.
Stonecoal is currently dictionary-based: it splits a text into and runs each of those words through a dictionary in order to translate each word individually, without any regard for context. This approach is easy to understand and works reasonably well for very simple texts – but it’s unable to handle many real-world texts, which often include compound words.
In English such compound words are generally a sequence of nouns, verbs, or adjectives, separated by a space.
For example, the compound “machine learning” consists of two words: “machine” and “learning”. Both words have meanings of their own and can be found in a dictionary. Because the combination of these two words has a very specific, non-obvious meaning, virtually all modern dictionaries also have a separate entry for “machine learning”.
However, dictionaries are unlikely to include entries for words like “requirements engineering”. Not only would it be infeasible to include any possible combination of words as a separate dictionary entry, it is also unnecessary: if you know what “requirements” and “engineering” mean, you can easily guess what “requirements engineering” is about.
Dutch – like most other Germanic languages – works a bit differently. Compounds are written without spaces as if they were one word. Again, this is not an issue if the compound has a distinct, non-obvious meaning. A word like “wittebroodsweken” – like its English counterpart, “honeymoon” – can be found in any dictionary. But you’re unlikely to find “erwtensoeppizza” in a dictionary. Can you discern the parts that make up this compound as a non-native speaker? It’s doable, but certainly not trivial.
I primarily use compound splitting for Stonecoal. But there are many other possible applications, including:
Electronic dictionaries rarely contain entries for compounds, which greatly reduces their usefulness for second-language learners. A dictionary that can split compounds into parts gives it the ability to provide definitions for compounds, even if they do not have an explicit entry.
Browsers, text processors, and other types of software can automatically hyphenate text. This is typically done using rules that are based on syllables, which result in hyphenations that are technically correct, but can sometimes be harder to understand when simple words are split in suboptimal positions, e.g. “bommel-ding” (“bommel” thing) vs. “bom-melding” (bomb threat).
If you are doing anything with term frequencies or text similarity, where you are performing some token-based analysis, you likely get better results if you split compound words first.
If you think that compound splitting is easy for computers, you wouldn’t be entirely wrong: a computer . It can therefore quickly spot substrings in compound words that possibly represent parts that have an entry in a dictionary. But there are a few things that we need to keep in mind.
There is often more than one way to split a compound word into parts, but not all ways make sense. This is especially true when the compound itself can already be found in a dictionary. If you want to understand what the word “honeymoon” means you shouldn’t look up “honey” and “moon”. Similarly, it would be a bad idea to split the Dutch “bestelling” (order) into “bes” and “telling” (berry count).
There are two basic rules of thumb that we can follow when splitting compounds:
different parts cannot be overlap with each other
each character in a compound should belong to exactly one part
we should prefer longer parts over shorter parts, as they are , e.g. the fictional word “regenboogschutter” could be “re” + “gen” + “boog” + “schutter”, but “regenboog” + “schutter” is way more likely.
This means that we can largely approach compound splitting as a knapsack problem, with a few additional constraints.
I wrote that each character in a compound should belong to exactly one part, but that doesn’t necessarily mean that each character also appears in the dictionary entry for that part.
Many Dutch compound words make use of interfixes: meaningless characters that link different parts of a compound word together. As I wrote earlier, in English compounds tend to be separated by spaces. This makes interfixes rather rare in English, but they do exist, e.g. the “-o-” in “speed**-o-**meter”.
Interfixes are a lot more common in Dutch, e.g.:
“geesteswetenschappen” (“geest” + “wetenschappen”)
“rundergehaktbal” (“rund” + “gehaktbal”)
(“bruid” + “gom”)
Diminutives and plurals create similar issues as interfixes, as they add (and sometimes even modify) characters of parts:
“inkomensongelijkheid” (“inkomen” + “ongelijkheid”)
“scholengemeenschap” (“school” + “gemeenschap”)
“lapjeskat” (“lap” + “kat”)
Before we verify whether a substring is possibly a real word we therefore first need to lemmatise it, i.e. try to convert it to a dictionary form.
Even when we take all of the above rules into account, we may still run into situations where it is not entirely clear how a word can be split. This may happen when one or more parts do not appear in a dictionary, e.g. “robinhood” in “robinhoodbelasting” or “fjdsagkljldfhkj” in “klotefjdsagkljldfhkjsmerissen”.
A simple solution that works most of the time is to treat any remaining consecutive character sequences as words.
Sometimes you may run into the opposite problem: there are multiple ways to split a compound.
A well-known confusing Dutch example is , which can be split into “mini” and “ster” (“miniature star”). If we follow the guideline that one should prefer longer parts over shorter parts, we should not split this compound. But what if the word appears in a text about stars?
Or another example that I used earlier: “regenboogschutter”. I proposed “regenboog” + “schutter” as a possible decomposition, but based on the above guidelines “regen” + “boogschutter” would be equally acceptable.
Most of these dilemmas can be solved reasonably well by taking word frequencies of possible part combinations into account. This can be done using general word frequency lists, domain-specific word frequency lists, or a combination thereof.