Translate Dutch to English like it’s 1999

Published: 1 Sept 2021
Written by: Chun Fei Lung

I spent three years building a web app that translates Dutch to bad English and the results are even worse than I expected.

Add cheesy caption here

I like the Dutch. I don’t think the feeling is mutual, but that doesn’t really matter: this article is not about the Dutch, but about their language. Actually, it’s not even about their language. This article is about how the Dutch speak English, the language of their friendly overseas neighbours.

Dunglish

Most Dutch speak English reasonably well, but there are also some well-known examples of Dutch people whose mastery of the English language leaves a lot to be desired (side note: I’m just gonna leave this here: Muphry’s law.).

For instance, Dutch football manager Louis van Gaal is renowned for his Dunglish. But his English is still miles ahead of that of football manager Hendrik de Jongh, whose steenkolenengels (side note: Translated literally this means “Coal English”. The term originates from the early 1900s, when Dutch port workers would often use a rudimentary form of English to communicate with staff of English coal ships.) (the Dutch word for the poorest form of Dunglish) is extremely hard to follow – even for Dutch speakers.

Dunglish is not limited to the sports industry of course. Here are some other good examples of Dunglish that I found on Wikipedia:

“The Dutch are a nation of undertakers.”

“Undertaker” is a literal translation from the Dutch word “ondernemer”, i.e. someone who “undertakes”. The correct English word for this would be entrepreneur. The word is a so-called false friend.

“I fuck horses.”

This was reportedly said by a Dutch minister of Foreign Affairs when he was asked about his hobbies during a visit to John F. Kennedy. He meant to say that he breeds horses, which would be “fok” in Dutch.

“This college goes over ramps.”

This very literal translation of the Dutch sentence “Dit college gaat over rampen” (“This lecture is about disasters”) clearly shows that Dunglish speakers not only mistranslate words, but are also likely to use Dutch syntax and grammar.

“Now my clog breaks.”

While Google Translate translation of the Dutch idiom “Nu breekt mijn klomp”, is technically correct, it’s extremely unlikely that whoever said it actually has a broken wooden shoe. It simply means that someone is very surprised.

Stonecoal

In the past, online translation services like AltaVista’s Babel Fish could be misused to convert Dutch to Dunglish.

Unfortunately, most online translators are pretty good nowadays. Google Translate, Microsoft’s translator, and Yandex Translate all work very well, even when asked to translate nonsensical compound words like “regenboogeenhoorntjesrugtasvergunningsaanvraag” (“rainbow unicorn backpack permit application”) or common idioms like “Maak dat de kat wijs” (“Tell that to the cat”).

This is why I built my own translation service: Stonecoal. Not only could it translate Dutch text to English poorly, it could also read the translated text aloud… poorly.

Screenshot of Stonecoal’s user interface. — A screenshot of Stonecoal when it was still up.

How it works

Stonecoal’s back end is reasonably straightforward. User input is sent through a pipeline that consists of three steps: tokenisation, translation, and phonemisation. I’ll discuss these three steps in more detail below.

Tokenisation

Stonecoal is a text-based translation service: you send it some text that is written in Dutch and Stonecoal returns a translated version of that text.

Humans can easily see which parts of a text should be translated and which should just be left as is (side note: This includes things like punctuation, URLs, email addresses, names of people, organisations, and so on.), but all a computer sees is one big-ass string.

Before we can translate anything, we therefore first need to tokenise a user’s text input. Simply put, the tokenisation process converts the text into a list of tokens, which consist of punctuation, words, and other word-like character sequences. It’s a bit more complicated than a simple split() on spaces and punctuation characters, but I won’t bore you with the details here.

Additionally, we try to determine where sentences start and end, the part-of-speech tag (side note: Nouns, verbs, adjectives, etc.) of each token in a sentence, and which tokens represent names. This makes it easier for us to find the right translations (or at the very least makes it more likely that we translate words “correctly”).

Translation

Once we have a list of tokens, we can start translating stuff! Implementing translators from scratch is normally a pretty hard problem, but fortunately for us we don’t really have to care about boring things like semantics, idioms, and word order. All we have to do is translate each word individually.

We do this by running each token through three word translators, which basically act as dumb dictionaries.

The first translator handles all the simple cases. This includes untranslatable tokens like punctuation, whitespace, and numbers, but also words that occur very often or are often mistranslated in undesirable ways.
Any token that cannot be handled by that first translator is handled by our main translator, which uses Google’s Cloud Translation API (side note: So please don’t give Stonecoal the hug of death; I have a mortgage to pay. 😅).
At this point every token should have been translated, because Google’s API will always return results, even for nonsensical words like “sdgsdg”. But if it ever fails, a “no-op” translator simply uses the original Dutch word as the English “translation” (side note: Some Dunglish speakers actually do this, so this is a pretty good fallback.).

Phonemisation

Translating Dutch sentences into broken English is mildly amusing. But what really makes it funny is hearing that broken English.

This is why the final step of our pipeline tries to determine how a Dunglish speaker would read the translated text aloud. To keep things simple I assume that Dunglish speakers kind of know how to pronounce words, but maybe can’t do it 100% correctly.

Because I didn’t want to reinvent the wheel, I made use of the CMU Pronouncing Dictionary, a machine-readable dictionary that contains pronunciations for more than 134,000 words in American English. This dictionary converts English words into a sequence of phonemes, which are kind of like sound patterns in written form.

Stonecoal first runs each word in the translated text through the dictionary and then translates the phonemes into something that can be converted into audio by a text-to-speech generator for Dutch.

How it doesn’t work (yet)

If you tried out Stonecoal with some texts of your own, you may have noticed that there’s a lot of room for improvement.

The art of good “bad” translations

Stonecoal translates Dutch into poor English. This sounds like an easy problem, but after three years of tinkering I can only conclude that it actually isn’t.

The hardest challenge is finding the “correct” bad translation for words. This is currently done using dictionaries which assume that each word with a certain part of speech only has one possible translation. This is obviously not true: even simple words, like “kop (side note: “Cup” or “head”)”, can have multiple possible meanings, but even the worst Dunglish speakers will probably be able to translate them using the correct word.

“Real” bilingual dictionaries don’t just give you a translation of a word, but all possible translations. I haven’t been able to find any decent dictionary for Dutch-to-English translation that can be used both offline and programmatically. There was one open-source dictionary that technically met those requirements, but it would often result in translations using obscure words that only people who browse through thesauri every day would know.

Since I ended up using an online translation service anyway, you might wonder why I didn’t just use the free (as in beer) Google Translate, as it too suggests multiple possible translations when you ask it to translate single words. The reason for that is that it cannot be used via an API (side note: There are libraries that seem to have reverse-engineered it, but they rarely work nowadays. Not to mention that libraries that use APIs “illegally” are wholly unsuitable for any public-facing application like Stonecoal.). Its paid version, Google Cloud’s Translate, does come with an API, but sadly does not seem to provide alternate translations for individual words, presumably because it’s meant for translating entire texts only and because good translations require understanding of context.

Of course there are other ways to translate words that are not based on bilingual dictionaries or translation services.

One of these is the word2word package for Python, which can translate words based on co-occurrence statistics from movie and TV subtitles. Stonecoal actually has a fourth translator that is based on word2word. Unfortunately, word2word’s translations are sometimes even worse than those of the open-source dictionary, as word2word will occasionally suggest seemingly random words that have nothing to do with the original. This is why I disabled this translator for now.

Another way to “translate” words is by simply looking for English words that look very similar, where similarity can either be determined using a text-based metric like the Levenshtein distance or a pronunciation-based algorithm, like Soundex. This would not be entirely trivial either without some method to filter out obscure words. So please get in touch if you have or know a good frequency list for English words that’s also reasonably complete!

Delivering speech, reliably

Generating speech from text also turned out to be harder than I thought, even though the actual speech generation is simply done using third-party libraries.

For instance, there’s a Web Speech API for SpeechSynthesis, but it’s still experimental and not all browsers support it. It also requires that the user has a Dutch voice installed on their system. This means that if I’d implemented the speech functionality using the Web Speech API for most people it would not work.

I therefore had to look for text-to-speech engines that also work offline and can be used programmatically. Very few engines provide support for Dutch and the ones that do tend to sound like Stephen Hawking. I eventually managed to find one, but I’m not sure how well it handles edge cases.

Another major issue is phonemisation. The CMU Pronouncing Dictionary gives us information about phonemes and intonation, but none of this is useful if you have an off-the-shelf text-to-speech engine that only accepts text as input.

I haven’t been able to find a good offline text-to-speech engine that lets you use International Phonetic Alphabet (IPA)-based inputs or customise things like intonation on a word-by-word basis, so some words may sound a bit off right now. Again, let me know if you have any suggestions!