Translate Dutch to English like it’s 1999
I like the Dutch. I don’t think the feeling is mutual, but that doesn’t really matter: this article is not about the Dutch, but about their language. Actually, it’s not even about their language. This article is about how the Dutch speak English, the language of their friendly overseas neighbours.
Most Dutch speak English reasonably well, but there are also some well-known examples of Dutch people whose .
For instance, Dutch football manager Louis van Gaal is renowned for his Dunglish. But his English is still miles ahead of that of football manager Hendrik de Jongh, whose (the Dutch word for the poorest form of Dunglish) is extremely hard to follow – even for Dutch speakers.
Dunglish is not limited to the sports industry of course. Here are some other good examples of Dunglish that I found on Wikipedia:
“The Dutch are a nation of undertakers.”
“Undertaker” is a literal translation from the Dutch word “ondernemer”, i.e. someone who “undertakes”. The correct English word for this would be entrepreneur. The word is a so-called false friend.
“I fuck horses.”
This was reportedly said by a Dutch minister of Foreign Affairs when he was asked about his hobbies during a visit to John F. Kennedy. He meant to say that he breeds horses, which would be “fok” in Dutch.
“This college goes over ramps.”
This very literal translation of the Dutch sentence “Dit college gaat over rampen” (“This lecture is about disasters”) clearly shows that Dunglish speakers not only mistranslate words, but are also likely to use Dutch syntax and grammar.
“Now my clog breaks.”
While Google Translate translation of the Dutch idiom “Nu breekt mijn klomp”, is technically correct, it’s extremely unlikely that whoever said it actually has a broken wooden shoe. It simply means that someone is very surprised.
In the past, online translation services like AltaVista’s Babel Fish could be misused to convert Dutch to Dunglish.
Unfortunately, most online translators are pretty good nowadays. Google Translate, Microsoft’s translator, and Yandex Translate all work very well, even when asked to translate nonsensical compound words like “regenboogeenhoorntjesrugtasvergunningsaanvraag” (“rainbow unicorn backpack permit application”) or common idioms like “Maak dat de kat wijs” (“Tell that to the cat”).
This is why I built my own translation service: Stonecoal. Not only does can translate Dutch text to English poorly, it can also read the translated text aloud… poorly.
You can check it out by going to stonecoal.chuniversiteit.nl.
Stonecoal’s back end is reasonably straightforward. User input is sent through a pipeline that consists of three steps: tokenisation, translation, and phonemisation. I’ll discuss these three steps in more detail below.
Stonecoal is a text-based translation service: you send it some text that is written in Dutch and Stonecoal returns a translated version of that text.
Humans can easily see which parts of a text should be translated and , but all a computer sees is one big-ass string.
Before we can translate anything, we therefore first need to tokenise a user’s
text input. Simply put, the tokenisation process converts the text into a list of
tokens, which consist of punctuation, words, and other word-like character sequences.
It’s a bit more complicated than a simple
split() on spaces and punctuation
characters, but I won’t bore you with the details
Additionally, we try to determine where sentences start and end, the of each token in a sentence, and which tokens represent names. This makes it easier for us to find the right translations (or at the very least makes it more likely that we translate words “correctly”).
Once we have a list of tokens, we can start translating stuff! Implementing translators from scratch is normally a pretty hard problem, but fortunately for us we don’t really have to care about boring things like semantics, idioms, and word order. All we have to do is translate each word individually.
We do this by running each token through three word translators, which basically act as dumb dictionaries.
The first translator handles all the simple cases. This includes untranslatable tokens like punctuation, whitespace, and numbers, but also words that occur very often or are often mistranslated in undesirable ways.
Any token that cannot be handled by that first translator is handled by our main translator, which uses .
At this point every token should have been translated, because Google’s API will always return results, even for nonsensical words like “sdgsdg”. But if it ever fails, a “no-op” translator simply .
Translating Dutch sentences into broken English is mildly amusing. But what really makes it funny is hearing that broken English.
This is why the final step of our pipeline tries to determine how a Dunglish speaker would read the translated text aloud. To keep things simple I assume that Dunglish speakers kind of know how to pronounce words, but maybe can’t do it 100% correctly.
Because I didn’t want to reinvent the wheel, I made use of the CMU Pronouncing Dictionary, a machine-readable dictionary that contains pronunciations for more than 134,000 words in American English. This dictionary converts English words into a sequence of phonemes, which are kind of like sound patterns in written form.
Stonecoal first runs each word in the translated text through the dictionary and then translates the phonemes into something that can be converted into audio by a text-to-speech generator for Dutch.
If you tried out Stonecoal with some texts of your own, you may have noticed that there’s a lot of room for improvement.
The art of good “bad” translations
Stonecoal translates Dutch into poor English. This sounds like an easy problem, but after three years of tinkering I can only conclude that it actually isn’t.
The hardest challenge is finding the “correct” bad translation for words. This is currently done using dictionaries which assume that each word with a certain part of speech only has one possible translation. This is obviously not true: even simple words, like “”, can have multiple possible meanings, but even the worst Dunglish speakers will probably be able to translate them using the correct word.
“Real” bilingual dictionaries don’t just give you a translation of a word, but all possible translations. I haven’t been able to find any decent dictionary for Dutch-to-English translation that can be used both offline and programmatically. There was one open-source dictionary that technically met those requirements, but it would often result in translations using obscure words that only people who browse through thesauri every day would know.
Since I ended up using an online translation service anyway, you might wonder why I didn’t just use the free (as in beer) Google Translate, as it too suggests multiple possible translations when you ask it to translate single words. The reason for that is that it . Its paid version, Google Cloud’s Translate, does come with an API, but sadly does not seem to provide alternate translations for individual words, presumably because it’s meant for translating entire texts only and because good translations require understanding of context.
Of course there are other ways to translate words that are not based on bilingual dictionaries or translation services.
One of these is the word2word package for Python, which can translate words based on co-occurrence statistics from movie and TV subtitles. Stonecoal actually has a fourth translator that is based on word2word. Unfortunately, word2word’s translations are sometimes even worse than those of the open-source dictionary, as word2word will occasionally suggest seemingly random words that have nothing to do with the original. This is why I disabled this translator for now.
Another way to “translate” words is by simply looking for English words that look very similar, where similarity can either be determined using a text-based metric like the Levenshtein distance or a pronunciation-based algorithm, like Soundex. This would not be entirely trivial either without some method to filter out obscure words. So please get in touch if you have or know a good frequency list for English words that’s also reasonably complete!
Delivering speech, reliably
Generating speech from text also turned out to be harder than I thought, even though the actual speech generation is simply done using third-party libraries.
For instance, there’s a Web Speech API for SpeechSynthesis, but it’s still experimental and not all browsers support it. It also requires that the user has a Dutch voice installed on their system. This means that if I’d implemented the speech functionality using the Web Speech API for most people it would not work.
I therefore had to look for text-to-speech engines that also work offline and can be used programmatically. Very few engines provide support for Dutch and the ones that do tend to sound like Stephen Hawking. I eventually managed to find one, but I’m not sure how well it handles edge cases.
Another major issue is phonemisation. The CMU Pronouncing Dictionary gives us information about phonemes and intonation, but none of this is useful if you have an off-the-shelf text-to-speech engine that only accepts text as input.
I haven’t been able to find a good offline text-to-speech engine that lets you use International Phonetic Alphabet (IPA)-based inputs or customise things like intonation on a word-by-word basis, so some words may sound a bit off right now. Again, let me know if you have any suggestions!