Chuniversiteit logomarkChuniversiteit.nl
“Heap, Heap, Array!”

Allonsay’s language classification algorithm

How you can detect the language of texts, without using the “ML” word.

Time-travelling spacecraft disguised as a police box
“You can’t name this thing Allonsay and not include support for French”

macOS comes with a say command that reads texts out loud using the system’s default voice. Unfortunately, voices are monolingual so if you often consume content in different languages you’re out of luck. Allonsay is a tiny command-line application that detects which language (and thus voice) you need. How does it work?

In an ideal sitation there’s no need to guess which language a text is written in. Web pages (like this one) often include metadata that tell you which language is used on the page. Some websites will also tell you when they use foreign words, so your screen reader knows how to pronounce them correctly.

Such information is rarely available in practice though, so we need a way to detect what language a text is written in.

Sometimes we’re lucky, and we only need to identify a single language that has its own unique writing systems (e.g. Thai and Hebrew). Text written in these languages will use Unicode code points that you won’t find in any other language. In all other cases we need a more sophisticated solution…

The easiest and most intuitive solution by far is to simply look up words of a text in dictionaries for various languages until you’ve found enough matches in one of the dictionaries. But this solution doesn’t scale very well. Any tool that needs to identify languages now needs to ship with dictionaries for every supported language. This can be expensive in terms of disk or bandwidth usage and (possibly) licensing costs, so we need to look for a solution that works without dictionaries!

Letter frequencies

Link

Fortunately we don’t have to look very far, because the characters that occur in a text can also tell us a lot about its language!

This might sound strange at first, because for western languages most of these characters will be in the a–z range. However, the frequency at which characters occur still differs between languages.

Here’s a table from Wikipedia that shows the a–z in English and Dutch:

Character English (%) Dutch (%)
a 8.167 7.486
b 1.492 1.584
c 2.782 1.242
d 4.253 5.933
e 12.702 18.910
f 2.228 0.805
g 2.015 3.403
h 6.094 2.380
i 6.966 6.499
j 0.153 1.460
k 0.772 2.248
l 4.025 3.568
m 2.406 2.213
n 6.749 10.032
o 7.507 6.063
p 1.929 1.570
q 0.095 0.009
r 5.987 6.411
s 6.327 3.730
t 9.056 6.790
u 2.758 1.990
v 0.978 2.850
w 2.360 1.520
x 0.150 0.036
y 1.974 0.035
z 0.074 1.390

The distributions are largely similar – both are part of the West Germanic language group after all – but there are some clear differences too. For example, the character e occurs a lot more in Dutch than in English, while y is basically non-existent in Dutch.

We can use this to guess the language of a text. The overall idea is pretty simple. First, we determine the distribution of characters in a text and then check which language has the most similar distribution.

Example

Link

Let’s say we have the following input text:

People assume that time is a strict progression of cause to effect, but, actually, from a non-linear, non-subjective viewpoint, it’s more like a big ball of wibbly-wobbly… timey-wimey… stuff

This text has the following character distribution:

Character Count Relative frequency (%)
a 10 6.667
b 8 5.333
c 5 3.333
d 0 0.000
e 16 10.667
f 7 4.667
g 2 1.333
h 1 0.667
i 14 9.333
j 1 0.667
k 1 0.667
l 9 6.000
m 6 4.000
n 7 4.667
o 12 8.000
p 4 2.667
q 0 0.000
r 6 4.000
s 10 6.667
t 14 9.333
u 6 4.000
v 2 1.333
w 4 2.667
x 0 0.000
y 5 3.333
z 0 0.000

We can compare the distribution of the input text with the overall distributions in the English and Dutch languages:

Character English (%) Dutch (%) Input text (%) Difference with English (pp.) Difference with Dutch (pp.)
a 8.167 7.486 6.667 1.500 0.819
b 1.492 1.584 5.333 3.841 3.749
c 2.782 1.242 3.333 0.551 2.091
d 4.253 5.933 0.000 4.253 5.933
e 12.702 18.910 10.667 2.035 8.243
f 2.228 0.805 4.667 2.439 3.862
g 2.015 3.403 1.333 0.682 2.070
h 6.094 2.380 0.667 5.427 1.713
i 6.966 6.499 9.333 2.367 2.834
j 0.153 1.460 0.667 0.514 0.793
k 0.772 2.248 0.667 0.105 1.581
l 4.025 3.568 6.000 1.975 2.432
m 2.406 2.213 4.000 1.594 1.787
n 6.749 10.032 4.667 2.082 5.365
o 7.507 6.063 8.000 0.493 1.937
p 1.929 1.570 2.667 0.738 1.097
q 0.095 0.009 0.000 0.095 0.009
r 5.987 6.411 4.000 1.987 2.411
s 6.327 3.730 6.667 0.340 2.937
t 9.056 6.790 9.333 0.277 2.543
u 2.758 1.990 4.000 1.242 2.010
v 0.978 2.850 1.333 0.355 1.517
w 2.360 1.520 2.667 0.307 1.147
x 0.150 0.036 0.000 0.150 0.036
y 1.974 0.035 3.333 1.359 3.298
z 0.074 1.390 0.000 0.074 1.390
Total 100.000 100.000 100.000 36.782 63.604

The last row shows the sum of the percentage point differences between the input text and the two languages. Our input text is more similar to English (36.782) than to Dutch (63.604), so we can assume that it’s written in English!

Summary

Link
  1. Language have their own, unique distribution of characters

  2. The language of a text can be classified by comparing the distribution of its characters with known distributions of languages