Allonsay’s language classification algorithm
macOS comes with a say
command that reads texts out loud using the
system’s default voice. Unfortunately, voices are monolingual so if you often
consume content in different languages you’re out of luck.
Allonsay is a tiny command-line
application that detects which language (and thus voice) you need. How does it
work?
In an ideal sitation there’s no need to guess which language a text is written in. Web pages (like this one) often include metadata that tell you which language is used on the page. Some websites will also tell you when they use foreign words, so your screen reader knows how to pronounce them correctly.
Such information is rarely available in practice though, so we need a way to detect what language a text is written in.
Sometimes we’re lucky, and we only need to identify a single language that has its own unique writing systems (e.g. Thai and Hebrew). Text written in these languages will use Unicode code points that you won’t find in any other language. In all other cases we need a more sophisticated solution…
The easiest and most intuitive solution by far is to simply look up words of a text in dictionaries for various languages until you’ve found enough matches in one of the dictionaries. But this solution doesn’t scale very well. Any tool that needs to identify languages now needs to ship with dictionaries for every supported language. This can be expensive in terms of disk or bandwidth usage and (possibly) licensing costs, so we need to look for a solution that works without dictionaries!
Fortunately we don’t have to look very far, because the characters that occur in a text can also tell us a lot about its language!
This might sound strange at first, because for western languages most of these characters will be in the a–z range. However, the frequency at which characters occur still differs between languages.
Here’s a table from Wikipedia that shows the a–z in English and Dutch:
Character | English (%) | Dutch (%) |
---|---|---|
a | 8.167 | 7.486 |
b | 1.492 | 1.584 |
c | 2.782 | 1.242 |
d | 4.253 | 5.933 |
e | 12.702 | 18.910 |
f | 2.228 | 0.805 |
g | 2.015 | 3.403 |
h | 6.094 | 2.380 |
i | 6.966 | 6.499 |
j | 0.153 | 1.460 |
k | 0.772 | 2.248 |
l | 4.025 | 3.568 |
m | 2.406 | 2.213 |
n | 6.749 | 10.032 |
o | 7.507 | 6.063 |
p | 1.929 | 1.570 |
q | 0.095 | 0.009 |
r | 5.987 | 6.411 |
s | 6.327 | 3.730 |
t | 9.056 | 6.790 |
u | 2.758 | 1.990 |
v | 0.978 | 2.850 |
w | 2.360 | 1.520 |
x | 0.150 | 0.036 |
y | 1.974 | 0.035 |
z | 0.074 | 1.390 |
The distributions are largely similar – both are part of the West Germanic
language group after all – but there are some clear differences too. For
example, the character e
occurs a lot more in Dutch than in English,
while y
is basically non-existent in Dutch.
We can use this to guess the language of a text. The overall idea is pretty simple. First, we determine the distribution of characters in a text and then check which language has the most similar distribution.
Let’s say we have the following input text:
People assume that time is a strict progression of cause to effect, but, actually, from a non-linear, non-subjective viewpoint, it’s more like a big ball of wibbly-wobbly… timey-wimey… stuff
This text has the following character distribution:
Character | Count | Relative frequency (%) |
---|---|---|
a | 10 | 6.667 |
b | 8 | 5.333 |
c | 5 | 3.333 |
d | 0 | 0.000 |
e | 16 | 10.667 |
f | 7 | 4.667 |
g | 2 | 1.333 |
h | 1 | 0.667 |
i | 14 | 9.333 |
j | 1 | 0.667 |
k | 1 | 0.667 |
l | 9 | 6.000 |
m | 6 | 4.000 |
n | 7 | 4.667 |
o | 12 | 8.000 |
p | 4 | 2.667 |
q | 0 | 0.000 |
r | 6 | 4.000 |
s | 10 | 6.667 |
t | 14 | 9.333 |
u | 6 | 4.000 |
v | 2 | 1.333 |
w | 4 | 2.667 |
x | 0 | 0.000 |
y | 5 | 3.333 |
z | 0 | 0.000 |
We can compare the distribution of the input text with the overall distributions in the English and Dutch languages:
Character | English (%) | Dutch (%) | Input text (%) | Difference with English (pp.) | Difference with Dutch (pp.) |
---|---|---|---|---|---|
a | 8.167 | 7.486 | 6.667 | 1.500 | 0.819 |
b | 1.492 | 1.584 | 5.333 | 3.841 | 3.749 |
c | 2.782 | 1.242 | 3.333 | 0.551 | 2.091 |
d | 4.253 | 5.933 | 0.000 | 4.253 | 5.933 |
e | 12.702 | 18.910 | 10.667 | 2.035 | 8.243 |
f | 2.228 | 0.805 | 4.667 | 2.439 | 3.862 |
g | 2.015 | 3.403 | 1.333 | 0.682 | 2.070 |
h | 6.094 | 2.380 | 0.667 | 5.427 | 1.713 |
i | 6.966 | 6.499 | 9.333 | 2.367 | 2.834 |
j | 0.153 | 1.460 | 0.667 | 0.514 | 0.793 |
k | 0.772 | 2.248 | 0.667 | 0.105 | 1.581 |
l | 4.025 | 3.568 | 6.000 | 1.975 | 2.432 |
m | 2.406 | 2.213 | 4.000 | 1.594 | 1.787 |
n | 6.749 | 10.032 | 4.667 | 2.082 | 5.365 |
o | 7.507 | 6.063 | 8.000 | 0.493 | 1.937 |
p | 1.929 | 1.570 | 2.667 | 0.738 | 1.097 |
q | 0.095 | 0.009 | 0.000 | 0.095 | 0.009 |
r | 5.987 | 6.411 | 4.000 | 1.987 | 2.411 |
s | 6.327 | 3.730 | 6.667 | 0.340 | 2.937 |
t | 9.056 | 6.790 | 9.333 | 0.277 | 2.543 |
u | 2.758 | 1.990 | 4.000 | 1.242 | 2.010 |
v | 0.978 | 2.850 | 1.333 | 0.355 | 1.517 |
w | 2.360 | 1.520 | 2.667 | 0.307 | 1.147 |
x | 0.150 | 0.036 | 0.000 | 0.150 | 0.036 |
y | 1.974 | 0.035 | 3.333 | 1.359 | 3.298 |
z | 0.074 | 1.390 | 0.000 | 0.074 | 1.390 |
Total | 100.000 | 100.000 | 100.000 | 36.782 | 63.604 |
The last row shows the sum of the percentage point differences between the input text and the two languages. Our input text is more similar to English (36.782) than to Dutch (63.604), so we can assume that it’s written in English!
-
Language have their own, unique distribution of characters
-
The language of a text can be classified by comparing the distribution of its characters with known distributions of languages