Ranking alphabetical languages by entropy of letter distribution

This came from some work I did for an information theory class this semester. Right off the bat, I want to warn everyone that this is amateur quality work at best. The main data source has some significant shortcomings, so you should treat these results as factoids of uncertain veracity, not statistical gospel. Before I go into full disclaimer mode, though, let me introduce the main concept at play.

Shannon entropy — after Claude Shannon, the visionary founder of information theory, and henceforth referred to just as “entropy” — is loosely, a measure of “surpise” among various outcomes.

Consider flipping a coin. Here the two outcomes are “heads” and “tails.” For a fair coin, each outcome is equally likely. We have no way of predicting what the outcome will be. For an unfair coin, we do already have some inkling of how the coin flip will go. In the extreme case, if a coin were to be so unfairly weighted that it always landed on heads, for example, there would be no surprise inherent to the system.

The surprise, thus entropy, is maximized when the coin is fair.

And the same thing would go for the roll of a die. A coin is like a two-sided die. A die with more sides — more possible outcomes — has more capacity to surprise than a die with fewer sides. Similarly, the less fair a die is, the less capacity it has to surprise. Entropy depends on both the number of possible outcomes and their relative likelihoods.

Okay, on to the formal mathematics. Let X = {x_1, x_2, x_3, …} be our collection of outcomes. Each outcome has a certain probability of occurring, defined by a probability function p. So p(x_1) denotes the probability of outcome x_1 occurring, p(x_2) denotes the probability of outcome x_2 occurring, and so on.

For a fair coin, we can sidestep the x-subscripts and write X = {h, t} and say that p(h) = 0.5 and p(t) = 0.5.
For a six-sided die, we have X = {x_1, …, x_6} and, if the die is properly weighted, p(x_i) = 1/6 for each index i.

The entropy of the system X, given a probability distribution defined on it, is denoted by H(X) and given by the formula

H(X) = Σ p(x) log (1/p(x)),

where the sum ranges over all outcomes x in X. The base of the logarithm determines the unit of entropy. For base 2 the unit is “bits.”

Now, for any alphabetical language, we can consider the letters as a collection of outcomes. Then the probability function on that alphabet is determined by letter frequency in a given corpus. (A corpus is a collection of language excerpts, representing the language as it is written and/or spoken.)

This page from the American Cryptogram Association displays letter frequency statistics for more than 130 alphabetical languages, both natural and artificial. Curiously, it omits English. We can turn to this page from Cornell University’s “Math Explorers’ Club” for the equivalent English data, though.

Now, there are good reasons to take that cryptogram.org data with a healthy dose of skepticism. In analyzing the data, I’ve found that

  • no methodology or sources are given for the data,
  • not all languages list a total sample size,
  • for the languages that do list a total sample size, the total doesn’t always agree with the sum of the individual letter frequencies (counts), and
  • Walloon is listed twice.

So this is strictly amateur hour when it comes to data. I’ve used the individual letter counts to recalculate the probabilities whenever necessary, but I have no way to double-check those individual letter counts.

Casting all those worries aside, taking the data as is and heedlessly crunching forth, we can obtain some interesting results. The table below ranks 133 alphabetical languages by the percent of maximum entropy achieved from the observed letter distribution of each. That is, for each language — i.e, each distribution of outcomes at the letter level — it compares the observed entropy to that of an equidistributed alphabet with the same number of letters.

All entropies are reported in bits.

Enjoy!

Rank Language Size of alphabet Observed entropy Maximum entropy Percent of max. entropy achieved
1 KLINGON 24 4.34 4.58 94.7%
2 ICELANDIC 21 4.14 4.39 94.2%
3 CHEYENNE 15 3.68 3.91 94.2%
4 MIKMAQ (MICMAC) 18 3.86 4.17 92.6%
5 FAROESE 27 4.37 4.75 92.0%
6 CROATIAN 23 4.15 4.52 91.8%
7 K’ICHE’ 25 4.25 4.64 91.6%
8 SWEDISH 26 4.30 4.70 91.4%
9 LIMBA 23 4.13 4.52 91.4%
10 FRANCOPROVENÇAL 30 4.47 4.91 91.1%
11 ITALIAN 22 4.05 4.46 90.8%
12 NORWEGIAN 23 4.10 4.52 90.7%
13 MENDE 24 4.15 4.58 90.5%
14 CORNISH 24 4.14 4.58 90.4%
15 HMONG (Sichuan-Guizhou-Yunnan) 26 4.25 4.70 90.4%
16 HAWAIIAN 13 3.34 3.70 90.3%
17 CZECH 37 4.70 5.21 90.2%
18 GREEK (MODERN) 25 4.18 4.64 90.1%
19 NAHUATL 18 3.75 4.17 90.0%
20 LATIN 21 3.95 4.39 89.9%
21 BICHELAMAR 22 4.00 4.46 89.7%
22 ESTONIAN 24 4.11 4.58 89.6%
23 KANURI YERWA 24 4.11 4.58 89.6%
24 LËTZEBUERGESCH 29 4.35 4.86 89.5%
25 FINNISH 21 3.92 4.39 89.3%
26 KURDISH 31 4.42 4.95 89.3%
27 POLISH 32 4.46 5.00 89.1%
28 BAVARIAN 28 4.28 4.81 89.1%
29 MAPUDUNGUN 26 4.19 4.70 89.1%
30 ENGLISH 26 4.18 4.70 89.0%
31 CAKCHIQUEL 31 4.40 4.95 88.8%
32 HMONG (Southern-East Guizhou) 26 4.16 4.70 88.6%
33 MAM 24 4.05 4.58 88.4%
34 NGANGELA 23 4.00 4.52 88.3%
35 KICONGO 21 3.88 4.39 88.3%
36 CHECHEWA 26 4.15 4.70 88.2%
37 HMONG (Northern East-Guizhou) 27 4.19 4.75 88.2%
38 DANISH 25 4.10 4.64 88.2%
39 MOORÉ 31 4.37 4.95 88.1%
40 NEDDERDÜÜTSCH 28 4.24 4.81 88.1%
41 CEBUANO 20 3.80 4.32 88.0%
42 ACHEHNESE 23 3.98 4.52 87.9%
43 ALBANIAN 34 4.47 5.09 87.9%
44 GAGAUZ 28 4.22 4.81 87.8%
45 KINYARWANDA 25 4.07 4.64 87.7%
46 IDO 25 4.07 4.64 87.7%
47 MALTESE 32 4.38 5.00 87.6%
48 KAONDE 23 3.96 4.52 87.5%
49 BRETON 29 4.24 4.86 87.3%
50 HUASTECO 29 4.24 4.86 87.3%
51 LUVALE 24 4.00 4.58 87.2%
52 ESPERANTO 27 4.14 4.75 87.1%
53 HILIGAYNON 20 3.77 4.32 87.1%
54 GREEK (CLASSICAL) 25 4.04 4.64 87.1%
55 ILOKO 20 3.76 4.32 87.0%
56 JAVANESE 24 3.98 4.58 86.9%
57 GREENLANDIC (INUKTIKUT) 18 3.62 4.17 86.9%
58 HUNGARIAN 37 4.52 5.21 86.8%
59 CORSICAN 23 3.92 4.52 86.7%
60 LITHUANIAN 31 4.29 4.95 86.6%
61 BASQUE 23 3.91 4.52 86.4%
62 COKWE 25 4.01 4.64 86.4%
63 RHAETO-ROMANCE 27 4.11 4.75 86.3%
64 GERMAN 27 4.10 4.75 86.3%
65 SCOTTISH GAELIC 30 4.23 4.91 86.3%
66 ARAGONESE 29 4.19 4.86 86.2%
67 MAORI 16 3.45 4.00 86.1%
68 DUTCH 26 4.04 4.70 86.0%
69 AFRIKAANS 26 4.04 4.70 85.9%
70 IRISH GAELIC 32 4.29 5.00 85.9%
71 IBIBIO 20 3.71 4.32 85.8%
72 EDO 22 3.83 4.46 85.8%
73 VIENNESE 28 4.12 4.81 85.7%
74 MARSHALLESE 19 3.64 4.25 85.7%
75 INDONESIAN 24 3.92 4.58 85.6%
76 INTERLINGUA 23 3.87 4.52 85.6%
77 MACEDONIAN 29 4.14 4.86 85.3%
78 BAOULÉ 29 4.14 4.86 85.3%
79 GUARANI 36 4.41 5.17 85.3%
80 VENETIAN 26 4.01 4.70 85.3%
81 BALINESE 23 3.86 4.52 85.2%
82 MADURESE 25 3.96 4.64 85.2%
83 PICARD 33 4.30 5.04 85.2%
84 LINGALA 21 3.74 4.39 85.2%
85 GALICIAN 24 3.90 4.58 85.0%
86 FIJIAN 21 3.73 4.39 85.0%
87 HANI 26 3.99 4.70 84.9%
88 MALAGASY 22 3.78 4.46 84.9%
89 CHUUK 22 3.77 4.46 84.6%
90 BUGISNESE 23 3.82 4.52 84.4%
91 SARDINIAN 27 3.99 4.75 84.0%
92 ROMANIAN 26 3.95 4.70 83.9%
93 MAYAN 30 4.12 4.91 83.9%
94 FRISIAN 32 4.19 5.00 83.9%
95 LOZI 25 3.90 4.64 83.9%
96 WALLOON 35 4.30 5.13 83.8%
97 ASTURIAN 29 4.06 4.86 83.6%
98 KAMPANPANGAN 21 3.67 4.39 83.6%
99 PAPIAMENTU 35 4.29 5.13 83.6%
100 MINANGKABAU 24 3.83 4.58 83.6%
101 OROMIFFA 26 3.92 4.70 83.5%
102 CHAMORRO 26 3.92 4.70 83.5%
103 RUMANTSCH 28 4.01 4.81 83.4%
104 LUNDA 27 3.96 4.75 83.3%
105 CATALAN 31 4.12 4.95 83.1%
106 CAMPA PANONALIJO 23 3.76 4.52 83.1%
107 MALAY 25 3.86 4.64 83.1%
108 AMAHUACA 24 3.81 4.58 83.0%
109 NDEBELE 31 4.11 4.95 82.9%
110 SPANISH 27 3.94 4.75 82.9%
111 CAQUINTE 22 3.69 4.46 82.7%
112 GARIFUNA 29 4.01 4.86 82.5%
113 FRENCH 31 4.08 4.95 82.3%
114 BIKOL 25 3.82 4.64 82.2%
115 MAZATECO 28 3.95 4.81 82.2%
116 JÈRRIAIS 37 4.28 5.21 82.1%
117 PIEDMONTESE 39 4.33 5.29 81.9%
118 ADJA 43 4.44 5.43 81.9%
119 ARABELA 20 3.54 4.32 81.8%
120 FRIULIAN 34 4.15 5.09 81.5%
121 GASCON 34 4.14 5.09 81.4%
122 SWAHILI 29 3.94 4.86 81.1%
123 PORTUGUESE 33 4.09 5.04 81.0%
124 CANDOSHI SHAPRA 25 3.75 4.64 80.7%
125 TAGALOG 23 3.65 4.52 80.7%
126 CHAYAHUITA 25 3.75 4.64 80.7%
127 ACHUAR-SHIWIAR 23 3.63 4.52 80.2%
128 AGUARUNA 24 3.68 4.58 80.2%
129 AMUESHA-YANESHA 33 3.99 5.04 79.0%
130 ASHÉNINCA 25 3.67 4.64 79.0%
131 ASHÀNINCA 24 3.58 4.58 78.0%
132 CASHIBO-CACATAIBO 31 3.77 4.95 76.1%
133 MISKITO 21 3.32 4.39 75.5%

submitted by /u/neutrinoprism
[link] [comments]

Published by

Nevin Manimala

Nevin Manimala is interested in blogging and finding new blogs https://nevinmanimala.com

Leave a Reply

Your email address will not be published. Required fields are marked *