This came from some work I did for an information theory class this semester. Right off the bat, I want to warn everyone that this is **amateur quality work** at best. The main data source has some significant shortcomings, so you should treat these results as factoids of uncertain veracity, not statistical gospel. Before I go into full disclaimer mode, though, let me introduce the main concept at play.

**Shannon entropy** — after Claude Shannon, the visionary founder of information theory, and henceforth referred to just as “entropy” — is loosely, a measure of “surpise” among various outcomes.

Consider flipping a coin. Here the two outcomes are “heads” and “tails.” For a fair coin, each outcome is equally likely. We have no way of predicting what the outcome will be. For an unfair coin, we do already have some inkling of how the coin flip will go. In the extreme case, if a coin were to be so unfairly weighted that it always landed on heads, for example, there would be *no* surprise inherent to the system.

The surprise, thus entropy, is maximized when the coin is fair.

And the same thing would go for the roll of a die. A coin is like a two-sided die. A die with more sides — more possible outcomes — has more capacity to surprise than a die with fewer sides. Similarly, the less fair a die is, the less capacity it has to surprise. Entropy depends on both the number of possible outcomes and their relative likelihoods.

Okay, on to the formal mathematics. Let *X* = {*x*_1, *x*_2, *x*_3, …} be our collection of outcomes. Each outcome has a certain probability of occurring, defined by a probability function *p*. So *p*(*x*_1) denotes the probability of outcome *x*_1 occurring, *p*(*x*_2) denotes the probability of outcome *x*_2 occurring, and so on.

For a fair coin, we can sidestep the *x*-subscripts and write *X* = {*h*, *t*} and say that *p*(*h*) = 0.5 and *p*(*t*) = 0.5.

For a six-sided die, we have *X* = {*x*_1, …, *x*_6} and, if the die is properly weighted, *p*(*x*_i) = 1/6 for each index *i*.

The entropy of the system *X*, given a probability distribution defined on it, is denoted by *H*(*X*) and given by the formula

H(X) = Σp(x) log (1/p(x)),

where the sum ranges over all outcomes *x* in *X*. The base of the logarithm determines the unit of entropy. For base 2 the unit is “bits.”

Now, for any alphabetical language, we can consider the letters as a collection of outcomes. Then the probability function on that alphabet is determined by letter frequency in a given corpus. (A corpus is a collection of language excerpts, representing the language as it is written and/or spoken.)

This page from the American Cryptogram Association displays letter frequency statistics for more than 130 alphabetical languages, both natural and artificial. Curiously, it omits English. We can turn to this page from Cornell University’s “Math Explorers’ Club” for the equivalent English data, though.

Now, there are good reasons to take that cryptogram.org data with a healthy dose of skepticism. In analyzing the data, I’ve found that

- no methodology or sources are given for the data,
- not all languages list a total sample size,
- for the languages that do list a total sample size, the total doesn’t always agree with the sum of the individual letter frequencies (counts), and
- Walloon is listed twice.

So this is strictly amateur hour when it comes to data. I’ve used the individual letter counts to recalculate the probabilities whenever necessary, but I have no way to double-check those individual letter counts.

Casting all those worries aside, taking the data as is and heedlessly crunching forth, we can obtain some interesting results. The table below ranks 133 alphabetical languages by the percent of maximum entropy achieved from the observed letter distribution of each. That is, for each language — i.e, each distribution of outcomes at the letter level — it compares the observed entropy to that of an equidistributed alphabet with the same number of letters.

All entropies are reported in bits.

Enjoy!

Rank | Language | Size of alphabet | Observed entropy | Maximum entropy | Percent of max. entropy achieved |
---|---|---|---|---|---|

1 | KLINGON | 24 | 4.34 | 4.58 | 94.7% |

2 | ICELANDIC | 21 | 4.14 | 4.39 | 94.2% |

3 | CHEYENNE | 15 | 3.68 | 3.91 | 94.2% |

4 | MIKMAQ (MICMAC) | 18 | 3.86 | 4.17 | 92.6% |

5 | FAROESE | 27 | 4.37 | 4.75 | 92.0% |

6 | CROATIAN | 23 | 4.15 | 4.52 | 91.8% |

7 | K’ICHE’ | 25 | 4.25 | 4.64 | 91.6% |

8 | SWEDISH | 26 | 4.30 | 4.70 | 91.4% |

9 | LIMBA | 23 | 4.13 | 4.52 | 91.4% |

10 | FRANCOPROVENÇAL | 30 | 4.47 | 4.91 | 91.1% |

11 | ITALIAN | 22 | 4.05 | 4.46 | 90.8% |

12 | NORWEGIAN | 23 | 4.10 | 4.52 | 90.7% |

13 | MENDE | 24 | 4.15 | 4.58 | 90.5% |

14 | CORNISH | 24 | 4.14 | 4.58 | 90.4% |

15 | HMONG (Sichuan-Guizhou-Yunnan) | 26 | 4.25 | 4.70 | 90.4% |

16 | HAWAIIAN | 13 | 3.34 | 3.70 | 90.3% |

17 | CZECH | 37 | 4.70 | 5.21 | 90.2% |

18 | GREEK (MODERN) | 25 | 4.18 | 4.64 | 90.1% |

19 | NAHUATL | 18 | 3.75 | 4.17 | 90.0% |

20 | LATIN | 21 | 3.95 | 4.39 | 89.9% |

21 | BICHELAMAR | 22 | 4.00 | 4.46 | 89.7% |

22 | ESTONIAN | 24 | 4.11 | 4.58 | 89.6% |

23 | KANURI YERWA | 24 | 4.11 | 4.58 | 89.6% |

24 | LËTZEBUERGESCH | 29 | 4.35 | 4.86 | 89.5% |

25 | FINNISH | 21 | 3.92 | 4.39 | 89.3% |

26 | KURDISH | 31 | 4.42 | 4.95 | 89.3% |

27 | POLISH | 32 | 4.46 | 5.00 | 89.1% |

28 | BAVARIAN | 28 | 4.28 | 4.81 | 89.1% |

29 | MAPUDUNGUN | 26 | 4.19 | 4.70 | 89.1% |

30 | ENGLISH | 26 | 4.18 | 4.70 | 89.0% |

31 | CAKCHIQUEL | 31 | 4.40 | 4.95 | 88.8% |

32 | HMONG (Southern-East Guizhou) | 26 | 4.16 | 4.70 | 88.6% |

33 | MAM | 24 | 4.05 | 4.58 | 88.4% |

34 | NGANGELA | 23 | 4.00 | 4.52 | 88.3% |

35 | KICONGO | 21 | 3.88 | 4.39 | 88.3% |

36 | CHECHEWA | 26 | 4.15 | 4.70 | 88.2% |

37 | HMONG (Northern East-Guizhou) | 27 | 4.19 | 4.75 | 88.2% |

38 | DANISH | 25 | 4.10 | 4.64 | 88.2% |

39 | MOORÉ | 31 | 4.37 | 4.95 | 88.1% |

40 | NEDDERDÜÜTSCH | 28 | 4.24 | 4.81 | 88.1% |

41 | CEBUANO | 20 | 3.80 | 4.32 | 88.0% |

42 | ACHEHNESE | 23 | 3.98 | 4.52 | 87.9% |

43 | ALBANIAN | 34 | 4.47 | 5.09 | 87.9% |

44 | GAGAUZ | 28 | 4.22 | 4.81 | 87.8% |

45 | KINYARWANDA | 25 | 4.07 | 4.64 | 87.7% |

46 | IDO | 25 | 4.07 | 4.64 | 87.7% |

47 | MALTESE | 32 | 4.38 | 5.00 | 87.6% |

48 | KAONDE | 23 | 3.96 | 4.52 | 87.5% |

49 | BRETON | 29 | 4.24 | 4.86 | 87.3% |

50 | HUASTECO | 29 | 4.24 | 4.86 | 87.3% |

51 | LUVALE | 24 | 4.00 | 4.58 | 87.2% |

52 | ESPERANTO | 27 | 4.14 | 4.75 | 87.1% |

53 | HILIGAYNON | 20 | 3.77 | 4.32 | 87.1% |

54 | GREEK (CLASSICAL) | 25 | 4.04 | 4.64 | 87.1% |

55 | ILOKO | 20 | 3.76 | 4.32 | 87.0% |

56 | JAVANESE | 24 | 3.98 | 4.58 | 86.9% |

57 | GREENLANDIC (INUKTIKUT) | 18 | 3.62 | 4.17 | 86.9% |

58 | HUNGARIAN | 37 | 4.52 | 5.21 | 86.8% |

59 | CORSICAN | 23 | 3.92 | 4.52 | 86.7% |

60 | LITHUANIAN | 31 | 4.29 | 4.95 | 86.6% |

61 | BASQUE | 23 | 3.91 | 4.52 | 86.4% |

62 | COKWE | 25 | 4.01 | 4.64 | 86.4% |

63 | RHAETO-ROMANCE | 27 | 4.11 | 4.75 | 86.3% |

64 | GERMAN | 27 | 4.10 | 4.75 | 86.3% |

65 | SCOTTISH GAELIC | 30 | 4.23 | 4.91 | 86.3% |

66 | ARAGONESE | 29 | 4.19 | 4.86 | 86.2% |

67 | MAORI | 16 | 3.45 | 4.00 | 86.1% |

68 | DUTCH | 26 | 4.04 | 4.70 | 86.0% |

69 | AFRIKAANS | 26 | 4.04 | 4.70 | 85.9% |

70 | IRISH GAELIC | 32 | 4.29 | 5.00 | 85.9% |

71 | IBIBIO | 20 | 3.71 | 4.32 | 85.8% |

72 | EDO | 22 | 3.83 | 4.46 | 85.8% |

73 | VIENNESE | 28 | 4.12 | 4.81 | 85.7% |

74 | MARSHALLESE | 19 | 3.64 | 4.25 | 85.7% |

75 | INDONESIAN | 24 | 3.92 | 4.58 | 85.6% |

76 | INTERLINGUA | 23 | 3.87 | 4.52 | 85.6% |

77 | MACEDONIAN | 29 | 4.14 | 4.86 | 85.3% |

78 | BAOULÉ | 29 | 4.14 | 4.86 | 85.3% |

79 | GUARANI | 36 | 4.41 | 5.17 | 85.3% |

80 | VENETIAN | 26 | 4.01 | 4.70 | 85.3% |

81 | BALINESE | 23 | 3.86 | 4.52 | 85.2% |

82 | MADURESE | 25 | 3.96 | 4.64 | 85.2% |

83 | PICARD | 33 | 4.30 | 5.04 | 85.2% |

84 | LINGALA | 21 | 3.74 | 4.39 | 85.2% |

85 | GALICIAN | 24 | 3.90 | 4.58 | 85.0% |

86 | FIJIAN | 21 | 3.73 | 4.39 | 85.0% |

87 | HANI | 26 | 3.99 | 4.70 | 84.9% |

88 | MALAGASY | 22 | 3.78 | 4.46 | 84.9% |

89 | CHUUK | 22 | 3.77 | 4.46 | 84.6% |

90 | BUGISNESE | 23 | 3.82 | 4.52 | 84.4% |

91 | SARDINIAN | 27 | 3.99 | 4.75 | 84.0% |

92 | ROMANIAN | 26 | 3.95 | 4.70 | 83.9% |

93 | MAYAN | 30 | 4.12 | 4.91 | 83.9% |

94 | FRISIAN | 32 | 4.19 | 5.00 | 83.9% |

95 | LOZI | 25 | 3.90 | 4.64 | 83.9% |

96 | WALLOON | 35 | 4.30 | 5.13 | 83.8% |

97 | ASTURIAN | 29 | 4.06 | 4.86 | 83.6% |

98 | KAMPANPANGAN | 21 | 3.67 | 4.39 | 83.6% |

99 | PAPIAMENTU | 35 | 4.29 | 5.13 | 83.6% |

100 | MINANGKABAU | 24 | 3.83 | 4.58 | 83.6% |

101 | OROMIFFA | 26 | 3.92 | 4.70 | 83.5% |

102 | CHAMORRO | 26 | 3.92 | 4.70 | 83.5% |

103 | RUMANTSCH | 28 | 4.01 | 4.81 | 83.4% |

104 | LUNDA | 27 | 3.96 | 4.75 | 83.3% |

105 | CATALAN | 31 | 4.12 | 4.95 | 83.1% |

106 | CAMPA PANONALIJO | 23 | 3.76 | 4.52 | 83.1% |

107 | MALAY | 25 | 3.86 | 4.64 | 83.1% |

108 | AMAHUACA | 24 | 3.81 | 4.58 | 83.0% |

109 | NDEBELE | 31 | 4.11 | 4.95 | 82.9% |

110 | SPANISH | 27 | 3.94 | 4.75 | 82.9% |

111 | CAQUINTE | 22 | 3.69 | 4.46 | 82.7% |

112 | GARIFUNA | 29 | 4.01 | 4.86 | 82.5% |

113 | FRENCH | 31 | 4.08 | 4.95 | 82.3% |

114 | BIKOL | 25 | 3.82 | 4.64 | 82.2% |

115 | MAZATECO | 28 | 3.95 | 4.81 | 82.2% |

116 | JÈRRIAIS | 37 | 4.28 | 5.21 | 82.1% |

117 | PIEDMONTESE | 39 | 4.33 | 5.29 | 81.9% |

118 | ADJA | 43 | 4.44 | 5.43 | 81.9% |

119 | ARABELA | 20 | 3.54 | 4.32 | 81.8% |

120 | FRIULIAN | 34 | 4.15 | 5.09 | 81.5% |

121 | GASCON | 34 | 4.14 | 5.09 | 81.4% |

122 | SWAHILI | 29 | 3.94 | 4.86 | 81.1% |

123 | PORTUGUESE | 33 | 4.09 | 5.04 | 81.0% |

124 | CANDOSHI SHAPRA | 25 | 3.75 | 4.64 | 80.7% |

125 | TAGALOG | 23 | 3.65 | 4.52 | 80.7% |

126 | CHAYAHUITA | 25 | 3.75 | 4.64 | 80.7% |

127 | ACHUAR-SHIWIAR | 23 | 3.63 | 4.52 | 80.2% |

128 | AGUARUNA | 24 | 3.68 | 4.58 | 80.2% |

129 | AMUESHA-YANESHA | 33 | 3.99 | 5.04 | 79.0% |

130 | ASHÉNINCA | 25 | 3.67 | 4.64 | 79.0% |

131 | ASHÀNINCA | 24 | 3.58 | 4.58 | 78.0% |

132 | CASHIBO-CACATAIBO | 31 | 3.77 | 4.95 | 76.1% |

133 | MISKITO | 21 | 3.32 | 4.39 | 75.5% |

submitted by /u/neutrinoprism

[link] [comments]