EDIT: I made it clearer that the second language is not original.

If you calculate a running sum of the sorted frequency of every word in a long book 1, the result looks like the following curve: 1

You can try this J program with your favorite book (as long as cutopen supports the language), where count is defined in my previous post: 2

NB. Plot the word count. Your text should only have one newline at the end of
NB. the file, otherwise `freadr` and `cutopen` won't work. J is definitely not
NB. an ideal language for string processing.
plot +/\ (% +/) \:~ (count"_ 0 ~.) cutopen , freadr 'your/book.txt'
NB.                            ~.   finds the unique word
NB.                 (count"_ 0 ~.)  is a monadic hook that gets the word count
NB.             \:~                 sort the word count
NB.      (% +/)                     normalizes it
NB.  +/\                            calculate partial sum

There are 18000 kinds of words used in total, but only 20% of them is used in 80% of the scenarios. This phenomenon is known as the Pareto principle or the 80:20 rule. In this post, we will have some understanding about why this pattern appears, by designing two languages that follows the principle.

A Language With Equal Word Length

The first language is a boring one: every word has the same length.

The syntax of the first language:

  1. There is a finite set Ω\Omega called alphabet.
  2. A word is an element of Ωn\Omega^n where nn is a fixed number.
  3. Characters in a word are iid samples of some random variable XX over Ω\Omega.

For example, let Ω={a,b,c}\Omega=\{a,b,c\}, n=5n=5 and the value of XX over a,b,ca,b,c is 0.5,0.4,0.10.5,0.4,0.1 respectively, then a sentence of our language looks like:

aabaa bbbaa aabaa baaba abaca bbbba abaac ...

This can be generated using the following J sentence:

20 5 $ ((roulette"1) 100 3 $ 0.5 0.4 0.1) (({ >)"0) 100 $ < 'abc'

Typical Set and the Asymptotic Equipartition Principle

We will focus on one seemingly arbitrary property of the language: the information content of a word. Let II be the random variable that represents the information content. Given a character xx, we have I(x)=log2Pr(X=x).I(x)=-\log_2\mathrm{Pr}(X=x).

Since the characters are iid samples, according to the law of large numbers, for every ε>0\varepsilon>0: limnPr(1ni=1nI(xi)H(X)<ε)=1,(1)\lim_{n\to\infty}\mathrm{Pr}\left(\left|\frac{1}{n}\sum_{i=1}^nI(x_i)-H(X)\right|<\varepsilon\right)=1,\tag{1} where H(X)=E[I(X)]H(X)=\mathbb{E}[I(X)] is the entropy of XX.

Let A={xnΩn:1ni=1nI(xi)H(X)<ε}\mathcal{A}=\{x^n\in\Omega^n:|\frac{1}{n}\sum_{i=1}^nI(x_i)-H(X)|<\varepsilon\} be the set of words where the average information content is close to the entropy (it’s called the (weak) typical set in information theory). We know from the equation above that you will almost surely get an element of A\mathcal{A} when sampling a word, despite that A|\mathcal{A}| is relatively small as we will see.

To estimate A|\mathcal{A}|, let’s consider an element xn=(x1,x2,,xn)Ax^n=(x_1,x_2,\dots,x_n)\in\mathcal{A} and try to answer: What’s the probability of getting the word xnx^n?

Here we can make use of the iid assumption again: Pr(Xn=xn)=i=1nPr(X=xi)=2i=1nlog2Pr(X=xi)=2i=1nI(xi).\mathrm{Pr}(X^n=x^n)=\prod_{i=1}^n\mathrm{Pr}(X=x_i)=2^{\sum_{i=1}^n\log_2\mathrm{Pr}(X=x_i)}=2^{-\sum_{i=1}^nI(x_i)}.

By definition of A\mathcal{A}, (1ε)H(X)<1ni=1nI(xi)<(1+ε)H(X).(1-\varepsilon)H(X)<\frac{1}{n}\sum_{i=1}^nI(x_i)<(1+\varepsilon)H(X).

Therefore, 2n(1+ε)H(X)<Pr(Xn=xn)<2n(1ε)H(X).2^{-n(1+\varepsilon)H(X)}<\mathrm{Pr}(X^n=x^n)<2^{-n(1-\varepsilon)H(X)}.

We know this is true for every xnAx^n\in\mathcal{A}, so the probability of XnAX^n\in\mathcal{A} is lower-bounded: 2n(1+ε)H(X)A<Pr(XnA).2^{-n(1+\varepsilon)H(X)}|\mathcal{A}|<\mathrm{Pr}(X^n\in\mathcal{A}).

Pr(XnA)\mathrm{Pr}(X^n\in\mathcal{A}) is a probability, therefore it can’t be greater than 1: 2n(1+ε)H(X)A<Pr(XnA)1.2^{-n(1+\varepsilon)H(X)}|\mathcal{A}|<\mathrm{Pr}(X^n\in\mathcal{A})\leq1.

Then we have this nice upper bound of A|\mathcal{A}|: A<2n(1+ε)H(X).|\mathcal{A}|<2^{n(1+\varepsilon)H(X)}.

The result is a part of the asymptotic equipartition principle in information theory.

Making The Language 80:20

Let go back to our simple language, and find a set of parameters to make it satisfies the 80:20 rule.

The typical set will have less than 20% of the element if 2n(1+ε)H(X)Ωn<15.\frac{2^{n(1+\varepsilon)H(X)}}{|\Omega|^n}<\frac{1}{5}.

We can use the probability distribution (12,14,,12Ω1,12Ω1)(\frac{1}{2},\frac{1}{4},\dots,\frac{1}{2^{|\Omega|-1}},\frac{1}{2^{|\Omega|-1}}) to ensure that the entropy is always less than 2 for every Ω\Omega that has more than 1 element (see appendix A for the proof).

Setting m=Ω=8m=|\Omega|=8 and using the result H(X)<2H(X)<2, we got the first constraint 22n(1+ε)8n<15ε<112nlog25.\frac{2^{2n(1+\varepsilon)}}{8^n}<\frac{1}{5}\Rightarrow \varepsilon<1-\frac{1}{2n}\log_2 5.

Unfortunately, weak LLN (1) doesn’t tell us how the probability converges as nn\to\infty, and the simplest way I came up with is to sample a lot of words.

n =: 6
m =: 8
sample =: 100000
NB. We use the constraint above to get a valid epsilon.
eps =: (1 - (2 ^. 5) % 2 * n) - 0.01

NB. The probability distribution
p =: (2 ^ _1 - i. m - 1) , 2 ^ - m - 1

NB. The entropy
e =: entropy p

NB. Sample the words
words =: roulette"1 (sample, n, m) $ p

NB. Take the y-th element of the probability
NB. The `"0` at the end specifies that its rank is 0
take =: 3 : '{: 1 {. y }. p'"0

NB. Calculate their average information content
i =: (+/ % #)"1 - 2 ^. take words
NB.  (+/ % #)   is a fork that calculates the averave

NB. Get the result
(+/ % #) (i < e + eps) *. (i > e - eps)

The probability is about 0.84, so the configuration does give us a language that satisfy the 80:20 rule.

Another Language

I was about to end this post when I read this paper by Conrad and Mitzenmacher. The paper describes a way to generate languages where the length of words are varied and the word count is power-law distribution. Let’s have a look at a simple language described in the paper.

The syntax of the second language:

  1. There are only three kinds of characters: a,ba,b and spaces.
  2. Each character are iid samples from a probability distribution where Pr(a)=p,Pr(b)=p2\mathrm{Pr}(a)=p,\mathrm{Pr}(b)=p^2 and Pr([space])=1pp2\mathrm{Pr}(\text{[space]})=1-p-p^2.
  3. A word is a sequence of aas and bbs that are not separated by a space.

Again, let’s generate some sample texts with J:

p =: 0.5
((roulette"1) 100 3 $ p,(p * p),(1 - p * p)) (({ >)"0) 100 $ < 'ab '

The result looks like:

b ab baa ab baa   aa  ba  bb ba aa baaaabbbbabb  ba a aaba abaaaa

Note that there are some extra spaces in the text, but that doesn’t change the probability of word because of the iid assumption.

Zipf’s Law

Our second language is a much richer language: there are an infinitive number of words, and as a result, we can not say that a set of a words’ size is 20% of “the set of all words”. Instead, we will proof that the word frequency in our language (asymptotically) satisfies Zipf’s law.

Zipf’s law: Let the probability of the ithi^{\text{th}} most frequent word in the language be fif_i, then fi1(i+β)α(2)f_i\propto\frac{1}{(i+\beta)^\alpha}\tag{2} where α>0\alpha>0.

If we take the logarithm of ff, then we have logf=αlog(i+β)+constant,\log f=-\alpha\log(i+\beta)+\text{constant},

That’s why people usually describe Zipf’s law using straight lines in log-log diagram, like this one: 2 (this diagram is plotted using Python and the code is AI-generated, so I won’t paste it here.)

Frequency Ranking and Fibonacci Numbers

To know what the ithi^{\text{th}} word is, we should sort all the words in our language according to its frequency. The probability of a word with ss aas and tt bbs (and a space) appearing in a text is Pr(a)jPr(b)kPr([space])=ps+2t(1pp2).\mathrm{Pr}(a)^j\mathrm{Pr}(b)^k\mathrm{Pr}(\text{[space]})=p^{s+2t}(1-p-p^2).

We can generate the list of words ordered by their frequencies: S1={a} has probability p(1pp2),S2={aa,b} has probability p2(1pp2),S3={ab,aaa,ba} has probability p3(1pp2),S4={aab,bb,aba,aaaa,baa} has probability p4(1pp2),\begin{aligned} S_1=\{a\} & \text{ has probability } p(1-p-p^2), \\ S_2=\{aa,b\} & \text{ has probability } p^2(1-p-p^2), \\ S_3=\{ab,aaa,ba\} & \text{ has probability } p^3(1-p-p^2), \\ S_4=\{aab,bb,aba,aaaa,baa\} & \text{ has probability } p^4(1-p-p^2), \\ & \vdots \\ \end{aligned}

Every word can find its place in the list since every word’s probability has the form pk(1pp2)p^k(1-p-p^2).

You may have noticed that the size of SkS_k looks like the (k+1)th(k+1)^{\text{th}} Fibonacci number Fk+1F_{k+1} (and I’m not sorting the words in alphabetical order). Indeed we can proof that Sk=Fk+1|S_k|=F_{k+1}, and thus Sk=φk+1ψk+15|S_k|=\frac{\varphi^{k+1}-\psi^{k+1}}{\sqrt{5}} (where φ=1+52,ψ=152\varphi=\frac{1+\sqrt{5}}{2},\psi=\frac{1-\sqrt{5}}{2}).

The proof. We can proof it by induction.

  • S1=1=F2|S_1|=1=F_2 and S2=2=F3|S_2|=2=F_3.
  • Assume that Sk2|S_{k-2}| and Sk1|S_{k-1}| are Fk1F_{k-1} and FkF_k respectively. We prove that Sk={wbwSk2}{wawSk1},S_k=\{wb|w\in S_{k-2}\}\cup\{wa|w\in S_{k-1}\}, where wawa means concatenating the word ww with aaaa.
    • Because every word in {wbwSk2}\{wb|w\in S_{k-2}\} and {wawSk1}\{wa|w\in S_{k-1}\} has probability pk(1pp2)p^k(1-p-p^2), therefore Sk{wbwSk2}{wawSk1}.S_k\supseteq\{wb|w\in S_{k-2}\}\cup\{wa|w\in S_{k-1}\}.
    • For a word wSkw'\in S_k, if it ends with aa, then we can strip away the last aa to get a word ww. It has probability pk1(1pp2)p^{k-1}(1-p-p^2), so it will appear in Sk1S_{k-1}. Similarily, if ww' ends with bb, then stripping away the last bb yields a word in Sk2S_{k-2}. Therefore Sk{wbwSk2}{wawSk1}. S_k\subseteq\{wb|w\in S_{k-2}\}\cup\{wa|w\in S_{k-1}\}.\ \Box

One Last Trick

Finally, we can crack the problem of the frequency fif_i. If we can write kk as an expression of ii, then its probability is just pkp^k times a constant.

Here’s the tricky part: we find that writing kk as a function of ii is harder than the other way, so we instead writing ii as a function of kk and take its inverse. We have j=1k1Sjij=1kSj,(3)\sum_{j=1}^{k-1}|S_j|\leq i\leq\sum_{j=1}^{k}|S_j|,\tag{3} and j=1kSj=j=1kφk+1ψk+15=15(j=1kφk+1j=1kψk+1)=15(φk+11φ1ψk+11ψ1).\begin{aligned} \sum_{j=1}^{k}|S_j| & =\sum_{j=1}^{k}\frac{\varphi^{k+1}-\psi^{k+1}}{\sqrt{5}} \\ & =\frac{1}{\sqrt{5}}\left(\sum_{j=1}^{k}\varphi^{k+1}-\sum_{j=1}^{k}\psi^{k+1}\right) \\ & =\frac{1}{\sqrt{5}}\left(\frac{\varphi^{k+1}-1}{\varphi-1}-\frac{\psi^{k+1}-1}{\psi-1}\right). \\ \end{aligned}

Because ψ<1|\psi|<1, when kk is large, ψk+1\psi^{k+1} is close to 0, and the equation (3) becomes C1φk+C2εiC1φk+1+C2+ε,C_1\varphi^{k}+C_2-\varepsilon\leq i\leq C_1\varphi^{k+1}+C_2+\varepsilon, where C1,C2C_1,C_2 are constant that we don’t care about, and ε\varepsilon is a relatively small value (in terms of kk).

Then we have 1C1logφ(iC2ε)1k1C1logφ(iC2+ε), p1C1logφ(iC2ε)1pkp1C1logφ(iC2+ε), 1p(iC2ε)1C1logφppk(iC2+ε)1C1logφp.\begin{aligned} & \frac{1}{C_1}\log_\varphi(i-C_2-\varepsilon)-1\leq k\leq\frac{1}{C_1}\log_\varphi(i-C_2+\varepsilon), \\ \Rightarrow\ & p^{\frac{1}{C_1}\log_\varphi(i-C_2-\varepsilon)-1}\leq p^k\leq p^{\frac{1}{C_1}\log_\varphi(i-C_2+\varepsilon)}, \\ \Rightarrow\ & \frac{1}{p}(i-C_2-\varepsilon)^{\frac{1}{C_1}\log_\varphi p}\leq p^k\leq(i-C_2+\varepsilon)^{\frac{1}{C_1}\log_\varphi p}. \end{aligned}

Note that C1>0C_1>0 and logφp<0\log_\varphi p<0, therefore the equation holds when ii is sufficiently large.

Conclusion

These two languages can’t fully explain the 80:20 pattern in natural languages, because we made an assumption that characters are iid, while characters in natural languages are not independent. For example, If you are asked to fill the blanks in t_ese t_ree words with one kind of character, you will choose h rather than e despite that the probability of e in English is higher than that of h. A better probabilistic model requires deep understanding of linguistics and better mathematical tools.

The concept of typical set and AEP are derived from chapter 4 of MacKay’s ITILA textbook. The first language is also influenced by its examples. Fixed-length words are widely used in computing, and typical set provides a way to compress them. In fact, the famous Shannon’s source coding theorem can be described as: we can compress our language without losing too much information, by only encoding the words in the typical set.

I have already mentioned that the second language is paraphrased from this paper by Conrad and Mitzenmacher. The paper proofs that the power-law distribution applies to languages with any probability distribution on a finite set of characters, not just (p,p2)(p,p^2) on two characters.

Appendix: Why the Entropy is Always Less than 2

We want to prove that: for every m>2m>2, Hm=12m1log212m1i=1m112ilog212i<2.H_m=-\frac{1}{2^{m-1}}\log_2\frac{1}{2^{m-1}}-\sum_{i=1}^{m-1}\frac{1}{2^i}\log_2\frac{1}{2^i}<2.

We can simplify HmH_m to Hm=m12m1+i=1m1i2i.H_m=\frac{m-1}{2^{m-1}}+\sum_{i=1}^{m-1}\frac{i}{2^i}.

The proof consists of two parts:

  1. HmH_m increases monotonically,
  2. limnHm=2\lim_{n\to\infty}H_m=2.

The first part is simple: Hm+1Hm=m2mm12m1+m2m=12m1>0.H_{m+1}-H_{m}=\frac{m}{2^m}-\frac{m-1}{2^{m-1}}+\frac{m}{2^m}=\frac{1}{2^{m-1}}>0.

For the second part, consider the series (we can get this by taking the derivative of the series i=0xn\sum_{i=0}^\infty x^n) n=1nxn1=1(1x)2(1<x<1).\sum_{n=1}^\infty nx^{n-1}=\frac{1}{(1-x)^2}\quad(-1<x<1).

Let x=12x=\frac{1}{2}, then we have n=1n2n1=4n=1n2n=2.\sum_{n=1}^\infty\frac{n}{2^{n-1}}=4\Rightarrow\sum_{n=1}^\infty\frac{n}{2^n}=2.

Therefore limmHm=limmm12m1+limmi=1m1i2i=2.\lim_{m\to\infty}H_m=\lim_{m\to\infty}\frac{m-1}{2^{m-1}}+\lim_{m\to\infty}\sum_{i=1}^{m-1}\frac{i}{2^i}=2.

  1. The book I used is Wuthering Heights by Emily Brontë, downloaded from the project Gutenberg. Actually, I found the plot hard to comprehend (probably because I’m often messed up with the names and don’t understand who is who) and gave up reading it. 

  2. If there is a J verb that isn’t in the standard library and doesn’t have definition later in this post, then it’s defined in the previous post.