1 The Khinchin (Shannon?) axioms for entropy

Note. In this course, “random variable” will mean “discrete random variable” (unless otherwise specified).

All logarithms will be base 2 (unless otherwise specified).

Definition (Entropy). The entropy of a discrete random variable X is a quantity H[X] that takes real values and has the following properties:

  • (i) Normalisation: If X is uniform on {0,1} then H[X]=1.
  • (ii) Invariance: If X takes values in A, Y takes values in B, f is a bijection from A to B, and for every aA we have [X=a]=[Y=f(a)], then H[Y]=H[X].
  • (iii) Extendability: If X takes values in a set A, and B is disjoint from A, Y takes values in AB, and for all aA we have [Y=a]=[X=a], then H[Y]=H[X].
  • (iv) Maximality: If X takes values in a finite set A and Y is uniformly distributed in A, then H[X]H[Y].
  • (v) Continuity: H depends continuously on X with respect to total variation distance (defined by the distance between X and Y is supE|[XE][YE]|).

For the last axiom we need a definition:

Let X and Y be random variables. The conditional entropy H[X|Y] of X given Y is

y[Y=y]H[X|Y=y].

  • (vi) Additivity: H[X,Y]=H[Y]+H[X|Y].

Lemma 1.1. Assuming that:

  • X and Y are independent random variables

Then
H[X,Y]=H[X]+H[Y].

Proof. H[X|Y]=y[Y=y]H[X|Y=y].

Since X and Y are independent, the distribution of X is unaffected by knowing Y (so by invariance, H[X|Y=y]=H[X]), so

H[X|Y=y]=H[X]

for all y, which gives the result.

Corollary 1.2. If X1,,Xn are independent, then

H[X1,,Xn]=H[X1]++H[Xn].

Proof. Lemma 1.1 and obvious induction.

Lemma 1.3 (Chain rule). Assuming that:

  • X1,,Xn are random variables

Then
H[X1,,Xn]=H[X1]+H[X2|X1]+H[X3|X1,X2]++H[Xn|X1,,Xn1].

Proof. The case n=2 is additivity. In general,

H[X1,,Xn]=H[X1,,Xn1]+H[Xn|X1,,Xn1]

so we are done by induction.

Lemma 1.4. Assuming that:

  • Y=f(X)

Then
H[X,Y]=H[X].

Also,

H[Z|X,Y]=H[Z|X].

Proof. The map g:x(x,f(x)) is a bijection, and (X,Y)=g(X). So the first statement follows by invariance. For the second statement:

H[Z|X,Y]=H[Z,X,Y]H[X,Y](by additivity)=H[Z,X]H[X](by first part)=H[Z|X](by additivity)

Lemma 1.5. Assuming that:

  • X takes only one value

Then H[X]=0.

Proof. X and X are independent. Therefore, by Lemma 1.1, H[X,X]=2H[X]. But by invariance, H[X,X]=H[X]. So H[X]=0.

Proposition 1.6. Assuming that:

  • X is uniformly distributed on a set of size 2n

Then H[X]=n.

Proof. Let X1,,Xn be independent random variables uniformly distributed on {0,1}. By Corollary 1.2 and normalisation, H[X1,,Xn]=n. But (X1,,Xn) is uniformly distributed on {0,1}n, so by invariance, the result follows.

Proposition 1.7. Assuming that:

  • X is uniformly distributed on a set A of size n

Then H[X]=logn.

Reminder: log here is to the base 2 (which is the convention for this course).

Proof. Let r be a positive integer and let X1,,Xr be independent copies of X.

Then (X1,,Xr) is uniform on Ar and

H[X1,,Xr]=rH[X].

Now pick k such that 2knr2k+1. Then by invariance, maximality, and Proposition 1.6, we have that

krH[X]k+1.

So

krlognk+1rkrH[X]k+1rk,r

Therefore, H[X]=logn as claimed.

Notation. We will write pa=[X=a].

We will also use the notation [n]={1,2,,n}.

Theorem 1.8 (Khinchin). Assuming that:

  • H satisfies the Khinchin axioms

  • X takes values in a finite set A

Then
H[X]=aApa log (1pa).

Proof. First we do the case where all pa are rational (and then can finish easily by the continuity axiom).

Pick n such that for all a, there is some ma{0} such that pa=man.

Let Z be uniform on [n]. Let (Ea:aA) be a partition of [n] into sets with |Ea|=ma. By invariance we may assume that X=aZEa. Then

logn=H[Z]=H[Z,X]=H[X]+H[Z|X]=H[X]+aApaH[Z|X=a]=H[X]+aApa log (ma)=H[X]+aApa( log pa+ log n)

Hence

H[X]=aApa log pa=aApa log (1pa).

By continuity, since this holds if all pa are rational, we conclude that the formula holds in general.

Corollary 1.9. Assuming that:

  • X and Y random variables

Then H[X]0 and H[X|Y]0.

Proof. Immediate consequence of Theorem 1.8.

Corollary 1.10. Assuming that:

  • Y=f(X)

Then H[Y]H[X].

Proof. H[X]=H[X,Y]=H[Y]+H[X|Y]. But H[X|Y]0.

Proposition 1.11 (Subadditivity). Assuming that:

  • X and Y be random variables

Then H[X,Y]H[X]+H[Y].

Proof. Note that for any two random variables X,Y we have

H[X,Y]H[X]+H[Y]H[X|Y]H[X]H[Y|X]H[Y]

Next, observe that H[X|Y]H[X] if X is uniform on a finite set. That is because

H[X|Y]=y[Y=y]H[X|Y=y]y[Y=y]H[X](by maximality)=H[X]

By the equivalence noted above, we also have that H[X|Y]H[X] if Y is uniform.

Now let pab=[(X,Y)=(a,b)] and assume that all pab are rational. Pick n such that we can write pab=mabn with each mab an integer. Partition [n] into sets Eab of size mab. Let Z be uniform on [n]. Without loss of generality (by invariance) (X,Y)=(a,b)ZEab.

Let Eb=aEab for each b. So Y=bZEb. Now define a random variable W as follows: If Y=b, then WEb, but then W is uniformly distributed in Eb and independent of X (or Z if you prefer).

So W and X are conditionally independent given Y, and W is uniform on [n].

Then

H[X|Y]=H[X|Y,W](by conditional independence)=H[X|W](as W determines Y)H[X](as W is uniform)

By continuity, we get the result for general probabilities.

Corollary 1.12. Assuming that:

  • X a random variable

Then H[X]0.

Proof (Without using formula). By Subadditivity, H[X|X]H[X]. But H[X|X]=0.

Corollary 1.13. Assuming that:

  • X1,,Xn are random variables

Then
H[X1,,Xn]H[X1]++H[Xn].

Proof. Induction using Subadditivity.

Proposition 1.14 (Submodularity). Assuming that:

  • X,Y,Z are random variables

Then
H[X|Y,Z]H[X|Z].

Proof. Calculate:

H[X|Y,Z]=z[Z=z]H[X|Y,Z=z]z[Z=z]H[X|Z=z]=H[X|Z]

Submodularity can be expressed in many ways.

Expanding using additivity gives the following inequalities:

H[X,Y,Z]H[Y,Z]H[X,Z]H[Z]H[X,Y,Z]H[X,Z]+H[Y,Z]H[Z]H[X,Y,Z]+H[Z]H[X,Z]+H[Y,Z]

Lemma 1.15. Assuming that:

  • X,Y,Z random variables

  • Z=f(Y)

Then
H[X|Y]H[X|Z].

Proof.

H[X|Y]=H[X,Y]H[Y]=H[X,Y,Z]H[Y,Z]H[X,Z]H[Z](Submodularity)=H[X|Z]

Lemma 1.16. Assuming that:

  • X,Y,Z random variables

  • Z=f(X)=g(Y)

Then
H[X,Y]+H[Z]H[X]+H[Y].

Proof. Submodularity says:

H[X,Y,Z]=H[Z]H[X,Z]+H[Y,Z]

which implies the result since Z depends on X and Y.

Lemma 1.17. Assuming that:

  • X takes values in a finite set A

  • Y is uniform on A

  • H[X]=H[Y]

Then X is uniform.

Proof. Let pa=[X=a]. Then

H[X]=aApa log (1pa)=|A|aApa log (1pa)

The function xxlog1x is concave on [0,1]. So, by Jensen’s inequality this is at most

|A|(𝔼apa)log(1𝔼apa)= log (|A|)=H[Y].

Equality holds if and only if apa is constant – i.e. X is uniform.

Corollary 1.18. Assuming that:

  • X,Y random variables

  • H[X,Y]=H[X]+H[Y]

Then X and Y are independent.

Proof. We go through the proof of Subadditivity and check when equality holds.

Suppose that X is uniform on A. Then

H[X|Y]=y[Y=y]H[X|Y=y]H[X]

with equality if and only if H[X|Y=y] is uniform on A for all y (by Lemma 1.17), which implies that X and Y are independent.

At the last stage of the proof we used

H[X|Y]=H[X|Y,W]=H[X|W]H[X]

where W was uniform. So equality holds only if X and W are independent, which implies (since Y depends on W) that X and Y are indpendent.

Definition (Mutual information). Let X and Y be random variables. The mutual information I[X:Y] is

H[X]+H[Y]H[X,Y]=H[X]H[X|Y]=H[Y]H[Y|X]

Subadditivity is equivalent to the statement that I[X:Y]0 and Corollary 1.18 implies that I[X:Y]=0 if and only if X and Y are independent.

Note that

H[X,Y]=H[X]+H[Y]I[X:Y].

Definition (Conditional mutual information). Let X, Y and Z be random variables. The conditional mutual information of X and Y given Z, denoted by I[X:Y|Z] is

z[Z=z]I[X|Z=z:Y|Z=z]=z[Z=z](H[X|Z=z]+H[Y|Z=z]H[X,Y|Z=z])=H[X|Z]+H[Y|Z]H[X,Y|Z]=H[X,Z]+H[Y,Z]H[X,Y,Z]H[Z]

Submodularity is equivalent to the statement that I[X:Y|Z]0.