1 The Khinchin (Shannon?) axioms for entropy
Note.
In this course, “random variable” will mean “discrete random variable” (unless otherwise
specified).
All logarithms will be base
(unless otherwise specified).
Definition (Entropy).
The entropy of a discrete random variable
is a
quantity
that takes real values and has the following properties:
-
(i)
Normalisation: If
is uniform on
then .
-
(ii)
Invariance: If
takes values in ,
takes values in ,
is a bijection from
to ,
and for every
we have ,
then .
-
(iii)
Extendability: If
takes values in a set ,
and
is disjoint from ,
takes values in ,
and for all
we have ,
then .
-
(iv)
Maximality: If
takes values in a finite set
and
is uniformly distributed in ,
then .
-
(v)
Continuity:
depends continuously on
with respect to total variation distance (defined by the distance between
and
is ).
For the last axiom we need a definition:
Let and
be random variables.
The conditional entropy
of given
is
-
(vi)
Additivity: .
Lemma 1.1.
Assuming that:
Proof.
.
Since and
are independent, the
distribution of is
unaffected by knowing
(so by invariance, ),
so
for all ,
which gives the result. □
Corollary 1.2.
If
are independent, then
|
Proof.
Lemma 1.1 and obvious induction. □
Lemma 1.3 (Chain rule).
Assuming that:
Then
|
Proof.
The case
is additivity. In general,
|
so we are done by induction. □
Lemma 1.4.
Assuming that:
Proof.
The map is
a bijection, and .
So the first statement follows by invariance. For the second statement:
Lemma 1.5.
Assuming that:
Proof.
and
are independent. Therefore, by Lemma 1.1, .
But by invariance, .
So .
□
Proposition 1.6.
Assuming that:
Proof.
Let
be independent random variables uniformly distributed on .
By Corollary 1.2 and normalisation, .
But
is uniformly distributed on ,
so by invariance, the result follows. □
Proposition 1.7.
Assuming that:
Reminder: here
is to the base
(which is the convention for this course).
Proof.
Let
be a positive integer and let
be independent copies of .
Then is
uniform on
and
Now pick
such that .
Then by invariance, maximality, and Proposition 1.6, we have that
So
|
Therefore,
as claimed. □
Notation.
We will write .
We will also use the notation .
Theorem 1.8 (Khinchin).
Assuming that:
Proof.
First we do the case where all
are rational (and then can finish easily by the continuity axiom).
Pick
such that for all ,
there is some
such that .
Let be uniform
on . Let
be a partition
of into sets with
. By invariance we
may assume that .
Then
Hence
|
By continuity, since this holds if all
are rational, we conclude that the formula holds in general. □
Corollary 1.9.
Assuming that:
Then
and
.
Proof.
Immediate consequence of Theorem 1.8. □
Corollary 1.10.
Assuming that:
Proof.
.
But .
□
Proposition 1.11 (Subadditivity).
Assuming that:
Proof.
Note that for any two random variables
we have
Next, observe that
if is
uniform on a finite set. That is because
By the equivalence noted above, we also have that
if
is
uniform.
Now let and assume
that all are
rational. Pick such
that we can write
with each an
integer. Partition
into sets of
size . Let
be uniform on
. Without loss of generality
(by invariance) .
Let for each
. So
. Now define a
random variable
as follows: If ,
then , but then
is uniformly
distributed in and
independent of
(or if
you prefer).
So and
are conditionally
independent given ,
and is
uniform on .
Then
By continuity, we get the result for general probabilities. □
Corollary 1.12.
Assuming that:
Proof (Without using formula).
By Subadditivity, .
But .
□
Corollary 1.13.
Assuming that:
Then
|
Proposition 1.14 (Submodularity).
Assuming that:
Proof.
Calculate:
Submodularity can be expressed in many ways.
Expanding using additivity gives the following inequalities:
Lemma 1.15.
Assuming that:
-
random variables
-
Proof.
Lemma 1.16.
Assuming that:
-
random variables
-
Proof.
Submodularity says:
|
which implies the result since
depends on
and .
□
Lemma 1.17.
Assuming that:
Then is
uniform.
Proof.
Let .
Then
The function
is concave on .
So, by Jensen’s inequality this is at most
|
Equality holds if and only if
is constant – i.e.
is uniform. □
Corollary 1.18.
Assuming that:
Then
and are
independent.
Proof.
We go through the proof of Subadditivity and check when equality holds.
Suppose that
is uniform on .
Then
with equality if and only if
is uniform on for all
(by Lemma 1.17),
which implies that
and are
independent.
At the last stage of the proof we used
|
where was uniform. So
equality holds only if
and are independent,
which implies (since
depends on )
that
and are
indpendent. □
Definition (Mutual information).
Let
and be random variables.
The mutual information
is
Subadditivity is equivalent to the statement that
and Corollary 1.18 implies that
if and only if
and are
independent.
Note that
Definition (Conditional mutual information).
Let
,
and
be random variables. The
conditional mutual information of
and
given ,
denoted by
is
Submodularity is equivalent to the statement that
.