Lecture 7. Entropic CLT (4)
This lecture completes the proof of the entropic central limit theorem.
From Fisher information to entropy (continued)
In the previous lecture, we proved the following:
where are weights satisfying and is any fractional packing of the hypergraph in the sense that for every .
By optimizing the choice of the weights, it is easy to check that (1) is equivalent to
Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.
In the following, let be the set of all subsets of of size and take for every . In this case, (2) takes the following form.
Theorem EPI. If are independent random variables, then
provided that all the entropies exist.
To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.
Theorem (de Bruijn identity). Let be a random variable with density on . Let , , where is independent of . We have
Proof. Let be the density of . Then
where . It is easy to check that . Hence,
Integrating from to gives (3).
Proof of Theorem EPI. In (1), let be the set of all subsets of of size and . We have
where, is the weight corresponding to the set . It is easy to check that (4) is equivalent, for every with , to
for any choice of with and with .
Let , where is independent of all else. Then
where the equality in distribution is due to the fact that . Hence, by (6),
We use de Bruijn identity to integrate this from to over time and get
By (6) again,
This is the entropy analog of the Fisher information inequality obtained above.
As a final step, let . We are to show
If for some , the result is immediate due to the general fact (convolution increases entropy). Hence, we assume that for every , so
Set , . Note that
With this choice of , we apply (7). After some computations, we get
Using the definition of the , the final result follows from here immediately.
Proof of the Entropic Central Limit Theorem
Entropic CLT. Suppose that are i.i.d. random variables with and . Let . If for some , then as . Equivalently, if for some , then as .
In the case of i.i.d. random variables, Theorem EPI gives
which is equivalent to because of the identity . Therefore, it remains to show as . For this, we need some analytical properties of the relative entropy. Henceforth, we use to denote the set of all probability measures on some Polish space (take for instance).
Proposition (Variational characterization of ). Let . Then
where denotes the set of all bounded measurable functions .
- In the variational characterization of above, it is enough to take the supremum over the set of all bounded continuous functions .
- For fixed , the mapping is convex and continuous. Since is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.
Corollary. Sublevel sets of are compact, that is, the set is compact (with respect to the topology of weak convergence) for every and .
Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.
Definition (Tightness). A set is called tight if for every there exists a compact set such that for every .
Prohorov Theorem. A set is weakly precompact if and only if it is tight.
Note that is a tight set as a singleton. We claim that the sequence is also tight. Indeed, let and let be a compact set with . We take and apply (8) to get
where the rightmost term can be made arbitrarily small. Hence, is tight. By Prohorov Theorem, there exists such that as . By lower semicontinuity of , we have
This finishes the proof of the corollary.
Proof of Variational Characterization of . As a first case, suppose . So and there exists a Borel set with and . Let . We have
as . Hence, both sides of the variational characterization are equal to .
For the rest, we assume . First, we show the part. If , the inequality is obvious. Suppose . Given , define a probability measure by
Since is chosen arbitrarily, taking supremum on the right hand side gives the part.
Next, we prove the part. Note that if , then . However, this choice of may not be in , that is, may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let
so that as . Note that and . Thus we have
by monotone convergence. On the other hand, by Fatou’s Lemma, we have
from which the part of the result follows.
Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.
Lemma. If are random variables with , , and the corresponding densities are uniformly bounded with as (pointwise) for some density , then and as .
Proof. Recall for with mean and variance . By lower semicontinuity of , we have
On the other hand, letting , we have
and using Fatou’s Lemma,
Hence, as .
End of proof of Entropic CLT. Assume . We will use to denote normalized Fisher information. For any , we have that for . So as for every . We want to show that , since then we will get by Lebesgue’s dominated convergence theorem that
as . But since
it is enough to show that for each .
By the monotonicity property we have proved, we know that
for any . By compactness of sublevel sets of , the sequence must therefore have a subsequence whose distribution converges to a probability measure (let us call a random variable with this limiting measure as its distribution). For , the smoothing caused by Gaussian convolution implies that the density of converges pointwise to that of , and also that the density of converges pointwise to that of , where is an independent copy of . By the previous lemma
as , and
so that necessarily
By the equality condition in the entropy power inequality, this can only happen if is Gaussian, which in turn implies that .
Lecture by Mokshay Madiman | Scribed by Cagin Ararat