Lecture 6. Entropic CLT (3)
In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.
Variance drop inequality
In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.
ANOVA decomposition. Let be independent random variables taking values in , and let be such that . Then satisfies
Note that is a function only of ( are the ordered elements of ).
In the previous lecture, we proved subadditivity of the inverse Fisher information . The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.
Lemma. Let where and is some collection of subsets of . If are independent random variables with , then
where is a fractional packing with respect to .
- Recall that a fractional packing is a function such that
Example 1. Let , and define . Taking always defines a fractional packing as by definition of .
Example 2. If , then .
- The original paper of Hoeffding (1948) proves the following special case where each is symmetric in its arguments and is as in Example 2 above: (the -statistic) satisfies
Of course, if then . Thus Hoeffding’s inequality for the variance of -statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.
Proof. We may assume without loss of generality that each has mean zero.
We then have, using orthogonality of the terms in the ANOVA decomposition:
For each term, we have
where the second inequality follows from the definition of fractional packing if is non-empty, and the fact that takes any to its mean. Hence
again using orthogonality of the in the last step.
Monotonicity of Fisher information
We can now finally prove monotonicity of the Fisher information.
Corollary. Let be independent random variables with . Then
for any hypergraph on , fractional packing , and positive weights summing to 1.
Proof. Recall that and . The identity proved in the last lecture states
With , we can write
since . By taking a convex combination of these identities,
Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have
Remark. The being arbitrary weights, we can optimize over them. This gives
With being all singletons and we recover the superadditivity property of . With being all sets of size and , we get
Thus we have proved the monotonicity of Fisher information in the central limit theorem.
From Fisher information to entropy
Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.
To formulate this identity, let us introduce some basic quantities. Let on , and define
where . Denote by the density of . The following facts are readily verified for :
- is smooth.
Observe that has density , and that as , converges to , which has a standard Gaussian distribution. Thus provides an interpolation between the density and the normal density.
Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process is defined by the stochastic differential equation
where is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by
has a generator (defined via ) given by . The semigroup generated by governs the evolution of conditional expectations of functions of the process , while the adjoint semigroup generated by governs the evolution of the marginal density of . The above expression for follows from this remark by noting that and are the same in distribution; however, it can also be deduced more simply just by writing down the density of explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.
We can now formulate the key identity.
de Bruijn identity. Let be the density of the standard normal .
- Differential form:
where is the normalized Fisher information.
The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since
which yields the desired identity since .
This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.
Lecture by Mokshay Madiman | Scribed by Georgina Hall