Lecture 6. Entropic CLT (3)

In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.

Variance drop inequality

In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.

ANOVA decomposition. Let X_1, \ldots, X_n be independent random variables taking values in \mathbb{R}, and let \Psi: \mathbb{R}^n \rightarrow \mathbb{R} be such that \mathbb{E}[\Psi(X_1,\ldots,X_n)^2]< \infty. Then \Psi satisfies

    \begin{align*} \Psi(X_1,\ldots,X_n)&= \sum_{t \subseteq \{1,\ldots,n\}} \bar{E_t}[\Psi] \\ \text{where } \bar{E_t}[\Psi]&:=\bigg(\prod_{i \notin t} E_i \prod_{j \in t} (I-E_j)\bigg)\Psi \\ \text{and } E_{i} [\Psi] &:= \mathbb{E}[\Psi | X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n]. \end{align*}

Furthermore, \mathbb{E}[\bar{E_t}\Psi \cdot \bar{E_s}\Psi]=0 \text{ if } t\neq s.

Note that \bar{E_t}\Psi is a function only of X_t=(X_{i_1}, \ldots, X_{i_m}) (i_1<\ldots < i_m are the ordered elements of t).

In the previous lecture, we proved subadditivity of the inverse Fisher information I^{-1}. The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.

Lemma. Let U(X_1,\ldots,X_n)=\sum_{s \in \mathcal{G}} \Psi_s(x_s) where \Psi_s : \mathbb{R}^{|s|} \rightarrow \mathbb{R} and \mathcal{G} is some collection of subsets of \{1,\ldots,n\}. If X_1,\ldots, X_n are independent random variables with \mathbb{E} \Psi_s(X_s)^2 < \infty \text{ } \forall s \in \mathcal{G}, then

    \[\mathrm{Var}(U(X_1,\ldots, X_n)) \leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s}\,\mathrm{Var}(\Psi_s(X_s)) ,\]

where \{\beta_s : s \in \mathcal{G} \} is a fractional packing with respect to \mathcal{G}.


  1. Recall that a fractional packing is a function \beta:\mathcal{G} \rightarrow [0;1] such that \sum_{s \in \mathcal{G}, s  \ni i} \beta_s\leq 1 \text{ } \forall i \in [n].

    Example 1. Let d(i)=\#\{s \in \mathcal{G} : s \ni i \}, and define d_+= \max d(i). Taking \beta_s=\frac{1}{d_+} always defines a fractional packing as \sum_{s \in \mathcal{G}, i  \ni s} \frac{1}{d_+}=\frac{1}{d_+} \cdot d(i) \leq 1 by definition of d_+.

    Example 2. If \mathcal{G}=\{s\subseteq\{1,\ldots,n\}:|s|=m\}, then d_+=\binom{n-1}{m-1}.

  2. The original paper of Hoeffding (1948) proves the following special case where each \Psi_s is symmetric in its arguments and \mathcal{G} is as in Example 2 above: U=\frac{1}{\binom{n}{m}}\sum_{|s|=m}\Psi(X_s) (the U-statistic) satisfies

        \[\mathrm{Var}(U) \leq \frac{m}{n} \mathrm{Var}(\Psi(X_s)).\]

    Of course, if m=1 then \mathrm{Var}(U) = \frac{1}{n} \mathrm{Var}(\Psi(X_1)). Thus Hoeffding’s inequality for the variance of U-statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.

Proof. We may assume without loss of generality that each \Psi_s(X_s) has mean zero.

    \begin{align*}  U(X_1,\ldots,X_n)&=\sum_{s \in \mathcal{G}} \Psi_s(X_s)= \sum_{s \in \mathcal{G}} \sum_{t \subseteq s} \bar{E_t}[\Psi_s(X_s)]\qquad\text{ by ANOVA}\\ &= \sum_{t \subseteq [n]} \sum_{s \in \mathcal{G}, s \supseteq t} \bar{E_t} \Psi_s(X_s)\\  &= \sum_{t \subseteq [n]} \bar{E_t} \sum_{s \in \mathcal{G}, s \supseteq t} \Psi_{s}. \end{align*}

We then have, using orthogonality of the terms in the ANOVA decomposition:

    \begin{align*} \mathbb{E}U^2 &= \sum_{t \subseteq [n]} \mathbb{E} \bigg[\sum_{s \in \mathcal{G} , s \supseteq t} \bar{E_t}[\Psi_s(X_s)] \bigg]^2. \end{align*}

For each term, we have

    \begin{align*} \Big[\sum_{s \in \mathcal{G} , s \supseteq t} \frac{\sqrt{\beta_s}\bar{E_t}[\Psi_s(X_s)]}{\sqrt{\beta_s}}\Big]^2 &\leq \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \beta_s\Big] \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s}\Big] \qquad\text{ by Cauchy-Schwarz}\\ &\leq \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} ,  \end{align*}

where the second inequality follows from the definition of fractional packing if t is non-empty, and the fact that \bar{E}_{\varnothing} takes any \Psi_s to its mean. Hence

    \begin{align*} \mathbb{E}[U^2] &\leq \sum_{t \subseteq [n]} \mathbb{E} \Big[ \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} \Big]\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \sum_{t \subseteq s} \mathbb{E} [\bar{E_t} \Psi_s]^2\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \Psi_s^2 , \end{align*}

again using orthogonality of the \bar{E_t}\Psi in the last step. \qquad\square

Monotonicity of Fisher information

We can now finally prove monotonicity of the Fisher information.

Corollary. Let X_i be independent random variables with I(X_i) <\infty. Then

    \[I(X_1+\ldots + X_n) \leq \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I\bigg(\sum_{i \in s} X_i\bigg)\]

for any hypergraph \mathcal{G} on [n], fractional packing \beta, and positive weights \{\omega_s :s \in \mathcal{G}\} summing to 1.

Proof. Recall that I(X)=\mathrm{Var}(\rho_X(X)) and \rho_X(x)=\frac{f_X'(x)}{f_X(x)}. The identity proved in the last lecture states

    \begin{align*} \rho_{X+Y}(x)&= \mathbb{E}[\rho_X(X)|X+Y=x]\\ \rho_{X+Y}(X+Y)&=\mathbb{E}[\rho_X(X)|X+Y] \end{align*}

With T_s=\sum_{i \in S} X_i, we can write

    \[\rho_{T_{[n]}}(T_{[n]})=\mathbb{E}[\rho_{T_s}(T_s)|T_{[n]}] \text{ }\qquad\forall s \in \mathcal{G}\]

since T_{[n]}=T_s+T_s^C. By taking a convex combination of these identities,

    \begin{align*} \rho_{T_{[n]}}(T_{[n]}) &= \sum_{s \in \mathcal{G}} \omega_s \mathbb{E}[\rho_{T_{s}}(T_s) | T_{[n]}] \\ &= \mathbb{E}\bigg[\sum_{s \in \mathcal{G}} \omega_s \rho_{T_{s}}(T_s) | T_{[n]}\bigg] . \end{align*}

Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have

    \begin{align*} I(T_{[n]}) &= \mathbb{E} [\rho^2_{T_{[n]}}(T_{[n]})]\\ &\leq \mathbb{E}\bigg[\bigg(\sum_{s \in \mathcal{G}} \omega_s \rho_{T_s}(T_s)\bigg)^2\bigg]\\ &\leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s} \omega_s^2 \mathbb{E}[\rho^2_{T_s}(T_s)]\\ &= \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I(T_s) \end{align*}

as desired. \qquad\square

Remark. The \omega_s being arbitrary weights, we can optimize over them. This gives

    \[\frac{1}{I(T_{[n]})} \geq \sum_{s \in \mathcal{G}} \frac{\beta_s}{I(T_s)}.\]

With \mathcal{G} being all singletons and \beta_s=1 we recover the superadditivity property of I^{-1}. With \mathcal{G} being all sets of size n-1 and \beta_s=\frac{1}{\binom{n-1}{n-2}}=\frac{1}{n-1}, we get

    \[\frac{1}{I(T_{[n]})} \geq \frac{n}{n-1}\frac{1}{I(T_{n-1})} \Rightarrow I\Big(\frac{T_{[n]}}{\sqrt{n}}\Big) \leq I\Big(\frac{T_{[n-1]}}{\sqrt{n-1}}\Big) \Leftrightarrow I(S_n) \leq I(S_{n-1}).\]

Thus we have proved the monotonicity of Fisher information in the central limit theorem.

From Fisher information to entropy

Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.

To formulate this identity, let us introduce some basic quantities. Let X\sim f on \mathbb{R}, and define

    \[X^t = e^{-t}X+\sqrt{1-e^{-2t}}Z\]

where Z \sim \mathcal{N}(0,1). Denote by f_t the density of X^t. The following facts are readily verified for t>0:

  1. f_t>0.
  2. f_t(\cdot) is smooth.
  3. I[f_t] < \infty.
  4. \frac{\partial f_t(x)}{\partial t}= f_t^{\prime \prime}(x)+ \frac{d}{dx} [xf_t(x)] =: (Lf_t)(x).

Observe that X^0=X has density f, and that as t \rightarrow \infty , X^t converges to Z, which has a standard Gaussian distribution. Thus X^t provides an interpolation between the density f and the normal density.

Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process X(t) is defined by the stochastic differential equation

    \[dX(t)= -X(t) dt + \sqrt{2} dB(t),\]

where B(t) is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by

    \[P_t \Psi(x)=\mathbb{E}[\Psi(X(t))|X_0=x] ,\]

has a generator A (defined via A= \lim_{t\rightarrow \infty} \frac{P_t - I}{t}) given by A\Psi(x)=\Psi^{\prime \prime}(x)-x\Psi'(x). The semigroup P_t generated by A governs the evolution of conditional expectations of functions of the process X(t), while the adjoint semigroup generated by L=A^* governs the evolution of the marginal density of X(t). The above expression for \partial f_t/\partial t follows from this remark by noting that X(t) and X^t are the same in distribution; however, it can also be deduced more simply just by writing down the density of X^t explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.

We can now formulate the key identity.

de Bruijn identity. Let g be the density of the standard normal N(0,1).

  1. Differential form:

        \[\frac{d}{dt} D(f_t \| g)=\frac{1}{2} J(f_t),\]

    where J(f)=Var(f) \cdot I(f)-1 is the normalized Fisher information.

  2. Integral form:

        \[D(f \| g)=\frac{1}{2} \int_0^\infty J(f_t) dt .\]

The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since

    \[D(g \| g)-D(f \| g)=- \frac{1}{2}\int_0^\infty J(f_t) dt ,\]

which yields the desired identity since D(g\|g)=0.

This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.

Lecture by Mokshay Madiman | Scribed by Georgina Hall

13. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment