Lecture 5. Entropic CLT (2)

The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.

Two lemmas about the score function

Recall that for a random variable X with absolutely continuous density f, the score function is defined as \rho(x)=f'(x)/f(x) and the Fisher information is I(X)=\mathbb{E}[\rho^2(X)].

The following lemma was proved in the previous lecture.

Lemma 1. Let X be a random variable with finite Fisher information. Let w:\mathbb{R} \rightarrow \mathbb{R} be measurable and let \eta(u) = \mathbb{E}[w^2(u + X)]. If \eta is bounded on the interval [a,b], then the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on [a,b] with g'(u) = -\mathbb{E}[w(u+X)\rho(X)] a.e.

There is a converse to the above lemma, which gives a useful characterization of the score function.

Lemma 2. Let X be a random variable with density f and let m: \mathbb{R} \rightarrow \mathbb{R} be a measurable function with \mathbb{E}|m(X)|<\infty. Suppose that for every bounded measurable function w:\mathbb{R} \rightarrow \mathbb{R}, the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on \mathbb{R} with g'(u) = -\mathbb{E}[w(u+X)m(X)] a.e. Then there must exist an absolutely continuous version of the density f, and moreover m(X)=\rho(X) a.s.

Proof. Take w(x)=\mathbb{I}_{\{x\leq t\}}. Then

    \[g(u)=\int_{-\infty}^{t-u}f(x)dx.\]

Hence g(u) is in fact continuously differentiable with derivative g'(u)=-f(t-u). On the other hand,

    \[\begin{split} g'(u)&=-\mathbb{E}[w(u+X)m(X)]\\ &=-\int_{-\infty}^{t-u}f(x)m(x)\qquad\mbox{a.e.} \end{split}\]

by our assumption. By continuity of f and the assumption \mathbb{E}|m(X)|<\infty, we must have

    \[f(y)=\int_{-\infty}^{y}f(x)m(x)dx\]

for every y\in\mathbb{R}. Hence f(x)m(x)=f'(x) a.e. Since \mathbb{P}[f(X)=0]=0, the proof is complete. \square

Score function of the sum of independent variables

We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.

Proposition. Let X and Y be two independent random variables. Suppose X has finite Fisher information. Then \rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y] a.s.

Proof. Let m(u)=\mathbb{E}[\rho_X(X)|X+Y=u]. By Lemma 2, we only need to show that the function g(u) = \mathbb{E} w(u+X+Y) is locally absolutely continuous with g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. for every bounded measurable function w.

Fix a\in\mathbb{R}. By independence of X and Y, we can apply Lemma 1 to X (conditioned on Y) to obtain

    \[\mathbb{E}[w(u+X+Y)|Y]-\mathbb{E}[w(a+X+Y)|Y]=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)|Y]\,dt.\]

Taking expectation of both sides and applying Fubini, we get

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)]\,dt.\]

Since

    \[\begin{split}\mathbb{E}[w(t+X+Y)\rho_X(X)]&=\mathbb{E}[\mathbb{E}[w(t+X+Y)\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)\mathbb{E}[\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)m(X+Y)],\end{split}\]

we arrive at

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)m(X+Y)]\,dt,\]

which implies g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. \qquad\square

Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in L^2(\Omega,\mathcal{F},\mathbb{P})), the score function of the sum is just the projection of the score of a summand onto the closed subspace L^2(\Omega, \sigma\{X+Y\}, \mathbb{P}). This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: I(X+Y)=\mathbb{E}[\rho_{X+Y}^2(X+Y)]\leq\mathbb{E}[\rho_{X}^2(X)]=I(X). In fact, we can do better, as we will see forthwith.

Monotonicity of FI in the CLT along a subsequence

We now make a first step towards proving monotonicity of the Fisher information in the CLT.

Theorem. Let X and Y be independent random variables both with finite Fisher information. Then

    \[I(X+Y)\leq\lambda^2I(X)+(1-\lambda)^2I(Y)\]

for any \lambda\in\mathbb{R}.

Before going into the proof, we make some remarks.

Remark. Taking \lambda=1, we get I(X+Y)\leq I(X). Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.

Remark. By taking \lambda=\frac{I(Y)}{I(X)+I(Y)}, we get the following optimized version of the theorem:

    \[\frac{1}{I(X+Y)}\geq\frac{1}{I(X)}+\frac{1}{I(Y)}.\]

Remark. Let X_1,X_2,\ldots be an i.i.d. sequence of random variables with mean zero and unit variance. Let S_n=\frac{1}{\sqrt{n}}\sum_{i=1}^nX_i. By taking \lambda=\frac{1}{2} in the theorem and using scaling property of Fisher information, it is easy to obtain I(S_{2n})\leq I(S_n). Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times 1,2,4,8,16,\ldots: that is, I(S_{2^n}) is monotone in n.
However, the theorem is not strong enough to give monotonicity I(S_{n+1})\leq I(S_n) without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get I(S_n)\leq I(S_1), which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.

Proof. By the projection property of the score function,

    \[\begin{split} &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y],\\ &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_Y(Y)|X+Y].\\ \end{split}\]

Hence

    \[\rho_{X+Y}(X+Y)=\mathbb{E}[\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y)|X+Y].\]

Applying the conditional Jensen inequality, we obtain,

    \[\rho_{X+Y}^2(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2|X+Y].\]

Taking expectations of both sides and using independence, we obtain

    \[I(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2] =\lambda^2I(X)+(1-\lambda)^2I(Y)+2\lambda(1-\lambda)\mathbb{E}[\rho_X(X)]\mathbb{E}[\rho_Y(Y)].\]

By Lemma 1, \mathbb{E}[\rho_X(X)]=0. This finishes the proof. \qquad\square

Monotonicity of Fisher information in the CLT

To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.

In the following, we will use the common notation [n]:=\{1,\ldots,n\}.

Definition. Let \mathcal{G} be a collection of non-empty subsets of [n]. Then a collection of non-negative numbers \{\beta_s:s\in\mathcal{G}\} is called a fractional packing for \mathcal{G} if \sum_{s:i\in s}\beta_s\leq 1 for all 1\leq i\leq n.

We can now state the desired strengthening of our earlier theorem.

Theorem. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G} and let X_1,\ldots,X_n be independent random variables. Let \{w_s:s\in\mathcal{G}\} be a collection of real numbers such that \sum_{s\in\mathcal{G}}w_s=1. Then

    \[I(X_1+\ldots+X_n)\leq\sum_{s\in\mathcal{G}}\frac{w_s^2}{\beta_s}I\bigg(\sum_{j\in s}X_j\bigg)\]

and

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\sum_{s\in\mathcal{G}}\beta_s\frac{1}{I(\sum_{j\in s}X_j)}.\]

Remark. Suppose that the X_i are identically distributed. Take \mathcal{G}=\{[n]\backslash 1,[n]\backslash 2,\ldots,[n]\backslash n\}. For each s\in\mathcal{G}, define \beta_s=\frac{1}{n-1}. It is easy to check that \{\beta_s:s\in\mathcal{G}\} is a fractional packing of \mathcal{G}. Then

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\frac{n}{n-1}\frac{1}{I(X_1+\ldots+X_{n-1})}\]

by the above theorem. By the scaling property of Fisher information, this is equivalent to I(S_n)\leq I(S_{n-1}), i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).

The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let X_1,\ldots,X_n be independent random variables, and define the Hilbert space

    \[H:=\{\phi:\mathbb{R}^n \rightarrow \mathbb{R}~~:~~\mathbb{E}[\phi^2(X_1,\ldots,X_n)]<\infty\}\]

with inner product

    \[\langle\phi,\psi\rangle:=\mathbb{E}[\phi(X_1,\ldots,X_n)\psi(X_1,\ldots,X _n)].\]

For every j\in[n], define an operator E_j on H as

    \[(E_j\phi)(x_1,\ldots,x_n)=\mathbb{E}[\phi(x_1,\ldots,x_{j+1},X_j,x_{j-1} ,\ldots,x_n)].\]

Proposition. Each \phi\in H can be decomposed as

    \[\phi=\sum_{T\subseteq [n]}\widetilde{E}_T\phi,\]

where

    \[\widetilde{E}_T:=\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\]

satisfies

  1. \widetilde{E}_T\widetilde{E}_S=0 if T\neq S;
  2. \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 if T\neq S;
  3. \widetilde{E}_T\phi does not depend on x_i if i\notin T.

Proof. It is easy to verify that E_j is a projection operator, that is, E_j^2=E_j and \langle E_j\phi,\psi\rangle=\langle\phi,E_j\psi\rangle (self-adjointness). It is also easy to see that E_jE_i=E_iE_j for i\neq j. Hence we have

    \[\phi=\prod_{j=1}^n(E_j+I-E_j)\phi=\sum_{T\subset [n]}\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\phi=\sum_{T\subset [n]}\widetilde{E}_T\phi.\]

If T\neq S, choose j_0\in T such that j_0\notin S. Then

    \[\widetilde{E}_T\widetilde{E}_S&=(I-E_{j_0})E_{j_0}\prod_{j\notin T}E_j\prod_{j\in T\backslash\{j_0\}}(I-E_j)\prod_{j\notin S\cup\{j_0\}}E_j\prod_{j\in S}(I-E_j)=0.\]

It is easily verified that \widetilde{E}_T is itself self-adjoint, so that \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 follows directly. Finally, as by definition E_j\phi does not depend on x_j, it is clear that \widetilde{E}_T\phi does not depend on x_j if j\notin T. \qquad\square

The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here X_S:=(X_{i_1},\ldots,X_{i_{|S|}}) for S=\{i_1,\ldots,i_{|S|}\}\subseteq [n], i_1<\cdots<i_{|S|}.

Lemma. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G}. Let U=\sum_{s\in\mathcal{G}}\phi_s(X_s). Suppose that \mathbb{E}[\phi_s^2(X_s)]<\infty for each s\in\mathcal{G}. Then \text{Var}[U]\leq\sum_{s\in\mathcal{G}}\frac{1}{\beta_s}\text{Var}[\phi_s(X_s)].

To be continued…

Lecture by Mokshay Madiman | Scribed by Liyao Wang

06. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

css.php