Lecture 4. Entropic CLT (1)

The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean zero and unit variance. Let

    \[S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^nX_i.\]

If h(S_n) > -\infty for some n, then h(S_n) \uparrow h(N(0,1)), or equivalently D( \mathcal{L}(S_n)\| N(0,1)) \downarrow 0. That is, the entropy of S_n increases monotonically to that of the standard Gaussian.

Recall that D( \mathcal{L}(X) \| \mathcal{L}(Z)) = h(Z) - h(X) when Z is Gaussian with the same mean and variance as X, which explains the equivalence stated in the theorem.

Let us note that the assumption h(S_n) > -\infty for some n represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if X_i are discrete.

Entropy power inequality

Historically, the first result on monotonicity of entropy in the CLT was that h(S_{2n}) \ge h(S_n) for all n. This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.

Entropy power inequality. Let X_1 and X_2 be independent real-valued random variables such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)} \ge e^{2h(X_1)} + e^{2h(X_2)},\]

with equality if and only if X_1 and X_2 are Gaussian.

Before we embark on the proof, let us make some remarks.

Remark. None of the assumptions about existence of entropies is redundant: it can happen that h(X_1) and h(X_2) exist but h(X_1+X_2) does not.

Remark. If X_1 and X_2 are i.i.d., S_1=X_1, and S_2=(X_1+X_2)/\sqrt{2}, then the EPI implies

    \[2e^{2h(X_1)} \le e^{2h(X_1+X_2)} = e^{2h(\sqrt{2}S_2)} = 2e^{2h(S_2)},\]

which implies h(S_1) \le h(S_2). Here we have used the easy-to-check equality h(aX) = h(X) + \log|a|, which of course implies e^{2h(aX)}=a^2e^{2h(X)}. From this observation, the proof of the claim that h(S_{2n}) \ge h(S_n) is immediate: simply note that S_{2n} is the sum of two independent copies of S_n.

Remark. It is easy to check that h(X_1+X_2) \ge h(X_1+X_2 | X_2) = h(X_1 | X_2) = h(X_1). In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.

Remark. The EPI admits the following multidimensional extension.

Multidimensional EPI. Let X_1 and X_2 be independent \mathbb{R}^n-valued random vectors such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)/n} \ge e^{2h(X_1)/n} + e^{2h(X_2)/n},\]

with equality if and only if X_1 and X_2 are Gaussian with proportional covariance matrices.

Define the entropy power N(X) for an \mathbb{R}^n-valued random vector X by

    \[N(X) := \exp\left(\frac{2h(X)}{n}\right).\]

The EPI says that N is superadditive under convolution.

Digression: EPI and Brunn-Minkowski

A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If A,B \subset \mathbb{R}^n are Borel sets and |\cdot| denotes n-dimensional Lebesgue measure, then

    \[|A + B|^{1/n} \ge |A|^{1/n} + |B|^{1/n},\]

where A+B := \{a + b : a \in A, \ b \in B\} is the Minkowski sum. In particular, note that |A|^{1/n} is proportional up to an absolute constant to the radius of the n-dimensional Euclidean ball whose volume matches that of A. The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in \mathbb{R}^n, which states that Euclidean balls with volume V have the minimal surface area among all subsets of \mathbb{R}^n with volume V.

In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since

    \[h(N(0,\sigma^2)) = \frac{1}{2}\log(2\pi e\sigma^2) \quad \Rightarrow \quad e^{2h(N(0,\sigma^2))} = 2\pi e\sigma^2,\]

we see that e^{2h(X)} is proportional to the variance of the Gaussian whose entropy matches that of X. The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.

Proposition. The EPI is equivalent to the following statement: if X_1 and X_2 are independent and Z_1 and Z_2 are independent Gaussians with h(Z_1)=h(X_1) and h(Z_2)=h(X_2), then h(X_1+X_2) \ge h(Z_1 +Z_2) provided that all of the entropies exist.

Proof. Both implications follow from

    \[\exp\left(\frac{2h(X_1)}{n}\right)+ \exp\left(\frac{2h(X_2)}{n}\right) = \exp\left(\frac{2h(Z_1)}{n}\right) + \exp\left(\frac{2h(Z_2)}{n}\right) = \exp\left(\frac{2h(Z_1+Z_2)}{n}\right). \qquad\square\]

Proof of the entropy power inequality

There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.

Definition (Fisher information). Let X be a \mathbb{R}-valued random variable whose density f is an absolutely continuous function. The score function of X is defined as

    \[\rho(x) = \rho_X(x) := \begin{cases} \frac{f'(x)}{f(x)} & \text{for } x \text{ at which } f(x)>0 \text{ and } f'(x) \text{ exists,} \\ 0 & \text{otherwise.} \end{cases}\]

The Fisher information (FI) of X is defined as

    \[I(X) := \mathbb{E}\left[\rho_X^2(X)\right]. \]\]

Remark. Let \mathcal{P} = \{p_\theta : \theta \in \Theta\} be a parametric statistical model. In statistics, the score function is usually defined by \rho(\theta,x) := \frac{\partial}{\partial \theta}\log p_\theta(x), and the Fisher information by I(\theta) = \mathbb{E}_\theta\left[\rho^2(\theta,X)\right]. This reduces to our definition in the special case of location families, where p_\theta(x) = p(x-\theta) for some probability density p(x): in this case, \rho(\theta,x) = -\frac{\partial}{\partial x}\log p(x-\theta) and we have

    \[I(\theta) = \int p_\theta(x)\left[\frac{\partial}{\partial x}\log p(x-\theta)\right]^2dx = \int p(x)\left[\frac{\partial}{\partial x}\log p(x)\right]^2dx.\]

Thus for location families I(\theta) does not depend on \theta and coincides precisely with the Fisher information I(Z) as we defined it above for a random variable Z with density p(x). The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that \mathbb{E}Z=0. Then \mathbb{E}_\theta X = \theta for every \theta, so X is an unbiased estimator of \theta. The Cramér-Rao bound therefore implies the inequality

    \[I(Z) \ge \frac{1}{\mathrm{Var}(Z)},\]

with equality if and only if Z is Gaussian. The same conclusion holds when \mathbf{E}Z\in\mathbb{R} is arbitrary, as both FI and variance are invariant under translation.

Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have I(S_n) \downarrow I(N(0,1)) = 1. Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as J(X) := \mathrm{Var}(X)I(X)-1. Note that J is both translation and scale invariant: J(aX)=J(X) and J(X+a)=J(X). We have J(X) \ge 0, with equality if and only if X is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as J(S_n) \downarrow 0.

The strategy of the proof of the EPI is as follows:

  1. We first prove an inequality for Fisher information:

        \[\frac{1}{I(X_1+X_2)} \ge \frac{1}{I(X_1)} + \frac{1}{I(X_2)}.\]

  2. Develop an integral identity relating I and h.
  3. Combine (1) and (2) to get the EPI.

The reason to concentrate on Fisher information is that I is an L^2-type quantity, as opposed to the entropy which is an L\log L-type quantity. This makes I easier to work with.

We begin with some technical results about Fisher information.

Lemma. If X has absolutely continuous density f and I(X) < \infty, then f has bounded variation. In particular, f is bounded.

Proof. Let D_f denote the set of points at which f is differentiable, so that |D_f^c|=0. Define f'(x) := 0 for x \notin D_f. Let S = \{x \in \mathbb{R} : f(x) > 0\}. Then

    \[\int_\mathbb{R}|f'(x)|dx = \int_S|f'(x)|dx = \int_S|\rho(x)|f(x)dx = \mathbb{E}|\rho(X)| \le \sqrt{I(X)} < \infty. \qquad\square\]

Lemma. Let w : \mathbb{R} \rightarrow \mathbb{R} be measurable. Let \eta : [a,b] \rightarrow [0,\infty] be defined as \eta(u) = \mathbb{E} w^2(u + X). If \eta is bounded, then g : [a,b] \rightarrow \mathbb{R} defined by g(u) = \mathbb{E} w(u+X) is absolutely continuous on (a,b) with g'(u) = -\mathbb{E}\left[w(u+X)\rho(X)\right] a.e.

Proof. Let \tilde{g}(u)=-\mathbb{E}\left[w(u+X)\rho(X)\right]; we want to show that g is absolutely continuous with derivative \tilde{g}. First observe that \tilde{g} is integrable on [a,b] from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute

    \begin{eqnarray*}\begin{split} \int_a^b \tilde{g}(u) du  &= - \int_a^b \bigg[ \int w(s) f'(s-u) ds \bigg] du \\ &= -  \int w(s)\bigg[ \int_a^b f'(s-u) du \bigg]  ds \\ &= -  \int w(s)\bigg[ f(s-a) - f(s-b) \bigg]  ds \\ &= \mathbb{E} w(b+X)- \mathbb{E} w(a+X) \\ &= g(b) - g(a) , \end{split}\end{eqnarray*}

which proves the desired result. \qquad\square

Corollary. \mathbb{E}[\rho(X)] = 0 and, if \mathbb{E} X^2 < \infty, then \mathbb{E}[X\rho(X)] = -1.

Proof. Take w \equiv 1 in the lemma above for the first claim. Take w(u)=u for the second. \qquad\square

To be continued…

Lecture by Mokshay Madiman | Scribed by Dan Lacker

23. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment