End of Fall 2013 seminar

That was it for the Fall 2013 stochastic analysis seminar! Many thanks in particular to our fantastic guest lecturers Mokshay Madiman and Emmanuel Abbe.

There will be no stochastic analysis seminar in the Spring 2014 semester (students interested in related material may want to consider taking the special topics course Probability in High Dimension). We will resume in the Fall 2014 semester with a topic to be announced at that time.

15. December 2013 by Ramon van Handel
Categories: Announcement | Leave a comment

Lecture 10. Concentration, information, transportation (2)

Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip} with finite mean.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu \ll \mu.

This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures \mu is it true that if X_1,\ldots,X_n are i.i.d. \sim\mu, then every 1\text{-Lip} function f(X_1,\ldots,X_n) is subgaussian with the same parameter \sigma^2 for every n\ge 1? In principle, the above result answers this question: \mu satisfies dimension-free concentration if and only if

    \[W_1(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}      \quad\text{for every }\mathbb{Q}\ll\mu^{\otimes n}\text{ and }n\ge 1.\]

However, this is not a very satisfactory characterization: to check whether \mu satisfies dimension-free concentration, we must check that the transportation-information inequality holds for \mu^{\otimes n} for every n\ge 1. Instead, we would like to characterize dimension-free transportation in terms of a property of \mu itself (and not its tensor products). This will be achieved in the present lecture.

Tensorization

How are we going to obtain a transportation-information inequality for every \mu^{\otimes n} starting from only a property of \mu? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for \mu itself already implies the analogous inequality for all its products \mu^{\otimes n}. In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.

We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y):X\sim\mu,Y\sim \nu\} denotes the set of couplings between probability measures \mu and \nu, and that the Wasserstein distance W_1(\mu,\nu) can be written as

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)].\]

We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.

Proposition. (Tensorization) Let \phi:\mathbb{R}_+\to\mathbb{R}_+ be a convex function, and let w:\mathbb{X}\times\mathbb{X}\to\mathbb{R}_+ be a positive weight function on the metric space (\mathbb{X},d). Fix a probability measure \mu on \mathbb{X}. If

    \[\inf_{\mathbb{M}\in C(\nu,\mu)} \phi(\mathbb{E}_{\mathbb{M}}[w(X,Y)])            \le c\, D(\nu || \mu)\]

holds for all \nu\ll\mu, then

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^n          \phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) \le c\, D(\mathbb{Q} || \mu^{\otimes n})\]

holds for every \mathbb{Q}\ll\mu^{\otimes n} and every n\ge 1.

Proof. The conclusion holds for n=1 by assumption. The proof now proceeds by induction on n: we will suppose in the sequel that the conclusion holds for n-1, and deduce the conclusion for n from that.

Fix \mathbb{Q}\ll\mu^{\otimes n}. We denote by \mathbb{Q}^{n-1} the marginal of \mathbb{Q} on the first n-1 coordinates, and define the conditional distribution \mathbb{Q}_{X_1,\ldots,X_{n-1}}=\mathbb{Q}(X_n\in\,\cdot\,|X_1,\ldots,X_{n-1}) (whose existence is guaranteed by the Bayes formula as \mathbb{Q}\ll\mu^{\otimes n}). The key idea of the proof is to exploit the chain rule for relative entropy:

    \[D(\mathbb{Q}||\mu^{\otimes n}) =    D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) +    \mathbb{E_Q}[D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu )].\]

The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing \varepsilon>0, it follows that we can choose \mathbb{M}^{n-1}\in\mathcal{C}(\mathbb{Q}^{n-1},\mu^{\otimes n-1}) and \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) such that

    \begin{align*}    c\,D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) &\ge     \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) - \varepsilon,\\    c\,D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu ) &\ge    \phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)]) - \varepsilon. \end{align*}

If we now define the probability measure \mathbb{M} as

    \[\mathbb{E_M}[f(X_1,\ldots,X_n,Y_1,\ldots,Y_n)] =     \mathbb{E}_{\mathbb{M}^{n-1}}\bigg[\int f(X_1,\ldots,X_{n-1},x,Y_1,\ldots,Y_{n-1},y)\mathbb{M}_{X_1,\ldots,X_{n-1}}(dx,dy)\bigg],\]

then we can estimate by Jensen’s inequality

    \begin{align*}    c\,D(\mathbb{Q}||\mu^{\otimes n}) &\ge    \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) +    \mathbb{E_Q}[\phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)])]    - 2\varepsilon \\ &\ge    \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) - 2\varepsilon. \end{align*}

But evidently \mathbb{M}\in\mathcal{C}(\mathbb{Q},\mu^{\otimes n}), and thus

    \[c\,D(\mathbb{Q}||\mu^{\otimes n}) \ge \inf_{\mathbb{M}\in    \mathcal{C}(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}    [w(X_i,Y_i)]) - 2\varepsilon.\]

The proof is completed by letting \varepsilon\downarrow 0. \square

Remark. In the above proof, we have swept a technicality under the rug: we assumed that an \varepsilon-optimal coupling \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) can be chosen to be measurable as a function of X_1,\ldots,X_{n-1}. This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as w(x,y)=\mathbf{1}_{x\ne y} by an explicit construction of the optimal coupling).

Now assume that \mu satisfies the transportation-information inequality

    \[W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu,\]

which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with \phi(x)=x^2 and w(x,y)=d(x,y). Tensorization then yields

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sqrt{\sum_{i=1}^n          \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2} \le \sqrt{2\sigma^2 D(\mathbb{Q} ||          \mu^{\otimes n})}\quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an \ell_1-type distance using the Cauchy-Schwarz inequality. However, we then lose a factor n^{-1/2} by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.

The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the \ell_2-norm of the vector of expectations (\mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)])_{i\le n}. If we could take the \ell_2 norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on (\mathbb{X}^n,d_n) with respect to the \ell_2-distance d_n(x,y):=\sqrt{\sum_{i=1}^nd(x_i,y_i)^2}! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.

To this end, define the quadratic Wasserstein distance

    \[W_2(\mu,\nu) := \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\sqrt{\mathbb{E_M}[d(X,Y)^2]}.\]

Suppose that \mu satisfies the quadratic transportation cost inequality (QTCI)

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then applying tensorization with \phi(x)=x and w(x,y)=d(x,y)^2 immediately yields

    \[W_2(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}     \quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

On the other hand, as obviously W_1(\mu,\nu)\le W_2(\mu,\nu) by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved

Corollary. Suppose that the probability measure \mu satisfies

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then we have dimension-free concentration, that is, for X_1,X_2,\ldots i.i.d. \sim\mu

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

for all n\ge 1 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the \ell_2-metric is due to Talagrand (1996), who used it to prove Gaussian concentration.

Characterizing dimension-free concentration

We began this topic by showing that the transportation cost inequality

    \[W_1(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu\]

is necessary and sufficient for a probability \mu on a fixed metric space (\mathbb{X},d) to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure \mu^{\otimes n} for every n\ge 1. To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).

Theorem. Let \mu be a probability measure on a Polish space (\mathbb{X},d), and let X_1,X_2,\ldots be i.i.d. \sim\mu. Then the following are equivalent:

  1. Dimension-free concentration:

        \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

    for all n\ge 1, t\ge 0 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

  2. Quadratic transporation cost inequality:

        \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\]

    for every probability measure \nu \ll \mu.

This result effectively resolves the question we set out to answer.

Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).

We will need the following three facts that will be proved below.

  1. Law of large numbers: \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty.
  2. Lower semicontinuity: O_t:=\{\nu:W_2(\mu,\nu)>t\} is open in the weak convergence topology.
  3. Lipschitz property: the map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

The first two claims are essentially technical exercises: the empirical measures \frac{1}{n}\sum_{i=1}^n\delta_{X_i} converge weakly to \mu by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.

As O_t is open, we can apply Sanov’s theorem as follows:

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     \frac{1}{n}\sum_{i=1}^n\delta_{X_i}\in O_t    \bigg] =    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg].\]

But as the function g_n is n^{-1/2}-Lipschitz, dimension-free concentration implies

    \[\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg] =    \mathbb{P}[     g_n(X_1,\ldots,X_n)>t    ] \le e^{-n(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2/2\sigma^2}.\]

Thus we have

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    -\limsup_{n\to\infty} \frac{(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2}{2\sigma^2}    = -\frac{t^2}{2\sigma^2},\]

where we have used \mathbb{E}[g_n(X_1,\ldots,X_n)]=\mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0. In particular, we have proved

    \[\sqrt{2\sigma^2 D(\nu || \mu)}  \ge    t\quad\text{whenever}\quad W_2(\mu,\nu)>t>0.\]

The quadratic transportation cost inequality follows readily (let t=W_2(\mu,\nu)-\varepsilon and \varepsilon\downarrow 0). \square

It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of g_n.

Lemma. (Claim 3) The map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

Proof. Let \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu). If we define \mu_i = \mathbb{M} [Y\in\,\cdot\, | X=x_i], then we evidently have

    \[\mathbb{E_M}[f(X,Y)]=\frac{1}{n}\sum_{i=1}^n\int f(x_i,y)\,\mu_i(dy),\qquad\quad    \frac{1}{n}\sum_{i=1}^n\mu_i=\mu.\]

Conversely, every family of measures \mu_1,\ldots,\mu_n such that \frac{1}{n}\sum_{i=1}^n\mu_i=\mu defines a coupling \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu) in this manner. We can therefore estimate as follows:

    \begin{align*}         &W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{x_i}, \mu\bigg)         - W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{\tilde x_i}, \mu\bigg)\\         &= \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         - \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \left\{         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         -         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         \right\}\\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int          \{d(x_i,y)-d(\tilde x_i,y)\}^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \frac{1}{\sqrt{n}} \bigg[\sum_{i=1}^n d(x_i,\tilde x_i)^2\bigg]^{1/2}, \end{align*}

where in the last two lines we used, respectively, the reverse triangle inequality for L^2 norms (that is, \| X \|_2 - \| Y \|_2 \le \| X - Y \|_2) and the reverse triangle inequality for the metric d. \square

Next, we prove lower-semicontinuity of W_2. This is an exercise in using weak convergence.

Lemma. (Claim 2) \nu\mapsto W_2(\nu,\mu) is lower-semicontinuous in the weak convergence topology.

Proof. Let \nu_1,\nu_2,\ldots be probability measures such that \nu_n\to \nu weakly as n\to\infty. We must show

    \[\liminf_{n\to\infty}W_2(\nu_n,\mu)\ge W_2(\nu,\mu).\]

Fix \varepsilon>0, and choose for each n a coupling \mathbb{M}_n\in\mathcal{C}(\nu_n,\mu) such that

    \[W_2(\nu_n,\mu) \ge \sqrt{\mathbb{E}_{\mathbb{M}_n}[d(X,Y)^2]}-\varepsilon.\]

We claim that the sequence (\mathbb{M}_n)_{n\ge 1} is tight. Indeed, the sequence (\nu_n)_{n\ge 1} is tight (as it converges) and clearly \mu itself is tight. For any \delta>0, choose a compact set K_\delta such that \nu_n(K_\delta)\ge 1-\delta/2 for all n\ge 1 and \mu(K_\delta)\ge 1-\delta/2. Then evidently \mathbb{M}_n(K_\delta\times K_\delta) \ge 1-\delta, and thus tightness follows.

By tightness, we can choose a subsequence n_k\uparrow\infty such that \liminf_{n}W_2(\nu_n,\mu)=\lim_kW_2(\nu_{n_k},\mu) and \mathbb{M}_{n_k}\to\mathbb{M} weakly for some probability measure \mathbb{M}. As d is continuous and nonnegative, we obtain

    \[\liminf_{n\to\infty} W_2(\nu_n,\mu) \ge    \lim_{k\to\infty}\sqrt{\mathbb{E}_{\mathbb{M}_{n_k}}[d(X,Y)^2]} - \varepsilon \ge    \sqrt{\mathbb{E}_{\mathbb{M}}[d(X,Y)^2]} - \varepsilon.\]

But as \mathbb{M}\in\mathcal{C}(\nu,\mu), we have shown \liminf_{n\to\infty} W_2(\nu_n,\mu) \ge W_2(\nu,\mu)-\varepsilon. We conclude using \varepsilon\downarrow 0. \square

Finally, we prove the law of large numbers in W_2. This is an exercise in truncation.

Lemma. (Claim 1) Suppose that \mu satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty for X_1,X_2,\ldots i.i.d. \sim\mu.

Proof. Let x^*\in\mathbb{X} be some arbitrary point. We truncate the Wasserstein distance as follows:

    \begin{align*}    W_2(\mu,\nu)^2    &= \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\{    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)\le a}] +    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)>a}] \} \\    &\le a\inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)\wedge a] +    \frac{4\int d(x,x^*)^3\{\mu(dx)+\nu(dx)\}}{a} \end{align*}

where we used (b+c)^3 \le 4(b^3 + c^3) for b,c\ge 0. We claim that if \nu_n\to\mu weakly, then

    \[\inf_{\mathbb{M}\in\mathcal{C}(\nu_n,\mu)}\mathbb{E_M}[d(X,Y)\wedge a]\xrightarrow{n\to\infty}0.\]

Indeed, by the Skorokhod representation theorem, we can construct random variables (X_n)_{n\ge 1} and X on a common probability space such that X_n\sim\nu_n for every n, X\sim\mu, and X_n\to X a.s. Thus \mathbb{E}[d(X_n,X)\wedge a]\to 0 by bounded convergence, and as \mathrm{Law}(X_n,X)\in\mathcal{C}(\nu_n,\mu) the claim follows. Thus \nu_n\to\mu implies W_2(\nu_n,\mu)\to 0 if we can control the second term in the above truncation.

Denote the empirical measure as \mu_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i}. Recall that \mu_n\to\mu weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain

    \[\limsup_{n\to\infty}\mathbb{E}[W_2(\mu_n,\mu)^2]     \le     \frac{8\int d(x,x^*)^3\mu(dx)}{a}\]

for every a>0. Thus it remains to show that \int d(x,x^*)^3\mu(dx)<\infty. But as x\mapsto d(x,x^*) is evidently Lipschitz (with constant 1), this follows directly from the following Lemma. \square

Finally, we have used in the last proof the following lemma, which shows that if \mu satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.

Lemma. Suppose that the probability measure \mu satisfies the Lipschitz concentration property. Then any Lipschitz function f satisfies \int |f(x)|^q\,\mu(dx)<\infty for every 0< q<\infty.

Proof. Let f be L-Lipschitz. It suffices to prove that |f| has finite mean. If this is the case, then the Lipschitz concentration property implies for every 0<q<\infty that

    \[\int |f(x)|^q\,\mu(dx) =    \int_0^\infty x^{q-1}\,\mathbb{P}[|f|\ge x]\,dx \le    \int_0^\infty x^{q-1}\,e^{-(x-\mathbb{E}|f|)^2/2\sigma^2L^2}\,dx<\infty,\]

where we note that |f| is Lipschitz with the same constant as f. To prove that |f| has finite mean, let us apply the Lipschitz concentration property to -\{|f|\wedge a\} (which certainly has finite mean). This gives

    \[\mathbb{P}[|f|\wedge a\le \mathbb{E}(|f|\wedge a)-t]\le e^{-t^2/2\sigma^2L^2}.\]

Now choose t such that e^{-t^2/2\sigma^2L^2}<1/2. Then clearly \mathbb{E}(|f|\wedge a)-t\le\mathrm{Med}(|f|\wedge a). But note that the median \mathrm{Med}(|f|\wedge a)=\mathrm{Med}|f| for a>\mathrm{Med}|f|. Thus we obtain \mathbb{E}|f|\le \mathrm{Med}|f|+t<\infty as a\uparrow\infty. \square

Gaussian concentration

We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal N(0,1) on (\mathbb{R},|\cdot|) satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]

Proposition. Let \mu=N(0,1) on (\mathbb{R},|\cdot|). Then W_2(\nu,\mu)\le\sqrt{2D(\nu||\mu)} for all \mu\ll\nu.

This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on \mathbb{R} admit an explicit solution in terms of quantile functions. This allows to establish inequalities on \mathbb{R} using calculus manipulations. In contrast, optimal transportation problems on \mathbb{R}^d for d\ge 2 are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.

Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.

Proof. Denote by \mathbb{P} be the law of standard Brownian motion (W_t)_{t\in[0,1]}. Fix a probability measure \nu\ll\mu such that D(\nu||\mu)<\infty, and define the probability measure \mathbb{Q} as

    \[d\mathbb{Q} = \frac{d\nu}{d\mu}(W_1)\,d\mathbb{P}.\]

Then clearly W_1\sim\mu under \mathbb{P} and W_1\sim\nu under \mathbb{Q}.

Note that M_t=\mathbb{E}[\frac{d\nu}{d\mu}(W_1)|\mathcal{F}_t]=(\frac{d\nu}{d\mu}*\phi_{1-t})(W_t) is a uniformly integrable martingale and M_t>0 for 0\le t<1 (here \phi_s denotes the density of N(0,s)). Thus we find that

    \[\frac{d\nu}{d\mu}(W_1) = \exp\bigg(         \int_0^1 \beta_t\,dW_t -\frac{1}{2}\int_0^1 \beta_t^2\,dt      \bigg)\]

for some nonanticipating process (\beta_t)_{t\in[0,1]} by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by

    \[Y_t := W_t - \int_0^t \beta_s\,ds\]

is Brownian motion under \mathbb{Q}. Thus the law of (W_1,Y_1) under \mathbb{Q} is a coupling of \nu and \mu, and

    \[W_2(\mu,\nu)^2 \le \mathbb{E_Q}[|W_1-Y_1|^2] \le      \mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg]\]

by Jensen's inequality. The proof is therefore complete once we prove that

    \[\mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg] = 2D(\nu||\mu).\]

To see this, note that

    \[D(\nu||\mu) = \mathbb{E_Q}\bigg[\log\frac{d\nu}{d\mu}(W_1)\bigg]     = \mathbb{E_Q}\bigg[         \int_0^1 \beta_t\,dY_t + \frac{1}{2}\int_0^1 \beta_t^2\,dt     \bigg].\]

If \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty then the dY_t integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that D(\nu||\mu)<\infty implies \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty, see Lemma 2.6 of Föllmer for a careful proof. \square

Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in \mathbb{R}^d for any d. Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

15. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Lecture 9. Concentration, information, transportation (1)

The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.

What is concentration?

Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.

To illustrate concentration, let X\sim N(0,1). Using Markov’s inequality, we can estimate for t\ge 0

    \[\mathbb{P}[X\ge t]          = \inf_{\lambda>0} \mathbb{P}[e^{\lambda X}\ge e^{\lambda t}]         \le \inf_{\lambda>0} e^{-\lambda t} \mathbb{E}[e^{\lambda X}]         = \inf_{\lambda>0} e^{\lambda^2/2 - \lambda t}         = e^{-t^2/2},\]

where we have made the optimal choice \lambda=t (this is called a Chernoff bound). Thus a Gaussian random variable X has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.

Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let X_1,\ldots,X_n be i.i.d. N(0,1), and let f:\mathbb{R}^n\rightarrow\mathbb{R} be L-Lipschitz, that is, |f(x)-f(y)|\le L \| x - y \| for each x,y\in\mathbb{R}^n (where \|\cdot\| is the Euclidean norm). Then

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E} f(X_1,\ldots,X_n) \ge t] \le e^{-t^2/2L^2}.\]

This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant L, and there is no dependence on the number of random variables n. Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.

Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.

The basic question that we want to address in the next two lectures is the following.

Question. Let X_1,\ldots,X_n be i.i.d. random variables on a metric space (\mathbb{X},d) with distribution \mu. Can we characterize all \mu for which dimension-free concentration holds as for the Gaussian case above?

It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.

When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.

Definition. X is subgaussian with parameter \sigma^2 if \mathbb{E}[e^{\lambda (X-\mathbb{E}X)}] \le e^{\lambda^2\sigma^2/2} for each \lambda\in\mathbb{R}.

If X is subgaussian, then \mathbb{P}[X-\mathbb{E}X \ge t]\le e^{-t^2/2\sigma^2} (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger \sigma, so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.

Concentration and relative entropy

Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.

Let (\mathbb{X},d) be a metric space. A function f:\mathbb{X}\rightarrow \mathbb{R} is L-Lipschitz (L-Lip) if |f(x)-f(y)|\le L d(x,y) for each x,y\in\mathbb{X}. One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures \mu,\nu on \mathbb{X}, define the Wasserstein distance as

    \[W_1(\mu,\nu) := \sup_{f\in 1\text{-Lip}} \left| \int f d\mu - \int f d\nu \right|.\]

The idea is that two measures are close if the expectations of a large class of function is close. In the case of W_1, the class of function being used is the class 1-Lip.

As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as W_1 should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy D(\nu || \mu) can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how W_1 and D are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip}.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu\ll\mu.

Note that this result characterizes those measures \mu on a fixed metric space (\mathbb{X},d) that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.

Example. Let d(x,y)=\mathbf{1}_{x\neq y} be the trivial metric. A function f is 1-Lip with respect to d if

    \[| f(x) - f(y) | \le \mathbf{1}_{x\neq y}         \quad\text{for each } x,y\in\mathbb{X},\]

that is, if and only if \mathrm{osc}\,f:=\sup f - \inf f \le 1. Hence we have

    \[W_1(\mu,\nu) = \sup_{f:\mathrm{osc}\,f \le 1} \left| \int f d\mu - \int f d\nu \right|          = \|\mu-\nu\|_{\rm TV}.\]

Thus 2 in the above proposition holds with \sigma^2=\frac{1}{4} by Pinsker’s inequality

    \[\|\mu-\nu\|_{\rm TV} \le \sqrt{\tfrac{1}{2}D(\nu||\mu)}.\]

We consequently find by 1 above that

    \[\mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2/8}\]

for every function f such that \mathrm{osc}\,f \le 1. Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.

Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).

Lemma. (Gibbs variational principle) Let Z be any random variable. Then

    \[\log \mathbb{E}_{\mathbb{P}}[e^Z]         = \sup_{\mathbb{Q} \ll \mathbb{P}}         \{         \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})         \}.\]

Proof. Assume that Z is bounded above by some constant Z\le M <\infty (otherwise replace Z by \min\{Z,M\} and then let M\uparrow\infty at the end). Define a probability measure \mathbb{\tilde Q} by

    \[d \mathbb{\tilde Q} = \frac{e^Z d\mathbb{P}}{\mathbb{E}_{\mathbb{P}}[e^Z]}.\]

Then

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z] - D(\mathbb{Q} || \mathbb{\tilde Q})         = \log\mathbb{E}_{\mathbb{P}}[e^Z]          - \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{Q}}{d \mathbb{P}}\right]         + \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{\tilde Q}}{d \mathbb{P}}\right]         = \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P}).\]

As the relative entropy is always positive, we have

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z]          \ge \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})\]

for every \mathbb{Q}\ll\mathbb{P}, and equality is obtained by choosing the optimizer \mathbb{Q}=\mathbb{\tilde Q}. \square

Using the variational principle, it is easy to prove the Proposition.

Proof of the Proposition. By the variational principle, we have

    \[\mathbb{E}_{\mathbb{P}}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2}\]

if and only if

    \[\lambda\, \mathbb{E}_{\mathbb{Q}}[f(X)] - \lambda\, \mathbb{E}_{\mathbb{P}}[f(X     )]         - D(\mathbb{Q} || \mathbb{P}) \le \frac{\lambda^2\sigma^2}{2}\]

for all \mathbb{Q}\ll\mathbb{P}. Optimizing over \lambda, we find that 1 is equivalent to the validity of

    \[\mathbb{E}_{\mathbb{Q}}[f(X)] - \mathbb{E}_{\mathbb{P}}[f(X     )]         \le \sqrt{2\sigma^2 D(\mathbb{Q} || \mathbb{P})}\]

for all f\in 1\text{-Lip} and \mathbb{Q}\ll\mathbb{P}. \square

Tensorization and optimal transport

So far we have considered concentration in a fixed metric space (\mathbb{X},d). If X_1,\ldots,X_n are independent random variables, we can certainly apply the Proposition to X=(X_1,\ldots,X_n) with the product distribution \mu^{\otimes n}. However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for \mu^{\otimes n} for every n with the same constant \sigma! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of \mu only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure \mu, then that will already imply the same property for the measures \mu^{\otimes n} for every n. This turns out not to be quite true, but this idea will lead us in the right direction.

Motivated by the above, we set out to answer the following

Question. Suppose that \mu satisfies W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every \nu\ll\mu. Does this imply that a similar property is satisfied by the product measures \mu^{\otimes n}?

Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.

Theorem. (Kantorovich-Rubinstein duality, 1958) Let \mu and \nu be probability measures on a Polish space. Let \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y): X\sim\mu, Y\sim\nu \} be the set of couplings of \mu and \nu. Then

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} are often called transportation-information inequalities.

The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.

Proof of lower bound. For each f\in 1\text{-Lip} and \mathbb{M}\in \mathcal{C}(\mu,\nu), we have

    \[\mathbb{E}_\mu [f] - \mathbb{E}_\nu [f]          = \mathbb{E}_\mathbb{M}[f(X)-f(Y)]          \le \mathbb{E}_\mathbb{M}[d(X,Y)],\]

from which we immediately get

    \[W_1(\mu,\nu) \le \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

This proves the easy direction in the above theorem. \square

It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.

Proposition. (Tensorization) Suppose that

    \[\inf_{\mathbb{M}\in \mathcal{C}(\nu,\mu)} \mathbb{E}_{\mathbb{M}}[d(X,Y)]^2          \le 2\sigma^2 D(\nu || \mu)\]

for all \nu\ll\mu. Then, for any n\ge 1,

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}          \sum_{i=1}^n \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2         \le 2\sigma^2 D(\mathbb{Q} || \mu^{\otimes n})\]

for all \mathbb{Q} \ll\mu^{\otimes n}.

We postpone the proof of this result until the next lecture.

Example. Let d(x,y) = \mathbf{1}_{x\neq y}. By Pinsker’s inequality

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)}         \mathbb{P}_{\mathbb{M}}[X\neq Y]^2 = \|\mu-\nu\|_{\rm TV}^2         \le \frac{1}{2} D(\nu || \mu)         \quad\mbox{for all }\nu \ll \mu.\]

Define the weighted Hamming distance for positive weights c_i as

    \[d_n(x,y) = \sum_{i=1}^n c_i \mathbf{1}_{x_i\neq y_i}.\]

Then, by Cauchy-Schwarz and tensorization we get

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}         \mathbb{E}_{\mathbb{M}}[d_n(X,Y)]^2        \le        \left(\sum_{i=1}^n c_i^2\right) \inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}\sum_{i=1}^n          \mathbb{P}_{\mathbb{M}}[X_i\neq Y_i]^2         \le \frac{1}{2} \left(\sum_{i=1}^n c_i^2\right) D(\mathbb{Q} || \mu^{\otimes n})\]

for each \mathbb{Q}\ll\mu^{\otimes n}. So, we have

    \[\mathbb{E}[e^{\lambda\{f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\}}]         \le e^{\lambda^2\sigma^2/2},\]

with \sigma^2=\frac{1}{4} \sum_{i=1}^n c_i^2, for each \lambda>0 and each function f which is 1-Lip with respect to d_n. This implies

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-2t^2/\sum_{i=1}^n c_i^2}.\]

That is, we recover the well known bounded difference inequality.

Outlook

We have not yet shown that the transportation-information inequality holds for X\sim N(0,1) on (\mathbb{R},|\cdot|). Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-t^2/2\sum_{i=1}^n c_i^2}\]

whenever

    \[|f(x)-f(y)|\le \sum_{i=1}^n c_i |x_i-y_i|.\]

Setting the weights c_i=1, we find a tail bound of the form e^{-t^2/2n} whenever f is 1\text{-Lip} with respect to the \ell_1-norm |f(x)-f(y)|\le\|x-y\|_1. Note that this is not dimension-free: the factor 1/n appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound e^{-t^2/2} whenever f is 1\text{-Lip} with respect to the \ell_2-norm |f(x)-f(y)|\le\|x-y\|_2. Note that the latter property is strictly stronger than the former because \|\cdot\|_1\le\sqrt{n}\,\|\cdot\|_2! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.

Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

11. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Lecture 8. Entropic cone and matroids

This lecture introduces the notion of the entropic cone and its connection with entropy inequalities.

Entropic cone

Recall that if X is a discrete random variable with distribution P_X, the entropy of X is defined as

    \[H(X) = H(P_X) = -\sum_{x\in\mathrm{supp}(P_X)}P_X(x)\log_2 P_X(x).\]

Now let (X_1,\ldots,X_m)\sim \mu be (not necessarily independent) discrete random variables. It will be convenient to define the entropy function

    \[H(S) := H(X_S) = H(\mu|_{S}) \quad\mbox{where}\quad S\subseteq[m]:=     \{1,\ldots,m\},\quad X_S:=\{X_i:i\in S\}.\]

As the entropy depends only on the probabilities of the outcomes of a random variable and not on its values, we will assume without loss of generality in the sequel that X_1,\ldots,X_n take values in \mathbb{Z}_+.

For any probability \mu on \mathbb{Z}_+^m, let \vec{H_\mu} be the (2^m-1)-dimensional vector

    \[\vec{H_\mu} := \{H(S):S\subseteq[m],~S\ne\varnothing\}.\]

We can now define the entropic cone.

Definition. The entropic cone is the set

    \[\Gamma_m^* := \{\vec{H_\mu}:\mu\mbox{ a probability on }\mathbb{Z}_+^m\}   \subset\mathbb{R}^{2^m-1},\]

i.e., the set of all vectors that can be obtained as the entropy function of m variables.

Question. Can we characterize \Gamma_m^*?

Examples.

  1. If m=1, then \Gamma_1^*=\mathbb{R}_+, as there exist random variables of arbitrary (nonnegative) entropy.
  2. If m=2, the vector \vec{H_\mu} for (X_1,X_2)\sim\mu is

        \[\vec{H_\mu} = \left( \begin{array}{c} H(X_1) \\ H(X_2) \\ H(X_1,X_2) \end{array} \right).\]

    What constraints must this vector satisfy? We must certainly have

        \[H(X_1) \vee H(X_2) \leq H(X_1,X_2) \leq H(X_1)+H(X_2),\]

    which follows from the chain rule together with positivity for the first inequality, and that conditioning reduces entropy for the second. Are these the only constraints? In fact, we can create many vectors given these constraints. For example:

    • The vector (\alpha,\alpha,\alpha)^T is obtained by taking Z such that H(Z)=\alpha, and letting X_1=X_2=Z.
    • The vector (\alpha,\alpha,2\alpha)^T is obtained by taking X_1,X_2 to be i.i.d. copies of Z.
    • Convex combinations of these vectors can be obtained by taking mixtures of their distributions.

    In fact, a careful analysis shows that the above constraints completely characterize the entropic cone for m=2: that is,

        \[\Gamma_2^* = \{(x,y,z)\in\mathbb{R}_+^3: x\vee y\le z\le x+y\}.\]

  3. If m=3, what happens? Here we are are still subject to the same constraints as in the case m=2, but we pick up some new constraints as well such as

        \[H(X_1,X_2) + H(X_2,X_3) \ge H(X_1,X_2,X_3) + H(X_2)\]

    which is equivalent to the inequality H(X_1|X_2) \ge H(X_1|X_2,X_3).

Evidently, the constraints on the entropic cone correspond to entropy inequalities. What type of inequalities must hold for any given m? Let us think about this question a bit more systematically.

Definition. A set function f:2^{[m]}\to\mathbb{R} is called a polymatroid if

  1. f(\varnothing)=0, f(S)\ge 0 for all S.
  2. f(S)\le f(T) when S\subseteq T (monotonicity).
  3. f(S\cup T)+f(S\cap T)\le f(S)+f(T) (submodularity).

It is not difficult to check that

Lemma. The entropy function H is a polymatroid.

Other examples of polymatroids: (1) entropy, mutual information; (2) max-weight f(S)=\max_{x\in S}w(x) for given weights w; (3) flows; (4) cuts (not monotone, but submodular); (5) Von Neumann entropy.

Let us define

    \[\Gamma_m = \{ \mbox{polymatroid vectors on }[m]\}\subset\mathbb{R}^{2^m-1}.\]

Evidently \Gamma_m is a polyhedral cone (the intersection of a finite number of halfplanes).

Theorem.

  1. \Gamma_3^*\ne \Gamma_3.
  2. \mathrm{cl}\,\Gamma_3^*=\Gamma_3.
  3. \mathrm{cl}\,\Gamma_m^*\ne\Gamma_m for m\ge 4.
  4. \mathrm{cl}\,\Gamma_m^* is a convex cone for all m\ge 1 (but not polyhedral for m\ge 4).

For the proof, we refer to the book for R. Yeung.

As an example of \Gamma_3^*\ne \Gamma_3, we note that the vector (\alpha,\alpha,\alpha,2\alpha,2\alpha,2\alpha,2\alpha)^T is only achievable when \alpha=\log_2M for some M\in\mathbb{Z}_+. Let us also note that convexity of \Gamma_m^* holds in its interior but not on the boundary. However, issues only arise on the boundary of \Gamma_3^*, that is, no new entropy inequalities appear beyond the polymatroid inequalities when m=3.

On the other hand, when m\ge 4, many new inequalities are introduced that actually cause holes to appear within \Gamma_m^*. One such inequality (expressed in terms of mutual information) is

    \[2I(X_3;X_4)\leq I(X_1;X_2) + I(X_1;X_3 \mid X_4) + 3I(X_3;X_4 \mid X_1) + I(X_3;X_4 \mid X_2).\]

Using the entropic cone

The entropic cone can be used to obtain new information theoretic inequalities. For example, the following question arose in the study of mechanism design in computer science.

Problem. Let X_1,\ldots,X_m be discrete random variables. Define the m\times m matrix A as A_{ij}=I(X_i;X_j)=H(X_i)+H(X_j)-H(X_i,X_j). Is A positive definite?

When m=2, we have \det(A)=H(X_1)H(X_2)-I(X_1 ; X_2)^2 \geq 0 since H(X_1) \geq I(X_1;X_2). Thus all the principal minors of A have positive determinant, so A is positive definite.

How about m=3? Note that A depends linearly on the entries of the vector \vec{H_\mu} (where \mu is the joint distribution of X_1,\ldots,X_3). Thus if A is positive definite for distributions on the extreme rays (ER) of the entropic cone, then A must be positive definite for any distribution. More generally:

Proposition. If F : \mathbb{R}_+^7 \rightarrow \mathbb{R} is convex, then F(\vec{H}) \geq 0 holds for all \vec H \in \Gamma_3^* if and only if F(\vec H_i) \geq 0 for all \vec H_i \in ER(\Gamma_3).

Proof. It suffices to note that

    \[F(\lambda H_i + (1-\lambda) H_j) \geq \lambda F(H_i) + (1-\lambda)F(H_j) \geq 0\]

for every 0\le\lambda\le 1, and to use that \Gamma_3=\mathrm{cl}\,\Gamma_3^*. \square

This necessary and sufficient condition for m=3 generalizes to a sufficient condition for m\ge 4.

Proposition. If F : \mathbb{R}_+^{2^m-1} \rightarrow \mathbb{R} is convex, then F(\vec{H}) \geq 0 holds for all \vec H \in \Gamma_m^* if F(\vec H_i) \geq 0 for all \vec H_i \in ER(\Gamma_m}).

As \Gamma_m is polyhedral, this simplifies solving problems such as checking positive definiteness of A significantly: it suffices to check a finite number of cases, which can essentially be done by hand. It can be shown this way that A is always positive definite for m=3, but this can fail for m\ge 4.

Matroids

It is sometimes of interest to investigate discrete variants of the above questions.

Definition. A matroid f : 2^{[m]} \rightarrow \mathbb{Z}_+ is defined by the conditions

  1. f(S)\le |S| for all S;
  2. f is monotone;
  3. f is submodular.

Examples.

  1. Vector matroids (also called \mathbb{F}-representable matroids, where \mathbb{F} is a field). Given a matrix A with values in \mathbb{F}, define f(S) = rank(A(S)) where A(S) denotes the columns induced by S.
  2. Graphical matroids. Let G=(V,E) be a graph, and choose m=|E|. Then define f(S) = \max|H|, where the maximum is taking over acyclic subgraphs H \subset S.
  3. Entropic matroids. Let f(S)=H(X_S). For what distributions of X_1,\ldots,X_m is this a matroid?

Denote by \tilde\Gamma^*_{m} the entropic cone induced by those distributions where X_1,\ldots,X_m take values in \{0,1\}. In order for such a distribution \mu to define an entropic matroid, the vector \vec{H_\mu} must take values in \mathbb{Z}^{2^m-1}. Thus we are led to consider the set

    \[\Pi_{m} := \tilde\Gamma^*_{m} \cap \mathbb{Z}^{2^m-1}.\]

Can one characterize what type of matroids can arise as entropic matroids?

Theorem. \Pi_m coincides with the set of all GF(2)-representable matroids with m elements.

For the proof and further results around this theme, see the paper of E. Abbe.

Lecture by Emmanuel Abbe | Scribed by Danny Gitelman

06. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Lecture 7. Entropic CLT (4)

This lecture completes the proof of the entropic central limit theorem.

From Fisher information to entropy (continued)

In the previous lecture, we proved the following:

Theorem. If X_1, ..., X_n are independent random variables with I(X_i)<+\infty, then

(1)   \begin{equation*} I\left(\sum_{i\in[n]}X_{i}\right)\leq\sum_{s\in\mathcal{G}}\frac{\omega_{s}^{2}}{\beta_{s}}\, I\left(\sum_{i\in s}X_{i}\right), \end{equation*}

where \{\omega_{s}: s\in\mathcal{G}\} are weights satisfying \sum_{s\in\mathcal{G}}\omega_{s}=1 and \{\beta_{s}\colon s\in\mathcal{G}\} is any fractional packing of the hypergraph G in the sense that \sum_{s\in\mathcal{G}, s\ni i}\beta_{s}\leq 1 for every i\in[n]=\{1,...,n\}.

By optimizing the choice of the weights, it is easy to check that (1) is equivalent to

    \begin{equation*} \frac{1}{I\left(\sum_{i\in[n]}X_{i}\right)}\geq \sum_{s\in\mathcal{G}}\frac{\beta_{s}}{I\left(\sum_{i\in s}X_{i}\right)}. \end{equation*}

In fact, an analogous statement holds for entropy:

(2)   \begin{equation*} e^{2h\left(\sum_{i\in[n]}X_{i}\right)}\geq \sum_{s\in\mathcal{G}}\beta_{s}\, e^{2h\left(\sum_{i\in s}X_{i}\right)}. \end{equation*}

Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.

In the following, let \mathcal{G} be the set of all subsets of [n] of size n-1 and take \beta_{s}=\frac{1}{n-1} for every s\in\mathcal{G}. In this case, (2) takes the following form.

Theorem EPI. If X_1, ..., X_n are independent random variables, then

    \begin{equation*} e^{2h\left(\sum_{i\in[n]}X_{i}\right)}\geq \frac{1}{n-1}\sum_{i=1}^{n}e^{2h\left(\sum_{j\neq i}X_{j}\right)}, \end{equation*}

provided that all the entropies exist.

To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.

Theorem (de Bruijn identity). Let X be a random variable with density f on \mathbb{R}. Let X^{t}=e^{-t}X+\sqrt{1-e^{-2t}}Z, t\geq 0, where Z\sim N(0,1) is independent of X. We have

    \begin{equation*} \frac{d}{dt}h(X^{t})=I(X^{t})-1. \end{equation*}

In particular,

(3)   \begin{equation*} h(N(0,1))-h(X)=\int_{0}^{\infty}[I(X^{t})-1]dt. \end{equation*}

Proof. Let f_{t} be the density of X^{t}. Then

    \begin{equation*} \frac{\partial f_{t}(x)}{\partial t}=(Lf_{t})(x), \end{equation*}

where (L\psi)(x)=\psi^{\prime\prime}(x)+\frac{d}{dx}[x\psi(x)]. It is easy to check that \int (Lf_{t})(x)dx=0. Hence,

    \begin{equation*} \begin{split} \frac{d}{dt}\left[-\int f_{t}(x)\log f_{t}(x)dx\right]&=-\int \frac{\partial f_{t}(x)}{\partial t}\log f_{t}(x)dx-\int f_{t}(x)\frac{1}{f_{t}(x)}\frac{\partial f_{t}(x)}{\partial x}dx\\ &=-\int (Lf_{t})(x)\log f_{t}(x)dx-\int (Lf_{t})(x)dx\\ &=-\int f_{t}^{\prime\prime}(x)dx\log f_{t}(x)dx-\int \frac{\partial}{\partial x}[xf_{t}(x)]\log f_{t}(x)dx\\ &=\int \frac{(f_{t}^{\prime}(x))^{2}}{f_{t}(x)}dx+\int xf_{t}^{\prime}(x)dx\\ &=I(X^{t})-\int f_{t}(x)dx\\ &=I(X^{t})-1. \end{split} \end{equation*}

Integrating from 0 to \infty gives (3). \square

Proof of Theorem EPI. In (1), let \mathcal{G} be the set of all subsets of [n] of size n-1 and \beta_{s}=\frac{1}{n-1}. We have

(4)   \begin{equation*} I\left(\sum_{i\in[n]}X_{i}\right)\leq(n-1)\sum_{i\in[n]}w_{i}^{2}\, I\left(\sum_{j\neq i}X_{j}\right), \end{equation*}

where, w_{i}=\omega_{[n]\setminus\{i\}} is the weight corresponding to the set [n]\setminus\{i\}\in\mathcal{G}. It is easy to check that (4) is equivalent, for every (a_{1},...,a_{n})\in\mathbb{R}^{n} with \sum_{i=1}^{n}a_{i}^{2}=1, to

    \begin{equation*} I\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq(n-1)\sum_{i=1}^{n}w_{i}^{2}\, I\left(\sum_{j\neq i}a_{j}X_{j}\right)=(n-1)\sum_{i=1}^{n}\frac{w_{i}^{2}}{\sum_{j\neq i}a_{j}^{2}}\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right). \end{equation*}

Note the last equality is due to the scaling property of Fisher information. Recalling that w_{1},...,w_{n} are arbitrarily chosen weights with \sum_{i=1}^{n}w_{i}=1, we conclude that

(5)   \begin{equation*} I\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq(n-1)\sum_{i=1}^{n}b_{i}^{2}\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right), \end{equation*}

for any choice of (a_1,...,a_n)\in\mathbb{R}^n with \sum_{i=1}^{n}a_{i}^{2}=1 and (b_1,...,b_n)\in \mathbb{R}^{n}_{+} with \sum_{i=1}^{n}b_{i}\sqrt{\sum_{j\neq i}a_{j}^{2}}=1.

Let us keep the choice of such (a_1,...,a_n) arbitrary for the moment and let b_{i}=\frac{1}{n-1}\sqrt{\sum_{j\neq i}a_{j}^{2}}. Then

(6)   \begin{equation*}\sum_{i=1}^{n}b_{i}\sqrt{\sum_{j\neq i}a_{j}^{2}}=\sum_{i=1}^{n}\frac{1}{n-1}\sum_{j\neq i}a_{j}^{2}=\frac{1}{n-1}\,(n-1)\sum_{i=1}^{n}a_{i}^{2}=1. \end{equation*}

Let X_{i}^{t}=e^{-t}X_{i}+\sqrt{1-e^{-2t}}Z_{i}, where Z_{i} is independent of all else. Then

    \begin{equation*}I\left(\left(\sum_{i=1}^{n}a_{i}X_{i}\right)^{t}\right)=I\left(\sum_{i=1}^{n}a_{i}X_{i}^{t}\right)\leq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}^{t}}{\sqrt{\sum_{j\neq i}a_{j} ^{2}}}\right). \end{equation*}

Note that

    \begin{equation*} \begin{split}\frac{\sum_{j\neq i}a_{j}X_{j}^{t}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}&=\frac{\sum_{j\neq i}a_{j}\left(e^{-t}X_{j}+\sqrt{1-e^{-2t}}Z_{j}\right)}{\sqrt{\sum_{j\neq  i}a_{j}^{2}}}\\&=e^{-t}\cdot\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}+\sqrt{1-e^{-2t}}\cdot\frac{\sum_{j\neq i}a_{j}Z_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}} }\\&\stackrel{(D)}{=}\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j }^{2}}}\right)^{t}, \end{split} \end{equation*}

where the equality in distribution is due to the fact that \frac{\sum_{j\neq i} a_{j}Z_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\sim N(0,1). Hence, by (6),

    \begin{equation*}I\left(\left(\sum_{i=1}^{n}a_{i}X_{i}\right)^{t}\right)-1\leq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\left[I\left(\left(\frac{\sum_ {j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right)^{t}\right)-1\right]. \end{equation*}

We use de Bruijn identity to integrate this from 0 to \infty over time and get

    \begin{equation*}h(N(0,1))-h\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq\frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\left[h(N(0,1))-h\left(\frac{\sum_{j\neq  i} a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right)\right]. \end{equation*}

By (6) again,

(7)   \begin{equation*}h\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\geq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)h\left(\frac{\sum_{j\neq i} a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right) \end{equation*}

This is the entropy analog of the Fisher information inequality obtained above.

As a final step, let N_{i}=e^{2h\left(\sum_{j\neq i}X_{j}\right)}. We are to show

    \[e^{2h\left(\sum_{i=1}^{n}X_{i}\right)}\geq \frac{1}{n-1}\sum_{i=1}^{n}N_{i}.\]

If N_{k}\geq \frac{1}{n-1}\sum_{i=1}^{n}N_{i} for some k, the result is immediate due to the general fact h(Y_1+Y_2)\geq h(Y_1) (convolution increases entropy). Hence, we assume that N_{k}\leq \frac{1}{n-1}\sum_{i=1}^{n}N_{i} for every k, so

    \begin{equation*} \lambda_{k}:=\frac{N_{k}}{\sum_{i=1}^{n}N_{i}}\leq \frac{1}{n-1}. \end{equation*}

Set a_{i}=\sqrt{1-(n-1)\lambda_{i}}, i\in[n]. Note that

    \begin{equation*} \sum_{i=1}^{n}a_{i}^{2}=\sum_{i=1}^{n}1-(n-1)\sum_{i=1}^{n}\lambda_{i}=n-(n-1)=1. \end{equation*}

With this choice of a_1,...,a_n, we apply (7). After some computations, we get

    \begin{equation*} h\left(\sum_{i=1}^{n}X_{i}\right)\geq \sum_{i=1}^{n}\lambda_{i}\,h\left(\frac{1}{\sqrt{(n-1)\lambda_{i}}}\sum_{j\neq i}X_{j}\right). \end{equation*}

Using the definition of the \lambda_{i}, the final result follows from here immediately. \square

Proof of the Entropic Central Limit Theorem

Entropic CLT. Suppose that X_1, ..., X_n are i.i.d. random variables with \mathbb{E} X_{i}=0 and \text{Var}X_{i}=1. Let S_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}X_{i}. If h(S_{n})>-\infty for some n, then h(S_{n})\uparrow h(N(0,1)) as n\rightarrow\infty. Equivalently, if D(\mathcal{L}(S_{n})||N(0,1))<+\infty for some n, then D(\mathcal{L}(S_{n})||N(0,1))\downarrow 0 as n\rightarrow\infty.

In the case of i.i.d. random variables, Theorem EPI gives

    \begin{equation*} e^{2h\left(\sum_{i=1}^{n}X_{i}\right)}\geq \frac{n}{n-1}e^{2h\left(\sum_{j=1}^{n-1}X_{i}\right)}, \end{equation*}

which is equivalent to h(S_n)\geq h(S_{n-1}) because of the identity h(aX)=h(X)+\log\abs{a}. Therefore, it remains to show h(S_{n})\rightarrow h(N(0,1)) as n\rightarrow \infty. For this, we need some analytical properties of the relative entropy. Henceforth, we use \mathcal{P}(\mathcal{X}) to denote the set of all probability measures on some Polish space \mathcal{X} (take \mathcal{X}=\mathbb{R}^d for instance).

Proposition (Variational characterization of D). Let \gamma,\theta\in\mathcal{P}(\mathcal{X}). Then

    \begin{equation*} D(\gamma||\theta)=\sup_{g\in\text{BM}(\mathcal{X})}\left(\int gd\gamma-\log\int e^{g}d\theta\right), \end{equation*}

where \text{BM}(\mathcal{X}) denotes the set of all bounded measurable functions g:\mathcal{X}\rightarrow\mathbb{R}.

Remark.

  1. In the variational characterization of D above, it is enough to take the supremum over the set \text{BC}(\mathcal{X}) of all bounded continuous functions g:\mathcal{X}\rightarrow\mathbb{R}.
  2. For fixed g\in\text{BM}(\mathcal{X}), the mapping (\gamma,\theta)\mapsto\int gd\gamma-\log\int e^{g}d\theta is convex and continuous. Since D(\gamma||\theta) is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.

Corollary. Sublevel sets of D(\cdot||\theta) are compact, that is, the set \{\mu\in\mathcal{P}(\mathcal{X})\colon D(\mu||\theta)\leq M\} is compact (with respect to the topology of weak convergence) for every M<+\infty and \theta\in\mathcal{P}(\mathcal{X}).

Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.

Definition (Tightness). A set A\subseteq\mathcal{P}(\mathcal{X}) is called tight if for every \varepsilon>0 there exists a compact set K\subseteq\mathcal{X} such that \mu(K^{c})<\varepsilon for every \mu\in A.

Prohorov Theorem. A set A\subseteq\mathcal{P}(\mathcal{X}) is weakly precompact if and only if it is tight.

Proof of Corollary. Let (\mu_{n})_{n\in\mathbb{N}} be a sequence in \mathcal{P}(\mathcal{X}) with \sup_{n\in\mathbb{N}}D(\mu_{n}||\theta)\leq M. By the variational characterization of D, for every g\in\text{BM}(\mathcal{X}), we have

(8)   \begin{equation*} \int gd\mu_{n}-\log \int e^{g}d\theta\leq D(\mu_{n}||\theta)\leq M. \end{equation*}

Note that \{\theta\} is a tight set as a singleton. We claim that the sequence (\mu_{n})_{n\in\mathbb{N}} is also tight. Indeed, let \varepsilon>0 and let K\subseteq\mathcal{X} be a compact set with \theta(K^{c})<\varepsilon. We take g=\log\left(1+\frac{1}{\varepsilon}\right)1_{K^{c}} and apply (8) to get

    \begin{equation*} \log\left(1+\frac{1}{\varepsilon}\right)\mu_{n}(K^{c})-\log\left[1+\frac{1}{\varepsilon}\,\theta(K^{c})\right]\leq M. \end{equation*}

Hence,

    \begin{equation*} \mu_{n}(K^{c})\leq\frac{1}{\log\left(1+\frac{1}{\varepsilon}\right)}\left(M+\log\left[1+\frac{1}{\varepsilon}\,\theta(K^{c})\right]\right)\leq \frac{M+\log 2}{\log\left(1+\frac{1}{\varepsilon}\right)}, \end{equation*}

where the rightmost term can be made arbitrarily small. Hence, (\mu_{n})_{n\in\mathbb{N}} is tight. By Prohorov Theorem, there exists \mu\in\mathcal{P}(\mathcal{X}) such that \mu_{n}\Rightarrow\mu as n\rightarrow\infty. By lower semicontinuity of D, we have

    \begin{equation*} D(\mu||\theta)\leq \liminf_{n\rightarrow\infty}D(\mu_{n}||\theta)\leq M. \end{equation*}

This finishes the proof of the corollary. \square

Proof of Variational Characterization of D. As a first case, suppose \gamma\not\ll\theta. So D(\gamma||\theta)=+\infty and there exists a Borel set A\subseteq \mathcal{X} with \theta(A)=0 and \gamma(A)>0. Let g_{n}=n\cdot 1_{A}\in\text{BM}(\mathcal{X}). We have

    \begin{equation*} \int g_{n}d\gamma-\log\int e^{g_{n}}d\theta=n\gamma(A)\rightarrow+\infty \end{equation*}

as n\rightarrow\infty. Hence, both sides of the variational characterization are equal to +\infty.

For the rest, we assume \gamma\ll\theta. First, we show the \geq part. If D(\gamma||\theta)=+\infty, the inequality is obvious. Suppose D(\gamma||\theta)<+\infty. Given g\in\text{BM}(\mathcal{X}), define a probability measure \gamma_{0}\sim\theta by

    \begin{equation*} \frac{d\gamma_{0}}{d\theta}=\frac{e^{g}}{\int e^{g}d\theta}. \end{equation*}

So

    \begin{equation*} \begin{split} D(\gamma||\theta)&=\int \log\left(\frac{d\gamma}{d\theta}\right)d\gamma\\ &=\int \log\left(\frac{d\gamma}{d\gamma_{0}}\right)d\gamma+\log\left(\frac{d\gamma_{0}}{d\theta}\right)d\gamma\\ &=D(\gamma||\gamma_{0})+\int\left(g-\log\int e^{g}d\theta\right)d\gamma\\ &\geq \int gd\gamma-\log\int e^{g}d\theta. \end{split} \end{equation*}

Since g\in\text{BM}(\mathcal{X}) is chosen arbitrarily, taking supremum on the right hand side gives the \geq part.

Next, we prove the \leq part. Note that if g=\log\frac{d\gamma}{d\theta}, then \int gd\gamma-\log\int e^{g}d\theta=\int \log\frac{d\gamma}{d\theta}d\gamma=D(\gamma||\theta). However, this choice of g may not be in \text{BM}(\mathcal{X}), that is, \frac{d\gamma}{d\theta} may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let

    \begin{equation*} g_n=\min\{\max\{g,-n\},n\}\in\text{BM}(\mathcal{X}), \end{equation*}

so that g_n\rightarrow g as n\rightarrow\infty. Note that g_{n}1_{\{g\geq 0\}}\uparrow g1_{\{g\geq 0\}} and g_{n}1_{\{g< 0\}}\downarrow g1_{\{g< 0\}}. Thus we have

    \begin{equation*} \begin{split} \lim_{n\rightarrow\infty}\int e^{g_{n}}d\theta&=\lim_{n\rightarrow\infty}\left(\int e^{g_{n}}1_{\{g\geq 0\}}d\theta+\int e^{g_{n}}1_{\{g< 0\}}d\theta\right)\\ &=\lim_{n\rightarrow\infty}\left(\int e^{\min\{g,n\}}1_{\{g\geq 0\}}d\theta+\int e^{\max\{g,-n\}}1_{\{g< 0\}}d\theta\right)\\ &=\int e^{g}1_{\{g\geq 0\}}d\theta+\int e^{g}1_{\{g< 0\}}d\theta\\ &=\int e^{g}d\theta\\ &=\int \frac{d\gamma}{d\theta}d\theta\\ &=1 \end{split} \end{equation*}

by monotone convergence. On the other hand, by Fatou’s Lemma, we have

    \begin{equation*} \liminf_{n\rightarrow\infty}\int g_{n}d\gamma\geq \int gd\gamma=D(\gamma||\theta). \end{equation*}

Hence,

    \begin{equation*} \liminf_{n\rightarrow\infty}\left(\int g_{n}d\gamma-\log\int e^{g_{n}}d\theta\right)\geq D(\gamma||\theta), \end{equation*}

from which the \leq part of the result follows. \square

Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.

Lemma. If Y_1, Y_2, ... are random variables with \E Y_n =0, \E Y_{n}^{2}=1, and the corresponding densities f_1, f_2, ... are uniformly bounded with f_n\rightarrow f as n\rightarrow \infty(pointwise) for some density f, then h(f_n)\rightarrow h(f) and D(f_n||N(0,1))\rightarrow D(f||N(0,1)) as n\rightarrow\infty.

Proof. Recall D(f||N(0,1))=h(N(0,1))-h(f) for f with mean 0 and variance 1. By lower semicontinuity of D, we have

    \begin{equation*} \limsup_{n\rightarrow\infty}h(f_n)\leq h(f). \end{equation*}

On the other hand, letting c=\sup_{n,x}f_{n}(x), we have

    \begin{equation*} h(f_n)=c\int \left(-\frac{f_n(x)}{c}\log\frac{f_n(x)}{c}\right)dx-\log c, \end{equation*}

and using Fatou’s Lemma,

    \begin{equation*}\liminf_{n\rightarrow\infty}h(f_n)\geq c\int \left(-\frac{f(x)}{c}\log\frac{f(x) }{c}\right)dx-\log c=-\int f(x)\log f(x)dx=h(f). \end{equation*}

Hence, h(f_n)\rightarrow h(f) as n\rightarrow\infty. \square

End of proof of Entropic CLT. Assume D(\mathcal{L}(S_N)||N(0,1))<+\infty. We will use J(X)=I(X)-1 to denote normalized Fisher information. For any t\in (0,\infty), we have that J(S^{t}_{n-1})\geq J(S^{t}_{n}) for n>N. So J(S^{t}_{n})\rightarrow g(t) as n\rightarrow \infty for every t\in(0,\infty). We want to show that g\equiv 0, since then we will get by Lebesgue’s dominated convergence theorem that

    \[D(\mathcal{L}(S_n)||N(0,1))= \int_0^\infty J(S^{t}_{n}) dt \rightarrow \int_0^\infty g(t) dt= 0\]

as n\rightarrow\infty. But since

    \[D(\mathcal{L}(S_n^u)||N(0,1))= \int_u^\infty J(S^{t}_{n}) dt \rightarrow \int_u^\infty g(t) dt ,\]

it is enough to show that D(\mathcal{L}(S_n^u)||N(0,1))\rightarrow 0 for each u>0.

By the monotonicity property we have proved, we know that

    \[D(\mathcal{L}(S_n)||N(0,1)) \leq D(\mathcal{L}(S_N)||N(0,1)) <\infty\]

for any n>N. By compactness of sublevel sets of D, the sequence S_n must therefore have a subsequence S_{n_k} whose distribution converges to a probability measure (let us call Z a random variable with this limiting measure as its distribution). For u>0, the smoothing caused by Gaussian convolution implies that the density of S_{n_k}^u converges pointwise to that of Z^u, and also that the density of S_{2n_k}^u converges pointwise to that of \frac{Z^u + \tilde{Z}^u}{\sqrt{2}}, where \tilde{Z} is an independent copy of Z. By the previous lemma

    \[D(\mathcal{L}(S_{n_k}^u)||N(0,1)) \rightarrow D(\mathcal{L}(Z^u)\|N(0,1))\]

as k\rightarrow \infty, and

    \[D(\mathcal{L}(S_{2n_k}^u)||N(0,1)) \rightarrow  D(\mathcal{L}(\tfrac{Z^u + \tilde{Z}^u}{\sqrt{2}})\|N(0,1)) ,\]

so that necessarily

    \[D(\mathcal{L}(\tfrac{Z^u + \tilde{Z}^u}{\sqrt{2}})\|N(0,1)) =D(\mathcal{L}(Z^u)\|N(0,1)) .\]

By the equality condition in the entropy power inequality, this can only happen if Z^u is Gaussian, which in turn implies that g(u)=J(Z^u)=0. \square

Lecture by Mokshay Madiman | Scribed by Cagin Ararat

05. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Schedule for rest of the semester

Here is the schedule for the remaining three lectures of the semester:

November 21: Emmanuel Abbe on information theoretic inequalities.

November 28: No lecture (Thanksgiving).

December 5 and 12: Ramon van Handel on transportation-information inequalities and concentration.

The remaining notes on the entropic CLT (Lecture 7) will be posted soon.

20. November 2013 by Ramon van Handel
Categories: Announcement | Leave a comment

Lecture 6. Entropic CLT (3)

In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.

Variance drop inequality

In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.

ANOVA decomposition. Let X_1, \ldots, X_n be independent random variables taking values in \mathbb{R}, and let \Psi: \mathbb{R}^n \rightarrow \mathbb{R} be such that \mathbb{E}[\Psi(X_1,\ldots,X_n)^2]< \infty. Then \Psi satisfies

    \begin{align*} \Psi(X_1,\ldots,X_n)&= \sum_{t \subseteq \{1,\ldots,n\}} \bar{E_t}[\Psi] \\ \text{where } \bar{E_t}[\Psi]&:=\bigg(\prod_{i \notin t} E_i \prod_{j \in t} (I-E_j)\bigg)\Psi \\ \text{and } E_{i} [\Psi] &:= \mathbb{E}[\Psi | X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n]. \end{align*}

Furthermore, \mathbb{E}[\bar{E_t}\Psi \cdot \bar{E_s}\Psi]=0 \text{ if } t\neq s.

Note that \bar{E_t}\Psi is a function only of X_t=(X_{i_1}, \ldots, X_{i_m}) (i_1<\ldots < i_m are the ordered elements of t).

In the previous lecture, we proved subadditivity of the inverse Fisher information I^{-1}. The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.

Lemma. Let U(X_1,\ldots,X_n)=\sum_{s \in \mathcal{G}} \Psi_s(x_s) where \Psi_s : \mathbb{R}^{|s|} \rightarrow \mathbb{R} and \mathcal{G} is some collection of subsets of \{1,\ldots,n\}. If X_1,\ldots, X_n are independent random variables with \mathbb{E} \Psi_s(X_s)^2 < \infty \text{ } \forall s \in \mathcal{G}, then

    \[\mathrm{Var}(U(X_1,\ldots, X_n)) \leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s}\,\mathrm{Var}(\Psi_s(X_s)) ,\]

where \{\beta_s : s \in \mathcal{G} \} is a fractional packing with respect to \mathcal{G}.

Remarks.

  1. Recall that a fractional packing is a function \beta:\mathcal{G} \rightarrow [0;1] such that \sum_{s \in \mathcal{G}, s  \ni i} \beta_s\leq 1 \text{ } \forall i \in [n].

    Example 1. Let d(i)=\#\{s \in \mathcal{G} : s \ni i \}, and define d_+= \max d(i). Taking \beta_s=\frac{1}{d_+} always defines a fractional packing as \sum_{s \in \mathcal{G}, i  \ni s} \frac{1}{d_+}=\frac{1}{d_+} \cdot d(i) \leq 1 by definition of d_+.

    Example 2. If \mathcal{G}=\{s\subseteq\{1,\ldots,n\}:|s|=m\}, then d_+=\binom{n-1}{m-1}.

  2. The original paper of Hoeffding (1948) proves the following special case where each \Psi_s is symmetric in its arguments and \mathcal{G} is as in Example 2 above: U=\frac{1}{\binom{n}{m}}\sum_{|s|=m}\Psi(X_s) (the U-statistic) satisfies

        \[\mathrm{Var}(U) \leq \frac{m}{n} \mathrm{Var}(\Psi(X_s)).\]

    Of course, if m=1 then \mathrm{Var}(U) = \frac{1}{n} \mathrm{Var}(\Psi(X_1)). Thus Hoeffding’s inequality for the variance of U-statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.

Proof. We may assume without loss of generality that each \Psi_s(X_s) has mean zero.

    \begin{align*}  U(X_1,\ldots,X_n)&=\sum_{s \in \mathcal{G}} \Psi_s(X_s)= \sum_{s \in \mathcal{G}} \sum_{t \subseteq s} \bar{E_t}[\Psi_s(X_s)]\qquad\text{ by ANOVA}\\ &= \sum_{t \subseteq [n]} \sum_{s \in \mathcal{G}, s \supseteq t} \bar{E_t} \Psi_s(X_s)\\  &= \sum_{t \subseteq [n]} \bar{E_t} \sum_{s \in \mathcal{G}, s \supseteq t} \Psi_{s}. \end{align*}

We then have, using orthogonality of the terms in the ANOVA decomposition:

    \begin{align*} \mathbb{E}U^2 &= \sum_{t \subseteq [n]} \mathbb{E} \bigg[\sum_{s \in \mathcal{G} , s \supseteq t} \bar{E_t}[\Psi_s(X_s)] \bigg]^2. \end{align*}

For each term, we have

    \begin{align*} \Big[\sum_{s \in \mathcal{G} , s \supseteq t} \frac{\sqrt{\beta_s}\bar{E_t}[\Psi_s(X_s)]}{\sqrt{\beta_s}}\Big]^2 &\leq \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \beta_s\Big] \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s}\Big] \qquad\text{ by Cauchy-Schwarz}\\ &\leq \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} ,  \end{align*}

where the second inequality follows from the definition of fractional packing if t is non-empty, and the fact that \bar{E}_{\varnothing} takes any \Psi_s to its mean. Hence

    \begin{align*} \mathbb{E}[U^2] &\leq \sum_{t \subseteq [n]} \mathbb{E} \Big[ \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} \Big]\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \sum_{t \subseteq s} \mathbb{E} [\bar{E_t} \Psi_s]^2\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \Psi_s^2 , \end{align*}

again using orthogonality of the \bar{E_t}\Psi in the last step. \qquad\square

Monotonicity of Fisher information

We can now finally prove monotonicity of the Fisher information.

Corollary. Let X_i be independent random variables with I(X_i) <\infty. Then

    \[I(X_1+\ldots + X_n) \leq \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I\bigg(\sum_{i \in s} X_i\bigg)\]

for any hypergraph \mathcal{G} on [n], fractional packing \beta, and positive weights \{\omega_s :s \in \mathcal{G}\} summing to 1.

Proof. Recall that I(X)=\mathrm{Var}(\rho_X(X)) and \rho_X(x)=\frac{f_X'(x)}{f_X(x)}. The identity proved in the last lecture states

    \begin{align*} \rho_{X+Y}(x)&= \mathbb{E}[\rho_X(X)|X+Y=x]\\ \rho_{X+Y}(X+Y)&=\mathbb{E}[\rho_X(X)|X+Y] \end{align*}

With T_s=\sum_{i \in S} X_i, we can write

    \[\rho_{T_{[n]}}(T_{[n]})=\mathbb{E}[\rho_{T_s}(T_s)|T_{[n]}] \text{ }\qquad\forall s \in \mathcal{G}\]

since T_{[n]}=T_s+T_s^C. By taking a convex combination of these identities,

    \begin{align*} \rho_{T_{[n]}}(T_{[n]}) &= \sum_{s \in \mathcal{G}} \omega_s \mathbb{E}[\rho_{T_{s}}(T_s) | T_{[n]}] \\ &= \mathbb{E}\bigg[\sum_{s \in \mathcal{G}} \omega_s \rho_{T_{s}}(T_s) | T_{[n]}\bigg] . \end{align*}

Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have

    \begin{align*} I(T_{[n]}) &= \mathbb{E} [\rho^2_{T_{[n]}}(T_{[n]})]\\ &\leq \mathbb{E}\bigg[\bigg(\sum_{s \in \mathcal{G}} \omega_s \rho_{T_s}(T_s)\bigg)^2\bigg]\\ &\leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s} \omega_s^2 \mathbb{E}[\rho^2_{T_s}(T_s)]\\ &= \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I(T_s) \end{align*}

as desired. \qquad\square

Remark. The \omega_s being arbitrary weights, we can optimize over them. This gives

    \[\frac{1}{I(T_{[n]})} \geq \sum_{s \in \mathcal{G}} \frac{\beta_s}{I(T_s)}.\]

With \mathcal{G} being all singletons and \beta_s=1 we recover the superadditivity property of I^{-1}. With \mathcal{G} being all sets of size n-1 and \beta_s=\frac{1}{\binom{n-1}{n-2}}=\frac{1}{n-1}, we get

    \[\frac{1}{I(T_{[n]})} \geq \frac{n}{n-1}\frac{1}{I(T_{n-1})} \Rightarrow I\Big(\frac{T_{[n]}}{\sqrt{n}}\Big) \leq I\Big(\frac{T_{[n-1]}}{\sqrt{n-1}}\Big) \Leftrightarrow I(S_n) \leq I(S_{n-1}).\]

Thus we have proved the monotonicity of Fisher information in the central limit theorem.

From Fisher information to entropy

Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.

To formulate this identity, let us introduce some basic quantities. Let X\sim f on \mathbb{R}, and define

    \[X^t = e^{-t}X+\sqrt{1-e^{-2t}}Z\]

where Z \sim \mathcal{N}(0,1). Denote by f_t the density of X^t. The following facts are readily verified for t>0:

  1. f_t>0.
  2. f_t(\cdot) is smooth.
  3. I[f_t] < \infty.
  4. \frac{\partial f_t(x)}{\partial t}= f_t^{\prime \prime}(x)+ \frac{d}{dx} [xf_t(x)] =: (Lf_t)(x).

Observe that X^0=X has density f, and that as t \rightarrow \infty , X^t converges to Z, which has a standard Gaussian distribution. Thus X^t provides an interpolation between the density f and the normal density.

Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process X(t) is defined by the stochastic differential equation

    \[dX(t)= -X(t) dt + \sqrt{2} dB(t),\]

where B(t) is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by

    \[P_t \Psi(x)=\mathbb{E}[\Psi(X(t))|X_0=x] ,\]

has a generator A (defined via A= \lim_{t\rightarrow \infty} \frac{P_t - I}{t}) given by A\Psi(x)=\Psi^{\prime \prime}(x)-x\Psi'(x). The semigroup P_t generated by A governs the evolution of conditional expectations of functions of the process X(t), while the adjoint semigroup generated by L=A^* governs the evolution of the marginal density of X(t). The above expression for \partial f_t/\partial t follows from this remark by noting that X(t) and X^t are the same in distribution; however, it can also be deduced more simply just by writing down the density of X^t explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.

We can now formulate the key identity.

de Bruijn identity. Let g be the density of the standard normal N(0,1).

  1. Differential form:

        \[\frac{d}{dt} D(f_t \| g)=\frac{1}{2} J(f_t),\]

    where J(f)=Var(f) \cdot I(f)-1 is the normalized Fisher information.

  2. Integral form:

        \[D(f \| g)=\frac{1}{2} \int_0^\infty J(f_t) dt .\]

The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since

    \[D(g \| g)-D(f \| g)=- \frac{1}{2}\int_0^\infty J(f_t) dt ,\]

which yields the desired identity since D(g\|g)=0.

This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.

Lecture by Mokshay Madiman | Scribed by Georgina Hall

13. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Lecture 5. Entropic CLT (2)

The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.

Two lemmas about the score function

Recall that for a random variable X with absolutely continuous density f, the score function is defined as \rho(x)=f'(x)/f(x) and the Fisher information is I(X)=\mathbb{E}[\rho^2(X)].

The following lemma was proved in the previous lecture.

Lemma 1. Let X be a random variable with finite Fisher information. Let w:\mathbb{R} \rightarrow \mathbb{R} be measurable and let \eta(u) = \mathbb{E}[w^2(u + X)]. If \eta is bounded on the interval [a,b], then the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on [a,b] with g'(u) = -\mathbb{E}[w(u+X)\rho(X)] a.e.

There is a converse to the above lemma, which gives a useful characterization of the score function.

Lemma 2. Let X be a random variable with density f and let m: \mathbb{R} \rightarrow \mathbb{R} be a measurable function with \mathbb{E}|m(X)|<\infty. Suppose that for every bounded measurable function w:\mathbb{R} \rightarrow \mathbb{R}, the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on \mathbb{R} with g'(u) = -\mathbb{E}[w(u+X)m(X)] a.e. Then there must exist an absolutely continuous version of the density f, and moreover m(X)=\rho(X) a.s.

Proof. Take w(x)=\mathbb{I}_{\{x\leq t\}}. Then

    \[g(u)=\int_{-\infty}^{t-u}f(x)dx.\]

Hence g(u) is in fact continuously differentiable with derivative g'(u)=-f(t-u). On the other hand,

    \[\begin{split} g'(u)&=-\mathbb{E}[w(u+X)m(X)]\\ &=-\int_{-\infty}^{t-u}f(x)m(x)\qquad\mbox{a.e.} \end{split}\]

by our assumption. By continuity of f and the assumption \mathbb{E}|m(X)|<\infty, we must have

    \[f(y)=\int_{-\infty}^{y}f(x)m(x)dx\]

for every y\in\mathbb{R}. Hence f(x)m(x)=f'(x) a.e. Since \mathbb{P}[f(X)=0]=0, the proof is complete. \square

Score function of the sum of independent variables

We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.

Proposition. Let X and Y be two independent random variables. Suppose X has finite Fisher information. Then \rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y] a.s.

Proof. Let m(u)=\mathbb{E}[\rho_X(X)|X+Y=u]. By Lemma 2, we only need to show that the function g(u) = \mathbb{E} w(u+X+Y) is locally absolutely continuous with g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. for every bounded measurable function w.

Fix a\in\mathbb{R}. By independence of X and Y, we can apply Lemma 1 to X (conditioned on Y) to obtain

    \[\mathbb{E}[w(u+X+Y)|Y]-\mathbb{E}[w(a+X+Y)|Y]=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)|Y]\,dt.\]

Taking expectation of both sides and applying Fubini, we get

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)]\,dt.\]

Since

    \[\begin{split}\mathbb{E}[w(t+X+Y)\rho_X(X)]&=\mathbb{E}[\mathbb{E}[w(t+X+Y)\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)\mathbb{E}[\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)m(X+Y)],\end{split}\]

we arrive at

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)m(X+Y)]\,dt,\]

which implies g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. \qquad\square

Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in L^2(\Omega,\mathcal{F},\mathbb{P})), the score function of the sum is just the projection of the score of a summand onto the closed subspace L^2(\Omega, \sigma\{X+Y\}, \mathbb{P}). This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: I(X+Y)=\mathbb{E}[\rho_{X+Y}^2(X+Y)]\leq\mathbb{E}[\rho_{X}^2(X)]=I(X). In fact, we can do better, as we will see forthwith.

Monotonicity of FI in the CLT along a subsequence

We now make a first step towards proving monotonicity of the Fisher information in the CLT.

Theorem. Let X and Y be independent random variables both with finite Fisher information. Then

    \[I(X+Y)\leq\lambda^2I(X)+(1-\lambda)^2I(Y)\]

for any \lambda\in\mathbb{R}.

Before going into the proof, we make some remarks.

Remark. Taking \lambda=1, we get I(X+Y)\leq I(X). Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.

Remark. By taking \lambda=\frac{I(Y)}{I(X)+I(Y)}, we get the following optimized version of the theorem:

    \[\frac{1}{I(X+Y)}\geq\frac{1}{I(X)}+\frac{1}{I(Y)}.\]

Remark. Let X_1,X_2,\ldots be an i.i.d. sequence of random variables with mean zero and unit variance. Let S_n=\frac{1}{\sqrt{n}}\sum_{i=1}^nX_i. By taking \lambda=\frac{1}{2} in the theorem and using scaling property of Fisher information, it is easy to obtain I(S_{2n})\leq I(S_n). Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times 1,2,4,8,16,\ldots: that is, I(S_{2^n}) is monotone in n.
However, the theorem is not strong enough to give monotonicity I(S_{n+1})\leq I(S_n) without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get I(S_n)\leq I(S_1), which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.

Proof. By the projection property of the score function,

    \[\begin{split} &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y],\\ &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_Y(Y)|X+Y].\\ \end{split}\]

Hence

    \[\rho_{X+Y}(X+Y)=\mathbb{E}[\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y)|X+Y].\]

Applying the conditional Jensen inequality, we obtain,

    \[\rho_{X+Y}^2(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2|X+Y].\]

Taking expectations of both sides and using independence, we obtain

    \[I(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2] =\lambda^2I(X)+(1-\lambda)^2I(Y)+2\lambda(1-\lambda)\mathbb{E}[\rho_X(X)]\mathbb{E}[\rho_Y(Y)].\]

By Lemma 1, \mathbb{E}[\rho_X(X)]=0. This finishes the proof. \qquad\square

Monotonicity of Fisher information in the CLT

To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.

In the following, we will use the common notation [n]:=\{1,\ldots,n\}.

Definition. Let \mathcal{G} be a collection of non-empty subsets of [n]. Then a collection of non-negative numbers \{\beta_s:s\in\mathcal{G}\} is called a fractional packing for \mathcal{G} if \sum_{s:i\in s}\beta_s\leq 1 for all 1\leq i\leq n.

We can now state the desired strengthening of our earlier theorem.

Theorem. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G} and let X_1,\ldots,X_n be independent random variables. Let \{w_s:s\in\mathcal{G}\} be a collection of real numbers such that \sum_{s\in\mathcal{G}}w_s=1. Then

    \[I(X_1+\ldots+X_n)\leq\sum_{s\in\mathcal{G}}\frac{w_s^2}{\beta_s}I\bigg(\sum_{j\in s}X_j\bigg)\]

and

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\sum_{s\in\mathcal{G}}\beta_s\frac{1}{I(\sum_{j\in s}X_j)}.\]

Remark. Suppose that the X_i are identically distributed. Take \mathcal{G}=\{[n]\backslash 1,[n]\backslash 2,\ldots,[n]\backslash n\}. For each s\in\mathcal{G}, define \beta_s=\frac{1}{n-1}. It is easy to check that \{\beta_s:s\in\mathcal{G}\} is a fractional packing of \mathcal{G}. Then

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\frac{n}{n-1}\frac{1}{I(X_1+\ldots+X_{n-1})}\]

by the above theorem. By the scaling property of Fisher information, this is equivalent to I(S_n)\leq I(S_{n-1}), i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).

The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let X_1,\ldots,X_n be independent random variables, and define the Hilbert space

    \[H:=\{\phi:\mathbb{R}^n \rightarrow \mathbb{R}~~:~~\mathbb{E}[\phi^2(X_1,\ldots,X_n)]<\infty\}\]

with inner product

    \[\langle\phi,\psi\rangle:=\mathbb{E}[\phi(X_1,\ldots,X_n)\psi(X_1,\ldots,X _n)].\]

For every j\in[n], define an operator E_j on H as

    \[(E_j\phi)(x_1,\ldots,x_n)=\mathbb{E}[\phi(x_1,\ldots,x_{j+1},X_j,x_{j-1} ,\ldots,x_n)].\]

Proposition. Each \phi\in H can be decomposed as

    \[\phi=\sum_{T\subseteq [n]}\widetilde{E}_T\phi,\]

where

    \[\widetilde{E}_T:=\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\]

satisfies

  1. \widetilde{E}_T\widetilde{E}_S=0 if T\neq S;
  2. \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 if T\neq S;
  3. \widetilde{E}_T\phi does not depend on x_i if i\notin T.

Proof. It is easy to verify that E_j is a projection operator, that is, E_j^2=E_j and \langle E_j\phi,\psi\rangle=\langle\phi,E_j\psi\rangle (self-adjointness). It is also easy to see that E_jE_i=E_iE_j for i\neq j. Hence we have

    \[\phi=\prod_{j=1}^n(E_j+I-E_j)\phi=\sum_{T\subset [n]}\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\phi=\sum_{T\subset [n]}\widetilde{E}_T\phi.\]

If T\neq S, choose j_0\in T such that j_0\notin S. Then

    \[\widetilde{E}_T\widetilde{E}_S&=(I-E_{j_0})E_{j_0}\prod_{j\notin T}E_j\prod_{j\in T\backslash\{j_0\}}(I-E_j)\prod_{j\notin S\cup\{j_0\}}E_j\prod_{j\in S}(I-E_j)=0.\]

It is easily verified that \widetilde{E}_T is itself self-adjoint, so that \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 follows directly. Finally, as by definition E_j\phi does not depend on x_j, it is clear that \widetilde{E}_T\phi does not depend on x_j if j\notin T. \qquad\square

The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here X_S:=(X_{i_1},\ldots,X_{i_{|S|}}) for S=\{i_1,\ldots,i_{|S|}\}\subseteq [n], i_1<\cdots<i_{|S|}.

Lemma. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G}. Let U=\sum_{s\in\mathcal{G}}\phi_s(X_s). Suppose that \mathbb{E}[\phi_s^2(X_s)]<\infty for each s\in\mathcal{G}. Then \text{Var}[U]\leq\sum_{s\in\mathcal{G}}\frac{1}{\beta_s}\text{Var}[\phi_s(X_s)].

To be continued…

Lecture by Mokshay Madiman | Scribed by Liyao Wang

06. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Next lecture: November 7

A quick reminder that next week is Fall break (no lecture). The next lecture will take place on Thursday, November 7: Mokshay will resume the proof of the entropic central limit theorem.

24. October 2013 by Ramon van Handel
Categories: Announcement | Leave a comment

Lecture 4. Entropic CLT (1)

The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean zero and unit variance. Let

    \[S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^nX_i.\]

If h(S_n) > -\infty for some n, then h(S_n) \uparrow h(N(0,1)), or equivalently D( \mathcal{L}(S_n)\| N(0,1)) \downarrow 0. That is, the entropy of S_n increases monotonically to that of the standard Gaussian.

Recall that D( \mathcal{L}(X) \| \mathcal{L}(Z)) = h(Z) - h(X) when Z is Gaussian with the same mean and variance as X, which explains the equivalence stated in the theorem.

Let us note that the assumption h(S_n) > -\infty for some n represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if X_i are discrete.

Entropy power inequality

Historically, the first result on monotonicity of entropy in the CLT was that h(S_{2n}) \ge h(S_n) for all n. This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.

Entropy power inequality. Let X_1 and X_2 be independent real-valued random variables such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)} \ge e^{2h(X_1)} + e^{2h(X_2)},\]

with equality if and only if X_1 and X_2 are Gaussian.

Before we embark on the proof, let us make some remarks.

Remark. None of the assumptions about existence of entropies is redundant: it can happen that h(X_1) and h(X_2) exist but h(X_1+X_2) does not.

Remark. If X_1 and X_2 are i.i.d., S_1=X_1, and S_2=(X_1+X_2)/\sqrt{2}, then the EPI implies

    \[2e^{2h(X_1)} \le e^{2h(X_1+X_2)} = e^{2h(\sqrt{2}S_2)} = 2e^{2h(S_2)},\]

which implies h(S_1) \le h(S_2). Here we have used the easy-to-check equality h(aX) = h(X) + \log|a|, which of course implies e^{2h(aX)}=a^2e^{2h(X)}. From this observation, the proof of the claim that h(S_{2n}) \ge h(S_n) is immediate: simply note that S_{2n} is the sum of two independent copies of S_n.

Remark. It is easy to check that h(X_1+X_2) \ge h(X_1+X_2 | X_2) = h(X_1 | X_2) = h(X_1). In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.

Remark. The EPI admits the following multidimensional extension.

Multidimensional EPI. Let X_1 and X_2 be independent \mathbb{R}^n-valued random vectors such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)/n} \ge e^{2h(X_1)/n} + e^{2h(X_2)/n},\]

with equality if and only if X_1 and X_2 are Gaussian with proportional covariance matrices.

Define the entropy power N(X) for an \mathbb{R}^n-valued random vector X by

    \[N(X) := \exp\left(\frac{2h(X)}{n}\right).\]

The EPI says that N is superadditive under convolution.

Digression: EPI and Brunn-Minkowski

A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If A,B \subset \mathbb{R}^n are Borel sets and |\cdot| denotes n-dimensional Lebesgue measure, then

    \[|A + B|^{1/n} \ge |A|^{1/n} + |B|^{1/n},\]

where A+B := \{a + b : a \in A, \ b \in B\} is the Minkowski sum. In particular, note that |A|^{1/n} is proportional up to an absolute constant to the radius of the n-dimensional Euclidean ball whose volume matches that of A. The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in \mathbb{R}^n, which states that Euclidean balls with volume V have the minimal surface area among all subsets of \mathbb{R}^n with volume V.

In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since

    \[h(N(0,\sigma^2)) = \frac{1}{2}\log(2\pi e\sigma^2) \quad \Rightarrow \quad e^{2h(N(0,\sigma^2))} = 2\pi e\sigma^2,\]

we see that e^{2h(X)} is proportional to the variance of the Gaussian whose entropy matches that of X. The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.

Proposition. The EPI is equivalent to the following statement: if X_1 and X_2 are independent and Z_1 and Z_2 are independent Gaussians with h(Z_1)=h(X_1) and h(Z_2)=h(X_2), then h(X_1+X_2) \ge h(Z_1 +Z_2) provided that all of the entropies exist.

Proof. Both implications follow from

    \[\exp\left(\frac{2h(X_1)}{n}\right)+ \exp\left(\frac{2h(X_2)}{n}\right) = \exp\left(\frac{2h(Z_1)}{n}\right) + \exp\left(\frac{2h(Z_2)}{n}\right) = \exp\left(\frac{2h(Z_1+Z_2)}{n}\right). \qquad\square\]

Proof of the entropy power inequality

There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.

Definition (Fisher information). Let X be a \mathbb{R}-valued random variable whose density f is an absolutely continuous function. The score function of X is defined as

    \[\rho(x) = \rho_X(x) := \begin{cases} \frac{f'(x)}{f(x)} & \text{for } x \text{ at which } f(x)>0 \text{ and } f'(x) \text{ exists,} \\ 0 & \text{otherwise.} \end{cases}\]

The Fisher information (FI) of X is defined as

    \[I(X) := \mathbb{E}\left[\rho_X^2(X)\right]. \]\]

Remark. Let \mathcal{P} = \{p_\theta : \theta \in \Theta\} be a parametric statistical model. In statistics, the score function is usually defined by \rho(\theta,x) := \frac{\partial}{\partial \theta}\log p_\theta(x), and the Fisher information by I(\theta) = \mathbb{E}_\theta\left[\rho^2(\theta,X)\right]. This reduces to our definition in the special case of location families, where p_\theta(x) = p(x-\theta) for some probability density p(x): in this case, \rho(\theta,x) = -\frac{\partial}{\partial x}\log p(x-\theta) and we have

    \[I(\theta) = \int p_\theta(x)\left[\frac{\partial}{\partial x}\log p(x-\theta)\right]^2dx = \int p(x)\left[\frac{\partial}{\partial x}\log p(x)\right]^2dx.\]

Thus for location families I(\theta) does not depend on \theta and coincides precisely with the Fisher information I(Z) as we defined it above for a random variable Z with density p(x). The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that \mathbb{E}Z=0. Then \mathbb{E}_\theta X = \theta for every \theta, so X is an unbiased estimator of \theta. The Cramér-Rao bound therefore implies the inequality

    \[I(Z) \ge \frac{1}{\mathrm{Var}(Z)},\]

with equality if and only if Z is Gaussian. The same conclusion holds when \mathbf{E}Z\in\mathbb{R} is arbitrary, as both FI and variance are invariant under translation.

Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have I(S_n) \downarrow I(N(0,1)) = 1. Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as J(X) := \mathrm{Var}(X)I(X)-1. Note that J is both translation and scale invariant: J(aX)=J(X) and J(X+a)=J(X). We have J(X) \ge 0, with equality if and only if X is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as J(S_n) \downarrow 0.

The strategy of the proof of the EPI is as follows:

  1. We first prove an inequality for Fisher information:

        \[\frac{1}{I(X_1+X_2)} \ge \frac{1}{I(X_1)} + \frac{1}{I(X_2)}.\]

  2. Develop an integral identity relating I and h.
  3. Combine (1) and (2) to get the EPI.

The reason to concentrate on Fisher information is that I is an L^2-type quantity, as opposed to the entropy which is an L\log L-type quantity. This makes I easier to work with.

We begin with some technical results about Fisher information.

Lemma. If X has absolutely continuous density f and I(X) < \infty, then f has bounded variation. In particular, f is bounded.

Proof. Let D_f denote the set of points at which f is differentiable, so that |D_f^c|=0. Define f'(x) := 0 for x \notin D_f. Let S = \{x \in \mathbb{R} : f(x) > 0\}. Then

    \[\int_\mathbb{R}|f'(x)|dx = \int_S|f'(x)|dx = \int_S|\rho(x)|f(x)dx = \mathbb{E}|\rho(X)| \le \sqrt{I(X)} < \infty. \qquad\square\]

Lemma. Let w : \mathbb{R} \rightarrow \mathbb{R} be measurable. Let \eta : [a,b] \rightarrow [0,\infty] be defined as \eta(u) = \mathbb{E} w^2(u + X). If \eta is bounded, then g : [a,b] \rightarrow \mathbb{R} defined by g(u) = \mathbb{E} w(u+X) is absolutely continuous on (a,b) with g'(u) = -\mathbb{E}\left[w(u+X)\rho(X)\right] a.e.

Proof. Let \tilde{g}(u)=-\mathbb{E}\left[w(u+X)\rho(X)\right]; we want to show that g is absolutely continuous with derivative \tilde{g}. First observe that \tilde{g} is integrable on [a,b] from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute

    \begin{eqnarray*}\begin{split} \int_a^b \tilde{g}(u) du  &= - \int_a^b \bigg[ \int w(s) f'(s-u) ds \bigg] du \\ &= -  \int w(s)\bigg[ \int_a^b f'(s-u) du \bigg]  ds \\ &= -  \int w(s)\bigg[ f(s-a) - f(s-b) \bigg]  ds \\ &= \mathbb{E} w(b+X)- \mathbb{E} w(a+X) \\ &= g(b) - g(a) , \end{split}\end{eqnarray*}

which proves the desired result. \qquad\square

Corollary. \mathbb{E}[\rho(X)] = 0 and, if \mathbb{E} X^2 < \infty, then \mathbb{E}[X\rho(X)] = -1.

Proof. Take w \equiv 1 in the lemma above for the first claim. Take w(u)=u for the second. \qquad\square

To be continued…

Lecture by Mokshay Madiman | Scribed by Dan Lacker

23. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

← Older posts

css.php