Lecture 10. Concentration, information, transportation (2)

Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip} with finite mean.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu \ll \mu.

This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures \mu is it true that if X_1,\ldots,X_n are i.i.d. \sim\mu, then every 1\text{-Lip} function f(X_1,\ldots,X_n) is subgaussian with the same parameter \sigma^2 for every n\ge 1? In principle, the above result answers this question: \mu satisfies dimension-free concentration if and only if

    \[W_1(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}      \quad\text{for every }\mathbb{Q}\ll\mu^{\otimes n}\text{ and }n\ge 1.\]

However, this is not a very satisfactory characterization: to check whether \mu satisfies dimension-free concentration, we must check that the transportation-information inequality holds for \mu^{\otimes n} for every n\ge 1. Instead, we would like to characterize dimension-free transportation in terms of a property of \mu itself (and not its tensor products). This will be achieved in the present lecture.

Tensorization

How are we going to obtain a transportation-information inequality for every \mu^{\otimes n} starting from only a property of \mu? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for \mu itself already implies the analogous inequality for all its products \mu^{\otimes n}. In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.

We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y):X\sim\mu,Y\sim \nu\} denotes the set of couplings between probability measures \mu and \nu, and that the Wasserstein distance W_1(\mu,\nu) can be written as

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)].\]

We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.

Proposition. (Tensorization) Let \phi:\mathbb{R}_+\to\mathbb{R}_+ be a convex function, and let w:\mathbb{X}\times\mathbb{X}\to\mathbb{R}_+ be a positive weight function on the metric space (\mathbb{X},d). Fix a probability measure \mu on \mathbb{X}. If

    \[\inf_{\mathbb{M}\in C(\nu,\mu)} \phi(\mathbb{E}_{\mathbb{M}}[w(X,Y)])            \le c\, D(\nu || \mu)\]

holds for all \nu\ll\mu, then

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^n          \phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) \le c\, D(\mathbb{Q} || \mu^{\otimes n})\]

holds for every \mathbb{Q}\ll\mu^{\otimes n} and every n\ge 1.

Proof. The conclusion holds for n=1 by assumption. The proof now proceeds by induction on n: we will suppose in the sequel that the conclusion holds for n-1, and deduce the conclusion for n from that.

Fix \mathbb{Q}\ll\mu^{\otimes n}. We denote by \mathbb{Q}^{n-1} the marginal of \mathbb{Q} on the first n-1 coordinates, and define the conditional distribution \mathbb{Q}_{X_1,\ldots,X_{n-1}}=\mathbb{Q}(X_n\in\,\cdot\,|X_1,\ldots,X_{n-1}) (whose existence is guaranteed by the Bayes formula as \mathbb{Q}\ll\mu^{\otimes n}). The key idea of the proof is to exploit the chain rule for relative entropy:

    \[D(\mathbb{Q}||\mu^{\otimes n}) =    D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) +    \mathbb{E_Q}[D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu )].\]

The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing \varepsilon>0, it follows that we can choose \mathbb{M}^{n-1}\in\mathcal{C}(\mathbb{Q}^{n-1},\mu^{\otimes n-1}) and \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) such that

    \begin{align*}    c\,D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) &\ge     \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) - \varepsilon,\\    c\,D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu ) &\ge    \phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)]) - \varepsilon. \end{align*}

If we now define the probability measure \mathbb{M} as

    \[\mathbb{E_M}[f(X_1,\ldots,X_n,Y_1,\ldots,Y_n)] =     \mathbb{E}_{\mathbb{M}^{n-1}}\bigg[\int f(X_1,\ldots,X_{n-1},x,Y_1,\ldots,Y_{n-1},y)\mathbb{M}_{X_1,\ldots,X_{n-1}}(dx,dy)\bigg],\]

then we can estimate by Jensen’s inequality

    \begin{align*}    c\,D(\mathbb{Q}||\mu^{\otimes n}) &\ge    \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) +    \mathbb{E_Q}[\phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)])]    - 2\varepsilon \\ &\ge    \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) - 2\varepsilon. \end{align*}

But evidently \mathbb{M}\in\mathcal{C}(\mathbb{Q},\mu^{\otimes n}), and thus

    \[c\,D(\mathbb{Q}||\mu^{\otimes n}) \ge \inf_{\mathbb{M}\in    \mathcal{C}(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}    [w(X_i,Y_i)]) - 2\varepsilon.\]

The proof is completed by letting \varepsilon\downarrow 0. \square

Remark. In the above proof, we have swept a technicality under the rug: we assumed that an \varepsilon-optimal coupling \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) can be chosen to be measurable as a function of X_1,\ldots,X_{n-1}. This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as w(x,y)=\mathbf{1}_{x\ne y} by an explicit construction of the optimal coupling).

Now assume that \mu satisfies the transportation-information inequality

    \[W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu,\]

which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with \phi(x)=x^2 and w(x,y)=d(x,y). Tensorization then yields

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sqrt{\sum_{i=1}^n          \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2} \le \sqrt{2\sigma^2 D(\mathbb{Q} ||          \mu^{\otimes n})}\quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an \ell_1-type distance using the Cauchy-Schwarz inequality. However, we then lose a factor n^{-1/2} by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.

The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the \ell_2-norm of the vector of expectations (\mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)])_{i\le n}. If we could take the \ell_2 norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on (\mathbb{X}^n,d_n) with respect to the \ell_2-distance d_n(x,y):=\sqrt{\sum_{i=1}^nd(x_i,y_i)^2}! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.

To this end, define the quadratic Wasserstein distance

    \[W_2(\mu,\nu) := \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\sqrt{\mathbb{E_M}[d(X,Y)^2]}.\]

Suppose that \mu satisfies the quadratic transportation cost inequality (QTCI)

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then applying tensorization with \phi(x)=x and w(x,y)=d(x,y)^2 immediately yields

    \[W_2(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}     \quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

On the other hand, as obviously W_1(\mu,\nu)\le W_2(\mu,\nu) by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved

Corollary. Suppose that the probability measure \mu satisfies

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then we have dimension-free concentration, that is, for X_1,X_2,\ldots i.i.d. \sim\mu

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

for all n\ge 1 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the \ell_2-metric is due to Talagrand (1996), who used it to prove Gaussian concentration.

Characterizing dimension-free concentration

We began this topic by showing that the transportation cost inequality

    \[W_1(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu\]

is necessary and sufficient for a probability \mu on a fixed metric space (\mathbb{X},d) to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure \mu^{\otimes n} for every n\ge 1. To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).

Theorem. Let \mu be a probability measure on a Polish space (\mathbb{X},d), and let X_1,X_2,\ldots be i.i.d. \sim\mu. Then the following are equivalent:

  1. Dimension-free concentration:

        \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

    for all n\ge 1, t\ge 0 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

  2. Quadratic transporation cost inequality:

        \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\]

    for every probability measure \nu \ll \mu.

This result effectively resolves the question we set out to answer.

Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).

We will need the following three facts that will be proved below.

  1. Law of large numbers: \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty.
  2. Lower semicontinuity: O_t:=\{\nu:W_2(\mu,\nu)>t\} is open in the weak convergence topology.
  3. Lipschitz property: the map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

The first two claims are essentially technical exercises: the empirical measures \frac{1}{n}\sum_{i=1}^n\delta_{X_i} converge weakly to \mu by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.

As O_t is open, we can apply Sanov’s theorem as follows:

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     \frac{1}{n}\sum_{i=1}^n\delta_{X_i}\in O_t    \bigg] =    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg].\]

But as the function g_n is n^{-1/2}-Lipschitz, dimension-free concentration implies

    \[\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg] =    \mathbb{P}[     g_n(X_1,\ldots,X_n)>t    ] \le e^{-n(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2/2\sigma^2}.\]

Thus we have

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    -\limsup_{n\to\infty} \frac{(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2}{2\sigma^2}    = -\frac{t^2}{2\sigma^2},\]

where we have used \mathbb{E}[g_n(X_1,\ldots,X_n)]=\mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0. In particular, we have proved

    \[\sqrt{2\sigma^2 D(\nu || \mu)}  \ge    t\quad\text{whenever}\quad W_2(\mu,\nu)>t>0.\]

The quadratic transportation cost inequality follows readily (let t=W_2(\mu,\nu)-\varepsilon and \varepsilon\downarrow 0). \square

It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of g_n.

Lemma. (Claim 3) The map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

Proof. Let \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu). If we define \mu_i = \mathbb{M} [Y\in\,\cdot\, | X=x_i], then we evidently have

    \[\mathbb{E_M}[f(X,Y)]=\frac{1}{n}\sum_{i=1}^n\int f(x_i,y)\,\mu_i(dy),\qquad\quad    \frac{1}{n}\sum_{i=1}^n\mu_i=\mu.\]

Conversely, every family of measures \mu_1,\ldots,\mu_n such that \frac{1}{n}\sum_{i=1}^n\mu_i=\mu defines a coupling \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu) in this manner. We can therefore estimate as follows:

    \begin{align*}         &W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{x_i}, \mu\bigg)         - W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{\tilde x_i}, \mu\bigg)\\         &= \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         - \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \left\{         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         -         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         \right\}\\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int          \{d(x_i,y)-d(\tilde x_i,y)\}^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \frac{1}{\sqrt{n}} \bigg[\sum_{i=1}^n d(x_i,\tilde x_i)^2\bigg]^{1/2}, \end{align*}

where in the last two lines we used, respectively, the reverse triangle inequality for L^2 norms (that is, \| X \|_2 - \| Y \|_2 \le \| X - Y \|_2) and the reverse triangle inequality for the metric d. \square

Next, we prove lower-semicontinuity of W_2. This is an exercise in using weak convergence.

Lemma. (Claim 2) \nu\mapsto W_2(\nu,\mu) is lower-semicontinuous in the weak convergence topology.

Proof. Let \nu_1,\nu_2,\ldots be probability measures such that \nu_n\to \nu weakly as n\to\infty. We must show

    \[\liminf_{n\to\infty}W_2(\nu_n,\mu)\ge W_2(\nu,\mu).\]

Fix \varepsilon>0, and choose for each n a coupling \mathbb{M}_n\in\mathcal{C}(\nu_n,\mu) such that

    \[W_2(\nu_n,\mu) \ge \sqrt{\mathbb{E}_{\mathbb{M}_n}[d(X,Y)^2]}-\varepsilon.\]

We claim that the sequence (\mathbb{M}_n)_{n\ge 1} is tight. Indeed, the sequence (\nu_n)_{n\ge 1} is tight (as it converges) and clearly \mu itself is tight. For any \delta>0, choose a compact set K_\delta such that \nu_n(K_\delta)\ge 1-\delta/2 for all n\ge 1 and \mu(K_\delta)\ge 1-\delta/2. Then evidently \mathbb{M}_n(K_\delta\times K_\delta) \ge 1-\delta, and thus tightness follows.

By tightness, we can choose a subsequence n_k\uparrow\infty such that \liminf_{n}W_2(\nu_n,\mu)=\lim_kW_2(\nu_{n_k},\mu) and \mathbb{M}_{n_k}\to\mathbb{M} weakly for some probability measure \mathbb{M}. As d is continuous and nonnegative, we obtain

    \[\liminf_{n\to\infty} W_2(\nu_n,\mu) \ge    \lim_{k\to\infty}\sqrt{\mathbb{E}_{\mathbb{M}_{n_k}}[d(X,Y)^2]} - \varepsilon \ge    \sqrt{\mathbb{E}_{\mathbb{M}}[d(X,Y)^2]} - \varepsilon.\]

But as \mathbb{M}\in\mathcal{C}(\nu,\mu), we have shown \liminf_{n\to\infty} W_2(\nu_n,\mu) \ge W_2(\nu,\mu)-\varepsilon. We conclude using \varepsilon\downarrow 0. \square

Finally, we prove the law of large numbers in W_2. This is an exercise in truncation.

Lemma. (Claim 1) Suppose that \mu satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty for X_1,X_2,\ldots i.i.d. \sim\mu.

Proof. Let x^*\in\mathbb{X} be some arbitrary point. We truncate the Wasserstein distance as follows:

    \begin{align*}    W_2(\mu,\nu)^2    &= \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\{    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)\le a}] +    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)>a}] \} \\    &\le a\inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)\wedge a] +    \frac{4\int d(x,x^*)^3\{\mu(dx)+\nu(dx)\}}{a} \end{align*}

where we used (b+c)^3 \le 4(b^3 + c^3) for b,c\ge 0. We claim that if \nu_n\to\mu weakly, then

    \[\inf_{\mathbb{M}\in\mathcal{C}(\nu_n,\mu)}\mathbb{E_M}[d(X,Y)\wedge a]\xrightarrow{n\to\infty}0.\]

Indeed, by the Skorokhod representation theorem, we can construct random variables (X_n)_{n\ge 1} and X on a common probability space such that X_n\sim\nu_n for every n, X\sim\mu, and X_n\to X a.s. Thus \mathbb{E}[d(X_n,X)\wedge a]\to 0 by bounded convergence, and as \mathrm{Law}(X_n,X)\in\mathcal{C}(\nu_n,\mu) the claim follows. Thus \nu_n\to\mu implies W_2(\nu_n,\mu)\to 0 if we can control the second term in the above truncation.

Denote the empirical measure as \mu_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i}. Recall that \mu_n\to\mu weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain

    \[\limsup_{n\to\infty}\mathbb{E}[W_2(\mu_n,\mu)^2]     \le     \frac{8\int d(x,x^*)^3\mu(dx)}{a}\]

for every a>0. Thus it remains to show that \int d(x,x^*)^3\mu(dx)<\infty. But as x\mapsto d(x,x^*) is evidently Lipschitz (with constant 1), this follows directly from the following Lemma. \square

Finally, we have used in the last proof the following lemma, which shows that if \mu satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.

Lemma. Suppose that the probability measure \mu satisfies the Lipschitz concentration property. Then any Lipschitz function f satisfies \int |f(x)|^q\,\mu(dx)<\infty for every 0< q<\infty.

Proof. Let f be L-Lipschitz. It suffices to prove that |f| has finite mean. If this is the case, then the Lipschitz concentration property implies for every 0<q<\infty that

    \[\int |f(x)|^q\,\mu(dx) =    \int_0^\infty x^{q-1}\,\mathbb{P}[|f|\ge x]\,dx \le    \int_0^\infty x^{q-1}\,e^{-(x-\mathbb{E}|f|)^2/2\sigma^2L^2}\,dx<\infty,\]

where we note that |f| is Lipschitz with the same constant as f. To prove that |f| has finite mean, let us apply the Lipschitz concentration property to -\{|f|\wedge a\} (which certainly has finite mean). This gives

    \[\mathbb{P}[|f|\wedge a\le \mathbb{E}(|f|\wedge a)-t]\le e^{-t^2/2\sigma^2L^2}.\]

Now choose t such that e^{-t^2/2\sigma^2L^2}<1/2. Then clearly \mathbb{E}(|f|\wedge a)-t\le\mathrm{Med}(|f|\wedge a). But note that the median \mathrm{Med}(|f|\wedge a)=\mathrm{Med}|f| for a>\mathrm{Med}|f|. Thus we obtain \mathbb{E}|f|\le \mathrm{Med}|f|+t<\infty as a\uparrow\infty. \square

Gaussian concentration

We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal N(0,1) on (\mathbb{R},|\cdot|) satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]

Proposition. Let \mu=N(0,1) on (\mathbb{R},|\cdot|). Then W_2(\nu,\mu)\le\sqrt{2D(\nu||\mu)} for all \mu\ll\nu.

This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on \mathbb{R} admit an explicit solution in terms of quantile functions. This allows to establish inequalities on \mathbb{R} using calculus manipulations. In contrast, optimal transportation problems on \mathbb{R}^d for d\ge 2 are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.

Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.

Proof. Denote by \mathbb{P} be the law of standard Brownian motion (W_t)_{t\in[0,1]}. Fix a probability measure \nu\ll\mu such that D(\nu||\mu)<\infty, and define the probability measure \mathbb{Q} as

    \[d\mathbb{Q} = \frac{d\nu}{d\mu}(W_1)\,d\mathbb{P}.\]

Then clearly W_1\sim\mu under \mathbb{P} and W_1\sim\nu under \mathbb{Q}.

Note that M_t=\mathbb{E}[\frac{d\nu}{d\mu}(W_1)|\mathcal{F}_t]=(\frac{d\nu}{d\mu}*\phi_{1-t})(W_t) is a uniformly integrable martingale and M_t>0 for 0\le t<1 (here \phi_s denotes the density of N(0,s)). Thus we find that

    \[\frac{d\nu}{d\mu}(W_1) = \exp\bigg(         \int_0^1 \beta_t\,dW_t -\frac{1}{2}\int_0^1 \beta_t^2\,dt      \bigg)\]

for some nonanticipating process (\beta_t)_{t\in[0,1]} by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by

    \[Y_t := W_t - \int_0^t \beta_s\,ds\]

is Brownian motion under \mathbb{Q}. Thus the law of (W_1,Y_1) under \mathbb{Q} is a coupling of \nu and \mu, and

    \[W_2(\mu,\nu)^2 \le \mathbb{E_Q}[|W_1-Y_1|^2] \le      \mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg]\]

by Jensen's inequality. The proof is therefore complete once we prove that

    \[\mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg] = 2D(\nu||\mu).\]

To see this, note that

    \[D(\nu||\mu) = \mathbb{E_Q}\bigg[\log\frac{d\nu}{d\mu}(W_1)\bigg]     = \mathbb{E_Q}\bigg[         \int_0^1 \beta_t\,dY_t + \frac{1}{2}\int_0^1 \beta_t^2\,dt     \bigg].\]

If \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty then the dY_t integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that D(\nu||\mu)<\infty implies \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty, see Lemma 2.6 of Föllmer for a careful proof. \square

Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in \mathbb{R}^d for any d. Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

15. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

css.php