Lecture 9. Concentration, information, transportation (1)

The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.

What is concentration?

Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.

To illustrate concentration, let X\sim N(0,1). Using Markov’s inequality, we can estimate for t\ge 0

    \[\mathbb{P}[X\ge t]          = \inf_{\lambda>0} \mathbb{P}[e^{\lambda X}\ge e^{\lambda t}]         \le \inf_{\lambda>0} e^{-\lambda t} \mathbb{E}[e^{\lambda X}]         = \inf_{\lambda>0} e^{\lambda^2/2 - \lambda t}         = e^{-t^2/2},\]

where we have made the optimal choice \lambda=t (this is called a Chernoff bound). Thus a Gaussian random variable X has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.

Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let X_1,\ldots,X_n be i.i.d. N(0,1), and let f:\mathbb{R}^n\rightarrow\mathbb{R} be L-Lipschitz, that is, |f(x)-f(y)|\le L \| x - y \| for each x,y\in\mathbb{R}^n (where \|\cdot\| is the Euclidean norm). Then

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E} f(X_1,\ldots,X_n) \ge t] \le e^{-t^2/2L^2}.\]

This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant L, and there is no dependence on the number of random variables n. Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.

Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.

The basic question that we want to address in the next two lectures is the following.

Question. Let X_1,\ldots,X_n be i.i.d. random variables on a metric space (\mathbb{X},d) with distribution \mu. Can we characterize all \mu for which dimension-free concentration holds as for the Gaussian case above?

It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.

When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.

Definition. X is subgaussian with parameter \sigma^2 if \mathbb{E}[e^{\lambda (X-\mathbb{E}X)}] \le e^{\lambda^2\sigma^2/2} for each \lambda\in\mathbb{R}.

If X is subgaussian, then \mathbb{P}[X-\mathbb{E}X \ge t]\le e^{-t^2/2\sigma^2} (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger \sigma, so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.

Concentration and relative entropy

Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.

Let (\mathbb{X},d) be a metric space. A function f:\mathbb{X}\rightarrow \mathbb{R} is L-Lipschitz (L-Lip) if |f(x)-f(y)|\le L d(x,y) for each x,y\in\mathbb{X}. One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures \mu,\nu on \mathbb{X}, define the Wasserstein distance as

    \[W_1(\mu,\nu) := \sup_{f\in 1\text{-Lip}} \left| \int f d\mu - \int f d\nu \right|.\]

The idea is that two measures are close if the expectations of a large class of function is close. In the case of W_1, the class of function being used is the class 1-Lip.

As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as W_1 should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy D(\nu || \mu) can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how W_1 and D are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip}.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu\ll\mu.

Note that this result characterizes those measures \mu on a fixed metric space (\mathbb{X},d) that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.

Example. Let d(x,y)=\mathbf{1}_{x\neq y} be the trivial metric. A function f is 1-Lip with respect to d if

    \[| f(x) - f(y) | \le \mathbf{1}_{x\neq y}         \quad\text{for each } x,y\in\mathbb{X},\]

that is, if and only if \mathrm{osc}\,f:=\sup f - \inf f \le 1. Hence we have

    \[W_1(\mu,\nu) = \sup_{f:\mathrm{osc}\,f \le 1} \left| \int f d\mu - \int f d\nu \right|          = \|\mu-\nu\|_{\rm TV}.\]

Thus 2 in the above proposition holds with \sigma^2=\frac{1}{4} by Pinsker’s inequality

    \[\|\mu-\nu\|_{\rm TV} \le \sqrt{\tfrac{1}{2}D(\nu||\mu)}.\]

We consequently find by 1 above that

    \[\mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2/8}\]

for every function f such that \mathrm{osc}\,f \le 1. Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.

Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).

Lemma. (Gibbs variational principle) Let Z be any random variable. Then

    \[\log \mathbb{E}_{\mathbb{P}}[e^Z]         = \sup_{\mathbb{Q} \ll \mathbb{P}}         \{         \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})         \}.\]

Proof. Assume that Z is bounded above by some constant Z\le M <\infty (otherwise replace Z by \min\{Z,M\} and then let M\uparrow\infty at the end). Define a probability measure \mathbb{\tilde Q} by

    \[d \mathbb{\tilde Q} = \frac{e^Z d\mathbb{P}}{\mathbb{E}_{\mathbb{P}}[e^Z]}.\]

Then

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z] - D(\mathbb{Q} || \mathbb{\tilde Q})         = \log\mathbb{E}_{\mathbb{P}}[e^Z]          - \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{Q}}{d \mathbb{P}}\right]         + \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{\tilde Q}}{d \mathbb{P}}\right]         = \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P}).\]

As the relative entropy is always positive, we have

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z]          \ge \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})\]

for every \mathbb{Q}\ll\mathbb{P}, and equality is obtained by choosing the optimizer \mathbb{Q}=\mathbb{\tilde Q}. \square

Using the variational principle, it is easy to prove the Proposition.

Proof of the Proposition. By the variational principle, we have

    \[\mathbb{E}_{\mathbb{P}}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2}\]

if and only if

    \[\lambda\, \mathbb{E}_{\mathbb{Q}}[f(X)] - \lambda\, \mathbb{E}_{\mathbb{P}}[f(X     )]         - D(\mathbb{Q} || \mathbb{P}) \le \frac{\lambda^2\sigma^2}{2}\]

for all \mathbb{Q}\ll\mathbb{P}. Optimizing over \lambda, we find that 1 is equivalent to the validity of

    \[\mathbb{E}_{\mathbb{Q}}[f(X)] - \mathbb{E}_{\mathbb{P}}[f(X     )]         \le \sqrt{2\sigma^2 D(\mathbb{Q} || \mathbb{P})}\]

for all f\in 1\text{-Lip} and \mathbb{Q}\ll\mathbb{P}. \square

Tensorization and optimal transport

So far we have considered concentration in a fixed metric space (\mathbb{X},d). If X_1,\ldots,X_n are independent random variables, we can certainly apply the Proposition to X=(X_1,\ldots,X_n) with the product distribution \mu^{\otimes n}. However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for \mu^{\otimes n} for every n with the same constant \sigma! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of \mu only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure \mu, then that will already imply the same property for the measures \mu^{\otimes n} for every n. This turns out not to be quite true, but this idea will lead us in the right direction.

Motivated by the above, we set out to answer the following

Question. Suppose that \mu satisfies W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every \nu\ll\mu. Does this imply that a similar property is satisfied by the product measures \mu^{\otimes n}?

Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.

Theorem. (Kantorovich-Rubinstein duality, 1958) Let \mu and \nu be probability measures on a Polish space. Let \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y): X\sim\mu, Y\sim\nu \} be the set of couplings of \mu and \nu. Then

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} are often called transportation-information inequalities.

The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.

Proof of lower bound. For each f\in 1\text{-Lip} and \mathbb{M}\in \mathcal{C}(\mu,\nu), we have

    \[\mathbb{E}_\mu [f] - \mathbb{E}_\nu [f]          = \mathbb{E}_\mathbb{M}[f(X)-f(Y)]          \le \mathbb{E}_\mathbb{M}[d(X,Y)],\]

from which we immediately get

    \[W_1(\mu,\nu) \le \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

This proves the easy direction in the above theorem. \square

It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.

Proposition. (Tensorization) Suppose that

    \[\inf_{\mathbb{M}\in \mathcal{C}(\nu,\mu)} \mathbb{E}_{\mathbb{M}}[d(X,Y)]^2          \le 2\sigma^2 D(\nu || \mu)\]

for all \nu\ll\mu. Then, for any n\ge 1,

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}          \sum_{i=1}^n \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2         \le 2\sigma^2 D(\mathbb{Q} || \mu^{\otimes n})\]

for all \mathbb{Q} \ll\mu^{\otimes n}.

We postpone the proof of this result until the next lecture.

Example. Let d(x,y) = \mathbf{1}_{x\neq y}. By Pinsker’s inequality

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)}         \mathbb{P}_{\mathbb{M}}[X\neq Y]^2 = \|\mu-\nu\|_{\rm TV}^2         \le \frac{1}{2} D(\nu || \mu)         \quad\mbox{for all }\nu \ll \mu.\]

Define the weighted Hamming distance for positive weights c_i as

    \[d_n(x,y) = \sum_{i=1}^n c_i \mathbf{1}_{x_i\neq y_i}.\]

Then, by Cauchy-Schwarz and tensorization we get

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}         \mathbb{E}_{\mathbb{M}}[d_n(X,Y)]^2        \le        \left(\sum_{i=1}^n c_i^2\right) \inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}\sum_{i=1}^n          \mathbb{P}_{\mathbb{M}}[X_i\neq Y_i]^2         \le \frac{1}{2} \left(\sum_{i=1}^n c_i^2\right) D(\mathbb{Q} || \mu^{\otimes n})\]

for each \mathbb{Q}\ll\mu^{\otimes n}. So, we have

    \[\mathbb{E}[e^{\lambda\{f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\}}]         \le e^{\lambda^2\sigma^2/2},\]

with \sigma^2=\frac{1}{4} \sum_{i=1}^n c_i^2, for each \lambda>0 and each function f which is 1-Lip with respect to d_n. This implies

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-2t^2/\sum_{i=1}^n c_i^2}.\]

That is, we recover the well known bounded difference inequality.

Outlook

We have not yet shown that the transportation-information inequality holds for X\sim N(0,1) on (\mathbb{R},|\cdot|). Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-t^2/2\sum_{i=1}^n c_i^2}\]

whenever

    \[|f(x)-f(y)|\le \sum_{i=1}^n c_i |x_i-y_i|.\]

Setting the weights c_i=1, we find a tail bound of the form e^{-t^2/2n} whenever f is 1\text{-Lip} with respect to the \ell_1-norm |f(x)-f(y)|\le\|x-y\|_1. Note that this is not dimension-free: the factor 1/n appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound e^{-t^2/2} whenever f is 1\text{-Lip} with respect to the \ell_2-norm |f(x)-f(y)|\le\|x-y\|_2. Note that the latter property is strictly stronger than the former because \|\cdot\|_1\le\sqrt{n}\,\|\cdot\|_2! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.

Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

11. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

css.php