## Lecture 9. Concentration, information, transportation (1)

The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.

**What is concentration?**

Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.

To illustrate concentration, let . Using Markov’s inequality, we can estimate for

where we have made the optimal choice (this is called a *Chernoff bound*). Thus a Gaussian random variable has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.

Theorem.(Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)

Let be i.i.d. , and let be -Lipschitz, that is, for each (where is the Euclidean norm). Then

This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is *dimension-free*, that is, the rate of decay of the tail only depends on the Lipschitz constant , and there is no dependence on the number of random variables . Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.

Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.

The basic question that we want to address in the next two lectures is the following.

**Question.** Let be i.i.d. random variables on a metric space with distribution . Can we characterize all for which dimension-free concentration holds as for the Gaussian case above?

It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.

When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.

Definition.is subgaussian with parameter if for each .

If is subgaussian, then (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger , so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.

**Concentration and relative entropy**

Before we can tackle the problem of *dimension-free* concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.

Let be a metric space. A function is -Lipschitz (-Lip) if for each . One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures on , define the *Wasserstein distance* as

The idea is that two measures are close if the expectations of a large class of function is close. In the case of , the class of function being used is the class -Lip.

As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, *a priori*, how and are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.

Proposition.The following are equivalent for :

- for every and .
- for every probability measure .

Note that this result characterizes those measures on a *fixed* metric space that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.

**Example.** Let be the trivial metric. A function is -Lip with respect to if

that is, if and only if . Hence we have

Thus *2* in the above proposition holds with by Pinsker’s inequality

We consequently find by *1* above that

for every function such that . Thus the above Proposition reduces in this case to the well known *Hoeffding lemma*, which states that bounded random variables are subgaussian.

Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).

Lemma.(Gibbs variational principle)Let be any random variable. Then

**Proof.** Assume that is bounded above by some constant (otherwise replace by and then let at the end). Define a probability measure by

Then

As the relative entropy is always positive, we have

for every , and equality is obtained by choosing the optimizer .

Using the variational principle, it is easy to prove the Proposition.

**Proof of the Proposition.** By the variational principle, we have

if and only if

for all . Optimizing over , we find that *1* is equivalent to the validity of

for all and .

**Tensorization and optimal transport**

So far we have considered concentration in a fixed metric space . If are independent random variables, we can certainly apply the Proposition to with the product distribution . However, to establish *dimension-free* concentration, we would have to check that the conditions of the Proposition hold for for *every* with the same constant ! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure , then that will already imply the same property for the measures for every . This turns out not to be *quite* true, but this idea will lead us in the right direction.

Motivated by the above, we set out to answer the following

**Question.** Suppose that satisfies for every . Does this imply that a similar property is satisfied by the product measures ?

Such a conclusion is often referred to as a *tensorization* property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.

Theorem.(Kantorovich-Rubinstein duality, 1958)Let and be probability measures on a Polish space. Let be the set of couplings of and . Then

The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as are often called *transportation-information inequalities*.

The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.

**Proof of lower bound.** For each and , we have

from which we immediately get

This proves the easy direction in the above theorem.

It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.

Proposition.(Tensorization)Suppose that

for all . Then, for any ,

for all .

We postpone the proof of this result until the next lecture.

**Example.** Let . By Pinsker’s inequality

Define the *weighted Hamming distance* for positive weights as

Then, by Cauchy-Schwarz and tensorization we get

for each . So, we have

with , for each and each function which is 1-Lip with respect to . This implies

That is, we recover the well known *bounded difference inequality*.

**Outlook**

We have not yet shown that the transportation-information inequality holds for on . Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get

whenever

Setting the weights , we find a tail bound of the form whenever is with respect to the -norm . Note that this is not dimension-free: the factor appears inside the exponent! On the other hand, Gaussian concentration shows that we have a *dimension-free* tail bound whenever is with respect to the -norm . Note that the latter property is strictly stronger than the former because ! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.

Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!

*Lecture by Ramon van Handel* | *Scribed by Patrick Rebeschini*