Lecture 10. Concentration, information, transportation (2)
Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).
Proposition. The following are equivalent for :
- for every and with finite mean.
- for every probability measure .
This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures is it true that if are i.i.d. , then every function is subgaussian with the same parameter for every ? In principle, the above result answers this question: satisfies dimension-free concentration if and only if
However, this is not a very satisfactory characterization: to check whether satisfies dimension-free concentration, we must check that the transportation-information inequality holds for for every . Instead, we would like to characterize dimension-free transportation in terms of a property of itself (and not its tensor products). This will be achieved in the present lecture.
How are we going to obtain a transportation-information inequality for every starting from only a property of ? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for itself already implies the analogous inequality for all its products . In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.
We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that denotes the set of couplings between probability measures and , and that the Wasserstein distance can be written as
We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.
Proposition. (Tensorization) Let be a convex function, and let be a positive weight function on the metric space . Fix a probability measure on . If
holds for all , then
holds for every and every .
Proof. The conclusion holds for by assumption. The proof now proceeds by induction on : we will suppose in the sequel that the conclusion holds for , and deduce the conclusion for from that.
Fix . We denote by the marginal of on the first coordinates, and define the conditional distribution (whose existence is guaranteed by the Bayes formula as ). The key idea of the proof is to exploit the chain rule for relative entropy:
The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing , it follows that we can choose and such that
If we now define the probability measure as
then we can estimate by Jensen’s inequality
But evidently , and thus
The proof is completed by letting .
Remark. In the above proof, we have swept a technicality under the rug: we assumed that an -optimal coupling can be chosen to be measurable as a function of . This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as by an explicit construction of the optimal coupling).
Now assume that satisfies the transportation-information inequality
which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with and . Tensorization then yields
Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an -type distance using the Cauchy-Schwarz inequality. However, we then lose a factor by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.
The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the -norm of the vector of expectations . If we could take the norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on with respect to the -distance ! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.
To this end, define the quadratic Wasserstein distance
Suppose that satisfies the quadratic transportation cost inequality (QTCI)
Then applying tensorization with and immediately yields
On the other hand, as obviously by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved
Corollary. Suppose that the probability measure satisfies
Then we have dimension-free concentration, that is, for i.i.d.
for all and with finite mean.
The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the -metric is due to Talagrand (1996), who used it to prove Gaussian concentration.
Characterizing dimension-free concentration
We began this topic by showing that the transportation cost inequality
is necessary and sufficient for a probability on a fixed metric space to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure for every . To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality
As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).
Theorem. Let be a probability measure on a Polish space , and let be i.i.d. . Then the following are equivalent:
- Dimension-free concentration:
for all , and with finite mean.
- Quadratic transporation cost inequality:
for every probability measure .
This result effectively resolves the question we set out to answer.
Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).
We will need the following three facts that will be proved below.
- Law of large numbers: as .
- Lower semicontinuity: is open in the weak convergence topology.
- Lipschitz property: the map is -Lipschitz.
The first two claims are essentially technical exercises: the empirical measures converge weakly to by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.
As is open, we can apply Sanov’s theorem as follows:
But as the function is -Lipschitz, dimension-free concentration implies
Thus we have
where we have used . In particular, we have proved
The quadratic transportation cost inequality follows readily (let and ).
It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of .
Lemma. (Claim 3) The map is -Lipschitz.
Proof. Let . If we define , then we evidently have
Conversely, every family of measures such that defines a coupling in this manner. We can therefore estimate as follows:
where in the last two lines we used, respectively, the reverse triangle inequality for norms (that is, ) and the reverse triangle inequality for the metric .
Next, we prove lower-semicontinuity of . This is an exercise in using weak convergence.
Lemma. (Claim 2) is lower-semicontinuous in the weak convergence topology.
Proof. Let be probability measures such that weakly as . We must show
Fix , and choose for each a coupling such that
We claim that the sequence is tight. Indeed, the sequence is tight (as it converges) and clearly itself is tight. For any , choose a compact set such that for all and . Then evidently , and thus tightness follows.
By tightness, we can choose a subsequence such that and weakly for some probability measure . As is continuous and nonnegative, we obtain
But as , we have shown . We conclude using .
Finally, we prove the law of large numbers in . This is an exercise in truncation.
Lemma. (Claim 1) Suppose that satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense as for i.i.d. .
Proof. Let be some arbitrary point. We truncate the Wasserstein distance as follows:
where we used for . We claim that if weakly, then
Indeed, by the Skorokhod representation theorem, we can construct random variables and on a common probability space such that for every , , and a.s. Thus by bounded convergence, and as the claim follows. Thus implies if we can control the second term in the above truncation.
Denote the empirical measure as . Recall that weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain
for every . Thus it remains to show that . But as is evidently Lipschitz (with constant ), this follows directly from the following Lemma.
Finally, we have used in the last proof the following lemma, which shows that if satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.
Lemma. Suppose that the probability measure satisfies the Lipschitz concentration property. Then any Lipschitz function satisfies for every .
Proof. Let be -Lipschitz. It suffices to prove that has finite mean. If this is the case, then the Lipschitz concentration property implies for every that
where we note that is Lipschitz with the same constant as . To prove that has finite mean, let us apply the Lipschitz concentration property to (which certainly has finite mean). This gives
Now choose such that . Then clearly . But note that the median for . Thus we obtain as .
We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal on satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]
Proposition. Let on . Then for all .
This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on admit an explicit solution in terms of quantile functions. This allows to establish inequalities on using calculus manipulations. In contrast, optimal transportation problems on for are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.
Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.
Proof. Denote by be the law of standard Brownian motion . Fix a probability measure such that , and define the probability measure as
Then clearly under and under .
Note that is a uniformly integrable martingale and for (here denotes the density of ). Thus we find that
for some nonanticipating process by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by
is Brownian motion under . Thus the law of under is a coupling of and , and
by Jensen's inequality. The proof is therefore complete once we prove that
To see this, note that
If then the integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that implies , see Lemma 2.6 of Föllmer for a careful proof.
Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in for any . Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini