That was it for the Fall 2013 stochastic analysis seminar! Many thanks in particular to our fantastic guest lecturers Mokshay Madiman and Emmanuel Abbe.
There will be no stochastic analysis seminar in the Spring 2014 semester (students interested in related material may want to consider taking the special topics course Probability in High Dimension). We will resume in the Fall 2014 semester with a topic to be announced at that time.
Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).
Proposition. The following are equivalent for :
- for every and with finite mean.
- for every probability measure .
This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures is it true that if are i.i.d. , then every function is subgaussian with the same parameter for every ? In principle, the above result answers this question: satisfies dimension-free concentration if and only if
However, this is not a very satisfactory characterization: to check whether satisfies dimension-free concentration, we must check that the transportation-information inequality holds for for every . Instead, we would like to characterize dimension-free transportation in terms of a property of itself (and not its tensor products). This will be achieved in the present lecture.
How are we going to obtain a transportation-information inequality for every starting from only a property of ? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for itself already implies the analogous inequality for all its products . In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.
We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that denotes the set of couplings between probability measures and , and that the Wasserstein distance can be written as
We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.
Proposition. (Tensorization) Let be a convex function, and let be a positive weight function on the metric space . Fix a probability measure on . If
holds for all , then
holds for every and every .
Proof. The conclusion holds for by assumption. The proof now proceeds by induction on : we will suppose in the sequel that the conclusion holds for , and deduce the conclusion for from that.
Fix . We denote by the marginal of on the first coordinates, and define the conditional distribution (whose existence is guaranteed by the Bayes formula as ). The key idea of the proof is to exploit the chain rule for relative entropy:
The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing , it follows that we can choose and such that
If we now define the probability measure as
then we can estimate by Jensen’s inequality
But evidently , and thus
The proof is completed by letting .
Remark. In the above proof, we have swept a technicality under the rug: we assumed that an -optimal coupling can be chosen to be measurable as a function of . This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as by an explicit construction of the optimal coupling).
Now assume that satisfies the transportation-information inequality
which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with and . Tensorization then yields
Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an -type distance using the Cauchy-Schwarz inequality. However, we then lose a factor by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.
The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the -norm of the vector of expectations . If we could take the norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on with respect to the -distance ! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.
To this end, define the quadratic Wasserstein distance
Suppose that satisfies the quadratic transportation cost inequality (QTCI)
Then applying tensorization with and immediately yields
On the other hand, as obviously by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved
Corollary. Suppose that the probability measure satisfies
Then we have dimension-free concentration, that is, for i.i.d.
for all and with finite mean.
The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the -metric is due to Talagrand (1996), who used it to prove Gaussian concentration.
Characterizing dimension-free concentration
We began this topic by showing that the transportation cost inequality
is necessary and sufficient for a probability on a fixed metric space to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure for every . To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality
As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).
Theorem. Let be a probability measure on a Polish space , and let be i.i.d. . Then the following are equivalent:
- Dimension-free concentration:
for all , and with finite mean.
- Quadratic transporation cost inequality:
for every probability measure .
This result effectively resolves the question we set out to answer.
Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).
We will need the following three facts that will be proved below.
- Law of large numbers: as .
- Lower semicontinuity: is open in the weak convergence topology.
- Lipschitz property: the map is -Lipschitz.
The first two claims are essentially technical exercises: the empirical measures converge weakly to by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.
As is open, we can apply Sanov’s theorem as follows:
But as the function is -Lipschitz, dimension-free concentration implies
Thus we have
where we have used . In particular, we have proved
The quadratic transportation cost inequality follows readily (let and ).
It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of .
Lemma. (Claim 3) The map is -Lipschitz.
Proof. Let . If we define , then we evidently have
Conversely, every family of measures such that defines a coupling in this manner. We can therefore estimate as follows:
where in the last two lines we used, respectively, the reverse triangle inequality for norms (that is, ) and the reverse triangle inequality for the metric .
Next, we prove lower-semicontinuity of . This is an exercise in using weak convergence.
Lemma. (Claim 2) is lower-semicontinuous in the weak convergence topology.
Proof. Let be probability measures such that weakly as . We must show
Fix , and choose for each a coupling such that
We claim that the sequence is tight. Indeed, the sequence is tight (as it converges) and clearly itself is tight. For any , choose a compact set such that for all and . Then evidently , and thus tightness follows.
By tightness, we can choose a subsequence such that and weakly for some probability measure . As is continuous and nonnegative, we obtain
But as , we have shown . We conclude using .
Finally, we prove the law of large numbers in . This is an exercise in truncation.
Lemma. (Claim 1) Suppose that satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense as for i.i.d. .
Proof. Let be some arbitrary point. We truncate the Wasserstein distance as follows:
where we used for . We claim that if weakly, then
Indeed, by the Skorokhod representation theorem, we can construct random variables and on a common probability space such that for every , , and a.s. Thus by bounded convergence, and as the claim follows. Thus implies if we can control the second term in the above truncation.
Denote the empirical measure as . Recall that weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain
for every . Thus it remains to show that . But as is evidently Lipschitz (with constant ), this follows directly from the following Lemma.
Finally, we have used in the last proof the following lemma, which shows that if satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.
Lemma. Suppose that the probability measure satisfies the Lipschitz concentration property. Then any Lipschitz function satisfies for every .
Proof. Let be -Lipschitz. It suffices to prove that has finite mean. If this is the case, then the Lipschitz concentration property implies for every that
where we note that is Lipschitz with the same constant as . To prove that has finite mean, let us apply the Lipschitz concentration property to (which certainly has finite mean). This gives
Now choose such that . Then clearly . But note that the median for . Thus we obtain as .
We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal on satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]
Proposition. Let on . Then for all .
This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on admit an explicit solution in terms of quantile functions. This allows to establish inequalities on using calculus manipulations. In contrast, optimal transportation problems on for are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.
Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.
Proof. Denote by be the law of standard Brownian motion . Fix a probability measure such that , and define the probability measure as
Then clearly under and under .
Note that is a uniformly integrable martingale and for (here denotes the density of ). Thus we find that
for some nonanticipating process by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by
is Brownian motion under . Thus the law of under is a coupling of and , and
by Jensen's inequality. The proof is therefore complete once we prove that
To see this, note that
If then the integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that implies , see Lemma 2.6 of Föllmer for a careful proof.
Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in for any . Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini
The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.
What is concentration?
Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.
To illustrate concentration, let . Using Markov’s inequality, we can estimate for
where we have made the optimal choice (this is called a Chernoff bound). Thus a Gaussian random variable has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.
Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let be i.i.d. , and let be -Lipschitz, that is, for each (where is the Euclidean norm). Then
This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant , and there is no dependence on the number of random variables . Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.
Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.
The basic question that we want to address in the next two lectures is the following.
Question. Let be i.i.d. random variables on a metric space with distribution . Can we characterize all for which dimension-free concentration holds as for the Gaussian case above?
It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.
When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.
Definition. is subgaussian with parameter if for each .
If is subgaussian, then (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger , so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.
Concentration and relative entropy
Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.
Let be a metric space. A function is -Lipschitz (-Lip) if for each . One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures on , define the Wasserstein distance as
The idea is that two measures are close if the expectations of a large class of function is close. In the case of , the class of function being used is the class -Lip.
As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how and are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.
Proposition. The following are equivalent for :
- for every and .
- for every probability measure .
Note that this result characterizes those measures on a fixed metric space that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.
Example. Let be the trivial metric. A function is -Lip with respect to if
that is, if and only if . Hence we have
Thus 2 in the above proposition holds with by Pinsker’s inequality
We consequently find by 1 above that
for every function such that . Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.
Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).
Lemma. (Gibbs variational principle) Let be any random variable. Then
Proof. Assume that is bounded above by some constant (otherwise replace by and then let at the end). Define a probability measure by
As the relative entropy is always positive, we have
for every , and equality is obtained by choosing the optimizer .
Using the variational principle, it is easy to prove the Proposition.
Proof of the Proposition. By the variational principle, we have
if and only if
for all . Optimizing over , we find that 1 is equivalent to the validity of
for all and .
Tensorization and optimal transport
So far we have considered concentration in a fixed metric space . If are independent random variables, we can certainly apply the Proposition to with the product distribution . However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for for every with the same constant ! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure , then that will already imply the same property for the measures for every . This turns out not to be quite true, but this idea will lead us in the right direction.
Motivated by the above, we set out to answer the following
Question. Suppose that satisfies for every . Does this imply that a similar property is satisfied by the product measures ?
Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.
Theorem. (Kantorovich-Rubinstein duality, 1958) Let and be probability measures on a Polish space. Let be the set of couplings of and . Then
The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as are often called transportation-information inequalities.
The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.
Proof of lower bound. For each and , we have
from which we immediately get
This proves the easy direction in the above theorem.
It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.
Proposition. (Tensorization) Suppose that
for all . Then, for any ,
for all .
We postpone the proof of this result until the next lecture.
Example. Let . By Pinsker’s inequality
Define the weighted Hamming distance for positive weights as
Then, by Cauchy-Schwarz and tensorization we get
for each . So, we have
with , for each and each function which is 1-Lip with respect to . This implies
That is, we recover the well known bounded difference inequality.
We have not yet shown that the transportation-information inequality holds for on . Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get
Setting the weights , we find a tail bound of the form whenever is with respect to the -norm . Note that this is not dimension-free: the factor appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound whenever is with respect to the -norm . Note that the latter property is strictly stronger than the former because ! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.
Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini
This lecture introduces the notion of the entropic cone and its connection with entropy inequalities.
Recall that if is a discrete random variable with distribution , the entropy of is defined as
Now let be (not necessarily independent) discrete random variables. It will be convenient to define the entropy function
As the entropy depends only on the probabilities of the outcomes of a random variable and not on its values, we will assume without loss of generality in the sequel that take values in .
For any probability on , let be the -dimensional vector
We can now define the entropic cone.
Definition. The entropic cone is the set
i.e., the set of all vectors that can be obtained as the entropy function of variables.
Question. Can we characterize ?
- If , then , as there exist random variables of arbitrary (nonnegative) entropy.
- If , the vector for is
What constraints must this vector satisfy? We must certainly have
which follows from the chain rule together with positivity for the first inequality, and that conditioning reduces entropy for the second. Are these the only constraints? In fact, we can create many vectors given these constraints. For example:
- The vector is obtained by taking such that , and letting .
- The vector is obtained by taking to be i.i.d. copies of .
- Convex combinations of these vectors can be obtained by taking mixtures of their distributions.
In fact, a careful analysis shows that the above constraints completely characterize the entropic cone for : that is,
- If , what happens? Here we are are still subject to the same constraints as in the case , but we pick up some new constraints as well such as
which is equivalent to the inequality .
Evidently, the constraints on the entropic cone correspond to entropy inequalities. What type of inequalities must hold for any given ? Let us think about this question a bit more systematically.
Definition. A set function is called a polymatroid if
- , for all .
- when (monotonicity).
It is not difficult to check that
Lemma. The entropy function is a polymatroid.
Other examples of polymatroids: (1) entropy, mutual information; (2) max-weight for given weights ; (3) flows; (4) cuts (not monotone, but submodular); (5) Von Neumann entropy.
Let us define
Evidently is a polyhedral cone (the intersection of a finite number of halfplanes).
- for .
- is a convex cone for all (but not polyhedral for ).
For the proof, we refer to the book for R. Yeung.
As an example of , we note that the vector is only achievable when for some . Let us also note that convexity of holds in its interior but not on the boundary. However, issues only arise on the boundary of , that is, no new entropy inequalities appear beyond the polymatroid inequalities when .
On the other hand, when , many new inequalities are introduced that actually cause holes to appear within . One such inequality (expressed in terms of mutual information) is
Using the entropic cone
The entropic cone can be used to obtain new information theoretic inequalities. For example, the following question arose in the study of mechanism design in computer science.
Problem. Let be discrete random variables. Define the matrix as . Is positive definite?
When , we have since . Thus all the principal minors of have positive determinant, so is positive definite.
How about ? Note that depends linearly on the entries of the vector (where is the joint distribution of ). Thus if is positive definite for distributions on the extreme rays (ER) of the entropic cone, then must be positive definite for any distribution. More generally:
Proposition. If is convex, then holds for all if and only if for all .
Proof. It suffices to note that
for every , and to use that .
This necessary and sufficient condition for generalizes to a sufficient condition for .
Proposition. If is convex, then holds for all if for all .
As is polyhedral, this simplifies solving problems such as checking positive definiteness of significantly: it suffices to check a finite number of cases, which can essentially be done by hand. It can be shown this way that is always positive definite for , but this can fail for .
It is sometimes of interest to investigate discrete variants of the above questions.
Definition. A matroid is defined by the conditions
- for all ;
- is monotone;
- is submodular.
- Vector matroids (also called -representable matroids, where is a field). Given a matrix with values in , define where denotes the columns induced by .
- Graphical matroids. Let be a graph, and choose . Then define , where the maximum is taking over acyclic subgraphs .
- Entropic matroids. Let . For what distributions of is this a matroid?
Denote by the entropic cone induced by those distributions where take values in . In order for such a distribution to define an entropic matroid, the vector must take values in . Thus we are led to consider the set
Can one characterize what type of matroids can arise as entropic matroids?
Theorem. coincides with the set of all -representable matroids with elements.
For the proof and further results around this theme, see the paper of E. Abbe.
Lecture by Emmanuel Abbe | Scribed by Danny Gitelman
This lecture completes the proof of the entropic central limit theorem.
From Fisher information to entropy (continued)
In the previous lecture, we proved the following:
By optimizing the choice of the weights, it is easy to check that (1) is equivalent to
Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.
In the following, let be the set of all subsets of of size and take for every . In this case, (2) takes the following form.
Theorem EPI. If are independent random variables, then
provided that all the entropies exist.
To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.
Theorem (de Bruijn identity). Let be a random variable with density on . Let , , where is independent of . We have
Proof. Let be the density of . Then
where . It is easy to check that . Hence,
Integrating from to gives (3).
Proof of Theorem EPI. In (1), let be the set of all subsets of of size and . We have
where, is the weight corresponding to the set . It is easy to check that (4) is equivalent, for every with , to
for any choice of with and with .
Let , where is independent of all else. Then
where the equality in distribution is due to the fact that . Hence, by (6),
We use de Bruijn identity to integrate this from to over time and get
By (6) again,
This is the entropy analog of the Fisher information inequality obtained above.
As a final step, let . We are to show
If for some , the result is immediate due to the general fact (convolution increases entropy). Hence, we assume that for every , so
Set , . Note that
With this choice of , we apply (7). After some computations, we get
Using the definition of the , the final result follows from here immediately.
Proof of the Entropic Central Limit Theorem
Entropic CLT. Suppose that are i.i.d. random variables with and . Let . If for some , then as . Equivalently, if for some , then as .
In the case of i.i.d. random variables, Theorem EPI gives
which is equivalent to because of the identity . Therefore, it remains to show as . For this, we need some analytical properties of the relative entropy. Henceforth, we use to denote the set of all probability measures on some Polish space (take for instance).
Proposition (Variational characterization of ). Let . Then
where denotes the set of all bounded measurable functions .
- In the variational characterization of above, it is enough to take the supremum over the set of all bounded continuous functions .
- For fixed , the mapping is convex and continuous. Since is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.
Corollary. Sublevel sets of are compact, that is, the set is compact (with respect to the topology of weak convergence) for every and .
Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.
Definition (Tightness). A set is called tight if for every there exists a compact set such that for every .
Prohorov Theorem. A set is weakly precompact if and only if it is tight.
Note that is a tight set as a singleton. We claim that the sequence is also tight. Indeed, let and let be a compact set with . We take and apply (8) to get
where the rightmost term can be made arbitrarily small. Hence, is tight. By Prohorov Theorem, there exists such that as . By lower semicontinuity of , we have
This finishes the proof of the corollary.
Proof of Variational Characterization of . As a first case, suppose . So and there exists a Borel set with and . Let . We have
as . Hence, both sides of the variational characterization are equal to .
For the rest, we assume . First, we show the part. If , the inequality is obvious. Suppose . Given , define a probability measure by
Since is chosen arbitrarily, taking supremum on the right hand side gives the part.
Next, we prove the part. Note that if , then . However, this choice of may not be in , that is, may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let
so that as . Note that and . Thus we have
by monotone convergence. On the other hand, by Fatou’s Lemma, we have
from which the part of the result follows.
Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.
Lemma. If are random variables with , , and the corresponding densities are uniformly bounded with as (pointwise) for some density , then and as .
Proof. Recall for with mean and variance . By lower semicontinuity of , we have
On the other hand, letting , we have
and using Fatou’s Lemma,
Hence, as .
End of proof of Entropic CLT. Assume . We will use to denote normalized Fisher information. For any , we have that for . So as for every . We want to show that , since then we will get by Lebesgue’s dominated convergence theorem that
as . But since
it is enough to show that for each .
By the monotonicity property we have proved, we know that
for any . By compactness of sublevel sets of , the sequence must therefore have a subsequence whose distribution converges to a probability measure (let us call a random variable with this limiting measure as its distribution). For , the smoothing caused by Gaussian convolution implies that the density of converges pointwise to that of , and also that the density of converges pointwise to that of , where is an independent copy of . By the previous lemma
as , and
so that necessarily
By the equality condition in the entropy power inequality, this can only happen if is Gaussian, which in turn implies that .
Lecture by Mokshay Madiman | Scribed by Cagin Ararat
Here is the schedule for the remaining three lectures of the semester:
November 21: Emmanuel Abbe on information theoretic inequalities.
November 28: No lecture (Thanksgiving).
December 5 and 12: Ramon van Handel on transportation-information inequalities and concentration.
The remaining notes on the entropic CLT (Lecture 7) will be posted soon.
In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.
Variance drop inequality
In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.
ANOVA decomposition. Let be independent random variables taking values in , and let be such that . Then satisfies
Note that is a function only of ( are the ordered elements of ).
In the previous lecture, we proved subadditivity of the inverse Fisher information . The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.
Lemma. Let where and is some collection of subsets of . If are independent random variables with , then
where is a fractional packing with respect to .
- Recall that a fractional packing is a function such that
Example 1. Let , and define . Taking always defines a fractional packing as by definition of .
Example 2. If , then .
- The original paper of Hoeffding (1948) proves the following special case where each is symmetric in its arguments and is as in Example 2 above: (the -statistic) satisfies
Of course, if then . Thus Hoeffding’s inequality for the variance of -statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.
Proof. We may assume without loss of generality that each has mean zero.
We then have, using orthogonality of the terms in the ANOVA decomposition:
For each term, we have
where the second inequality follows from the definition of fractional packing if is non-empty, and the fact that takes any to its mean. Hence
again using orthogonality of the in the last step.
Monotonicity of Fisher information
We can now finally prove monotonicity of the Fisher information.
Corollary. Let be independent random variables with . Then
for any hypergraph on , fractional packing , and positive weights summing to 1.
Proof. Recall that and . The identity proved in the last lecture states
With , we can write
since . By taking a convex combination of these identities,
Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have
Remark. The being arbitrary weights, we can optimize over them. This gives
With being all singletons and we recover the superadditivity property of . With being all sets of size and , we get
Thus we have proved the monotonicity of Fisher information in the central limit theorem.
From Fisher information to entropy
Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.
To formulate this identity, let us introduce some basic quantities. Let on , and define
where . Denote by the density of . The following facts are readily verified for :
- is smooth.
Observe that has density , and that as , converges to , which has a standard Gaussian distribution. Thus provides an interpolation between the density and the normal density.
Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process is defined by the stochastic differential equation
where is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by
has a generator (defined via ) given by . The semigroup generated by governs the evolution of conditional expectations of functions of the process , while the adjoint semigroup generated by governs the evolution of the marginal density of . The above expression for follows from this remark by noting that and are the same in distribution; however, it can also be deduced more simply just by writing down the density of explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.
We can now formulate the key identity.
de Bruijn identity. Let be the density of the standard normal .
- Differential form:
where is the normalized Fisher information.
The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since
which yields the desired identity since .
This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.
Lecture by Mokshay Madiman | Scribed by Georgina Hall
The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.
Two lemmas about the score function
Recall that for a random variable with absolutely continuous density , the score function is defined as and the Fisher information is .
The following lemma was proved in the previous lecture.
Lemma 1. Let be a random variable with finite Fisher information. Let be measurable and let . If is bounded on the interval , then the function is absolutely continuous on with a.e.
There is a converse to the above lemma, which gives a useful characterization of the score function.
Lemma 2. Let be a random variable with density and let be a measurable function with . Suppose that for every bounded measurable function , the function is absolutely continuous on with a.e. Then there must exist an absolutely continuous version of the density , and moreover a.s.
Proof. Take . Then
Hence is in fact continuously differentiable with derivative . On the other hand,
by our assumption. By continuity of and the assumption , we must have
for every . Hence a.e. Since , the proof is complete.
Score function of the sum of independent variables
We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.
Proposition. Let and be two independent random variables. Suppose has finite Fisher information. Then a.s.
Proof. Let . By Lemma 2, we only need to show that the function is locally absolutely continuous with a.e. for every bounded measurable function .
Fix . By independence of and , we can apply Lemma 1 to (conditioned on ) to obtain
Taking expectation of both sides and applying Fubini, we get
we arrive at
which implies a.e.
Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in ), the score function of the sum is just the projection of the score of a summand onto the closed subspace . This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: . In fact, we can do better, as we will see forthwith.
Monotonicity of FI in the CLT along a subsequence
We now make a first step towards proving monotonicity of the Fisher information in the CLT.
Theorem. Let and be independent random variables both with finite Fisher information. Then
for any .
Before going into the proof, we make some remarks.
Remark. Taking , we get . Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.
Remark. By taking , we get the following optimized version of the theorem:
Remark. Let be an i.i.d. sequence of random variables with mean zero and unit variance. Let . By taking in the theorem and using scaling property of Fisher information, it is easy to obtain . Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times : that is, is monotone in .
However, the theorem is not strong enough to give monotonicity without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get , which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.
Proof. By the projection property of the score function,
Applying the conditional Jensen inequality, we obtain,
Taking expectations of both sides and using independence, we obtain
By Lemma 1, . This finishes the proof.
Monotonicity of Fisher information in the CLT
To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.
In the following, we will use the common notation .
Definition. Let be a collection of non-empty subsets of . Then a collection of non-negative numbers is called a fractional packing for if for all .
We can now state the desired strengthening of our earlier theorem.
Theorem. Let be a fractional packing for and let be independent random variables. Let be a collection of real numbers such that . Then
Remark. Suppose that the are identically distributed. Take . For each , define . It is easy to check that is a fractional packing of . Then
by the above theorem. By the scaling property of Fisher information, this is equivalent to , i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).
The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let be independent random variables, and define the Hilbert space
with inner product
For every , define an operator on as
Proposition. Each can be decomposed as
- if ;
- if ;
- does not depend on if .
Proof. It is easy to verify that is a projection operator, that is, and (self-adjointness). It is also easy to see that for . Hence we have
If , choose such that . Then
It is easily verified that is itself self-adjoint, so that follows directly. Finally, as by definition does not depend on , it is clear that does not depend on if .
The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here for , .
Lemma. Let be a fractional packing for . Let . Suppose that for each . Then
To be continued…
Lecture by Mokshay Madiman | Scribed by Liyao Wang
A quick reminder that next week is Fall break (no lecture). The next lecture will take place on Thursday, November 7: Mokshay will resume the proof of the entropic central limit theorem.
The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.
Theorem (Entropic CLT). Let be i.i.d. real-valued random variables with mean zero and unit variance. Let
If for some , then , or equivalently . That is, the entropy of increases monotonically to that of the standard Gaussian.
Recall that when is Gaussian with the same mean and variance as , which explains the equivalence stated in the theorem.
Let us note that the assumption for some represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if are discrete.
Entropy power inequality
Historically, the first result on monotonicity of entropy in the CLT was that for all . This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.
Entropy power inequality. Let and be independent real-valued random variables such that , , and all exist. Then
with equality if and only if and are Gaussian.
Before we embark on the proof, let us make some remarks.
Remark. None of the assumptions about existence of entropies is redundant: it can happen that and exist but does not.
Remark. If and are i.i.d., , and , then the EPI implies
which implies . Here we have used the easy-to-check equality , which of course implies . From this observation, the proof of the claim that is immediate: simply note that is the sum of two independent copies of .
Remark. It is easy to check that . In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.
Remark. The EPI admits the following multidimensional extension.
Multidimensional EPI. Let and be independent -valued random vectors such that , , and all exist. Then
with equality if and only if and are Gaussian with proportional covariance matrices.
Define the entropy power for an -valued random vector by
The EPI says that is superadditive under convolution.
Digression: EPI and Brunn-Minkowski
A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If are Borel sets and denotes -dimensional Lebesgue measure, then
where is the Minkowski sum. In particular, note that is proportional up to an absolute constant to the radius of the -dimensional Euclidean ball whose volume matches that of . The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in , which states that Euclidean balls with volume have the minimal surface area among all subsets of with volume .
In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since
we see that is proportional to the variance of the Gaussian whose entropy matches that of . The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.
Proposition. The EPI is equivalent to the following statement: if and are independent and and are independent Gaussians with and , then provided that all of the entropies exist.
Proof. Both implications follow from
Proof of the entropy power inequality
There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.
Definition (Fisher information). Let be a -valued random variable whose density is an absolutely continuous function. The score function of is defined as
The Fisher information (FI) of is defined as
Remark. Let be a parametric statistical model. In statistics, the score function is usually defined by , and the Fisher information by . This reduces to our definition in the special case of location families, where for some probability density : in this case, and we have
Thus for location families does not depend on and coincides precisely with the Fisher information as we defined it above for a random variable with density . The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that . Then for every , so is an unbiased estimator of . The Cramér-Rao bound therefore implies the inequality
with equality if and only if is Gaussian. The same conclusion holds when is arbitrary, as both FI and variance are invariant under translation.
Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have . Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as . Note that is both translation and scale invariant: and . We have , with equality if and only if is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as .
The strategy of the proof of the EPI is as follows:
- We first prove an inequality for Fisher information:
- Develop an integral identity relating and .
- Combine (1) and (2) to get the EPI.
The reason to concentrate on Fisher information is that is an -type quantity, as opposed to the entropy which is an -type quantity. This makes easier to work with.
We begin with some technical results about Fisher information.
Lemma. If has absolutely continuous density and , then has bounded variation. In particular, is bounded.
Proof. Let denote the set of points at which is differentiable, so that . Define for . Let . Then
Lemma. Let be measurable. Let be defined as . If is bounded, then defined by is absolutely continuous on with a.e.
Proof. Let ; we want to show that is absolutely continuous with derivative . First observe that is integrable on from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute
which proves the desired result.
Corollary. and, if , then .
Proof. Take in the lemma above for the first claim. Take for the second.
To be continued…
Lecture by Mokshay Madiman | Scribed by Dan Lacker