The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.
What is concentration?
Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.
To illustrate concentration, let . Using Markov’s inequality, we can estimate for
where we have made the optimal choice (this is called a Chernoff bound). Thus a Gaussian random variable has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.
Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let be i.i.d. , and let be -Lipschitz, that is, for each (where is the Euclidean norm). Then
This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant , and there is no dependence on the number of random variables . Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.
Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.
The basic question that we want to address in the next two lectures is the following.
Question. Let be i.i.d. random variables on a metric space with distribution . Can we characterize all for which dimension-free concentration holds as for the Gaussian case above?
It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.
When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.
Definition. is subgaussian with parameter if for each .
If is subgaussian, then (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger , so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.
Concentration and relative entropy
Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.
Let be a metric space. A function is -Lipschitz (-Lip) if for each . One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures on , define the Wasserstein distance as
The idea is that two measures are close if the expectations of a large class of function is close. In the case of , the class of function being used is the class -Lip.
As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how and are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.
Proposition. The following are equivalent for :
- for every and .
- for every probability measure .
Note that this result characterizes those measures on a fixed metric space that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.
Example. Let be the trivial metric. A function is -Lip with respect to if
that is, if and only if . Hence we have
Thus 2 in the above proposition holds with by Pinsker’s inequality
We consequently find by 1 above that
for every function such that . Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.
Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).
Lemma. (Gibbs variational principle) Let be any random variable. Then
Proof. Assume that is bounded above by some constant (otherwise replace by and then let at the end). Define a probability measure by
As the relative entropy is always positive, we have
for every , and equality is obtained by choosing the optimizer .
Using the variational principle, it is easy to prove the Proposition.
Proof of the Proposition. By the variational principle, we have
if and only if
for all . Optimizing over , we find that 1 is equivalent to the validity of
for all and .
Tensorization and optimal transport
So far we have considered concentration in a fixed metric space . If are independent random variables, we can certainly apply the Proposition to with the product distribution . However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for for every with the same constant ! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure , then that will already imply the same property for the measures for every . This turns out not to be quite true, but this idea will lead us in the right direction.
Motivated by the above, we set out to answer the following
Question. Suppose that satisfies for every . Does this imply that a similar property is satisfied by the product measures ?
Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.
Theorem. (Kantorovich-Rubinstein duality, 1958) Let and be probability measures on a Polish space. Let be the set of couplings of and . Then
The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as are often called transportation-information inequalities.
The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.
Proof of lower bound. For each and , we have
from which we immediately get
This proves the easy direction in the above theorem.
It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.
Proposition. (Tensorization) Suppose that
for all . Then, for any ,
for all .
We postpone the proof of this result until the next lecture.
Example. Let . By Pinsker’s inequality
Define the weighted Hamming distance for positive weights as
Then, by Cauchy-Schwarz and tensorization we get
for each . So, we have
with , for each and each function which is 1-Lip with respect to . This implies
That is, we recover the well known bounded difference inequality.
We have not yet shown that the transportation-information inequality holds for on . Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get
Setting the weights , we find a tail bound of the form whenever is with respect to the -norm . Note that this is not dimension-free: the factor appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound whenever is with respect to the -norm . Our tensorization result is not sufficiently strong to yield this type of behavior.
Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini
This lecture introduces the notion of the entropic cone and its connection with entropy inequalities.
Recall that if is a discrete random variable with distribution , the entropy of is defined as
Now let be (not necessarily independent) discrete random variables. It will be convenient to define the entropy function
As the entropy depends only on the probabilities of the outcomes of a random variable and not on its values, we will assume without loss of generality in the sequel that take values in .
For any probability on , let be the -dimensional vector
We can now define the entropic cone.
Definition. The entropic cone is the set
i.e., the set of all vectors that can be obtained as the entropy function of variables.
Question. Can we characterize ?
- If , then , as there exist random variables of arbitrary (nonnegative) entropy.
- If , the vector for is
What constraints must this vector satisfy? We must certainly have
which follows from the chain rule together with positivity for the first inequality, and that conditioning reduces entropy for the second. Are these the only constraints? In fact, we can create many vectors given these constraints. For example:
- The vector is obtained by taking such that , and letting .
- The vector is obtained by taking to be i.i.d. copies of .
- Convex combinations of these vectors can be obtained by taking mixtures of their distributions.
In fact, a careful analysis shows that the above constraints completely characterize the entropic cone for : that is,
- If , what happens? Here we are are still subject to the same constraints as in the case , but we pick up some new constraints as well such as
which is equivalent to the inequality .
Evidently, the constraints on the entropic cone correspond to entropy inequalities. What type of inequalities must hold for any given ? Let us think about this question a bit more systematically.
Definition. A set function is called a polymatroid if
- , for all .
- when (monotonicity).
It is not difficult to check that
Lemma. The entropy function is a polymatroid.
Other examples of polymatroids: (1) entropy, mutual information; (2) max-weight for given weights ; (3) flows; (4) cuts (not monotone, but submodular); (5) Von Neumann entropy.
Let us define
Evidently is a polyhedral cone (the intersection of a finite number of halfplanes).
- for .
- is a convex cone for all (but not polyhedral for ).
For the proof, we refer to the book for R. Yeung.
As an example of , we note that the vector is only achievable when for some . Let us also note that convexity of holds in its interior but not on the boundary. However, issues only arise on the boundary of , that is, no new entropy inequalities appear beyond the polymatroid inequalities when .
On the other hand, when , many new inequalities are introduced that actually cause holes to appear within . One such inequality (expressed in terms of mutual information) is
Using the entropic cone
The entropic cone can be used to obtain new information theoretic inequalities. For example, the following question arose in the study of mechanism design in computer science.
Problem. Let be discrete random variables. Define the matrix as . Is positive definite?
When , we have since . Thus all the principal minors of have positive determinant, so is positive definite.
How about ? Note that depends linearly on the entries of the vector (where is the joint distribution of ). Thus if is positive definite for distributions on the extreme rays (ER) of the entropic cone, then must be positive definite for any distribution. More generally:
Proposition. If is convex, then holds for all if and only if for all .
Proof. It suffices to note that
for every , and to use that .
This necessary and sufficient condition for generalizes to a sufficient condition for .
Proposition. If is convex, then holds for all if for all .
As is polyhedral, this simplifies solving problems such as checking positive definiteness of significantly: it suffices to check a finite number of cases, which can essentially be done by hand. It can be shown this way that is always positive definite for , but this can fail for .
It is sometimes of interest to investigate discrete variants of the above questions.
Definition. A matroid is defined by the conditions
- for all ;
- is monotone;
- is submodular.
- Vector matroids (also called -representable matroids, where is a field). Given a matrix with values in , define where denotes the columns induced by .
- Graphical matroids. Let be a graph, and choose . Then define , where the maximum is taking over acyclic subgraphs .
- Entropic matroids. Let . For what distributions of is this a matroid?
Denote by the entropic cone induced by those distributions where take values in . In order for such a distribution to define an entropic matroid, the vector must take values in . Thus we are led to consider the set
Can one characterize what type of matroids can arise as entropic matroids?
Theorem. coincides with the set of all -representable matroids with elements.
For the proof and further results around this theme, see the paper of E. Abbe.
Lecture by Emmanuel Abbe | Scribed by Danny Gitelman
This lecture completes the proof of the entropic central limit theorem.
From Fisher information to entropy (continued)
In the previous lecture, we proved the following:
By optimizing the choice of the weights, it is easy to check that (1) is equivalent to
Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.
In the following, let be the set of all subsets of of size and take for every . In this case, (2) takes the following form.
Theorem EPI. If are independent random variables, then
provided that all the entropies exist.
To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.
Theorem (de Bruijn identity). Let be a random variable with density on . Let , , where is independent of . We have
Proof. Let be the density of . Then
where . It is easy to check that . Hence,
Integrating from to gives (3).
Proof of Theorem EPI. In (1), let be the set of all subsets of of size and . We have
where, is the weight corresponding to the set . It is easy to check that (4) is equivalent, for every with , to
for any choice of with and with .
Let , where is independent of all else. Then
where the equality in distribution is due to the fact that . Hence, by (6),
We use de Bruijn identity to integrate this from to over time and get
By (6) again,
This is the entropy analog of the Fisher information inequality obtained above.
As a final step, let . We are to show
If for some , the result is immediate due to the general fact (convolution increases entropy). Hence, we assume that for every , so
Set , . Note that
With this choice of , we apply (7). After some computations, we get
Using the definition of the , the final result follows from here immediately.
Proof of the Entropic Central Limit Theorem
Entropic CLT. Suppose that are i.i.d. random variables with and . Let . If for some , then as . Equivalently, if for some , then as .
In the case of i.i.d. random variables, Theorem EPI gives
which is equivalent to because of the identity . Therefore, it remains to show as . For this, we need some analytical properties of the relative entropy. Henceforth, we use to denote the set of all probability measures on some Polish space (take for instance).
Proposition (Variational characterization of ). Let . Then
where denotes the set of all bounded measurable functions .
- In the variational characterization of above, it is enough to take the supremum over the set of all bounded continuous functions .
- For fixed , the mapping is convex and continuous. Since is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.
Corollary. Sublevel sets of are compact, that is, the set is compact (with respect to the topology of weak convergence) for every and .
Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.
Definition (Tightness). A set is called tight if for every there exists a compact set such that for every .
Prohorov Theorem. A set is weakly precompact if and only if it is tight.
Note that is a tight set as a singleton. We claim that the sequence is also tight. Indeed, let and let be a compact set with . We take and apply (8) to get
where the rightmost term can be made arbitrarily small. Hence, is tight. By Prohorov Theorem, there exists such that as . By lower semicontinuity of , we have
This finishes the proof of the corollary.
Proof of Variational Characterization of . As a first case, suppose . So and there exists a Borel set with and . Let . We have
as . Hence, both sides of the variational characterization are equal to .
For the rest, we assume . First, we show the part. If , the inequality is obvious. Suppose . Given , define a probability measure by
Since is chosen arbitrarily, taking supremum on the right hand side gives the part.
Next, we prove the part. Note that if , then . However, this choice of may not be in , that is, may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let
so that as . Note that and . Thus we have
by monotone convergence. On the other hand, by Fatou’s Lemma, we have
from which the part of the result follows.
Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.
Lemma. If are random variables with , , and the corresponding densities are uniformly bounded with as (pointwise) for some density , then and as .
Proof. Recall for with mean and variance . By lower semicontinuity of , we have
On the other hand, letting , we have
and using Fatou’s Lemma,
Hence, as .
End of proof of Entropic CLT. Assume . We will use to denote normalized Fisher information. For any , we have that for . So as for every . We want to show that , since then we will get by Lebesgue’s dominated convergence theorem that
as . But since
it is enough to show that for each .
By the monotonicity property we have proved, we know that
for any . By compactness of sublevel sets of , the sequence must therefore have a subsequence whose distribution converges to a probability measure (let us call a random variable with this limiting measure as its distribution). For , the smoothing caused by Gaussian convolution implies that the density of converges pointwise to that of , and also that the density of converges pointwise to that of , where is an independent copy of . By the previous lemma
as , and
so that necessarily
By the equality condition in the entropy power inequality, this can only happen if is Gaussian, which in turn implies that .
Lecture by Mokshay Madiman | Scribed by Cagin Ararat
Here is the schedule for the remaining three lectures of the semester:
November 21: Emmanuel Abbe on information theoretic inequalities.
November 28: No lecture (Thanksgiving).
December 5 and 12: Ramon van Handel on transportation-information inequalities and concentration.
The remaining notes on the entropic CLT (Lecture 7) will be posted soon.
In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.
Variance drop inequality
In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.
ANOVA decomposition. Let be independent random variables taking values in , and let be such that . Then satisfies
Note that is a function only of ( are the ordered elements of ).
In the previous lecture, we proved subadditivity of the inverse Fisher information . The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.
Lemma. Let where and is some collection of subsets of . If are independent random variables with , then
where is a fractional packing with respect to .
- Recall that a fractional packing is a function such that
Example 1. Let , and define . Taking always defines a fractional packing as by definition of .
Example 2. If , then .
- The original paper of Hoeffding (1948) proves the following special case where each is symmetric in its arguments and is as in Example 2 above: (the -statistic) satisfies
Of course, if then . Thus Hoeffding’s inequality for the variance of -statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.
Proof. We may assume without loss of generality that each has mean zero.
We then have, using orthogonality of the terms in the ANOVA decomposition:
For each term, we have
where the second inequality follows from the definition of fractional packing if is non-empty, and the fact that takes any to its mean. Hence
again using orthogonality of the in the last step.
Monotonicity of Fisher information
We can now finally prove monotonicity of the Fisher information.
Corollary. Let be independent random variables with . Then
for any hypergraph on , fractional packing , and positive weights summing to 1.
Proof. Recall that and . The identity proved in the last lecture states
With , we can write
since . By taking a convex combination of these identities,
Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have
Remark. The being arbitrary weights, we can optimize over them. This gives
With being all singletons and we recover the superadditivity property of . With being all sets of size and , we get
Thus we have proved the monotonicity of Fisher information in the central limit theorem.
From Fisher information to entropy
Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.
To formulate this identity, let us introduce some basic quantities. Let on , and define
where . Denote by the density of . The following facts are readily verified for :
- is smooth.
Observe that has density , and that as , converges to , which has a standard Gaussian distribution. Thus provides an interpolation between the density and the normal density.
Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process is defined by the stochastic differential equation
where is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by
has a generator (defined via ) given by . The semigroup generated by governs the evolution of conditional expectations of functions of the process , while the adjoint semigroup generated by governs the evolution of the marginal density of . The above expression for follows from this remark by noting that and are the same in distribution; however, it can also be deduced more simply just by writing down the density of explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.
We can now formulate the key identity.
de Bruijn identity. Let be the density of the standard normal .
- Differential form:
where is the normalized Fisher information.
The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since
which yields the desired identity since .
This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.
Lecture by Mokshay Madiman | Scribed by Georgina Hall
The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.
Two lemmas about the score function
Recall that for a random variable with absolutely continuous density , the score function is defined as and the Fisher information is .
The following lemma was proved in the previous lecture.
Lemma 1. Let be a random variable with finite Fisher information. Let be measurable and let . If is bounded on the interval , then the function is absolutely continuous on with a.e.
There is a converse to the above lemma, which gives a useful characterization of the score function.
Lemma 2. Let be a random variable with density and let be a measurable function with . Suppose that for every bounded measurable function , the function is absolutely continuous on with a.e. Then there must exist an absolutely continuous version of the density , and moreover a.s.
Proof. Take . Then
Hence is in fact continuously differentiable with derivative . On the other hand,
by our assumption. By continuity of and the assumption , we must have
for every . Hence a.e. Since , the proof is complete.
Score function of the sum of independent variables
We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.
Proposition. Let and be two independent random variables. Suppose has finite Fisher information. Then a.s.
Proof. Let . By Lemma 2, we only need to show that the function is locally absolutely continuous with a.e. for every bounded measurable function .
Fix . By independence of and , we can apply Lemma 1 to (conditioned on ) to obtain
Taking expectation of both sides and applying Fubini, we get
we arrive at
which implies a.e.
Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in ), the score function of the sum is just the projection of the score of a summand onto the closed subspace . This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: . In fact, we can do better, as we will see forthwith.
Monotonicity of FI in the CLT along a subsequence
We now make a first step towards proving monotonicity of the Fisher information in the CLT.
Theorem. Let and be independent random variables both with finite Fisher information. Then
for any .
Before going into the proof, we make some remarks.
Remark. Taking , we get . Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.
Remark. By taking , we get the following optimized version of the theorem:
Remark. Let be an i.i.d. sequence of random variables with mean zero and unit variance. Let . By taking in the theorem and using scaling property of Fisher information, it is easy to obtain . Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times : that is, is monotone in .
However, the theorem is not strong enough to give monotonicity without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get , which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.
Proof. By the projection property of the score function,
Applying the conditional Jensen inequality, we obtain,
Taking expectations of both sides and using independence, we obtain
By Lemma 1, . This finishes the proof.
Monotonicity of Fisher information in the CLT
To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.
In the following, we will use the common notation .
Definition. Let be a collection of non-empty subsets of . Then a collection of non-negative numbers is called a fractional packing for if for all .
We can now state the desired strengthening of our earlier theorem.
Theorem. Let be a fractional packing for and let be independent random variables. Let be a collection of real numbers such that . Then
Remark. Suppose that the are identically distributed. Take . For each , define . It is easy to check that is a fractional packing of . Then
by the above theorem. By the scaling property of Fisher information, this is equivalent to , i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).
The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let be independent random variables, and define the Hilbert space
with inner product
For every , define an operator on as
Proposition. Each can be decomposed as
- if ;
- if ;
- does not depend on if .
Proof. It is easy to verify that is a projection operator, that is, and (self-adjointness). It is also easy to see that for . Hence we have
If , choose such that . Then
It is easily verified that is itself self-adjoint, so that follows directly. Finally, as by definition does not depend on , it is clear that does not depend on if .
The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here for , .
Lemma. Let be a fractional packing for . Let . Suppose that for each . Then
To be continued…
Lecture by Mokshay Madiman | Scribed by Liyao Wang
A quick reminder that next week is Fall break (no lecture). The next lecture will take place on Thursday, November 7: Mokshay will resume the proof of the entropic central limit theorem.
The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.
Theorem (Entropic CLT). Let be i.i.d. real-valued random variables with mean zero and unit variance. Let
If for some , then , or equivalently . That is, the entropy of increases monotonically to that of the standard Gaussian.
Recall that when is Gaussian with the same mean and variance as , which explains the equivalence stated in the theorem.
Let us note that the assumption for some represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if are discrete.
Entropy power inequality
Historically, the first result on monotonicity of entropy in the CLT was that for all . This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.
Entropy power inequality. Let and be independent real-valued random variables such that , , and all exist. Then
with equality if and only if and are Gaussian.
Before we embark on the proof, let us make some remarks.
Remark. None of the assumptions about existence of entropies is redundant: it can happen that and exist but does not.
Remark. If and are i.i.d., , and , then the EPI implies
which implies . Here we have used the easy-to-check equality , which of course implies . From this observation, the proof of the claim that is immediate: simply note that is the sum of two independent copies of .
Remark. It is easy to check that . In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.
Remark. The EPI admits the following multidimensional extension.
Multidimensional EPI. Let and be independent -valued random vectors such that , , and all exist. Then
with equality if and only if and are Gaussian with proportional covariance matrices.
Define the entropy power for an -valued random vector by
The EPI says that is superadditive under convolution.
Digression: EPI and Brunn-Minkowski
A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If are Borel sets and denotes -dimensional Lebesgue measure, then
where is the Minkowski sum. In particular, note that is proportional up to an absolute constant to the radius of the -dimensional Euclidean ball whose volume matches that of . The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in , which states that Euclidean balls with volume have the minimal surface area among all subsets of with volume .
In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since
we see that is proportional to the variance of the Gaussian whose entropy matches that of . The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.
Proposition. The EPI is equivalent to the following statement: if and are independent and and are independent Gaussians with and , then provided that all of the entropies exist.
Proof. Both implications follow from
Proof of the entropy power inequality
There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.
Definition (Fisher information). Let be a -valued random variable whose density is an absolutely continuous function. The score function of is defined as
The Fisher information (FI) of is defined as
Remark. Let be a parametric statistical model. In statistics, the score function is usually defined by , and the Fisher information by . This reduces to our definition in the special case of location families, where for some probability density : in this case, and we have
Thus for location families does not depend on and coincides precisely with the Fisher information as we defined it above for a random variable with density . The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that . Then for every , so is an unbiased estimator of . The Cramér-Rao bound therefore implies the inequality
with equality if and only if is Gaussian. The same conclusion holds when is arbitrary, as both FI and variance are invariant under translation.
Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have . Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as . Note that is both translation and scale invariant: and . We have , with equality if and only if is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as .
The strategy of the proof of the EPI is as follows:
- We first prove an inequality for Fisher information:
- Develop an integral identity relating and .
- Combine (1) and (2) to get the EPI.
The reason to concentrate on Fisher information is that is an -type quantity, as opposed to the entropy which is an -type quantity. This makes easier to work with.
We begin with some technical results about Fisher information.
Lemma. If has absolutely continuous density and , then has bounded variation. In particular, is bounded.
Proof. Let denote the set of points at which is differentiable, so that . Define for . Let . Then
Lemma. Let be measurable. Let be defined as . If is bounded, then defined by is absolutely continuous on with a.e.
Proof. Let ; we want to show that is absolutely continuous with derivative . First observe that is integrable on from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute
which proves the desired result.
Corollary. and, if , then .
Proof. Take in the lemma above for the first claim. Take for the second.
To be continued…
Lecture by Mokshay Madiman | Scribed by Dan Lacker
The next lecture is on October 17. Please note that we have a one time room change to Sherrerd Hall Room 101 on October 17 as the usual Bendheim room is in use for another event. We will return to the Bendheim center the following week for the rest of the semester.
The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:
- It is an example of a probabilistic question where entropy naturally appears.
- The proof we give uses ideas typical in information theory.
- We will need it later to discuss the transportation-information inequalities (if we get there).
What is large deviations?
The best way to start thinking about large deviations is to consider a basic example. Let be i.i.d. random variables with and . The law of large numbers states that
To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that
Therefore, for any fixed , we have
(the probability converges to a value strictly between zero and one). Informally, this implies that the typical size of is of order .
Rather than considering the probability of typical events, large deviations theory allows us to understand the probability of rare events, that is, events whose probabilities are exponentially small. For example, if , then the probability that is at least of order unity (a rare event, as we have just shown that the typical size is of order ) can be computed as
The probability of this rare event decays exponentially at rate . If the random variables have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of large deviations theory is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.
Let be i.i.d. random variables with values in a finite set and with distribution (random variables in a continuous space will be considered at the end of this lecture). Denote by the set of probabilities on . Let be the empirical distribution of :
The law of large numbers states that a.s. To define a rare event, we fix that does not contain . We are interested in behavior of probabilities of the form , as .
Example. Let be such that . Define for some . Then . Thus the rare events of the type described in the previous section form a special case of the present setting.
We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities decay.
Theorem (Sanov). With the above notations, it holds that
In particular, for “nice” such that the left- and right-hand sides coincide we have the exact rate
In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event is controlled by the element of that is closest to the true distribution in the sense of relative entropy.
Method of types
It is a trivial observation that each possible value must appear an integer number of times among the samples . This implies, however, that the empirical measure cannot take arbitrary values: evidently it is always the case that , where we define
Each element of is called a type: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value for each type .
Type theorem. For every , we have
That is, up to a polynomial factor, the probability of each type behaves like .
In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types such that have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in . The probability of the rare event is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: the probability of a rare event is dominated by the most likely of the unlikely outcomes. The proof of Sanov’s theorem makes this intuition precise.
Proof of Sanov’s theorem.
Upper bound. Note that
[Note that in the finite case, by continuity, the infimum over equals the infimum over as stated in the theorem. The closure becomes more important in the continuous setting.]
Lower bound. Note that is dense in . As is open, we can choose for each , such that . Therefore,
It follows that
[Note that despite that we are in the finite case, it is essential to consider the interior of .]
Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.
Proof of the type theorem. Let us define
Then we can write
It is therefore sufficient to prove that for every
To show this, the key idea is to utilize precisely the same expression for given above, for the case that the distribution that defined the empirical measure is replaced by (which is a type). To this end, let us denote by the empirical measure of i.i.d. random variables with distribution .
Upper bound. We simply estimate using the above expression
Lower bound. It seems intuitively plausible that for every , that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate
Proof of the claim. It remains to prove the above claim that for every . To this end, note that consists of all vectors such that of the entries take the value , of the entries take the value , etc. The number of such vectors is
It is now straightforward to estimate
Thus the claim is established.
Remark. It is a nice exercise to work out the explicit form of the rate function in the example considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as Cramèr’s theorem.
General form of Sanov’s theorem
The drawback to the method of types is that it relies heavily on the assumption that take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.
Let be a Polish space (think ), and let be i.i.d. random variables taking values in with distribution . Denote by the space of probability measures on endowed with the topology of weak convergence: that is, iff for every bounded continuous function . Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of .
Theorem. In the present setting, Sanov’s theorem holds verbatim as stated above.
It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]
We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.
Proof of the lower bound. It evidently suffices to consider the case that is an open set. We use the following topological fact whose proof will be given below: if is open and , then there is a finite (measurable) partition of and such that
Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.
Define the function such that for . Then if and only if the empirical measure of lies in . Thus
As take values in a finite set, and as is open, we obtain from the finite Sanov theorem
where we have used the data processing inequality and in the last inequality. As was arbitrary, taking the supremum over completes the proof.
Proof of the topological fact. Sets of the form
for , , bounded continuous functions, and form a base for the weak convergence topology on . Thus any open subset must contain a set of this form for every (think of the analogous statement in : any open set must contain a ball around any ).
It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as is a bounded function, we can find for each a simple function such that . Clearly implies , so we can replace the functions by simple functions. But then forming the partition generated by the sets that define these simple functions, it is evident that if is chosen sufficiently small, then for all implies . The proof is complete.
Remark. It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.
Lecture by Ramon van Handel | Scribed by Quentin Berthet