Recall that we are in the middle of proving the following result.
Theorem. Let be independent random vectors in whose distribution has finite support. Suppose that
with positive probability.
Define the random matrix and its characteristic polynomial as
As is positive definite, one representation of the matrix norm is . The proof of the above theorem consists of two main parts:
- Show that with positive probability.
- Show that .
The first statement was proved in the previous two lectures. The goal of this lecture and the next is to prove the second statement.
Mixed characteristic polynomials
The key tool that was used in the previous lectures in the mixed characteristic polynomial of the matrices , which we defined as follows:
We recall the two crucial observations used in the previous lectures:
- is multilinear in .
- If have rank one (this is essential!), then .
As in our case are independent random matrices of rank one, this implies
where we define for simplicity .
Note that the matrices are no longer of rank one (in particular, it is not true that is the characteristic polynomial of : that would make the remainder of the proof quite trivial!) However, by assumption, the matrices satisfy
To complete the proof, our aim will be to show that for any matrices with these properties, the maximal root of can be at most .
Theorem. Let be positive semidefinite matrices in . Suppose that
As the matrices sum to the identity , it will be convenient to rewrite the definition of the mixed characteristic polynomial slightly using this property:
where we defined the multivariate polynomial
Our aim is to show that whenever .
How should we go about proving such a property? The first observation we should make is that the polynomial itself has no roots in an even larger region.
Lemma. whenever .
Proof. Note that
as and . Thus is nonsingular whenever .
One might now hope that we can follow a similar idea to what we did in the previous lectures. In the previous lectures, we wanted to show that the mixed characteristic polynomial is real-rooted. To this end, we first showed that the determinant in the definition of the mixed characteristic polynomial is real stable (the appropriate multivariate generalization or real-rootedness), and then we showed that this property is preserved when we take derivatives of the form .
In an ideal world, we would show precisely the same thing here: if the polynomial has no roots in the region , then perhaps has no roots there as well, etc. If this were true, then we would in fact have shown that has no roots in the region . Unfortunately, this is clearly false: that would imply an upper bound of zero for the matrix norm in our original problem (impossible in general!) Nonetheless, this general strategy proves to be the right approach. It is indeed not true that the region with no roots is preserved by the operation . However, it turns out that we will be able to control by how much this region shrinks in successive applications of . The control that we will develop will prove to be sufficiently sharp in order to obtain the desired result.
Evidently, the key problem that we face is to understand what happens to the roots of a polynomial when we apply an operation of the form . In the remainder of this lecture, we will investigate this problem in a univariate toy setting. This univariate approach will not be sufficiently sharp to yield the desired result, but will give us significant insight into the ingredients that are needed in the proof. In the following lecture, we will complete the proof by developing a multivariate version of these ideas.
Barrier functions: a toy problem
Let us put aside for the moment our original problem, and consider the following simplified setting. Let be a univariate polynomial with real roots . What can we say about the locations of the roots of the polynomial ? Clearly, the latter polynomial has a root at if and only if
The function is very interesting; let us investigate what it looks like. Note that we can always represent a polynomial in terms of its roots as
for some constant . We can therefore readily compute
As we assumed that the roots of are real, the function looks something like this:
The function blows up at the roots of ; this function is therefore referred to as the barrier function, in analogy with a similar notion in optimization. The values where it is equal to one determine the locations of the roots of . It follows immediately from the shape of the barrier function that the roots of our two polynomials are interlaced, as can be seen by inspecting the above figure. Note, moreover, that , so that it is unfortunately the case that the maximal root of a polynomial increases under the operation . However, we can control the location of the maximal root if we are able to control the barrier function.
Let us now return to our matrix problem and try to see how this idea can be useful. We will consider the first step towards establishing the desired result: what happens to the locations of the roots of the polynomial when we apply a single operation ? (We will ultimately want to iterate such an argument for every coordinate to control the roots of the mixed characteristic polynomial.)
To this end, let us fix for the time being, and define
We have already shown that . In order to control the roots of , we need to control the barrier function . The stroke of luck that we have at this point is that the derivative of the logarithm of a determinant is a remarkably nice object.
Lemma (Jacobi formula). If is invertible, then .
Proof. First, we note that
It therefore suffices to prove that
To this end, we use directly the definition of the determinant:
where the sum is over permutations . Setting , we see that the only term that survives is the one that satisfies for all , that is, the identity permutation. We therefore obtain (as )
This completes the proof.
Using the Jacobi formula, we immediately find
It therefore follows that
This brings us one step closer to the desired conclusion!
We have seen above that we can control the roots of and . Of course, the next step will be to control the roots of . Unfortunately, a direct application of the barrier function method does not lend itself well to iteration. The problem is that while we could easily control the barrier function of using the Jacobi formula, it is not so obvious how to control the barrier function of .
Instead, we are going to develop in the following lecture a multivariate version of the barrier argument. In each consecutive application of , we will control the region of in which the polynomial has no roots; at the same time, we will also obtain control over the barrier function of the polynomial in the next stage. When implemented properly, this procedure will allow us to iterate the barrier argument without requiring explicit control of the barrier function except at the first stage.
Many thanks to Mark Cerenzia for scribing this lecture!
There will be no lecture on October 9 as several of us are out of town. The stochastic analysis seminar will resume on October 16 (the next week) at the usual time and place.
The goal of the following lectures is to prove the following result about random matrices due to Marcus, Spielman, and Srivastava. This result readily implies a positive resolution of the Kadison-Singer conjecture, as was explained in detail in the previous lecture.
Theorem. Let be independent random vectors in whose distribution has finite support. Suppose that
with positive probability.
Define the random matrix and its characteristic polynomial as
As is positive definite, one representation of the matrix norm is . The proof of the above theorem consists of two main parts:
- Show that with positive probability.
- Show that .
The present post is devoted to the proof of the first statement. The second statement will be the subject of the subsequent lectures.
Interlacing roots of polynomials
How can we prove an identity such as ? The key issue that we face is that the functional is extremely nonlinear, and it is not at all clear how it behaves when you take expectations. This problem already arises in the very simplest setting: what can we say about the locations of the roots of a convex combination of two polynomials and ? In general, there is little hope to say anything concrete. However, if the polynomials and have some special properties, then we are in much better shape.
Lemma. Let and be polynomials that are monic (their leading coefficient is one) and of the same degree. Suppose that the polynomial is real-rooted (all its roots are real) for every . Then
for every .
The message of this lemma is that under the given assumptions, the maximal root of the convex combination interlaces between the maximal roots of and (in fact, a similar property holds for all other roots as well; but we will not need this fact.)
Proof. Let us assume without loss of generality that , so we must prove
Proof of upper bound. Because the polynomials are monic, and are strictly increasing for real to the right of their maximal roots. Thus and are both strictly positive for . But this means that is also strictly positive for , and thus cannot have a root in this region. This implies that
Proof of lower bound. Of course, this is the interesting part. By the same argument as the upper bound, it is clear that if . We can therefore assume without loss of generality that .
We distinguish two cases. Note that for . Suppose first that for all . Then and must look something like this:
This means that for any , we have at and at . Thus must have a root between and by the intermediate value theorem. In particular,
Now suppose that for some , and let be the second largest root of . Then and must look something like this:
(Note that it could be that , or that is a local maximum; then the picture looks slightly different, but this will not change our argument.) We now argue as follows. The polynomial has no roots in the interval , but the polynomial has at least two roots in this interval (counting multiplicities). Thus as we let vary from to , the number of roots of the polynomial must jump at some point from having at least two roots to having zero roots in the interval . The polynomial must therefore look like this:
This means that for any sufficiently small, the polynomial loses two real roots in the interval . As long as these roots do not reappear elsewhere on the real line, this implies that this polynomial has less real roots than its degree: and, therefore, it must have at least two roots in the complex plane away from the real line. This contradicts the assumption that all convex combinations of and are real-rooted.
Thus it only remains to argue what happens if the two roots that are lost when we go from to reappear elsewhere on the real line. Note that for new roots to appear, a local maximum that was strictly below (or a local minimum that was strictly above ) when must have moved above (or below) when . As the graph of a polynomial is continuous in its coefficients, we can always avoid this by choosing to be sufficiently small.
Once we know how to deal with two polynomials, we can easily generalize as follows.
Corollary. Let be monic polynomials of the same degree. Suppose that all convex combinations of are real-rooted. Then
for all convex weights , .
Proof by induction. assume the result is true for a given number of polynomials ; then the result is also true for polynomials by applying the previous lemma.
Mixed characteristic polynomials
In order to apply the interlacing property developed above to the Kadison-Singer problem, we must show that the characteristic polynomial satisfies the necessary assumptions. Recall that the random matrix has a distribution with finite support; that is, is a random variable that can take one of possible values . Our aim is to show that there exists such that
where the expectation is clearly a convex combination of the polynomials . In order to apply the interlacing property, we must show not only that each of these polynomials is real-rooted (this is obvious, as are Hermitian matrices), but also that convex combinations of these polynomials are real-rooted. The latter is far from obvious: we must reason about the properties of convex combinations of determinants, such as . Determinants are nonlinear objects, and so it is not at all clear how to address such questions.
The stroke of luck in the Kadison-Singer problem is that the the matrix is a sum of rank-one matrices. Even though the determinant is nonlinear, it is linear with respect to rank-one perturbations. This crucial fact will allow us to express convex combinations of the determinants that appear in our problem in terms of another determinant, for which it will prove to be easier to establish real-rootedness.
Lemma. (Sylvester’s determinant identity) .
Proof. Let be nonsingular square matrices. Then
By continuity, this remains valid for singular matrices . The result follows by letting be the matrix whose first column is and be the matrix whose first column is (all other entries are zero).
Corollary. is linear in .
Proof. If is nonsingular, we can write
which is linear in . As the limit of linear functions is linear, the result remains valid for singular .
What we would really like is that for a random vector : that would make it very easy to handle convex combinations of characteristic polynomials. Unfortunately, this is wishful thinking: the matrix is not of rank one, so this does not follow from the linearity of rank-one perturbations. Nonetheless, this is not such a bad idea. We can use the linearity of rank-one perturbations to rewrite the characteristic polynomial of a sum of rank-one matrices in terms of a different quantity, which proves to be linear even beyond the rank-one setting.
To this end, let us define the multivariate polynomial
Then . Moreover, we have just shown that is multilinear in for every . Using that a linear function equals its first-order Taylor expansion, we can write
Iterating this device, we obtain
Setting , we obtain the identity
The quantity on the right is called a mixed characteristic polynomial.
Definition. The mixed characteristic polynomial of square matrices is defined as
Note that if all are rank one, we have shown that
That is, the mixed characteristic polynomial of coincides with the characteristic polynomial of when all are rank one. This is no longer true when the matrices are not of rank one! Nonetheless, the mixed characteristic polynomial is an extremely useful device for the following reason.
Lemma. is multilinear in .
Proof. The determinant is a polynomial in the entries of the matrices , and each term in the product is a product of derivatives with respect to distinct variables . In order for such a derivative of one of the terms of the polynomial to be nonzero after we set , this term can contain each at most once: otherwise this term must be proportional to for some , so the derivative of this term will vanish when we set . Thus the only terms that survive in the mixed characteristic polynomials are products of the entries of the matrices , where each appears at most once. Such terms are clearly multilinear in .
A simple consequence of this result is, for example, that
where we have used multilinearity and the independence of the vectors . Let us emphasize that the right-hand side in this expression cannot itself be expressed as the characteristic polynomial of , for example. Nonetheless, we have achieved something very useful: we have expressed a convex combination of determinants in terms of the derivative of a determinant. It turns out that it is much easier to establish real-rootedness of derivatives of determinants, than it is to establish real-rootedness of convex combinations of determinants. Our next order of business is to understand why this is true.
Multivariate polynomials and real stability
Given positive semidefinite matrices , let us define the multivariate polynomial
The mixed characteristic polynomial is obtained from by taking derivatives, followed by fixing . We are going to prove real-rootedness of by showing that the real-rootedness property is preserved by such operations.
Of course, it is not quite clear what we mean by this: is a multivariate polynomial, so its roots are points in . The appropriate extension of real-rootedness to multivariate polynomials is as follows.
Definition. A multivariate polynomial is called stable if it has no roots in the set , and is called real stable if in addition all its coefficients are real.
We are doing to establish the following facts:
- The polynomial is real stable.
- Taking derivatives or fixing parameters preserves real stability.
- A univariate polynomial is real stable if and only if it is real-rooted.
As the (univariate) mixed characteristic polynomial is obtained from by taking derivatives and fixing parameters, this reasoning will show that is real-rooted.
The third assertion is the easiest, and we prove it first.
Lemma. A univariate polynomial is real stable if and only if it is real-rooted.
Proof. Clearly a real-rooted polynomial is real stable. Conversely, suppose a univariate polynomial is real stable. Then it has no roots with . However, if , then also. Therefore, there are also no roots with . It follows that must be real-rooted.
Next, we prove the preservation of real stability under taking derivatives and fixing parameters.
Lemma. Let be a real stable polynomial and let . Then
- is real stable.
- is real stable.
Proof. We begin by proving the first assertion. Clearly has real coefficients, so it suffices to show that it is stable. Suppose to the contrary that has a root in . As the roots of a polynomial vary continuously with its coefficients, the polynomial must also have a root in for sufficiently small. But this contradicts the real stability of .
Remark. A careful proof of the continuity of roots of complex polynomials can be given using Rouché’s theorem. This is left as an exercise to the interested reader.
We now turn to the second assertion. It is again obvious that the derivative of the polynomial has real coefficients, so it remains to establish that it is stable. Assume to the contrary that
for some in .
Define the univariate polynomial , and denote by the roots of . As is real stable, we must have and for all . Moreover, we can express as
By assumption, we have
(as is analytic in a neighborhood of ). But note that
Taking the imaginary part of this expression yields
But and , so we have achieved a contradiction.
Finally, we prove that the polynomial is in fact real stable.
Lemma. Suppose that are positive semidefinite matrices. Then the polynomial
is real stable.
Proof. To show that is real, note that
As are Hermitian, this implies that is real whenever the parameters are all real. Thus the coefficients of must be real, and it remains to prove stability.
To this end, suppose that has a root in . Then the matrix is not invertible, so there exists a vector such that
But and , so the imaginary part of the left-hand size must be strictly positive by the assumption that the root is in . Thus we have achieved a contradiction.
Putting together the previous three results, we have shown the following.
Corollary. The mixed characteristic polynomial is real-rooted whenever are positive semidefinite matrices.
Putting everything together
We are now ready to put together the various ideas developed in this post.
Proposition. Let be independent random vectors in whose distribution has finite support, and let . Then
with positive probability.
Proof. We will apply the interlacing property step by step. Denote by the possible values that the random vector can take. First, note that for every collection of convex weights
by multilinearity. We have shown that the mixed characteristic polynomial on the right is real-rooted. Moreover, it is easily verified from their definition that the polynomials are monic and have the same degree. Therefore, by the interlacing property, there exists a vector (which may depend on the outcomes of ) such that
We now proceed in precisely the same manner with the polynomial on the right-hand side of this inequality to establish the existence of a vector (depending on ) such that
By iterating this argument, we show that
But note that , which is a single realization of ; whereas by multilinearity of the mixed characteristic polynomial and as the random vectors are independent. We have therefore completed the proof.
We have now completed the first step in our program. In the following lectures, we will concern ourselves the with second step: the control of .
Remark. It is only fair to note that, even though the result we are after is a statement about random matrices, essentially no probability appears in the proof: the proof only finds one single realization of the random variable whose maximal root is controlled as desired, and the statement that this happens with positive probability is largely cosmetic (this is a finite probability space, so any realization occurs with positive probability). This highlights, in some sense, the limitations of the probabilistic method in the present context: as the event we are after has very small probability, the usual methods of probabilistic reasoning are not efficient at finding such an event.
Lectures by Philippe Rigollet | Notes by Junwei Lu and Ramon van Handel
The goal of our seminar this semester is to try to understand the unexpected application of the geometry of roots of polynomials to address probabilistic problems. Such methods have recently led to the resolution of several important questions, and could potentially be very useful in many other problems as well. In the following lectures, we will work in detail through some recent problems in which the geometry of polynomials arises in a crucial manner in the proof.
Kadison-Singer, random matrices, and polynomials
In 1959, Kadison and Singer made a fundamental conjecture about the extension of pure states on algebras. While the original conjecture was purely qualitative, it has since been reformulated in many different ways. These different formulations prove to have applications in various areas: see the nice survey by Cassaza and Tremain. After many decades, the Kadison-Singer conjecture was finally proved in a paper by Marcus, Spielman, and Srivastava. They prove a quantitative formulation of the conjecture, due to Weaver, that can be stated as an elementary (deterministic!) property of vectors in .
Theorem (formerly Conjecture) There exist universal constants such that, if and are any positive integers and are vectors such that
then we can partition into disjoint subsets and such that
where denotes the spectral norm of the matrix .
This result appears at first sight to be almost obvious: in the scalar case , the corresponding statement is completely elementary. To see this, let satisfy and for all . The only way that we can fail to find a subset such that and is when the sizes of the are very imbalanced (e.g., if and ), and this is ruled out by the assumption . It seems natural to expect that a similar conclusion follows in any dimension .
Remark. To make this argument precise, let us argue by contradiction. Suppose that for every set , we have either or . The first possibility occurs when and the second possibility occurs when . By adding points one by one, we see that there must exist a set and a point such that and . But this implies that , which contradicts when is sufficiently small.
Despite the intuitive nature of the statement, the general result is far from obvious. In particular, it is important to note that the affirmative solution to the Kadison-Singer conjecture requires that the constants must be independent of and . It is not at all obvious a priori that it is possible to achieve this: the scalar case is misleadingly simple, and its proof does not extend to higher dimension (try it!)
So far no probability has entered the picture: we are only asked to establish the existence of one set for which the above result holds. However, in many problems in analysis and combinatorics, the easiest way to find the existence of a nontrivial object is by selecting such an object randomly and attempting to show that it satisfies the desired properties with positive probability. This approach is known as the probabilistic method. Using this idea, we will presently show that the above deterministic problem can be reduced to a basic probabilistic problem of estimating the norm of certain random matrices.
Let be given. In order to find a suitable set , let us try to choose this set uniformly at random: that is, is a random set such that each point is included in independently with probability one half. Let us now define the random vectors as follows:
where denotes the zero vector in . A quick calculation gives
We therefore have
Moreover, we can compute
It is now immediately clear that in order to prove the deterministic version of the Kadison-Singer conjecture, it suffices to prove the following result about random matrices.
Theorem. Let be independent random vectors in whose distribution has finite support. Suppose that
(where as depends only on ).
How might one prove this result? The easiest approach would be to show that
in which case the result would follow immediately (this is called the first moment method). This seems promising, as the literature on random matrices contains numerous bounds on the norm of a random matrix. Unfortunately, this approach cannot work, as the following example illustrates.
Example. Denote by the standard basis in , and let be i.i.d. standard normal random variables for and . Define the random vectors by
Then , so . Moreover, . Thus the assumptions of the above result are satisfied when is sufficiently large. On the other hand, the matrix is diagonal with entries given by , and we therefore find that
as (as the maximum of chi-square random variables is of order ). Thus the expected matrix norm in this example must depend on the dimension . We therefore cannot obtain a dimension-free result, as is required to resolve Kadison-Singer, by looking at the expectation of the matrix norm.
Remark. In this example we used Gaussian variables, for which but is not uniformly bounded. In the reduction of Kadison-Singer to the random matrix problem, we had the stronger property that uniformly. However, by using Bernoulli variables instead of Gaussians, we should be able to construct an analogous example using only bounded variables.
Remark. The lovely method due to Ahlswede and Winter makes it possible to show rather easily that in the general random matrix setting that we are considering. Thus the above example exhibits the worst case behavior for this problem. However, as noted above, a dimension-dependent bound of this kind does not suffice for our purposes.
Is the result we are trying to prove hopeless? Not at all: bounding the expectation of the matrix norm is only a very crude way to show that the matrix norm is small with positive probability. The expected norm captures the typical behavior of the matrix norm: in the present case, we have seen above that this typical behavior must in general be dimension-dependent. However, all we want to establish is that it is possible that the matrix norm is bounded by a dimension-independent quantity (that is, that this happens with positive probability). Unfortunately, the above example suggests that the latter cannot be a typical event: the bound that we are interested in can only hold, in general, with very small probability that vanishes as we increase the dimension. This means that it is much more difficult to apply the probabilistic method, which usually works well precisely when the phenomenon that we are interested in is not only possible, but in fact typical. The present situation is much more delicate, and results in random matrix theory are not equipped to address it. To surmount this problem, Marcus, Spielman, and Srivastava developed an entirely new approach to bound the norm of a random matrix.
Let us briefly recall the common approaches in random matrix theory for bounding the norm of a random matrix . The key issue is to reduce the random variable to a probabilistic structure that we know how to control. The following two approaches are used in much of the literature:
- Recall that we can write , where denotes the unit ball in . This reduces estimating the norm to estimating the supremum of a family of random variables whose distribution is easily controlled (this is a linear function of the matrix entries). There are many standard probabilistic methods for estimating the supremum of random variables.
- Recall that we can write (the quantity is the -norm of the eigenvalues of , and the spectral norm is the norm of the eigenvalues). We can therefore estimate the matrix norm if we can control the moments . The latter is the expectation of a polynomial in the matrix entries, and thus the problem reduces to controlling the moments of the matrix entries and counting (this typically involves some combinatorics).
The beauty of these methods is that they reduce the problem of estimating the matrix norm to well-understood probabilistic problems of estimating the maxima or moments of random variables.
There is a third representation of the matrix norm that is taught in every linear algebra course. Suppose that the random matrix is positive definite (as is the case in our setting). Then the matrix norm is the largest eigenvalue of , and therefore coincides with the maximal root of the characteristic polynomial (the roots of are all real as is Hermitian, so it makes sense to speak of the “maximal” root). That is, we have the representation
Note that as is a random matrix, is a random polynomial. Its expectation could potentially be a very useful tool in random matrix theory (just as was a very useful quantity). Nonetheless, this representation is typically considered to be useless in random matrix theory. The problem is that is an extremely nasty functional, and there is no reason for it to behave well under expectations. In essence, the difficulty is that the geometry of roots of polynomials is extremely nonlinear, and the probabilistic behavior of these objects is not well understood.
A major insight of Marcus, Spielman, and Srivastava is that this representation can be extremely useful in certain problems. To exploit it, they use (a variant of) the following remarkable observation.
Lemma. Let be a random polynomial that takes values in the finite set . Suppose that all have the same degree and are monic (the leading coefficient is ), and that all convex combinations of have only real roots. Then
This statement is not at all obvious! In particular, it is not true without the assumption on the locations of the roots. To use it, we must show that the random polynomials that arise in the random matrix problem satisfy the required assumption, and we must also find a way to control the maximal root of . We can now see why roots of polynomials play a key role throughout the proof, both in its statement (the norm is represented as a maximal root) and in its guts (the locations of the roots are essential to obtain results such as the above lemma). The development of these ideas will require us to develop tools that marry the geometry of roots of polynomials with the probabilistic problems that we are trying to address. These methods will be developed in detail in the following lectures.
Let us now turn to an entirely different application of roots of polynomials in probability theory. We first recall some basic definitions.
A point process on a space is a random finite or countable subset of . For example, might be , , or . Some interesting examples of point processes include:
- The eigenvalues of Gaussian random matrices;
- The roots of an infinite power series with i.i.d. Gaussian coefficients;
- Given a (finite) graph , choosing a spanning tree uniformly at random defines a random subset of the edges ;
- (Karlin-McGregor) Pick different starting points in , and run independent random walks starting from these points conditionally on the event that their paths never cross. Then the positions of the random walks at time is a point process.
Remarkably, all of these apparently quite unrelated point processes turn out to share a very rich common structure: they are examples of determinantal point processes. For simplicity, let us define this notion only in the discrete case where .
Definition. A random subset of is called a determinantal point process if
for every , where is a given matrix called the kernel of the point process, and is the restriction of to the rows and columns in .
Determinantal point processes have many interesting properties. For the time being, let us mention one example: the property of negative dependence
for all , . We will (hopefully) encounter more interesting examples later on.
A basic tool for studying determinantal point processes is the generating function
Note that is a multivariate polynomial in (complex) variables. As an illustration of the utility of the generating function, let us use it to express the property of negative dependence. It is easy to check that
where denotes the vector of ones. Hence, negative dependence is equivalent to the polynomial identity
Clearly only very special kinds of polynomials can satisfy such a property. Why should the generating functions of determinantal point processes be of this kind? Some insight can be gained from the following result of Brändén.
Theorem. The following are equivalent.
- For all and , we have
- has no roots in .
If either of the equivalent conditions of this theorem hold, we say that is real stable. Note that , and thus we recover the negative dependence relationship if the generating function is real stable.
While it was initially far from clear why we would care about the roots of polynomials in this setting, we now see that the locations of the roots of the generating function are apparently intimately related with the negative dependence properties of point processes. The geometry of the roots of polynomials enters in an essential manner in results of this kind. It is not difficult to see that determinantal point processes must have real stable generating functions under some mild assumptions (very roughly speaking, one can compute in terms of the kernel of the determinantal point process; thus if , the generating function is the characteristic polynomial of a positive definite matrix, and thus has no roots in the upper quadrant of ). Such ideas therefore provide a powerful tool to investigate the properties of determinantal point processes.
The simple negative dependence property described above could have been obtained in a more elementary fashion. However, the polynomial method makes it possible to prove more delicate properties, such as stronger negative association properties or concentration of measure properties of determinantal point processes. If time permits, we will try to cover some of these topics at the end of the semester. While these questions are quite distinct from the Kadison-Singer problem, it is remarkable that many of the same tools play a central role in both cases.
Many thanks to Dan Lacker for scribing this lecture!
Welcome back to the stochastic analysis seminar! We resume this semester with an unusual but exciting topic: the geometry of roots of polynomials and their probabilistic applications.
Understanding the roots of polynomials seems far from a probabilistic issue, yet has recently appeared as an important technique in various unexpected problems in probability as well as in theoretical computer science. As an illustration of the power of such methods, these informal lectures will work through two settings where significant recent progress was enabled using this idea. The first is the proof of the Kadison-Singer conjecture by using roots of polynomials to study the norm of certain random matrices. The second is the proof that determinantal processes, which arise widely in probability theory, exhibit concentration of measure properties. No prior knowledge of these topics will be assumed.
Time and location: Thursdays, 4:30-6:00, Sherrerd Hall 101 (note different location than last year).
The first lecture will be on September 18.
Notes from the lectures, schedule changes and any other announcements will be posted on this blog.
- A. W. Marcus, D. A. Spielman, N. Srivastava, “Interlacing families I/II”
- Notes by T. Tao
- Notes by N. K. Vishnoi
- J. Borcea, P. Brändén, T. M. Liggett, “Negative dependence and the geometry of polynomials”
- R. Pemantle and Y. Peres, “Concentration of Lipschitz functionals of determinantal and other strong Rayleigh measures”
That was it for the Fall 2013 stochastic analysis seminar! Many thanks in particular to our fantastic guest lecturers Mokshay Madiman and Emmanuel Abbe.
There will be no stochastic analysis seminar in the Spring 2014 semester (students interested in related material may want to consider taking the special topics course Probability in High Dimension). We will resume in the Fall 2014 semester with a topic to be announced at that time.
Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).
Proposition. The following are equivalent for :
- for every and with finite mean.
- for every probability measure .
This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures is it true that if are i.i.d. , then every function is subgaussian with the same parameter for every ? In principle, the above result answers this question: satisfies dimension-free concentration if and only if
However, this is not a very satisfactory characterization: to check whether satisfies dimension-free concentration, we must check that the transportation-information inequality holds for for every . Instead, we would like to characterize dimension-free transportation in terms of a property of itself (and not its tensor products). This will be achieved in the present lecture.
How are we going to obtain a transportation-information inequality for every starting from only a property of ? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for itself already implies the analogous inequality for all its products . In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.
We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that denotes the set of couplings between probability measures and , and that the Wasserstein distance can be written as
We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.
Proposition. (Tensorization) Let be a convex function, and let be a positive weight function on the metric space . Fix a probability measure on . If
holds for all , then
holds for every and every .
Proof. The conclusion holds for by assumption. The proof now proceeds by induction on : we will suppose in the sequel that the conclusion holds for , and deduce the conclusion for from that.
Fix . We denote by the marginal of on the first coordinates, and define the conditional distribution (whose existence is guaranteed by the Bayes formula as ). The key idea of the proof is to exploit the chain rule for relative entropy:
The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing , it follows that we can choose and such that
If we now define the probability measure as
then we can estimate by Jensen’s inequality
But evidently , and thus
The proof is completed by letting .
Remark. In the above proof, we have swept a technicality under the rug: we assumed that an -optimal coupling can be chosen to be measurable as a function of . This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as by an explicit construction of the optimal coupling).
Now assume that satisfies the transportation-information inequality
which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with and . Tensorization then yields
Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an -type distance using the Cauchy-Schwarz inequality. However, we then lose a factor by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.
The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the -norm of the vector of expectations . If we could take the norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on with respect to the -distance ! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.
To this end, define the quadratic Wasserstein distance
Suppose that satisfies the quadratic transportation cost inequality (QTCI)
Then applying tensorization with and immediately yields
On the other hand, as obviously by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved
Corollary. Suppose that the probability measure satisfies
Then we have dimension-free concentration, that is, for i.i.d.
for all and with finite mean.
The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the -metric is due to Talagrand (1996), who used it to prove Gaussian concentration.
Characterizing dimension-free concentration
We began this topic by showing that the transportation cost inequality
is necessary and sufficient for a probability on a fixed metric space to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure for every . To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality
As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).
Theorem. Let be a probability measure on a Polish space , and let be i.i.d. . Then the following are equivalent:
- Dimension-free concentration:
for all , and with finite mean.
- Quadratic transporation cost inequality:
for every probability measure .
This result effectively resolves the question we set out to answer.
Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).
We will need the following three facts that will be proved below.
- Law of large numbers: as .
- Lower semicontinuity: is open in the weak convergence topology.
- Lipschitz property: the map is -Lipschitz.
The first two claims are essentially technical exercises: the empirical measures converge weakly to by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.
As is open, we can apply Sanov’s theorem as follows:
But as the function is -Lipschitz, dimension-free concentration implies
Thus we have
where we have used . In particular, we have proved
The quadratic transportation cost inequality follows readily (let and ).
It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of .
Lemma. (Claim 3) The map is -Lipschitz.
Proof. Let . If we define , then we evidently have
Conversely, every family of measures such that defines a coupling in this manner. We can therefore estimate as follows:
where in the last two lines we used, respectively, the reverse triangle inequality for norms (that is, ) and the reverse triangle inequality for the metric .
Next, we prove lower-semicontinuity of . This is an exercise in using weak convergence.
Lemma. (Claim 2) is lower-semicontinuous in the weak convergence topology.
Proof. Let be probability measures such that weakly as . We must show
Fix , and choose for each a coupling such that
We claim that the sequence is tight. Indeed, the sequence is tight (as it converges) and clearly itself is tight. For any , choose a compact set such that for all and . Then evidently , and thus tightness follows.
By tightness, we can choose a subsequence such that and weakly for some probability measure . As is continuous and nonnegative, we obtain
But as , we have shown . We conclude using .
Finally, we prove the law of large numbers in . This is an exercise in truncation.
Lemma. (Claim 1) Suppose that satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense as for i.i.d. .
Proof. Let be some arbitrary point. We truncate the Wasserstein distance as follows:
where we used for . We claim that if weakly, then
Indeed, by the Skorokhod representation theorem, we can construct random variables and on a common probability space such that for every , , and a.s. Thus by bounded convergence, and as the claim follows. Thus implies if we can control the second term in the above truncation.
Denote the empirical measure as . Recall that weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain
for every . Thus it remains to show that . But as is evidently Lipschitz (with constant ), this follows directly from the following Lemma.
Finally, we have used in the last proof the following lemma, which shows that if satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.
Lemma. Suppose that the probability measure satisfies the Lipschitz concentration property. Then any Lipschitz function satisfies for every .
Proof. Let be -Lipschitz. It suffices to prove that has finite mean. If this is the case, then the Lipschitz concentration property implies for every that
where we note that is Lipschitz with the same constant as . To prove that has finite mean, let us apply the Lipschitz concentration property to (which certainly has finite mean). This gives
Now choose such that . Then clearly . But note that the median for . Thus we obtain as .
We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal on satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]
Proposition. Let on . Then for all .
This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on admit an explicit solution in terms of quantile functions. This allows to establish inequalities on using calculus manipulations. In contrast, optimal transportation problems on for are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.
Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.
Proof. Denote by be the law of standard Brownian motion . Fix a probability measure such that , and define the probability measure as
Then clearly under and under .
Note that is a uniformly integrable martingale and for (here denotes the density of ). Thus we find that
for some nonanticipating process by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by
is Brownian motion under . Thus the law of under is a coupling of and , and
by Jensen's inequality. The proof is therefore complete once we prove that
To see this, note that
If then the integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that implies , see Lemma 2.6 of Föllmer for a careful proof.
Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in for any . Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini
The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.
What is concentration?
Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.
To illustrate concentration, let . Using Markov’s inequality, we can estimate for
where we have made the optimal choice (this is called a Chernoff bound). Thus a Gaussian random variable has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.
Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let be i.i.d. , and let be -Lipschitz, that is, for each (where is the Euclidean norm). Then
This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant , and there is no dependence on the number of random variables . Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.
Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.
The basic question that we want to address in the next two lectures is the following.
Question. Let be i.i.d. random variables on a metric space with distribution . Can we characterize all for which dimension-free concentration holds as for the Gaussian case above?
It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.
When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.
Definition. is subgaussian with parameter if for each .
If is subgaussian, then (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger , so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.
Concentration and relative entropy
Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.
Let be a metric space. A function is -Lipschitz (-Lip) if for each . One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures on , define the Wasserstein distance as
The idea is that two measures are close if the expectations of a large class of function is close. In the case of , the class of function being used is the class -Lip.
As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how and are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.
Proposition. The following are equivalent for :
- for every and .
- for every probability measure .
Note that this result characterizes those measures on a fixed metric space that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.
Example. Let be the trivial metric. A function is -Lip with respect to if
that is, if and only if . Hence we have
Thus 2 in the above proposition holds with by Pinsker’s inequality
We consequently find by 1 above that
for every function such that . Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.
Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).
Lemma. (Gibbs variational principle) Let be any random variable. Then
Proof. Assume that is bounded above by some constant (otherwise replace by and then let at the end). Define a probability measure by
As the relative entropy is always positive, we have
for every , and equality is obtained by choosing the optimizer .
Using the variational principle, it is easy to prove the Proposition.
Proof of the Proposition. By the variational principle, we have
if and only if
for all . Optimizing over , we find that 1 is equivalent to the validity of
for all and .
Tensorization and optimal transport
So far we have considered concentration in a fixed metric space . If are independent random variables, we can certainly apply the Proposition to with the product distribution . However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for for every with the same constant ! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure , then that will already imply the same property for the measures for every . This turns out not to be quite true, but this idea will lead us in the right direction.
Motivated by the above, we set out to answer the following
Question. Suppose that satisfies for every . Does this imply that a similar property is satisfied by the product measures ?
Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.
Theorem. (Kantorovich-Rubinstein duality, 1958) Let and be probability measures on a Polish space. Let be the set of couplings of and . Then
The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as are often called transportation-information inequalities.
The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.
Proof of lower bound. For each and , we have
from which we immediately get
This proves the easy direction in the above theorem.
It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.
Proposition. (Tensorization) Suppose that
for all . Then, for any ,
for all .
We postpone the proof of this result until the next lecture.
Example. Let . By Pinsker’s inequality
Define the weighted Hamming distance for positive weights as
Then, by Cauchy-Schwarz and tensorization we get
for each . So, we have
with , for each and each function which is 1-Lip with respect to . This implies
That is, we recover the well known bounded difference inequality.
We have not yet shown that the transportation-information inequality holds for on . Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get
Setting the weights , we find a tail bound of the form whenever is with respect to the -norm . Note that this is not dimension-free: the factor appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound whenever is with respect to the -norm . Note that the latter property is strictly stronger than the former because ! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.
Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!
Lecture by Ramon van Handel | Scribed by Patrick Rebeschini
This lecture introduces the notion of the entropic cone and its connection with entropy inequalities.
Recall that if is a discrete random variable with distribution , the entropy of is defined as
Now let be (not necessarily independent) discrete random variables. It will be convenient to define the entropy function
As the entropy depends only on the probabilities of the outcomes of a random variable and not on its values, we will assume without loss of generality in the sequel that take values in .
For any probability on , let be the -dimensional vector
We can now define the entropic cone.
Definition. The entropic cone is the set
i.e., the set of all vectors that can be obtained as the entropy function of variables.
Question. Can we characterize ?
- If , then , as there exist random variables of arbitrary (nonnegative) entropy.
- If , the vector for is
What constraints must this vector satisfy? We must certainly have
which follows from the chain rule together with positivity for the first inequality, and that conditioning reduces entropy for the second. Are these the only constraints? In fact, we can create many vectors given these constraints. For example:
- The vector is obtained by taking such that , and letting .
- The vector is obtained by taking to be i.i.d. copies of .
- Convex combinations of these vectors can be obtained by taking mixtures of their distributions.
In fact, a careful analysis shows that the above constraints completely characterize the entropic cone for : that is,
- If , what happens? Here we are are still subject to the same constraints as in the case , but we pick up some new constraints as well such as
which is equivalent to the inequality .
Evidently, the constraints on the entropic cone correspond to entropy inequalities. What type of inequalities must hold for any given ? Let us think about this question a bit more systematically.
Definition. A set function is called a polymatroid if
- , for all .
- when (monotonicity).
It is not difficult to check that
Lemma. The entropy function is a polymatroid.
Other examples of polymatroids: (1) entropy, mutual information; (2) max-weight for given weights ; (3) flows; (4) cuts (not monotone, but submodular); (5) Von Neumann entropy.
Let us define
Evidently is a polyhedral cone (the intersection of a finite number of halfplanes).
- for .
- is a convex cone for all (but not polyhedral for ).
For the proof, we refer to the book for R. Yeung.
As an example of , we note that the vector is only achievable when for some . Let us also note that convexity of holds in its interior but not on the boundary. However, issues only arise on the boundary of , that is, no new entropy inequalities appear beyond the polymatroid inequalities when .
On the other hand, when , many new inequalities are introduced that actually cause holes to appear within . One such inequality (expressed in terms of mutual information) is
Using the entropic cone
The entropic cone can be used to obtain new information theoretic inequalities. For example, the following question arose in the study of mechanism design in computer science.
Problem. Let be discrete random variables. Define the matrix as . Is positive definite?
When , we have since . Thus all the principal minors of have positive determinant, so is positive definite.
How about ? Note that depends linearly on the entries of the vector (where is the joint distribution of ). Thus if is positive definite for distributions on the extreme rays (ER) of the entropic cone, then must be positive definite for any distribution. More generally:
Proposition. If is convex, then holds for all if and only if for all .
Proof. It suffices to note that
for every , and to use that .
This necessary and sufficient condition for generalizes to a sufficient condition for .
Proposition. If is convex, then holds for all if for all .
As is polyhedral, this simplifies solving problems such as checking positive definiteness of significantly: it suffices to check a finite number of cases, which can essentially be done by hand. It can be shown this way that is always positive definite for , but this can fail for .
It is sometimes of interest to investigate discrete variants of the above questions.
Definition. A matroid is defined by the conditions
- for all ;
- is monotone;
- is submodular.
- Vector matroids (also called -representable matroids, where is a field). Given a matrix with values in , define where denotes the columns induced by .
- Graphical matroids. Let be a graph, and choose . Then define , where the maximum is taking over acyclic subgraphs .
- Entropic matroids. Let . For what distributions of is this a matroid?
Denote by the entropic cone induced by those distributions where take values in . In order for such a distribution to define an entropic matroid, the vector must take values in . Thus we are led to consider the set
Can one characterize what type of matroids can arise as entropic matroids?
Theorem. coincides with the set of all -representable matroids with elements.
For the proof and further results around this theme, see the paper of E. Abbe.
Lecture by Emmanuel Abbe | Scribed by Danny Gitelman
This lecture completes the proof of the entropic central limit theorem.
From Fisher information to entropy (continued)
In the previous lecture, we proved the following:
By optimizing the choice of the weights, it is easy to check that (1) is equivalent to
Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.
In the following, let be the set of all subsets of of size and take for every . In this case, (2) takes the following form.
Theorem EPI. If are independent random variables, then
provided that all the entropies exist.
To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.
Theorem (de Bruijn identity). Let be a random variable with density on . Let , , where is independent of . We have
Proof. Let be the density of . Then
where . It is easy to check that . Hence,
Integrating from to gives (3).
Proof of Theorem EPI. In (1), let be the set of all subsets of of size and . We have
where, is the weight corresponding to the set . It is easy to check that (4) is equivalent, for every with , to
for any choice of with and with .
Let , where is independent of all else. Then
where the equality in distribution is due to the fact that . Hence, by (6),
We use de Bruijn identity to integrate this from to over time and get
By (6) again,
This is the entropy analog of the Fisher information inequality obtained above.
As a final step, let . We are to show
If for some , the result is immediate due to the general fact (convolution increases entropy). Hence, we assume that for every , so
Set , . Note that
With this choice of , we apply (7). After some computations, we get
Using the definition of the , the final result follows from here immediately.
Proof of the Entropic Central Limit Theorem
Entropic CLT. Suppose that are i.i.d. random variables with and . Let . If for some , then as . Equivalently, if for some , then as .
In the case of i.i.d. random variables, Theorem EPI gives
which is equivalent to because of the identity . Therefore, it remains to show as . For this, we need some analytical properties of the relative entropy. Henceforth, we use to denote the set of all probability measures on some Polish space (take for instance).
Proposition (Variational characterization of ). Let . Then
where denotes the set of all bounded measurable functions .
- In the variational characterization of above, it is enough to take the supremum over the set of all bounded continuous functions .
- For fixed , the mapping is convex and continuous. Since is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.
Corollary. Sublevel sets of are compact, that is, the set is compact (with respect to the topology of weak convergence) for every and .
Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.
Definition (Tightness). A set is called tight if for every there exists a compact set such that for every .
Prohorov Theorem. A set is weakly precompact if and only if it is tight.
Note that is a tight set as a singleton. We claim that the sequence is also tight. Indeed, let and let be a compact set with . We take and apply (8) to get
where the rightmost term can be made arbitrarily small. Hence, is tight. By Prohorov Theorem, there exists such that as . By lower semicontinuity of , we have
This finishes the proof of the corollary.
Proof of Variational Characterization of . As a first case, suppose . So and there exists a Borel set with and . Let . We have
as . Hence, both sides of the variational characterization are equal to .
For the rest, we assume . First, we show the part. If , the inequality is obvious. Suppose . Given , define a probability measure by
Since is chosen arbitrarily, taking supremum on the right hand side gives the part.
Next, we prove the part. Note that if , then . However, this choice of may not be in , that is, may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let
so that as . Note that and . Thus we have
by monotone convergence. On the other hand, by Fatou’s Lemma, we have
from which the part of the result follows.
Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.
Lemma. If are random variables with , , and the corresponding densities are uniformly bounded with as (pointwise) for some density , then and as .
Proof. Recall for with mean and variance . By lower semicontinuity of , we have
On the other hand, letting , we have
and using Fatou’s Lemma,
Hence, as .
End of proof of Entropic CLT. Assume . We will use to denote normalized Fisher information. For any , we have that for . So as for every . We want to show that , since then we will get by Lebesgue’s dominated convergence theorem that
as . But since
it is enough to show that for each .
By the monotonicity property we have proved, we know that
for any . By compactness of sublevel sets of , the sequence must therefore have a subsequence whose distribution converges to a probability measure (let us call a random variable with this limiting measure as its distribution). For , the smoothing caused by Gaussian convolution implies that the density of converges pointwise to that of , and also that the density of converges pointwise to that of , where is an independent copy of . By the previous lemma
as , and
so that necessarily
By the equality condition in the entropy power inequality, this can only happen if is Gaussian, which in turn implies that .
Lecture by Mokshay Madiman | Scribed by Cagin Ararat