Next lecture: December 4

There will be no lecture on November 20 in view of the Northeast Probability Seminar. There will also be no lecture on November 27 due to Thanksgiving.

The seminar will resume on December 4 (the week after thanksgiving) at the usual time and place. We will try to post notes from the last two lectures before the seminar resumes.

16. November 2014 by Ramon van Handel
Categories: Announcement | Leave a comment

Lecture 5. Proof of Kadison-Singer (3)

This is the last installment of the proof of the Kadison-Singer theorem. After several plot twists, we have finally arrived at the following formulation of the problem. If you do not recall how we got here, this might be a good time to go back and read the previous posts.

Theorem 1. Let B_1,\ldots,B_n\succeq 0 be positive semidefinite matrices in \mathbb{C}^{d\times d} such that

    \[\sum_{i=1}^nB_i=I,\qquad \mathrm{Tr}[B_i]\le\varepsilon\text{ for all }i.\]

Define the multivariate polynomial

    \[p(z_1,\ldots,z_n):=\det\Bigg(\sum_{i=1}^nz_iB_i\Bigg).\]

Then

    \[\prod_{i=1}^n\bigg(1-\frac{\partial}{\partial z_i}\bigg)p(z_1,\ldots,z_n)\ne 0\quad\text{whenever }z_1,\ldots,z_n>(1+\sqrt{\varepsilon})^2.\]

The entire goal of this lecture is to prove the above theorem.

Let us recall that we already proved such a result without the derivatives 1-\frac{\partial}{\partial z_i} (this is almost trivial).

Lemma 1. p(z_1,\ldots,z_n)\ne 0 whenever z_1,\ldots,z_n>0.

The key challenge is to understand what happens to the region without roots under operations of the form 1-\frac{\partial}{\partial z_i}. To this end, we introduced in the previous lecture the barrier function

    \[\frac{\partial}{\partial z_i}\log p(z_1,\ldots,z_n).\]

It is easily seen that the roots of (1-\frac{\partial}{\partial z_i})p(z) coincide with the points where the barrier function equals one. As the barrier function is easily bounded in our setting using the Jacobi formula for determinants, we can immediately bound the roots of (1-\frac{\partial}{\partial z_i})p(z).

Unfortunately, this simple argument is only sufficient to apply the operation 1-\frac{\partial}{\partial z_i} once, while we must apply n such operations. The difficulty is that while it is easy to control the barrier function of the polynomial p(z), we do not know how to control directly the barrier function of the polynomial (1-\frac{\partial}{\partial z_i})p(z). To solve this problem, we must develop a more sophisticated version of the barrier argument that provides control not only of the roots of the derivative polynomial, but also of its barrier function: once we accomplish this, we will be able to iterate this argument to complete the proof.

The multivariate barrier argument

In the multivariate barrier argument, we will keep track of an octant in which the multivariate polynomial of interest has no roots. We will use the following terminology.

Definition. A point x=(x_1,\ldots,x_n)\in\mathbb{R}^n is said to bound the roots of the multivariate polynomial q(z_1,\ldots,z_n) if q(z_1,\ldots,z_n)\ne 0 whenever z_1>x_1, z_2>x_2, \ldots, z_n>x_n.

This notion is illustrated in the following figure:

Rendered by QuickLaTeX.com

We can now formulate the multivariate barrier argument that is at the heart of the proof.

Theorem 2. Let q(z_1,\ldots,z_n) be a real stable polynomial. Suppose that x bounds the roots of q, and

    \[\frac{\partial}{\partial x_j}\log q(x_1,\ldots,x_n)\le 1-\frac{1}{\delta}\]

for some \delta>0 and j\in[n]. Then x bounds the roots of (1-\frac{\partial}{\partial z_j})q(z_1,\ldots,z_n), and

    \[\frac{\partial}{\partial x_i}\log\bigg(1-\frac{\partial}{\partial x_j}\bigg)q(x_1,\ldots,x_{j-1},x_j+\delta,x_{j+1},\ldots,x_n) \le \frac{\partial}{\partial x_i}\log q(x_1,\ldots,x_n)\]

for all i\in[n].

Note that the barrier function assumption

    \[\frac{\partial}{\partial x_j}\log q(x_1,\ldots,x_n)<1\]

would already be enough to ensure that x bounds the roots of (1-\frac{\partial}{\partial z_j})q(z_1,\ldots,z_n); this is essentially what we proved in the last lecture (combined with a basic monotonicity property of the barrier function that we will prove below). The key innovation in Theorem 2 is that we do not only bound the roots of (1-\frac{\partial}{\partial z_j})q(z_1,\ldots,z_n), but we also control its barrier function \frac{\partial}{\partial z_i}\log(1-\frac{\partial}{\partial z_j})q(z_1,\ldots,z_n). This allows us to iterate this theorem over and over again to add more derivatives to the polynomial. To engineer this property, we must build some extra room (a gap of 1/\delta) into our bound on the barrier function of q. Once we understand the proof, we will see that this idea arises very naturally.

Up to the proof of Theorem 2, we now have everything we need to prove Theorem 1 (and therefore the Kadison-Singer theorem). Let us complete this proof first, so that we can concentrate for the remainder of this lecture on proving Theorem 2.

Proof of Theorem 1. In the previous lecture, we used the Jacobi formula to show that

    \[\frac{\partial}{\partial z_i}\log p(z_1,\ldots,z_n)\bigg|_{z_1,\ldots,z_n=t} \le \frac{\varepsilon}{t}\]

for all i. To start the barrier argument, let us therefore choose x_1,\ldots,x_n=t>0 and \delta>0 such that

    \[\frac{\varepsilon}{t} \le 1-\frac{1}{\delta}.\]

Initially, by Lemma 1, we see that x bounds the roots of p and that the barrier function satisfies the assumption of Theorem 2. That is, we start in the following situation:

Rendered by QuickLaTeX.com

Applying Theorem 2, we find that x still bounds the roots of (1-\frac{\partial}{\partial z_1})p(z_1,\ldots,z_n). Moreover, we have control over the barrier function of this derivative polynomial, however at a point that lies above x:

    \[\frac{\partial}{\partial x_i}\log\bigg(1-\frac{\partial}{\partial x_1}\bigg)p(x_1+\delta,x_2,\ldots,x_n) \le \frac{\partial}{\partial x_i}\log p(x_1,\ldots,x_n) \le 1-\frac{1}{\delta}\]

for all i. This is illustrated in the following figure:

Rendered by QuickLaTeX.com

We now have everything we need to apply Theorem 2 again to the polynomial \big(1-\frac{\partial}{\partial z_1}\big) p(z) (recall that p and its derivative polynomials are all real stable, as we proved early on in these lectures). In the next step, we control the barrier in the z_2-direction to obtain, in a picture:

Rendered by QuickLaTeX.com

We can now repeat this process in the z_3-direction, etc. After n iterations, we have evidently proved that

    \[\prod_{i=1}^n\bigg(1-\frac{\partial}{\partial z_i}\bigg)p(z_1,\ldots,z_n)\ne 0\quad\text{whenever }z_1,\ldots,z_n>t+\delta.\]

All that remains is to choose t>0 and \delta>0. To optimize the bound, we minimize t+\delta subject to the constraint \frac{\varepsilon}{t} \le 1-\frac{1}{\delta}. This is achieved by t=\sqrt{\varepsilon}+\varepsilon and \delta=1+\sqrt{\varepsilon}, which completes the proof. \square

Of course, it remains to prove Theorem 2. It turns out that this is quite easy, once we develop some (nontrivial!) properties of the barrier functions of real-stable polynomials.

Some properties of barrier functions

Throughout this section, let q(z_1,\ldots,z_n) be a real stable polynomial. As q is real stable, the univariate polynomial z_1\mapsto q(z_1,z_2,\ldots,z_n) is real-rooted, and we denote its roots by

    \[u_1(z_2,\ldots,z_n)\ge u_2(z_2,\ldots,z_n)\ge\cdots\ge u_d(z_2,\ldots,z_n).\]

We can then represent the polynomial q as

    \[q(z_1,\ldots,z_n) = c(z_2,\ldots,z_n)\prod_{i=1}^d(z_1-u_i(z_2,\ldots,z_n)),\]

and therefore the barrier function as

    \[\frac{\partial}{\partial z_1}\log q(z_1,\ldots,z_n) = \sum_{i=1}^d\frac{1}{z_1-u_i(z_2,\ldots,z_n)}.\]

Some simple properties of the barrier function follow easily from this expression.

Lemma 2. Suppose that z\in\mathbb{R}^n bounds the roots of q. Then the barrier function t\mapsto\frac{\partial}{\partial z_1}\log q(z_1+t,z_2\ldots,z_n) is positive, decreasing and convex for t\ge 0.

Proof. As z bounds the roots of q, we have z_1>u_i(z_2,\ldots,z_n) for all i. Thus clearly the barrier function is positive. To show that it is decreasing, we note that

    \[\frac{\partial}{\partial z_1}\frac{\partial}{\partial z_1}\log q(z_1,z_2,\ldots,z_n) = -\sum_{i=1}^d\frac{1}{(z_1-u_i(z_2,\ldots,z_n))^2}<0.\]

Likewise, convexity follows as

    \[\frac{\partial^2}{\partial z_1^2}\frac{\partial}{\partial z_1}\log q(z_1,z_2,\ldots,z_n) = \sum_{i=1}^d\frac{2}{(z_1-u_i(z_2,\ldots,z_n))^3}>0\]

when z bounds the roots of q. \square

The main property that we need in order to prove Theorem 2 is that these monotonicity and convexity properties of the barrier function also hold when we vary other coordinates. This seems innocuous, but is actually much harder to prove (and requires a clever idea!)

Lemma 3. Suppose that z\in\mathbb{R}^n bounds the roots of q. Then the barrier function t\mapsto\frac{\partial}{\partial z_1}\log q(z_1,z_2+t,z_3,\ldots,z_n) is positive, decreasing and convex for t\ge 0.

Remark. There is of course nothing special about the use (for notational simplicity) of the first two coordinates z_1,z_2 in Lemmas 2 and 3. We can identically consider the barrier function in any direction z_i and obtain monotonicity and convexity in any other direction z_j, as will be needed in the proof of Theorem 2. In fact, as the remaining coordinates z_3,\ldots,z_n are frozen, Lemmas 2 and 3 are really just statements about the properties of bivariate real stable polynomials.

The difficulty in proving Lemma 3 is that while \frac{\partial}{\partial z_1}\log q(z) behaves very nicely as a function of z_1, it is much less clear how it behaves as a function of z_2: to understand this, we must understand how the roots u_i(z_2,\ldots,z_n) vary as a function of z_2. In general, this might seem like a hopeless task. Surprisingly, however, the roots of real stable polynomials exhibit some remarkable behavior.

Lemma 4. Let q(x,y) be a bivariate real-stable polynomial, and let x_1(y)\ge x_2(y)\ge\cdots\ge x_d(y) be the roots of x\mapsto q(x,y). Then each y\mapsto x_i(y) is a decreasing function.

This is enough to prove Lemma 3.

Proof of Lemma 3. That the barrier function is positive follows precisely as in the proof of Lemma 2 (using the fact that if z bounds the roots of q, then (z_1,z_2+t,z_3,\ldots,z_n) also bounds the roots of q).

It remains to establish monotonicity and convexity. There is no loss of generality in assuming that q is a bivariate polynomial in z_1,z_2 only, and that we can write

    \[q(z_1,z_2) = c(z_1)\prod_{i=1}^d (z_2-v_i(z_1))\]

for roots v_1(z_1)\ge v_2(z_1)\ge\cdots \ge v_d(z_1). We can therefore write

    \[\frac{\partial}{\partial z_2} \frac{\partial}{\partial z_1}\log q(z_1,z_2) = \frac{\partial}{\partial z_1}\sum_{i=1}^d \frac{1}{z_2-v_i(z_1)}, \qquad \frac{\partial^2}{\partial z_2^2} \frac{\partial}{\partial z_1}\log q(z_1,z_2) = -\frac{\partial}{\partial z_1}\sum_{i=1}^d \frac{1}{(z_2-v_i(z_1))^2}.\]

But as z bounds the roots of q, we have z_2>v_i(z_1) for all i. As z_1\mapsto v_i(z_1) is also decreasing, we have

    \[\frac{\partial}{\partial z_2} \frac{\partial}{\partial z_1}\log q(z_1,z_2) <0, \qquad \frac{\partial^2}{\partial z_2^2} \frac{\partial}{\partial z_1}\log q(z_1,z_2) > 0,\]

which is precisely what we set out to show. \square

We must still prove Lemma 4. This seems to be quite a miracle: why should such a property be true? To get some intuition, let us first consider an apparently very special case where q is a polynomial of degree one, that is, q(x,y)=ax+by+c. The root of the polynomial x\mapsto q(x,y) is clearly given by

    \[x_1(y) = -\frac{b}{a}y-\frac{c}{a}.\]

Suppose that y\mapsto x_1(y) is nondecreasing, that is, that b and a have opposite sign. Then q cannot be real stable! Indeed, for any real root (x,y), the point (x+i/a,y-i/b) is also a root, which violates real stability if a and b have opposite sign (as both coordinates have strictly positive or negative imaginary parts). Thus for polynomials of degree one, real stability trivially implies that y\mapsto x_1(y) is decreasing.

While this is very intuitive, it also seems at first sight like it does not help much in understanding nontrivial polynomials of higher degree. Nonetheless, this simple observation proves to be the key to understanding the case of general polynomials. This idea is that the property that y\mapsto x_i(y) is decreasing is local. If there is a point at which this property is violated, then we can Taylor expand the polynomial around that point to reduce to the degree one case, and thereby obtain a contradiction.

Proof of Lemma 4. By the implicit function theorem, the maps y\mapsto x_i(y) are continuous and C^1 everywhere except at a finite number of points. Therefore, if the conclusion fails, then there must exist a root i and a (nondegenerate) point y^* such that

    \[\beta:=\frac{dx_i(y)}{dy}\bigg|_{y=y^*}>0.\]

We will use this to bring ourselves to a contradiction.

Let us write x^*=x_i(y^*), so (x^*,y^*) is a root of q. We Taylor expand q around this point. Note that

    \[q(x,y) = c(y)\prod_{j=1}^d (x-x_j(y)),\]

so that

    \[\frac{\partial}{\partial x}q(x,y)\bigg|_{x^*,y^*} = c(y)\sum_{k=1}^d\prod_{j\ne k} (x-x_j(y))\bigg|_{x^*,y^*} = c(y^*)\prod_{j\ne i} (x^*-x_j(y^*)),\]

where we have used that x^*-x_i(y^*)=0. Similarly, we obtain

    \[\frac{\partial}{\partial y}q(x,y)\bigg|_{x^*,y^*} = -\frac{dx_i(y)}{dy}\bigg|_{y=y^*}c(y^*) \prod_{j\ne i} (x^*-x_j(y^*)).\]

As q(x^*,y^*)=0, we obtain the Taylor expansion

    \[q(x,y) = \alpha\bigg\{(x-x^*)-\beta(y-y^*)\bigg\} +o(|x-x^*|+|y-y^*|)\]

for a suitable constant \alpha\in\mathbb{R} and where \beta>0.

We now conclude by a perturbation argument. Define the univariate polynomials

    \[\tilde q_\delta(x) := \delta^{-1} q(\delta x+x^*,\delta i+y^*).\]

Letting \delta\downarrow 0, we obtain \tilde q_0(x) = \alpha(x-i\beta), which evidently has a root \tilde x=i\beta with strictly positive real part. By the continuity of roots of polynomials, \tilde q_\delta(x) must still have a root with strictly positive real part when \delta>0 is sufficiently small (this follows readily using Rouché’s theorem). But this implies that q(x,y) has a root with \text{Im}(x)>0 and \text{Im}(y)>0, contradicting real stability. \square

Conclusion of the proof

All that remains is to finally prove Theorem 2. The hard work is behind us: with the monotonicity and convexity properties of Lemmas 2 and 3 in hand, the proof of Theorem 2 is straightforward.

Proof of Theorem 2. As the barrier function is coordintewise decreasing, the assumption

    \[\frac{\partial}{\partial x_j}\log q(x_1,\ldots,x_n)< 1\]

implies that

    \[\frac{\partial}{\partial z_j}\log q(z_1,\ldots,z_n)< 1\quad\text{whenever }z_1>x_1,\ldots,z_n>x_n.\]

It follows immediately that x bounds the roots of \big(1-\frac{\partial}{\partial z_j}\big)q(z_1,\ldots,z_n).

Let us now control the barrier function of \big(1-\frac{\partial}{\partial z_j}\big)q(z_1,\ldots,z_n). As q is positive above the roots of q,

    \[\bigg(1-\frac{\partial}{\partial z_j}\bigg)q(z_1,\ldots,z_n) = q(z_1,\ldots,z_n)\bigg(1-\frac{\partial}{\partial z_j}\log q(z_1,\ldots,z_n)\bigg).\]

We can therefore write the barrier function as

    \[\frac{\partial}{\partial z_i}\log \bigg(1-\frac{\partial}{\partial z_j}\bigg)q(z_1,\ldots,z_n) = \frac{\partial}{\partial z_i}\log q(z_1,\ldots,z_n) + \frac{-\frac{\partial^2}{\partial z_i\partial z_j}\log q(z_1,\ldots,z_n)}{1-\frac{\partial}{\partial z_j}\log q(z_1,\ldots,z_n)}.\]

Note that as the barrier function is decreasing, the numerator in this expression is positive. Moreover, by the assumption of Theorem 2 and monotonicity of the barrier function, we have 1-\frac{\partial}{\partial z_j}\log q(z_1,\ldots,z_n) \ge \frac{1}{\delta} whenever z_i\ge x_i. We can therefore estimate

    \begin{align*} &\frac{\partial}{\partial x_i}\log\bigg(1-\frac{\partial}{\partial x_j}\bigg)q(x_1,\ldots,x_{j-1},x_j+\delta,x_{j+1},\ldots,x_n) \\ &\le \frac{\partial}{\partial x_i}\log q(x_1,\ldots,x_{j-1},x_j+\delta,x_{j+1},\ldots,x_n) - \delta\frac{\partial}{\partial x_j} \frac{\partial}{\partial x_i}\log q(x_1,\ldots,x_{j-1},x_j+\delta,x_{j+1},\ldots,x_n) \\ &\le \frac{\partial}{\partial x_i}\log q(x_1,\ldots,x_n), \end{align*}

where we have used convexity of the barrier function (that is, we used the first-order condition g(t+\delta)-g(t)\le \delta g'(t+\delta) for the convexity of a function g). And Kadison-Singer is now proved. \square

Epilogue: our presentation of the proof of the Kadison-Singer theorem has largely followed the approach from the blog post by T. Tao, which simplifies some of the arguments in the original paper by A. W. Marcus, D. A. Spielman, and N. Srivastava. Both are very much worth reading!

03. November 2014 by Ramon van Handel
Categories: Roots of polynomials | Leave a comment

Lecture 4. Proof of Kadison-Singer (2)

Recall that we are in the middle of proving the following result.

Theorem. Let v_1,\ldots,v_n be independent random vectors in \mathbb{C}^d whose distribution has finite support. Suppose that

    \[\mathbf{E}\sum_{i=1}^nv_iv_i^* = I, \qquad \mathbf{E}\|v_i\|^2 \le \varepsilon\mbox{ for all }i.\]

Then

    \[\left\|\sum_{i=1}^nv_iv_i^*\right\| \le (1+\sqrt{\varepsilon})^2\]

with positive probability.

Define the random matrix A and its characteristic polynomial p_A as

    \[A = \sum_{i=1}^n A_i,\qquad A_i= v_iv_i^*,\qquad p_A(z) = \mathrm{det}(zI-A).\]

As A is positive definite, one representation of the matrix norm is \|A\|=\mathrm{maxroot}(p_A). The proof of the above theorem consists of two main parts:

  1. Show that \mathrm{maxroot}(p_A)\le \mathrm{maxroot}(\mathbf{E}p_A) with positive probability.
  2. Show that \mathrm{maxroot}(\mathbf{E}p_A)\le (1+\sqrt{\varepsilon})^2.

The first statement was proved in the previous two lectures. The goal of this lecture and the next is to prove the second statement.

Mixed characteristic polynomials

The key tool that was used in the previous lectures in the mixed characteristic polynomial of the matrices A_1,\ldots,A_n, which we defined as follows:

    \[\mu[A_1, \ldots, A_n](z) := \prod_{i=1}^n \bigg(1- \frac{\partial}{\partial t_i}\bigg) \det\Bigg( zI + \sum_{i=1}^n t_i A_i\Bigg)\Bigg\vert_{t_1,\ldots,t_n = 0}.\]

We recall the two crucial observations used in the previous lectures:

  1. \mu[A_1,\ldots,A_n] is multilinear in A_1,\ldots,A_n.
  2. If A_1,\ldots,A_n have rank one (this is essential!), then p_{\sum_{i=1}^nA_i}(z)=\mu[A_1,\ldots,A_n](z).

As in our case A_1,\ldots,A_n are independent random matrices of rank one, this implies

    \[\mathbf{E}p_A = \mu[\mathbf{E}A_1,\ldots,\mathbf{E}A_n] =     \mu[B_1,\ldots,B_n],\]

where we define for simplicity B_i:=\mathbf{E}A_i.

Note that the matrices B_i are no longer of rank one (in particular, it is not true that \mathbf{E}p_A is the characteristic polynomial of \sum_{i=1}^nB_i: that would make the remainder of the proof quite trivial!) However, by assumption, the matrices B_i satisfy

    \[\sum_{i=1}^nB_i=I,\qquad     B_i\succeq 0 \quad\mbox{and}\quad\mathrm{Tr}[B_i]\le\varepsilon \quad\mbox{for all }i.\]

To complete the proof, our aim will be to show that for any matrices B_1,\ldots,B_n with these properties, the maximal root of \mu[B_1,\ldots,B_n] can be at most (1+\sqrt{\varepsilon})^2.

Theorem. Let B_1,\ldots,B_n\succeq 0 be positive semidefinite matrices in \mathbb{C}^{d\times d}. Suppose that

    \[\sum_{i=1}^nB_i=I,\qquad     \mathrm{Tr}[B_i]\le\varepsilon\mbox{ for all }i.\]

Then

    \[\text{maxroot}(\mu[B_1,\ldots,B_n]) \le (1+\sqrt{\varepsilon})^2.\]

As the matrices B_i sum to the identity \sum_{i=1}^nB_i=I, it will be convenient to rewrite the definition of the mixed characteristic polynomial slightly using this property:

    \begin{align*} \mu[B_1, \ldots, B_n](z) &:= \prod_{i=1}^n \bigg(1- \frac{\partial}{\partial t_i}\bigg) \det\Bigg( zI + \sum_{i=1}^n t_i B_i\Bigg)\Bigg\vert_{t_1,\ldots,t_n = 0} \\ &= \prod_{i=1}^n \bigg(1- \frac{\partial}{\partial t_i}\bigg) \det\Bigg(\sum_{i=1}^n (z+t_i) B_i\Bigg)\Bigg\vert_{t_1,\ldots,t_n = 0} \\ &= \prod_{i=1}^n \bigg(1- \frac{\partial}{\partial t_i}\bigg) \det\Bigg(\sum_{i=1}^n z_i B_i\Bigg)\Bigg\vert_{z_1,\ldots,z_n = z} \\ &=: \prod_{i=1}^n \bigg(1- \frac{\partial}{\partial t_i}\bigg) p(z_1,\ldots,z_n)\bigg\vert_{z_1,\ldots,z_n = z}, \end{align*}

where we defined the multivariate polynomial

    \[p(z_1,\ldots,z_n):=\det\Bigg(\sum_{i=1}^n z_i B_i\Bigg).\]

Our aim is to show that \prod_{i=1}^n(1+\frac{\partial}{\partial z_i})p(z_1,\ldots,z_n)\ne 0 whenever z_1=\cdots=z_n=z>(1+\sqrt{\varepsilon})^2.

How should we go about proving such a property? The first observation we should make is that the polynomial p itself has no roots in an even larger region.

Lemma. p(z_1,\ldots,z_n)\ne 0 whenever z_1,\ldots,z_n>0.

Proof. Note that

    \[\sum_{i=1}^n z_iB_i - \big(\min_iz_i\big)I = \sum_{i=1}^n \big(z_i-\min_iz_i\big)B_i\succeq 0,\]

as z_i-\min_iz_i\ge 0 and B_i\succeq 0. Thus \sum_{i=1}^n z_iB_i is nonsingular whenever \min_iz_i>0. \square

One might now hope that we can follow a similar idea to what we did in the previous lectures. In the previous lectures, we wanted to show that the mixed characteristic polynomial is real-rooted. To this end, we first showed that the determinant in the definition of the mixed characteristic polynomial is real stable (the appropriate multivariate generalization or real-rootedness), and then we showed that this property is preserved when we take derivatives of the form 1-\frac{\partial}{\partial z_i}.

In an ideal world, we would show precisely the same thing here: if the polynomial p has no roots in the region \{z_1,\ldots,z_n>0\}, then perhaps (1-\frac{\partial}{\partial z_i})p has no roots there as well, etc. If this were true, then we would in fact have shown that \mu[B_1,\ldots,B_n](z) has no roots in the region z>0. Unfortunately, this is clearly false: that would imply an upper bound of zero for the matrix norm in our original problem (impossible in general!) Nonetheless, this general strategy proves to be the right approach. It is indeed not true that the region with no roots is preserved by the operation 1-\frac{\partial}{\partial z_i}. However, it turns out that we will be able to control by how much this region shrinks in successive applications of 1-\frac{\partial}{\partial z_i}. The control that we will develop will prove to be sufficiently sharp in order to obtain the desired result.

Evidently, the key problem that we face is to understand what happens to the roots of a polynomial when we apply an operation of the form 1-\frac{\partial}{\partial z_i}. In the remainder of this lecture, we will investigate this problem in a univariate toy setting. This univariate approach will not be sufficiently sharp to yield the desired result, but will give us significant insight into the ingredients that are needed in the proof. In the following lecture, we will complete the proof by developing a multivariate version of these ideas.

Barrier functions: a toy problem

Let us put aside for the moment our original problem, and consider the following simplified setting. Let q(z) be a univariate polynomial with real roots x_1\ge x_2\ge\cdots\ge x_d. What can we say about the locations of the roots of the polynomial (1-\frac{\partial}{\partial z})q(z)? Clearly, the latter polynomial has a root at z if and only if

    \[\bigg(1-\frac{\partial}{\partial z}\bigg)q(z) = 0 \qquad\Longleftrightarrow\qquad q(z) = \frac{\partial q(z)}{\partial z} \qquad\Longleftrightarrow\qquad \frac{\partial}{\partial z}\log q(z)=1.\]

The function \frac{\partial}{\partial z}\log q(z) is very interesting; let us investigate what it looks like. Note that we can always represent a polynomial in terms of its roots as

    \[q(z) = c\prod_{i=1}^d (z-x_i)\]

for some constant c. We can therefore readily compute

    \[\frac{\partial}{\partial z}\log q(z) = \sum_{i=1}^d \frac{1}{z-x_i}.\]

As we assumed that the roots x_i of q are real, the function \frac{\partial}{\partial z}\log q(z) looks something like this:

Rendered by QuickLaTeX.com

The function \frac{\partial}{\partial z}\log q(z) blows up at the roots x_1\ge x_2\ge\cdots\ge x_d of q; this function is therefore referred to as the barrier function, in analogy with a similar notion in optimization. The values where it is equal to one determine the locations of the roots y_1\ge y_2\ge\cdots\ge y_d of (1-\frac{\partial}{\partial z})q(z). It follows immediately from the shape of the barrier function that the roots of our two polynomials are interlaced, as can be seen by inspecting the above figure. Note, moreover, that y_1>x_1, so that it is unfortunately the case that the maximal root of a polynomial increases under the operation 1-\frac{\partial}{\partial z}. However, we can control the location of the maximal root if we are able to control the barrier function.

Let us now return to our matrix problem and try to see how this idea can be useful. We will consider the first step towards establishing the desired result: what happens to the locations of the roots of the polynomial p(z_1,\ldots,z_n) when we apply a single operation 1-\frac{\partial}{\partial z_1}? (We will ultimately want to iterate such an argument for every coordinate to control the roots of the mixed characteristic polynomial.)

To this end, let us fix t>0 for the time being, and define

    \[q(z) = p(z,t,t,\ldots,t) = \det\Bigg(zB_1+t\sum_{i=2}^n B_i\Bigg).\]

We have already shown that \text{maxroot}(q)\le 0. In order to control the roots of (1-\frac{\partial}{\partial z_1})p(z_1,\ldots,z_n), we need to control the barrier function \frac{\partial}{\partial z}\log q(z). The stroke of luck that we have at this point is that the derivative of the logarithm of a determinant is a remarkably nice object.

Lemma (Jacobi formula). If A+tB is invertible, then \frac{d}{dt}\log\det(A+tB) = \mathrm{Tr}((A+tB)^{-1}B).

Proof. First, we note that

    \begin{align*} \frac{d}{dt}\log\det(A+tB) &= \frac{d}{d\varepsilon}\log\det(A+tB+\varepsilon B)\bigg|_{\varepsilon=0} \\ &= \frac{d}{d\varepsilon}\log(\det(A+tB)\det(I+\varepsilon(A+tB)^{-1} B))\bigg|_{\varepsilon=0} \\ & = \frac{d}{d\varepsilon}\log\det(I+\varepsilon(A+tB)^{-1} B)\bigg|_{\varepsilon=0}. \end{align*}

It therefore suffices to prove that

    \[\frac{d}{d\varepsilon}\log\det(I+\varepsilon M)\bigg|_{\varepsilon=0}=\mathrm{Tr}[M].\]

To this end, we use directly the definition of the determinant:

    \[\frac{d}{d\varepsilon}\det(I+\varepsilon M) = \sum_{\sigma} (-1)^{|\sigma|} \sum_{j=1}^d M_{j\sigma(j)} \prod_{i\ne j}(\delta_{i\sigma(i)}+\varepsilon M_{i\sigma(i)}),\]

where the sum is over permutations \sigma. Setting \varepsilon=0, we see that the only term that survives is the one that satisfies i=\sigma(i) for all i\ne j, that is, the identity permutation. We therefore obtain (as \det(I)=1)

    \[\frac{d}{d\varepsilon}\log\det(I+\varepsilon M)\bigg|_{\varepsilon=0}= \frac{1}{\det(I+\varepsilon M)}\frac{d}{d\varepsilon} \det(I+\varepsilon M)\bigg|_{\varepsilon=0}= \sum_{j=1}^d M_{jj} = \mathrm{Tr}[M].\]

This completes the proof. \square

Using the Jacobi formula, we immediately find

    \[\frac{\partial}{\partial z}\log q(z)\bigg|_{z=t} = \mathrm{Tr}[({\textstyle t\sum_{i=1}^nB_i})^{-1}B_1] = \mathrm{Tr}[t^{-1}B_1]\le \frac{\varepsilon}{t}.\]

It therefore follows that

    \[\bigg(1-\frac{\partial}{\partial z_1}\bigg)p(z_1,\ldots,z_n)\bigg|_{z_1,\ldots,z_n=t}\ne 0 \quad\mbox{for all}\quad t>\varepsilon.\]

This brings us one step closer to the desired conclusion!

We have seen above that we can control the roots of p(z_1,\ldots,z_n) and (1-\frac{\partial}{\partial z_1})p(z_1,\ldots,z_n). Of course, the next step will be to control the roots of (1-\frac{\partial}{\partial z_2})(1-\frac{\partial}{\partial z_1})p(z_1,\ldots,z_n). Unfortunately, a direct application of the barrier function method does not lend itself well to iteration. The problem is that while we could easily control the barrier function of p(z_1,\ldots,z_n) using the Jacobi formula, it is not so obvious how to control the barrier function of (1-\frac{\partial}{\partial z_1})p(z_1,\ldots,z_n).

Instead, we are going to develop in the following lecture a multivariate version of the barrier argument. In each consecutive application of 1-\frac{\partial}{\partial z_i}, we will control the region of \mathbb{R}^n in which the polynomial has no roots; at the same time, we will also obtain control over the barrier function of the polynomial in the next stage. When implemented properly, this procedure will allow us to iterate the barrier argument without requiring explicit control of the barrier function except at the first stage.

Many thanks to Mark Cerenzia for scribing this lecture!

19. October 2014 by Ramon van Handel
Categories: Roots of polynomials | Leave a comment

Next lecture: October 16

There will be no lecture on October 9 as several of us are out of town. The stochastic analysis seminar will resume on October 16 (the next week) at the usual time and place.

05. October 2014 by Ramon van Handel
Categories: Announcement | Leave a comment

Lectures 2 and 3. Proof of Kadison-Singer (1)

The goal of the following lectures is to prove the following result about random matrices due to Marcus, Spielman, and Srivastava. This result readily implies a positive resolution of the Kadison-Singer conjecture, as was explained in detail in the previous lecture.

Theorem. Let v_1,\ldots,v_n be independent random vectors in \mathbb{C}^d whose distribution has finite support. Suppose that

    \[\mathbf{E}\sum_{i=1}^nv_iv_i^* = I, \qquad \mathbf{E}\|v_i\|^2 \le \varepsilon\mbox{ for all }i.\]

Then

    \[\left\|\sum_{i=1}^nv_iv_i^*\right\| \le (1+\sqrt{\varepsilon})^2\]

with positive probability.

Define the random matrix A and its characteristic polynomial p_A as

    \[A = \sum_{i=1}^n v_iv_i^*,\qquad p_A(z) = \mathrm{det}(zI-A).\]

As A is positive definite, one representation of the matrix norm is \|A\|=\mathrm{maxroot}(p_A). The proof of the above theorem consists of two main parts:

  1. Show that \mathrm{maxroot}(p_A)\le \mathrm{maxroot}(\mathbf{E}p_A) with positive probability.
  2. Show that \mathrm{maxroot}(\mathbf{E}p_A)\le (1+\sqrt{\varepsilon})^2.

The present post is devoted to the proof of the first statement. The second statement will be the subject of the subsequent lectures.

Interlacing roots of polynomials

How can we prove an identity such as \mathrm{maxroot}(p_A)\le \mathrm{maxroot}(\mathbf{E}p_A)? The key issue that we face is that the \mathrm{maxroot} functional is extremely nonlinear, and it is not at all clear how it behaves when you take expectations. This problem already arises in the very simplest setting: what can we say about the locations of the roots of a convex combination \lambda p+(1-\lambda)q of two polynomials p and q? In general, there is little hope to say anything concrete. However, if the polynomials p and q have some special properties, then we are in much better shape.

Lemma. Let p and q be polynomials that are monic (their leading coefficient is one) and of the same degree. Suppose that the polynomial \lambda p+(1-\lambda)q is real-rooted (all its roots are real) for every 0\le\lambda\le 1. Then

    \[\text{maxroot}(p)\wedge\text{maxroot}(q) \le \text{maxroot}(\lambda p+(1-\lambda)q) \le \text{maxroot}(p)\vee\text{maxroot}(q)\]

for every 0\le \lambda\le 1.

The message of this lemma is that under the given assumptions, the maximal root of the convex combination \lambda p+(1-\lambda)q interlaces between the maximal roots of p and q (in fact, a similar property holds for all other roots as well; but we will not need this fact.)

Proof. Let us assume without loss of generality that \text{maxroot}(p)\le\text{maxroot}(q), so we must prove

    \[\text{maxroot}(p) \le \text{maxroot}(\lambda p+(1-\lambda)q) \le \text{maxroot}(q).\]

Proof of upper bound. Because the polynomials are monic, p(z) and q(z) are strictly increasing for real z to the right of their maximal roots. Thus p(z) and q(z) are both strictly positive for z>\text{maxroot}(q). But this means that \lambda p(x)+(1-\lambda)q(z) is also strictly positive for z>\text{maxroot}(q), and thus cannot have a root in this region. This implies that

    \[\text{maxroot}(\lambda p+(1-\lambda)q)\le\text{maxroot}(q).\]

Proof of lower bound. Of course, this is the interesting part. By the same argument as the upper bound, it is clear that \text{maxroot}(\lambda p+(1-\lambda)q)=\text{maxroot}(p) if \text{maxroot}(p)=\text{maxroot}(q). We can therefore assume without loss of generality that \text{maxroot}(p)<\text{maxroot}(q).

We distinguish two cases. Note that p(z)>0 for z>\text{maxroot}(p). Suppose first that q(z)<0 for all \text{maxroot}(p)<z<\text{maxroot}(q). Then p and q must look something like this:

Rendered by QuickLaTeX.com

This means that for any 0<\lambda<1, we have \lambda p(z)+(1-\lambda)q(z)\le 0 at z=\text{maxroot}(p) and \lambda p(z)+(1-\lambda)q(z)\ge 0 at z=\text{maxroot}(q). Thus \lambda p+(1-\lambda)q must have a root between \text{maxroot}(p) and \text{maxroot}(q) by the intermediate value theorem. In particular,

    \[\text{maxroot}(\lambda p+(1-\lambda)q)\ge\text{maxroot}(p).\]

Now suppose that q(z)\ge 0 for some \text{maxroot}(p)<z<\text{maxroot}(q), and let z^* be the second largest root of q. Then p and q must look something like this:

Rendered by QuickLaTeX.com

(Note that it could be that z^*=\text{maxroot}(q), or that z^* is a local maximum; then the picture looks slightly different, but this will not change our argument.) We now argue as follows. The polynomial p has no roots in the interval [z^*,\infty[, but the polynomial q has at least two roots in this interval (counting multiplicities). Thus as we let \lambda vary from 0 to 1, the number of roots of the polynomial \lambda p+(1-\lambda)q must jump at some point \lambda^* from having at least two roots to having zero roots in the interval [z^*,\infty[. The polynomial \lambda^*p+(1-\lambda^*)q must therefore look like this:

Rendered by QuickLaTeX.com

This means that for any \varepsilon>0 sufficiently small, the polynomial (\lambda^*+\varepsilon)p+(1-\lambda^*-\varepsilon)q loses two real roots in the interval [z^*,\infty[. As long as these roots do not reappear elsewhere on the real line, this implies that this polynomial has less real roots than its degree: and, therefore, it must have at least two roots in the complex plane away from the real line. This contradicts the assumption that all convex combinations of p and q are real-rooted.

Thus it only remains to argue what happens if the two roots that are lost when we go from \varepsilon=0 to \varepsilon>0 reappear elsewhere on the real line. Note that for new roots to appear, a local maximum that was strictly below 0 (or a local minimum that was strictly above 0) when \varepsilon=0 must have moved above (or below) 0 when \varepsilon>0. As the graph of a polynomial is continuous in its coefficients, we can always avoid this by choosing \varepsilon>0 to be sufficiently small. \square

Once we know how to deal with two polynomials, we can easily generalize as follows.

Corollary. Let p_1,\ldots,p_n be monic polynomials of the same degree. Suppose that all convex combinations of p_1,\ldots,p_n are real-rooted. Then

    \[\min_{i\le n}\text{maxroot}(p_i) \le \text{maxroot}(\lambda_1 p_1+\cdots+\lambda_np_n) \le \max_{i\le n}\text{maxroot}(p_i)\]

for all convex weights \lambda_i\ge 0, \lambda_1+\cdots+\lambda_n=1.

Proof by induction. assume the result is true for a given number of polynomials n; then the result is also true for n+1 polynomials by applying the previous lemma. \square

Mixed characteristic polynomials

In order to apply the interlacing property developed above to the Kadison-Singer problem, we must show that the characteristic polynomial p_A satisfies the necessary assumptions. Recall that the random matrix A has a distribution with finite support; that is, A is a random variable that can take one of N possible values A_1,\ldots,A_N. Our aim is to show that there exists 1\le i\le N such that

    \[\text{maxroot}(p_{A_i}) \le \text{maxroot}(\mathbf{E}p_A),\]

where the expectation \mathbf{E}p_A is clearly a convex combination of the polynomials p_{A_1},\ldots,p_{A_N}. In order to apply the interlacing property, we must show not only that each of these polynomials is real-rooted (this is obvious, as A_i are Hermitian matrices), but also that convex combinations of these polynomials are real-rooted. The latter is far from obvious: we must reason about the properties of convex combinations of determinants, such as \lambda \det(zI-A_1)+(1-\lambda)\det(zI-A_2). Determinants are nonlinear objects, and so it is not at all clear how to address such questions.

The stroke of luck in the Kadison-Singer problem is that the the matrix A is a sum of rank-one matrices. Even though the determinant is nonlinear, it is linear with respect to rank-one perturbations. This crucial fact will allow us to express convex combinations of the determinants that appear in our problem in terms of another determinant, for which it will prove to be easier to establish real-rootedness.

Lemma. (Sylvester’s determinant identity) \det(I+uv^*) = 1+v^*u.

Proof. Let U,V be nonsingular square matrices. Then

    \[\det(I+UV^*) = \det(U^{-1})\det(I+UV^*)\det(U) = \det(I+V^*U).\]

By continuity, this remains valid for singular matrices U,V. The result follows by letting U be the matrix whose first column is u and V be the matrix whose first column is v (all other entries are zero). \square

Corollary. t\mapsto\det(B+tvv^*) is linear in t.

Proof. If B is nonsingular, we can write

    \[\det(B+tvv^*) = \det(B)\det(I+tB^{-1}vv^*) = \det(B)(1+tv^*B^{-1}v),\]

which is linear in t. As the limit of linear functions is linear, the result remains valid for singular B. \square

What we would really like is that \mathbf{E}\det(B+vv^*)\overset{?}{=}\det(B+\mathbf{E}vv^*) for a random vector v: that would make it very easy to handle convex combinations of characteristic polynomials. Unfortunately, this is wishful thinking: the matrix \mathbf{E}vv^* is not of rank one, so this does not follow from the linearity of rank-one perturbations. Nonetheless, this is not such a bad idea. We can use the linearity of rank-one perturbations to rewrite the characteristic polynomial of a sum of rank-one matrices in terms of a different quantity, which proves to be linear even beyond the rank-one setting.

To this end, let us define the multivariate polynomial

    \[p(z,t_1,\ldots,t_n) = \det\Bigg(zI+\sum_{i=1}^nt_iv_iv_i^*\Bigg).\]

Then p_A(z)=p(z,-1,\ldots,-1). Moreover, we have just shown that p(z,t_1,\ldots,t_n) is multilinear in t_1,\ldots,t_n for every z. Using that a linear function equals its first-order Taylor expansion, we can write

    \[p(z,t_1,\ldots,t_n) = p(z,t_1,\ldots,t_{n-1},0) + t_n\frac{\partial}{\partial s}(z,t_1,\ldots,t_{n-1},s)\bigg|_{s=0}.\]

Iterating this device, we obtain

    \[p(z,t_1,\ldots,t_n) = \prod_{i=1}^n\bigg(1+t_i\frac{\partial}{\partial s_i}\bigg)p(z,s_1,\ldots,s_n)\bigg|_{s_1,\ldots,s_n=0}.\]

Setting t_1,\ldots,t_n=-1, we obtain the identity

    \[\det\Bigg(zI-\sum_{i=1}^nv_iv_i^*\Bigg) = \prod_{i=1}^n\bigg(1-\frac{\partial}{\partial t_i}\bigg) \det\Bigg(zI+\sum_{i=1}^nt_i v_iv_i^*\Bigg)\Bigg|_{t_1,\ldots,t_n=0}.\]

The quantity on the right is called a mixed characteristic polynomial.

Definition. The mixed characteristic polynomial of square matrices B_1,\ldots,B_n is defined as

    \[\mu[B_1,\ldots,B_n](z) := \prod_{i=1}^n\bigg(1-\frac{\partial}{\partial t_i}\bigg) \det\Bigg(zI+\sum_{i=1}^nt_i B_i\Bigg)\Bigg|_{t_1,\ldots,t_n=0}.\]

Note that if all B_i are rank one, we have shown that

    \[p_{\sum_{i=1}^nB_i}(z) = \mu[B_1,\ldots,B_n](z).\]

That is, the mixed characteristic polynomial of B_1,\ldots,B_n coincides with the characteristic polynomial of \sum_{i=1}^nB_i when all B_i are rank one. This is no longer true when the matrices B_i are not of rank one! Nonetheless, the mixed characteristic polynomial is an extremely useful device for the following reason.

Lemma. \mu[B_1,\ldots,B_n] is multilinear in B_1,\ldots,B_n.

Proof. The determinant \det(zI+\sum_{i=1}^nt_iB_i) is a polynomial in the entries of the matrices t_iB_i, and each term in the product \prod_{i=1}^n(1-\frac{\partial}{\partial t_i}) is a product of derivatives with respect to distinct variables t_i. In order for such a derivative of one of the terms of the polynomial \det(zI+\sum_{i=1}^nt_iB_i) to be nonzero after we set t_1,\ldots,t_n=0, this term can contain each t_iB_i at most once: otherwise this term must be proportional to t_i^k for some k>1, so the derivative of this term will vanish when we set t_i=0. Thus the only terms that survive in the mixed characteristic polynomials are products of the entries of the matrices t_iB_i, where each i appears at most once. Such terms are clearly multilinear in B_1,\ldots,B_n. \square

A simple consequence of this result is, for example, that

    \[\mathbf{E}p_A = \mathbf{E}\mu[v_1v_1^*,\ldots,v_nv_n^*] = \mu[\mathbf{E}v_1v_1^*,\ldots,\mathbf{E}v_nv_n^*],\]

where we have used multilinearity and the independence of the vectors v_i. Let us emphasize that the right-hand side in this expression cannot itself be expressed as the characteristic polynomial of \mathbf{E}A, for example. Nonetheless, we have achieved something very useful: we have expressed a convex combination of determinants in terms of the derivative of a determinant. It turns out that it is much easier to establish real-rootedness of derivatives of determinants, than it is to establish real-rootedness of convex combinations of determinants. Our next order of business is to understand why this is true.

Multivariate polynomials and real stability

Given positive semidefinite matrices B_1,\ldots,B_n, let us define the multivariate polynomial

    \[p(z,t_1,\ldots,t_n) = \det\Bigg(zI+\sum_{i=1}^nt_iB_i\Bigg).\]

The mixed characteristic polynomial \mu[B_1,\ldots,B_n](z) is obtained from p by taking derivatives, followed by fixing t_1,\ldots,t_n=0. We are going to prove real-rootedness of \mu[B_1,\ldots,B_n] by showing that the real-rootedness property is preserved by such operations.

Of course, it is not quite clear what we mean by this: p is a multivariate polynomial, so its roots are points in \mathbb{C}^n. The appropriate extension of real-rootedness to multivariate polynomials is as follows.

Definition. A multivariate polynomial q(z_1,\ldots,z_n) is called stable if it has no roots in the set \mathcal{H}=\{z\in\mathbb{C}^n:\text{Im}z_i>0\text{ for all }i\}, and is called real stable if in addition all its coefficients are real.

We are doing to establish the following facts:

  1. The polynomial p is real stable.
  2. Taking derivatives or fixing parameters preserves real stability.
  3. A univariate polynomial is real stable if and only if it is real-rooted.

As the (univariate) mixed characteristic polynomial \mu[B_1,\ldots,B_n] is obtained from p by taking derivatives and fixing parameters, this reasoning will show that \mu[B_1,\ldots,B_n] is real-rooted.

The third assertion is the easiest, and we prove it first.

Lemma. A univariate polynomial is real stable if and only if it is real-rooted.

Proof. Clearly a real-rooted polynomial is real stable. Conversely, suppose a univariate polynomial q is real stable. Then it has no roots z with \text{Im}z>0. However, if q(z)=0, then q(\bar z)=0 also. Therefore, there are also no roots with \text{Im}z<0. It follows that q must be real-rooted. \square

Next, we prove the preservation of real stability under taking derivatives and fixing parameters.

Lemma. Let q(z_1,\ldots,z_n) be a real stable polynomial and let t\in\mathbb{R}. Then

  1. q_t(z_1,\ldots,z_{n-1}):= q(z_1,\ldots,z_{n-1},t) is real stable.
  2. (1+t\frac{\partial}{\partial z_n})q(z_1,\ldots,z_n) is real stable.

Proof. We begin by proving the first assertion. Clearly q_t has real coefficients, so it suffices to show that it is stable. Suppose to the contrary that q_t has a root z^*=(z^*_1,\ldots,z^*_{n-1}) in \mathcal{H}. As the roots of a polynomial vary continuously with its coefficients, the polynomial q_{t+i\varepsilon} must also have a root in \mathcal{H} for \varepsilon>0 sufficiently small. But this contradicts the real stability of q.

Remark. A careful proof of the continuity of roots of complex polynomials can be given using Rouché’s theorem. This is left as an exercise to the interested reader.

We now turn to the second assertion. It is again obvious that the derivative of the polynomial has real coefficients, so it remains to establish that it is stable. Assume to the contrary that

    \[(1+t\frac{\partial}{\partial z_n})q(z_1^*,\ldots,z_n^*)=0\]

for some z^*=(z^*_1,\ldots,z^*_n) in \mathcal{H}.

Define the univariate polynomial \tilde q(z)= q(z^*_1,\ldots,z^*_{n-1},z), and denote by w_1,\ldots,w_d\in\mathbb{C} the roots of \tilde q. As q is real stable, we must have \tilde q(z_n^*)\ne 0 and \text{Im}w_i\le 0 for all i. Moreover, we can express \tilde q as

    \[\tilde q(z) = c\prod_{i=1}^d (z-w_i).\]

By assumption, we have

    \[0 = \tilde q(z_n^*)+t \tilde q'(z_n^*)    = \tilde q(z_n^*) (1+t(\log\tilde q)'(z_n^*))\]

(as \log\tilde q(z) is analytic in a neighborhood of z_n^*). But note that

    \[\log \tilde q(z) = \log c+\sum_{i=1}^d\log(z-w_i),\]

so

    \[0 = 1+\sum_{i=1}^d\frac{t}{z_n^*-w_i}.\]

Taking the imaginary part of this expression yields

    \[0 = \sum_{i=1}^d \frac{\text{Im}(z_n^*)-\text{Im}(w_i)}{|z_n^*-w_i|^2}.\]

But \text{Im}(z_n^*)>0 and \text{Im}(w_i)\le 0, so we have achieved a contradiction. \square

Finally, we prove that the polynomial p is in fact real stable.

Lemma. Suppose that B_1,\ldots,B_n are positive semidefinite matrices. Then the polynomial

    \[p(z,t_1,\ldots,t_n) = \det\Bigg(zI+\sum_{i=1}^nt_iB_i\Bigg).\]

is real stable.

Proof. To show that p is real, note that

    \[\overline{\det\Bigg(zI+\sum_{i=1}^nt_iB_i\Bigg)} = \det\Bigg(\bar zI+\sum_{i=1}^n\bar t_i B_i^*\Bigg).\]

As B_i are Hermitian, this implies that p(z,t_1,\ldots,t_n) is real whenever the parameters z,t_1,\ldots,t_n are all real. Thus the coefficients of p must be real, and it remains to prove stability.

To this end, suppose that p has a root z^*,t_1^*,\ldots,t_n^* in \mathcal{H}. Then the matrix z^*I+\sum_{i=1}^nt_i^*B_i is not invertible, so there exists a vector v\ne 0 such that

    \[z^*\|v\|^2+\sum_{i=1}^nt_i^*v^*B_iv=0.\]

But \|v\|^2>0 and v^*B_iv\ge 0, so the imaginary part of the left-hand size must be strictly positive by the assumption that the root z^*,t_1^*,\ldots,t_n^* is in \mathcal{H}. Thus we have achieved a contradiction. \square

Putting together the previous three results, we have shown the following.

Corollary. The mixed characteristic polynomial \mu[B_1,\ldots,B_n] is real-rooted whenever B_1,\ldots,B_n are positive semidefinite matrices.

Putting everything together

We are now ready to put together the various ideas developed in this post.

Proposition. Let v_1,\ldots,v_n be independent random vectors in \mathbb{C}^d whose distribution has finite support, and let A=\sum_{i=1}^nv_iv_i^*. Then

    \[\text{maxroot}(p_A) \le \text{maxroot}(\mathbf{E}p_A)\]

with positive probability.

Proof. We will apply the interlacing property step by step. Denote by \mathcal{V}_i the possible values that the random vector v_i can take. First, note that for every collection of convex weights \{\lambda_w\}_{w\in\mathcal{V}_n}

    \[\sum_{w\in\mathcal{V}_n}\lambda_w\,\mu[v_1^*v_1,\ldots,v_{n-1}^*v_{n-1},ww^*] = \mu\Bigg[v_1^*v_1,\ldots,v_{n-1}^*v_{n-1},\sum_{w\in\mathcal{V}_n}\lambda_w ww^*\Bigg]\]

by multilinearity. We have shown that the mixed characteristic polynomial on the right is real-rooted. Moreover, it is easily verified from their definition that the polynomials \mu[v_1^*v_1,\ldots,v_{n-1}^*v_{n-1},ww^*] are monic and have the same degree. Therefore, by the interlacing property, there exists a vector w_n\in\mathcal{V}_n (which may depend on the outcomes of v_1,\ldots,v_{n-1}) such that

    \[\text{maxroot}(\mu[v_1v_1^*,\ldots,v_{n-1}v_{n-1}^*,w_nw_n^*]) \le \text{maxroot}(\mu[v_1v_1^*,\ldots,v_{n-1}v_{n-1}^*,\mathbf{E}v_nv_n^*]).\]

We now proceed in precisely the same manner with the polynomial on the right-hand side of this inequality to establish the existence of a vector w_{n-1} (depending on v_1,\ldots,v_{n-2}) such that

    \begin{align*} &\text{maxroot}(\mu[v_1v_1^*,\ldots,v_{n-2}v_{n-2}^*,w_{n-1}w_{n-1}^*,\mathbf{E}v_nv_n^*]) \\ &\qquad\le \text{maxroot}(\mu[v_1v_1^*,\ldots,v_{n-2}v_{n-2}^*,\mathbf{E}v_{n-1}v_{n-1}^*,\mathbf{E}v_nv_n^*]). \end{align*}

By iterating this argument, we show that

    \[\text{maxroot}(\mu[w_1w_1^*,\ldots,w_{n}w_{n}^*]) \le \text{maxroot}(\mu[\mathbf{E}v_1v_1^*,\ldots,\mathbf{E}v_nv_n^*]).\]

But note that \mu[w_1w_1^*,\ldots,w_{n}w_{n}^*]=p_{\sum_{i=1}^nw_iw_i^*}, which is a single realization of p_A; whereas \mu[\mathbf{E}v_1v_1^*,\ldots,\mathbf{E}v_nv_n^*]=\mathbf{E}\mu[v_1v_1^*,\ldots,v_nv_n^*]=\mathbf{E}p_A by multilinearity of the mixed characteristic polynomial and as the random vectors v_i are independent. We have therefore completed the proof. \square

We have now completed the first step in our program. In the following lectures, we will concern ourselves the with second step: the control of \mathrm{maxroot}(\mathbf{E}p_A).

Remark. It is only fair to note that, even though the result we are after is a statement about random matrices, essentially no probability appears in the proof: the proof only finds one single realization of the random variable p_A whose maximal root is controlled as desired, and the statement that this happens with positive probability is largely cosmetic (this is a finite probability space, so any realization occurs with positive probability). This highlights, in some sense, the limitations of the probabilistic method in the present context: as the event we are after has very small probability, the usual methods of probabilistic reasoning are not efficient at finding such an event.

Lectures by Philippe Rigollet | Notes by Junwei Lu and Ramon van Handel

05. October 2014 by Ramon van Handel
Categories: Roots of polynomials | Leave a comment

Lecture 1. Introduction

The goal of our seminar this semester is to try to understand the unexpected application of the geometry of roots of polynomials to address probabilistic problems. Such methods have recently led to the resolution of several important questions, and could potentially be very useful in many other problems as well. In the following lectures, we will work in detail through some recent problems in which the geometry of polynomials arises in a crucial manner in the proof.

Kadison-Singer, random matrices, and polynomials

In 1959, Kadison and Singer made a fundamental conjecture about the extension of pure states on C^* algebras. While the original conjecture was purely qualitative, it has since been reformulated in many different ways. These different formulations prove to have applications in various areas: see the nice survey by Cassaza and Tremain. After many decades, the Kadison-Singer conjecture was finally proved in a paper by Marcus, Spielman, and Srivastava. They prove a quantitative formulation of the conjecture, due to Weaver, that can be stated as an elementary (deterministic!) property of vectors in \mathbb{C}^n.

Theorem (formerly Conjecture) There exist universal constants \delta,\varepsilon > 0 such that, if n and d are any positive integers and w_1,\ldots,w_n \in \mathbb{C}^d are vectors such that

    \[\sum_{i=1}^n w_iw_i^* = I\qquad\mbox{and}\qquad \|w_i\|^2\le\delta\mbox{ for all }i,\]

then we can partition \{1,\ldots,n\} into disjoint subsets S and S^c such that

    \begin{align*} \left\|\sum_{i \in S}w_iw_i^*\right\|\le 1-\varepsilon,\qquad \left\|\sum_{i \in S^c}w_iw_i^*\right\|\le 1-\varepsilon, \end{align*}

where \|M\| denotes the spectral norm of the matrix M.

This result appears at first sight to be almost obvious: in the scalar case d=1, the corresponding statement is completely elementary. To see this, let x_1,\ldots,x_n\ge 0 satisfy \sum_ix_i=1 and x_i\le\delta for all i. The only way that we can fail to find a subset S such that \sum_{i\in S}x_i\le 1-\varepsilon and \sum_{i\in S^c}x_i\le 1-\varepsilon is when the sizes of the x_i are very imbalanced (e.g., if x_1=1 and x_2,\ldots,x_n=0), and this is ruled out by the assumption x_i\le\delta. It seems natural to expect that a similar conclusion follows in any dimension d.

Remark. To make this argument precise, let us argue by contradiction. Suppose that for every set S, we have either \sum_{i\in S}x_i> 1-\varepsilon or \sum_{i\in S^c}x_i> 1-\varepsilon. The first possibility occurs when S=\{1,\ldots,n\} and the second possibility occurs when S=\varnothing. By adding points one by one, we see that there must exist a set S and a point j\not\in S such that \sum_{i\in S\cup\{j\}}x_i> 1-\varepsilon and \sum_{i\in S^c}x_i> 1-\varepsilon. But this implies that x_j=\sum_{i\in S\cup\{j\}}x_i-1+\sum_{i\in S^c}x_i > 1-2\varepsilon, which contradicts x_j\le\delta when \varepsilon is sufficiently small.

Despite the intuitive nature of the statement, the general result is far from obvious. In particular, it is important to note that the affirmative solution to the Kadison-Singer conjecture requires that the constants \delta,\varepsilon must be independent of d and n. It is not at all obvious a priori that it is possible to achieve this: the scalar case is misleadingly simple, and its proof does not extend to higher dimension (try it!)

So far no probability has entered the picture: we are only asked to establish the existence of one set S for which the above result holds. However, in many problems in analysis and combinatorics, the easiest way to find the existence of a nontrivial object is by selecting such an object randomly and attempting to show that it satisfies the desired properties with positive probability. This approach is known as the probabilistic method. Using this idea, we will presently show that the above deterministic problem can be reduced to a basic probabilistic problem of estimating the norm of certain random matrices.

Let w_1,\ldots,w_n\in\mathbb{C}^d be given. In order to find a suitable set S, let us try to choose this set uniformly at random: that is, S is a random set such that each point i\in\{1,\ldots,n\} is included in S independently with probability one half. Let us now define the random vectors v_1,\ldots,v_n\in\mathbb{C}^{2d} as follows:

    \[v_i = \begin{cases} \left(\begin{array}{c} \sqrt{2}w_i \\ 0_d \end{array}\right) &\text{ if } i\in S, \vspace{0.5em} \\ \left(\begin{array}{c} 0_d \\ \sqrt{2}w_i \end{array}\right) &\text{ if } i\not\in S, \end{cases}\]

where 0_d denotes the zero vector in \mathbb{C}^d. A quick calculation gives

    \[\mathbf{E}[v_iv_i^*] = \left(\begin{array}{cc}         w_iw_i^* & 0 \\ 0  &w_iw_i^* \end{array}\right).\]

We therefore have

    \[\mathbf{E}\sum_{i=1}^nv_iv_i^* = I,\qquad \|v_i\|^2 = 2\|w_i\|^2 \le 2\delta.\]

Moreover, we can compute

    \[\sum_{i=1}^nv_iv_i^* = 2 \left(\begin{array}{cc}         \sum_{i \in S}w_iw_i^* & 0 \\ 0  &\sum_{i \in S^c}w_iw_i^* \end{array}\right),\]

and therefore

    \[\left\|\sum_{i=1}^nv_iv_i^* \right\| =  2\max\left(\left\|\sum_{i \in S}^nw_iw_i^*  \right\|, \left\|\sum_{i \in S^c}^nw_iw_i^*  \right\|\right).\]

It is now immediately clear that in order to prove the deterministic version of the Kadison-Singer conjecture, it suffices to prove the following result about random matrices.

Theorem. Let v_1,\ldots,v_n be independent random vectors in \mathbb{C}^d whose distribution has finite support. Suppose that

    \[\mathbf{E}\sum_{i=1}^nv_iv_i^* = I, \qquad \mathbf{E}\|v_i\|^2 \le \gamma\mbox{ for all }i.\]

Then

    \begin{align*} \mathbf{P}\left(\left\|\sum_{i=1}^nv_iv_i^*\right\| \le 1 + o(1)\right) > 0 \end{align*}

(where o(1)\to 0 as \gamma\to 0 depends only on \gamma).

How might one prove this result? The easiest approach would be to show that

    \[\mathbf{E}\left\|\sum_{i=1}^nv_iv_i^*\right\| \le 1 + o(1),\]

in which case the result would follow immediately (this is called the first moment method). This seems promising, as the literature on random matrices contains numerous bounds on the norm of a random matrix. Unfortunately, this approach cannot work, as the following example illustrates.

Example. Denote by e_1,\ldots,e_d the standard basis in \mathbb{C}^d, and let g_{ij} be i.i.d. standard normal random variables for i=1,\ldots,k and j=1,\ldots,d. Define the random vectors v_1,\ldots,v_{dk} \in \mathbb{C}^d by

    \[v_{(j-1)k + i} := \frac{1}{\sqrt{k}}g_{ij}e_j.\]

That is,

    \[\begin{array}{cccc} v_1 = \frac{1}{\sqrt{k}}g_{11}e_1, &v_{k+1} = \frac{1}{\sqrt{k}}g_{12}e_2, &\cdots &v_{(d-1)k + 1} = \frac{1}{\sqrt{k}}g_{1d}e_d, \\ \vdots & \vdots &  & \vdots \\ v_k = \frac{1}{\sqrt{k}}g_{k1}e_1, &v_{2k} = \frac{1}{\sqrt{k}}g_{k2}e_2, &\cdots &v_{dk} = \frac{1}{\sqrt{k}}g_{kd}e_d. \end{array}\]

Then \mathbf{E}[v_{(j-1)k + i}v_{(j-1)k + i}^*] = \frac{1}{k}e_je_j^*, so \mathbf{E}\sum_{i=1}^{kd}v_iv_i^* = I. Moreover, \mathbf{E}\|v_i\|^2 = 1/k =: \gamma. Thus the assumptions of the above result are satisfied when k is sufficiently large. On the other hand, the matrix \sum_{i=1}^{kd} v_iv_i^* is diagonal with entries j=1,\ldots,d given by \frac{1}{k}\sum_{i=1}^kg_{ij}^2, and we therefore find that

    \[\left\|\sum_{i=1}^{kd} v_iv_i^*\right\| = \max_{j \le d} \frac{1}{k}\sum_{i=1}^kg_{ij}^2 \sim \gamma\log d\]

as d\to\infty (as the maximum of chi-square random variables is of order \log d). Thus the expected matrix norm in this example must depend on the dimension d. We therefore cannot obtain a dimension-free result, as is required to resolve Kadison-Singer, by looking at the expectation of the matrix norm.

Remark. In this example we used Gaussian variables, for which \mathbf{E}\|v_i\|^2\le\gamma but \|v_i\|^2 is not uniformly bounded. In the reduction of Kadison-Singer to the random matrix problem, we had the stronger property that \|v_i\|^2\le\gamma uniformly. However, by using Bernoulli variables instead of Gaussians, we should be able to construct an analogous example using only bounded variables.

Remark. The lovely method due to Ahlswede and Winter makes it possible to show rather easily that \mathbf{E}\left\|\sum_i v_iv_i^*\right\| \lesssim \gamma\log d in the general random matrix setting that we are considering. Thus the above example exhibits the worst case behavior for this problem. However, as noted above, a dimension-dependent bound of this kind does not suffice for our purposes.

Is the result we are trying to prove hopeless? Not at all: bounding the expectation of the matrix norm is only a very crude way to show that the matrix norm is small with positive probability. The expected norm captures the typical behavior of the matrix norm: in the present case, we have seen above that this typical behavior must in general be dimension-dependent. However, all we want to establish is that it is possible that the matrix norm is bounded by a dimension-independent quantity (that is, that this happens with positive probability). Unfortunately, the above example suggests that the latter cannot be a typical event: the bound that we are interested in can only hold, in general, with very small probability that vanishes as we increase the dimension. This means that it is much more difficult to apply the probabilistic method, which usually works well precisely when the phenomenon that we are interested in is not only possible, but in fact typical. The present situation is much more delicate, and results in random matrix theory are not equipped to address it. To surmount this problem, Marcus, Spielman, and Srivastava developed an entirely new approach to bound the norm of a random matrix.

Let us briefly recall the common approaches in random matrix theory for bounding the norm \|X\| of a random matrix X. The key issue is to reduce the random variable \|X\| to a probabilistic structure that we know how to control. The following two approaches are used in much of the literature:

  1. Recall that we can write \|X\|=\sup_{v,w\in B}\langle v,Xw\rangle, where B denotes the unit ball in \mathbb{C}^d. This reduces estimating the norm to estimating the supremum of a family of random variables \langle v,Xw\rangle whose distribution is easily controlled (this is a linear function of the matrix entries). There are many standard probabilistic methods for estimating the supremum of random variables.
  2. Recall that we can write \|X\| = \lim_{p\to\infty}\mathrm{Tr}[X^{2p}]^{1/2p} (the quantity \mathrm{Tr}[X^{2p}]^{1/2p} is the \ell_p-norm of the eigenvalues of X, and the spectral norm is the \ell_\infty norm of the eigenvalues). We can therefore estimate the matrix norm if we can control the moments \mathbf{E}\mathrm{Tr}[X^{2p}]. The latter is the expectation of a polynomial in the matrix entries, and thus the problem reduces to controlling the moments of the matrix entries and counting (this typically involves some combinatorics).

The beauty of these methods is that they reduce the problem of estimating the matrix norm to well-understood probabilistic problems of estimating the maxima or moments of random variables.

There is a third representation of the matrix norm that is taught in every linear algebra course. Suppose that the random matrix X is positive definite (as is the case in our setting). Then the matrix norm \|X\| is the largest eigenvalue of X, and therefore coincides with the maximal root of the characteristic polynomial p_X(z) := \text{det}(z-X) (the roots of p_X are all real as X is Hermitian, so it makes sense to speak of the “maximal” root). That is, we have the representation

    \[\|X\| = \text{maxroot}(p_X).\]

Note that as X is a random matrix, p_X is a random polynomial. Its expectation \mathbf{E}p_X could potentially be a very useful tool in random matrix theory (just as \mathbf{E}\text{Tr}[X^{2p}] was a very useful quantity). Nonetheless, this representation is typically considered to be useless in random matrix theory. The problem is that p\mapsto\text{maxroot}(p) is an extremely nasty functional, and there is no reason for it to behave well under expectations. In essence, the difficulty is that the geometry of roots of polynomials is extremely nonlinear, and the probabilistic behavior of these objects is not well understood.

A major insight of Marcus, Spielman, and Srivastava is that this representation can be extremely useful in certain problems. To exploit it, they use (a variant of) the following remarkable observation.

Lemma. Let p be a random polynomial that takes values in the finite set \{p_1,\ldots,p_n\}. Suppose that p_1,\ldots,p_n all have the same degree and are monic (the leading coefficient is 1), and that all convex combinations of p_1,\ldots,p_n have only real roots. Then \mathbf{P}\left[\text{maxroot}(p) \le \text{maxroot}(\mathbf{E}p)\right] > 0.

This statement is not at all obvious! In particular, it is not true without the assumption on the locations of the roots. To use it, we must show that the random polynomials that arise in the random matrix problem satisfy the required assumption, and we must also find a way to control the maximal root of \mathbf{E}p_X. We can now see why roots of polynomials play a key role throughout the proof, both in its statement (the norm is represented as a maximal root) and in its guts (the locations of the roots are essential to obtain results such as the above lemma). The development of these ideas will require us to develop tools that marry the geometry of roots of polynomials with the probabilistic problems that we are trying to address. These methods will be developed in detail in the following lectures.

Determinantal processes

Let us now turn to an entirely different application of roots of polynomials in probability theory. We first recall some basic definitions.

A point process on a space E is a random finite or countable subset of E. For example, E might be \{1,\ldots,n\}, \mathbb{R}, or \mathbb{C}. Some interesting examples of point processes include:

  • The eigenvalues of Gaussian random matrices;
  • The roots of an infinite power series with i.i.d. Gaussian coefficients;
  • Given a (finite) graph (V,E), choosing a spanning tree uniformly at random defines a random subset of the edges E;
  • (Karlin-McGregor) Pick n different starting points in \mathbb{Z}, and run n independent random walks starting from these points conditionally on the event that their paths never cross. Then the positions of the n random walks at time n is a point process.

Remarkably, all of these apparently quite unrelated point processes turn out to share a very rich common structure: they are examples of determinantal point processes. For simplicity, let us define this notion only in the discrete case where E=\{1,\ldots,n\}.

Definition. A random subset X of \{1,\ldots,n\} is called a determinantal point process if

    \[\mathbf{P}(S \subseteq X) = \text{det}(A|_S)\]

for every S \subseteq \{1,\ldots,n\}, where A is a given n \times n matrix called the kernel of the point process, and A|_S is the restriction of A to the rows and columns in S.

Determinantal point processes have many interesting properties. For the time being, let us mention one example: the property of negative dependence

    \[\mathbf{P}(i \in X,  j \in X) \le \mathbf{P}(i \in X)\mathbf{P}(j \in X)\]

for all i\ne j, i,j\in\{1,\ldots,n\}. We will (hopefully) encounter more interesting examples later on.

A basic tool for studying determinantal point processes is the generating function

    \[g(z) = \mathbf{E}\left[\prod_{i \in X}z_i\right] =: \mathbf{E}[z^X],\qquad z\in\mathbb{C}^n.\]

Note that g is a multivariate polynomial in n (complex) variables. As an illustration of the utility of the generating function, let us use it to express the property of negative dependence. It is easy to check that

    \[\mathbf{P}(i \in X) = \frac{\partial g}{\partial z_i}(1)\]

and

    \[\mathbf{P}(i \in X,  j \in X) = \frac{\partial^2 g}{\partial z_i \partial z_j}(1),\]

where 1 denotes the vector of ones. Hence, negative dependence is equivalent to the polynomial identity

    \[\frac{\partial^2 g}{\partial z_i \partial z_j}(1) \le \frac{\partial g}{\partial z_i}(1)\frac{\partial g}{\partial z_j}(1).\]

Clearly only very special kinds of polynomials can satisfy such a property. Why should the generating functions of determinantal point processes be of this kind? Some insight can be gained from the following result of Brändén.

Theorem. The following are equivalent.

  1. For all z \in \mathbb{R}^n and 1\le i,j\le n, we have

        \begin{align*} g(z)\frac{\partial^2 g}{\partial z_i \partial z_j}(z) \le \frac{\partial g}{\partial z_i}(z)\frac{\partial g}{\partial z_j}(z). \end{align*}

  2. g has no roots in \{z \in \mathbb{C}^n : \text{Im}(z_i) > 0 \text{ for all }1\le i\le n\}.

If either of the equivalent conditions of this theorem hold, we say that g is real stable. Note that g(1) = 1, and thus we recover the negative dependence relationship if the generating function g is real stable.

While it was initially far from clear why we would care about the roots of polynomials in this setting, we now see that the locations of the roots of the generating function g are apparently intimately related with the negative dependence properties of point processes. The geometry of the roots of polynomials enters in an essential manner in results of this kind. It is not difficult to see that determinantal point processes must have real stable generating functions under some mild assumptions (very roughly speaking, one can compute g(z)=\text{det}(A)\text{det}(A^{-1}-I-\text{diag}(z)) in terms of the kernel A of the determinantal point process; thus if 0\preceq A\preceq I, the generating function is the characteristic polynomial of a positive definite matrix, and thus has no roots in the upper quadrant of \mathbb{C}^n). Such ideas therefore provide a powerful tool to investigate the properties of determinantal point processes.

The simple negative dependence property described above could have been obtained in a more elementary fashion. However, the polynomial method makes it possible to prove more delicate properties, such as stronger negative association properties or concentration of measure properties of determinantal point processes. If time permits, we will try to cover some of these topics at the end of the semester. While these questions are quite distinct from the Kadison-Singer problem, it is remarkable that many of the same tools play a central role in both cases.

Many thanks to Dan Lacker for scribing this lecture!

23. September 2014 by Ramon van Handel
Categories: Roots of polynomials | Leave a comment

Fall 2014: Roots of polynomials and probabilistic applications

Welcome back to the stochastic analysis seminar! We resume this semester with an unusual but exciting topic: the geometry of roots of polynomials and their probabilistic applications.

Understanding the roots of polynomials seems far from a probabilistic issue, yet has recently appeared as an important technique in various unexpected problems in probability as well as in theoretical computer science. As an illustration of the power of such methods, these informal lectures will work through two settings where significant recent progress was enabled using this idea. The first is the proof of the Kadison-Singer conjecture by using roots of polynomials to study the norm of certain random matrices. The second is the proof that determinantal processes, which arise widely in probability theory, exhibit concentration of measure properties. No prior knowledge of these topics will be assumed.

Time and location: Thursdays, 4:30-6:00, Sherrerd Hall 101 (note different location than last year).
The first lecture will be on September 18.

Notes from the lectures, schedule changes and any other announcements will be posted on this blog.

References:

19. September 2014 by Ramon van Handel
Categories: Roots of polynomials | Leave a comment

End of Fall 2013 seminar

That was it for the Fall 2013 stochastic analysis seminar! Many thanks in particular to our fantastic guest lecturers Mokshay Madiman and Emmanuel Abbe.

There will be no stochastic analysis seminar in the Spring 2014 semester (students interested in related material may want to consider taking the special topics course Probability in High Dimension). We will resume in the Fall 2014 semester with a topic to be announced at that time.

15. December 2013 by Ramon van Handel
Categories: Announcement | Comments Off

Lecture 10. Concentration, information, transportation (2)

Recall the main proposition proved in the previous lecture, which is due to Bobkov and Götze (1999).

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip} with finite mean.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu \ll \mu.

This result provides a characterization of concentration of Lipschitz functions on a fixed metric space. Our original aim, however, was to understand dimension-free concentration (in analogy with the Gaussian case): that is, for which measures \mu is it true that if X_1,\ldots,X_n are i.i.d. \sim\mu, then every 1\text{-Lip} function f(X_1,\ldots,X_n) is subgaussian with the same parameter \sigma^2 for every n\ge 1? In principle, the above result answers this question: \mu satisfies dimension-free concentration if and only if

    \[W_1(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}      \quad\text{for every }\mathbb{Q}\ll\mu^{\otimes n}\text{ and }n\ge 1.\]

However, this is not a very satisfactory characterization: to check whether \mu satisfies dimension-free concentration, we must check that the transportation-information inequality holds for \mu^{\otimes n} for every n\ge 1. Instead, we would like to characterize dimension-free transportation in terms of a property of \mu itself (and not its tensor products). This will be achieved in the present lecture.

Tensorization

How are we going to obtain a transportation-information inequality for every \mu^{\otimes n} starting from only a property of \mu? If one is an optimist, one might hope for a miracle: perhaps the validity of the transportation-information inequality for \mu itself already implies the analogous inequality for all its products \mu^{\otimes n}. In such situations, one says that the inequality tensorizes. As we will shortly see, the inequality W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} does not tensorize in precisely the manner that one would hope, but this naive idea will nonetheless lead us in the right direction.

We will develop the tensorization result in a slightly more general setting that will be useful in the sequel. Recall that \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y):X\sim\mu,Y\sim \nu\} denotes the set of couplings between probability measures \mu and \nu, and that the Wasserstein distance W_1(\mu,\nu) can be written as

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)].\]

We can now state a general tensorization result for transportation-information type inequalities. The proof is due to Marton (1996), who initiated the use of this method to study concentration.

Proposition. (Tensorization) Let \phi:\mathbb{R}_+\to\mathbb{R}_+ be a convex function, and let w:\mathbb{X}\times\mathbb{X}\to\mathbb{R}_+ be a positive weight function on the metric space (\mathbb{X},d). Fix a probability measure \mu on \mathbb{X}. If

    \[\inf_{\mathbb{M}\in C(\nu,\mu)} \phi(\mathbb{E}_{\mathbb{M}}[w(X,Y)])            \le c\, D(\nu || \mu)\]

holds for all \nu\ll\mu, then

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^n          \phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) \le c\, D(\mathbb{Q} || \mu^{\otimes n})\]

holds for every \mathbb{Q}\ll\mu^{\otimes n} and every n\ge 1.

Proof. The conclusion holds for n=1 by assumption. The proof now proceeds by induction on n: we will suppose in the sequel that the conclusion holds for n-1, and deduce the conclusion for n from that.

Fix \mathbb{Q}\ll\mu^{\otimes n}. We denote by \mathbb{Q}^{n-1} the marginal of \mathbb{Q} on the first n-1 coordinates, and define the conditional distribution \mathbb{Q}_{X_1,\ldots,X_{n-1}}=\mathbb{Q}(X_n\in\,\cdot\,|X_1,\ldots,X_{n-1}) (whose existence is guaranteed by the Bayes formula as \mathbb{Q}\ll\mu^{\otimes n}). The key idea of the proof is to exploit the chain rule for relative entropy:

    \[D(\mathbb{Q}||\mu^{\otimes n}) =    D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) +    \mathbb{E_Q}[D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu )].\]

The first term on the right can be bounded below by the induction hypothesis, while the second term can be bounded below by the assumption of the Proposition. In particular, fixing \varepsilon>0, it follows that we can choose \mathbb{M}^{n-1}\in\mathcal{C}(\mathbb{Q}^{n-1},\mu^{\otimes n-1}) and \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) such that

    \begin{align*}    c\,D(\mathbb{Q}^{n-1}||\mu^{\otimes n-1}) &\ge     \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) - \varepsilon,\\    c\,D(\mathbb{Q}_{X_1,\ldots,X_{n-1}} || \mu ) &\ge    \phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)]) - \varepsilon. \end{align*}

If we now define the probability measure \mathbb{M} as

    \[\mathbb{E_M}[f(X_1,\ldots,X_n,Y_1,\ldots,Y_n)] =     \mathbb{E}_{\mathbb{M}^{n-1}}\bigg[\int f(X_1,\ldots,X_{n-1},x,Y_1,\ldots,Y_{n-1},y)\mathbb{M}_{X_1,\ldots,X_{n-1}}(dx,dy)\bigg],\]

then we can estimate by Jensen’s inequality

    \begin{align*}    c\,D(\mathbb{Q}||\mu^{\otimes n}) &\ge    \sum_{i=1}^{n-1}\phi(\mathbb{E}_{\mathbb{M}^{n-1}}[w(X_i,Y_i)]) +    \mathbb{E_Q}[\phi(\mathbb{E}_{\mathbb{M}_{X_1,\ldots,X_{n-1}}}[w(X_n,Y_n)])]    - 2\varepsilon \\ &\ge    \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}[w(X_i,Y_i)]) - 2\varepsilon. \end{align*}

But evidently \mathbb{M}\in\mathcal{C}(\mathbb{Q},\mu^{\otimes n}), and thus

    \[c\,D(\mathbb{Q}||\mu^{\otimes n}) \ge \inf_{\mathbb{M}\in    \mathcal{C}(\mathbb{Q},\mu^{\otimes n})} \sum_{i=1}^{n}\phi(\mathbb{E}_{\mathbb{M}}    [w(X_i,Y_i)]) - 2\varepsilon.\]

The proof is completed by letting \varepsilon\downarrow 0. \square

Remark. In the above proof, we have swept a technicality under the rug: we assumed that an \varepsilon-optimal coupling \mathbb{M}_{X_1,\ldots,X_{n-1}}\in\mathcal{C}(\mathbb{Q}_{X_1,\ldots,X_{n-1}},\mu) can be chosen to be measurable as a function of X_1,\ldots,X_{n-1}. This can generally be justified by standard methods (e.g., on Polish spaces by a measurable selection argument, or in special cases such as w(x,y)=\mathbf{1}_{x\ne y} by an explicit construction of the optimal coupling).

Now assume that \mu satisfies the transportation-information inequality

    \[W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu,\]

which characterizes concentration in a fixed metric space. This corresponds to the setting of the above tensorization result with \phi(x)=x^2 and w(x,y)=d(x,y). Tensorization then yields

    \[\inf_{\mathbb{M}\in C(\mathbb{Q},\mu^{\otimes n})} \sqrt{\sum_{i=1}^n          \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2} \le \sqrt{2\sigma^2 D(\mathbb{Q} ||          \mu^{\otimes n})}\quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

Unfortunately, the left-hand side of this inequality is not itself a Wasserstein distance, and so we do not automatically obtain a transportation-information inequality in higher dimension. In the previous lecture, we showed that one can bound the left-hand side from below by a Wasserstein metric with respect to an \ell_1-type distance using the Cauchy-Schwarz inequality. However, we then lose a factor n^{-1/2} by Cauchy-Schwarz, and thus the dimension-free nature of the concentration is lost.

The above computation, however, suggests how we can “fix” our assumption to obtain dimension-free concentration. Note that the left-hand side of the tensorization inequality above is the \ell_2-norm of the vector of expectations (\mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)])_{i\le n}. If we could take the \ell_2 norm inside the expectation, rather than outside, then the left-hand side would be a Wasserstein distance between probability measures on (\mathbb{X}^n,d_n) with respect to the \ell_2-distance d_n(x,y):=\sqrt{\sum_{i=1}^nd(x_i,y_i)^2}! In order to engineer such a stronger inequality, however, we must begin with a stronger assumption.

To this end, define the quadratic Wasserstein distance

    \[W_2(\mu,\nu) := \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\sqrt{\mathbb{E_M}[d(X,Y)^2]}.\]

Suppose that \mu satisfies the quadratic transportation cost inequality (QTCI)

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then applying tensorization with \phi(x)=x and w(x,y)=d(x,y)^2 immediately yields

    \[W_2(\mathbb{Q},\mu^{\otimes n}) \le \sqrt{2\sigma^2 D(\mathbb{Q}||\mu^{\otimes n})}     \quad\text{for all }\mathbb{Q}\ll\mu^{\otimes n},~n\ge 1.\]

On the other hand, as obviously W_1(\mu,\nu)\le W_2(\mu,\nu) by Jensen’s inequality, we immediately find that QTCI implies dimension-free concentration: that is, we have proved

Corollary. Suppose that the probability measure \mu satisfies

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

Then we have dimension-free concentration, that is, for X_1,X_2,\ldots i.i.d. \sim\mu

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

for all n\ge 1 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

The observation that the quadratic transportation cost inequality yields dimension-free concentration with respect to the \ell_2-metric is due to Talagrand (1996), who used it to prove Gaussian concentration.

Characterizing dimension-free concentration

We began this topic by showing that the transportation cost inequality

    \[W_1(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu\]

is necessary and sufficient for a probability \mu on a fixed metric space (\mathbb{X},d) to exhibit concentration of Lipschitz functions. However, this does not suffice to obtain dimension-free concentration, that is, concentration of the product measure \mu^{\otimes n} for every n\ge 1. To obtain the latter, we “fixed” our original assumption by imposing the stronger quadratic transportation cost inequality

    \[W_2(\mu,\nu) \le \sqrt{2\sigma^2D(\nu||\mu)}\quad\text{for all }\nu\ll\mu.\]

As was shown above, this inequality is sufficient to obtain dimension-free concentration. But it is far from clear whether it should also be necessary: by strengthening the inequality, we have lost the connection with the proof on a fixed space (which relies on the variational property of relative entropy). It is therefore a remarkable fact that the quadratic transportation cost inequality proves to be both necessary and sufficient for dimension-free concentration, as was observed by Gozlan (2009).

Theorem. Let \mu be a probability measure on a Polish space (\mathbb{X},d), and let X_1,X_2,\ldots be i.i.d. \sim\mu. Then the following are equivalent:

  1. Dimension-free concentration:

        \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\ge t] \le e^{-t^2/2\sigma^2}\]

    for all n\ge 1, t\ge 0 and |f(x)-f(y)|\le \sqrt{\sum_{i=1}^n d(x_i,y_i)^2} with finite mean.

  2. Quadratic transporation cost inequality:

        \[W_2(\mu,\nu) \le \sqrt{2\sigma^2 D(\nu||\mu)}\]

    for every probability measure \nu \ll \mu.

This result effectively resolves the question we set out to answer.

Proof. We have already shown that 2 implies 1. In the sequel, we will prove that 1 implies 2: that is, the validity of the quadratic transportation cost inequality is necessary for dimension-free concentration. The proof of this fact is a surprising application of Sanov’s theorem (see Lecture 3).

We will need the following three facts that will be proved below.

  1. Law of large numbers: \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty.
  2. Lower semicontinuity: O_t:=\{\nu:W_2(\mu,\nu)>t\} is open in the weak convergence topology.
  3. Lipschitz property: the map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

The first two claims are essentially technical exercises: the empirical measures \frac{1}{n}\sum_{i=1}^n\delta_{X_i} converge weakly to \mu by the law of large numbers, so the only difficulty is to verify that the convergence holds in the slightly stronger sense of the quadratic Wasserstein distance; and lower-semicontinuity of the quadratic Wasserstein distance is an elementary technical fact. The third claim is a matter of direct computation, which we will do below. Let us take these claims for granted and complete the proof.

As O_t is open, we can apply Sanov’s theorem as follows:

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     \frac{1}{n}\sum_{i=1}^n\delta_{X_i}\in O_t    \bigg] =    \liminf_{n\to\infty}\frac{1}{n}\log\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg].\]

But as the function g_n is n^{-1/2}-Lipschitz, dimension-free concentration implies

    \[\mathbb{P}\bigg[     W_2\bigg(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\bigg)>t    \bigg] =    \mathbb{P}[     g_n(X_1,\ldots,X_n)>t    ] \le e^{-n(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2/2\sigma^2}.\]

Thus we have

    \[-\inf_{\nu\in O_t} D(\nu || \mu)  \le    -\limsup_{n\to\infty} \frac{(t-\mathbb{E}[g_n(X_1,\ldots,X_n)])^2}{2\sigma^2}    = -\frac{t^2}{2\sigma^2},\]

where we have used \mathbb{E}[g_n(X_1,\ldots,X_n)]=\mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0. In particular, we have proved

    \[\sqrt{2\sigma^2 D(\nu || \mu)}  \ge    t\quad\text{whenever}\quad W_2(\mu,\nu)>t>0.\]

The quadratic transportation cost inequality follows readily (let t=W_2(\mu,\nu)-\varepsilon and \varepsilon\downarrow 0). \square

It remains to establish the claims used in the proof. Let us begin the with Lipschitz property of g_n.

Lemma. (Claim 3) The map g_n:(x_1,\ldots,x_n)\mapsto W_2(\frac{1}{n}\sum_{i=1}^n\delta_{x_i},\mu) is n^{-1/2}-Lipschitz.

Proof. Let \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu). If we define \mu_i = \mathbb{M} [Y\in\,\cdot\, | X=x_i], then we evidently have

    \[\mathbb{E_M}[f(X,Y)]=\frac{1}{n}\sum_{i=1}^n\int f(x_i,y)\,\mu_i(dy),\qquad\quad    \frac{1}{n}\sum_{i=1}^n\mu_i=\mu.\]

Conversely, every family of measures \mu_1,\ldots,\mu_n such that \frac{1}{n}\sum_{i=1}^n\mu_i=\mu defines a coupling \mathbb{M}\in C(\frac{1}{n}\sum_{i=1}^n \delta_{x_i},\mu) in this manner. We can therefore estimate as follows:

    \begin{align*}         &W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{x_i}, \mu\bigg)         - W_2\bigg(\frac{1}{n} \sum_{i=1}^n\delta_{\tilde x_i}, \mu\bigg)\\         &= \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         - \inf_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \left\{         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         -         \bigg[\frac{1}{n} \sum_{i=1}^n \int d(\tilde x_i,y)^2 \mu_i(dy)\bigg]^{1/2}         \right\}\\         &\le \sup_{\frac{1}{n}\sum_{i=1}^n \mu_i = \mu}         \bigg[\frac{1}{n} \sum_{i=1}^n \int          \{d(x_i,y)-d(\tilde x_i,y)\}^2 \mu_i(dy)\bigg]^{1/2} \\         &\le \frac{1}{\sqrt{n}} \bigg[\sum_{i=1}^n d(x_i,\tilde x_i)^2\bigg]^{1/2}, \end{align*}

where in the last two lines we used, respectively, the reverse triangle inequality for L^2 norms (that is, \| X \|_2 - \| Y \|_2 \le \| X - Y \|_2) and the reverse triangle inequality for the metric d. \square

Next, we prove lower-semicontinuity of W_2. This is an exercise in using weak convergence.

Lemma. (Claim 2) \nu\mapsto W_2(\nu,\mu) is lower-semicontinuous in the weak convergence topology.

Proof. Let \nu_1,\nu_2,\ldots be probability measures such that \nu_n\to \nu weakly as n\to\infty. We must show

    \[\liminf_{n\to\infty}W_2(\nu_n,\mu)\ge W_2(\nu,\mu).\]

Fix \varepsilon>0, and choose for each n a coupling \mathbb{M}_n\in\mathcal{C}(\nu_n,\mu) such that

    \[W_2(\nu_n,\mu) \ge \sqrt{\mathbb{E}_{\mathbb{M}_n}[d(X,Y)^2]}-\varepsilon.\]

We claim that the sequence (\mathbb{M}_n)_{n\ge 1} is tight. Indeed, the sequence (\nu_n)_{n\ge 1} is tight (as it converges) and clearly \mu itself is tight. For any \delta>0, choose a compact set K_\delta such that \nu_n(K_\delta)\ge 1-\delta/2 for all n\ge 1 and \mu(K_\delta)\ge 1-\delta/2. Then evidently \mathbb{M}_n(K_\delta\times K_\delta) \ge 1-\delta, and thus tightness follows.

By tightness, we can choose a subsequence n_k\uparrow\infty such that \liminf_{n}W_2(\nu_n,\mu)=\lim_kW_2(\nu_{n_k},\mu) and \mathbb{M}_{n_k}\to\mathbb{M} weakly for some probability measure \mathbb{M}. As d is continuous and nonnegative, we obtain

    \[\liminf_{n\to\infty} W_2(\nu_n,\mu) \ge    \lim_{k\to\infty}\sqrt{\mathbb{E}_{\mathbb{M}_{n_k}}[d(X,Y)^2]} - \varepsilon \ge    \sqrt{\mathbb{E}_{\mathbb{M}}[d(X,Y)^2]} - \varepsilon.\]

But as \mathbb{M}\in\mathcal{C}(\nu,\mu), we have shown \liminf_{n\to\infty} W_2(\nu_n,\mu) \ge W_2(\nu,\mu)-\varepsilon. We conclude using \varepsilon\downarrow 0. \square

Finally, we prove the law of large numbers in W_2. This is an exercise in truncation.

Lemma. (Claim 1) Suppose that \mu satisfies the Lipschitz concentration property. Then the law of large numbers holds in the sense \mathbb{E}[W_2(\frac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu)]\to 0 as n\to\infty for X_1,X_2,\ldots i.i.d. \sim\mu.

Proof. Let x^*\in\mathbb{X} be some arbitrary point. We truncate the Wasserstein distance as follows:

    \begin{align*}    W_2(\mu,\nu)^2    &= \inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\{    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)\le a}] +    \mathbb{E_M}[d(X,Y)^2\mathbf{1}_{d(X,Y)>a}] \} \\    &\le a\inf_{\mathbb{M}\in\mathcal{C}(\mu,\nu)}\mathbb{E_M}[d(X,Y)\wedge a] +    \frac{4\int d(x,x^*)^3\{\mu(dx)+\nu(dx)\}}{a} \end{align*}

where we used (b+c)^3 \le 4(b^3 + c^3) for b,c\ge 0. We claim that if \nu_n\to\mu weakly, then

    \[\inf_{\mathbb{M}\in\mathcal{C}(\nu_n,\mu)}\mathbb{E_M}[d(X,Y)\wedge a]\xrightarrow{n\to\infty}0.\]

Indeed, by the Skorokhod representation theorem, we can construct random variables (X_n)_{n\ge 1} and X on a common probability space such that X_n\sim\nu_n for every n, X\sim\mu, and X_n\to X a.s. Thus \mathbb{E}[d(X_n,X)\wedge a]\to 0 by bounded convergence, and as \mathrm{Law}(X_n,X)\in\mathcal{C}(\nu_n,\mu) the claim follows. Thus \nu_n\to\mu implies W_2(\nu_n,\mu)\to 0 if we can control the second term in the above truncation.

Denote the empirical measure as \mu_n=\frac{1}{n}\sum_{i=1}^n\delta_{X_i}. Recall that \mu_n\to\mu weakly a.s. by the law of large numbers. Therefore, following the above reasoning, we obtain

    \[\limsup_{n\to\infty}\mathbb{E}[W_2(\mu_n,\mu)^2]     \le     \frac{8\int d(x,x^*)^3\mu(dx)}{a}\]

for every a>0. Thus it remains to show that \int d(x,x^*)^3\mu(dx)<\infty. But as x\mapsto d(x,x^*) is evidently Lipschitz (with constant 1), this follows directly from the following Lemma. \square

Finally, we have used in the last proof the following lemma, which shows that if \mu satisfies the Lipschitz concentration property then any Lipschitz function has all finite moments. In particular, every Lipschitz function has finite mean, which means that the qualifier “with finite mean” used above in our definition of (dimension-free) concentration is superfluous and can therefore be dropped.

Lemma. Suppose that the probability measure \mu satisfies the Lipschitz concentration property. Then any Lipschitz function f satisfies \int |f(x)|^q\,\mu(dx)<\infty for every 0< q<\infty.

Proof. Let f be L-Lipschitz. It suffices to prove that |f| has finite mean. If this is the case, then the Lipschitz concentration property implies for every 0<q<\infty that

    \[\int |f(x)|^q\,\mu(dx) =    \int_0^\infty x^{q-1}\,\mathbb{P}[|f|\ge x]\,dx \le    \int_0^\infty x^{q-1}\,e^{-(x-\mathbb{E}|f|)^2/2\sigma^2L^2}\,dx<\infty,\]

where we note that |f| is Lipschitz with the same constant as f. To prove that |f| has finite mean, let us apply the Lipschitz concentration property to -\{|f|\wedge a\} (which certainly has finite mean). This gives

    \[\mathbb{P}[|f|\wedge a\le \mathbb{E}(|f|\wedge a)-t]\le e^{-t^2/2\sigma^2L^2}.\]

Now choose t such that e^{-t^2/2\sigma^2L^2}<1/2. Then clearly \mathbb{E}(|f|\wedge a)-t\le\mathrm{Med}(|f|\wedge a). But note that the median \mathrm{Med}(|f|\wedge a)=\mathrm{Med}|f| for a>\mathrm{Med}|f|. Thus we obtain \mathbb{E}|f|\le \mathrm{Med}|f|+t<\infty as a\uparrow\infty. \square

Gaussian concentration

We started our discussion of dimension-free concentration with the classical Gaussian concentration property of Tsirelson, Ibragimov, and Sudakov. It therefore seems fitting to conclude by giving a proof of this result using the machinery that we have developed: we only need to show that the standard normal N(0,1) on (\mathbb{R},|\cdot|) satisfies the quadratic transportation cost inequality. [It should be noted that there are numerous other proofs of Gaussian concentration, each with their own interesting ideas.]

Proposition. Let \mu=N(0,1) on (\mathbb{R},|\cdot|). Then W_2(\nu,\mu)\le\sqrt{2D(\nu||\mu)} for all \mu\ll\nu.

This result is due to Talagrand (1996). Talagrand’s proof exploits the fact that optimal transportation problems on \mathbb{R} admit an explicit solution in terms of quantile functions. This allows to establish inequalities on \mathbb{R} using calculus manipulations. In contrast, optimal transportation problems on \mathbb{R}^d for d\ge 2 are far from trivial (see excellent introductory and comprehensive texts by Villani). We therefore see that tensorization is really key to a tractable proof by Talagrand’s method.

Instead of going down this road, we will present a lovely short proof of the transportation-information inequality due to Djellout, Guillin, and Wu (2004) that uses stochastic calculus.

Proof. Denote by \mathbb{P} be the law of standard Brownian motion (W_t)_{t\in[0,1]}. Fix a probability measure \nu\ll\mu such that D(\nu||\mu)<\infty, and define the probability measure \mathbb{Q} as

    \[d\mathbb{Q} = \frac{d\nu}{d\mu}(W_1)\,d\mathbb{P}.\]

Then clearly W_1\sim\mu under \mathbb{P} and W_1\sim\nu under \mathbb{Q}.

Note that M_t=\mathbb{E}[\frac{d\nu}{d\mu}(W_1)|\mathcal{F}_t]=(\frac{d\nu}{d\mu}*\phi_{1-t})(W_t) is a uniformly integrable martingale and M_t>0 for 0\le t<1 (here \phi_s denotes the density of N(0,s)). Thus we find that

    \[\frac{d\nu}{d\mu}(W_1) = \exp\bigg(         \int_0^1 \beta_t\,dW_t -\frac{1}{2}\int_0^1 \beta_t^2\,dt      \bigg)\]

for some nonanticipating process (\beta_t)_{t\in[0,1]} by the martingale representation theorem and Itô's formula. But then Girsanov's theorem implies that the stochastic process defined by

    \[Y_t := W_t - \int_0^t \beta_s\,ds\]

is Brownian motion under \mathbb{Q}. Thus the law of (W_1,Y_1) under \mathbb{Q} is a coupling of \nu and \mu, and

    \[W_2(\mu,\nu)^2 \le \mathbb{E_Q}[|W_1-Y_1|^2] \le      \mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg]\]

by Jensen's inequality. The proof is therefore complete once we prove that

    \[\mathbb{E_Q}\bigg[\int_0^1 \beta_t^2\,dt\bigg] = 2D(\nu||\mu).\]

To see this, note that

    \[D(\nu||\mu) = \mathbb{E_Q}\bigg[\log\frac{d\nu}{d\mu}(W_1)\bigg]     = \mathbb{E_Q}\bigg[         \int_0^1 \beta_t\,dY_t + \frac{1}{2}\int_0^1 \beta_t^2\,dt     \bigg].\]

If \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty then the dY_t integral is a square-integrable martingale; thus its expectation vanishes and the proof is complete. However, it is not difficult to show using a simple localization argument that D(\nu||\mu)<\infty implies \mathbb{E_Q}[\int_0^1 \beta_t^2dt]<\infty, see Lemma 2.6 of Föllmer for a careful proof. \square

Remark. To be fair, it should be noted that the above stochastic calculus proof works just as easily in \mathbb{R}^d for any d. Thus we could directly establish the transportation-information inequality (and therefore concentration) in any dimension in this manner without going through the tensorization argument.

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

15. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

Lecture 9. Concentration, information, transportation (1)

The goal of the next two lectures is to explore the connections between concentration of measure, entropy inequalities, and optimal transportation.

What is concentration?

Roughly speaking, concentration of measure is the idea that nonlinear functions of many random variables often have “nice tails”. Concentration represents one of the most important ideas in modern probability and has played an important role in many other fields, such as statistics, computer science, combinatorics, and geometric analysis.

To illustrate concentration, let X\sim N(0,1). Using Markov’s inequality, we can estimate for t\ge 0

    \[\mathbb{P}[X\ge t]          = \inf_{\lambda>0} \mathbb{P}[e^{\lambda X}\ge e^{\lambda t}]         \le \inf_{\lambda>0} e^{-\lambda t} \mathbb{E}[e^{\lambda X}]         = \inf_{\lambda>0} e^{\lambda^2/2 - \lambda t}         = e^{-t^2/2},\]

where we have made the optimal choice \lambda=t (this is called a Chernoff bound). Thus a Gaussian random variable X has “nice tails”. The concentration of measure phenomenon shows that not only do Gaussian random variables have nice tails, but that many nonlinear functions of Gaussian random variables still have nice tails. The classical result on this topic is the following.

Theorem. (Gaussian concentration, Tsirelson-Ibragimov-Sudakov 1976)
Let X_1,\ldots,X_n be i.i.d. N(0,1), and let f:\mathbb{R}^n\rightarrow\mathbb{R} be L-Lipschitz, that is, |f(x)-f(y)|\le L \| x - y \| for each x,y\in\mathbb{R}^n (where \|\cdot\| is the Euclidean norm). Then

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E} f(X_1,\ldots,X_n) \ge t] \le e^{-t^2/2L^2}.\]

This result shows that Lipschitz functions (which could be very nonlinear) of i.i.d. Gaussian random variables concentrate closely around their expected value: the probability that the function differs from its expected value decays as a Gaussian in the degree of the discrepancy. The beauty of this result is that it is dimension-free, that is, the rate of decay of the tail only depends on the Lipschitz constant L, and there is no dependence on the number of random variables n. Such results are essential in high-dimensional problems where one would like to obtain dimension-free bounds.

Gaussian concentration is only one result in a theory with many fascinating ideas and questions. One might ask, for instance, what random variables beside Gaussians exhibit this type of phenomenon, whether other notions of Lipschitz functions can be considered, etc. See, for example, the excellent books by Ledoux and Boucheron, Lugosi, and Massart for much more around this theme.

The basic question that we want to address in the next two lectures is the following.

Question. Let X_1,\ldots,X_n be i.i.d. random variables on a metric space (\mathbb{X},d) with distribution \mu. Can we characterize all \mu for which dimension-free concentration holds as for the Gaussian case above?

It turns out that a remarkably complete answer can be given in terms of entropy inequalities. This is where information-theoretic methods enter the picture.

When bounding tails of random variables (provided they are at least exponential), it is convenient to bound moment generating functions as we did above instead of working directly with the tail.

Definition. X is subgaussian with parameter \sigma^2 if \mathbb{E}[e^{\lambda (X-\mathbb{E}X)}] \le e^{\lambda^2\sigma^2/2} for each \lambda\in\mathbb{R}.

If X is subgaussian, then \mathbb{P}[X-\mathbb{E}X \ge t]\le e^{-t^2/2\sigma^2} (by the Chernoff bound as above). In fact, it can be shown that the converse is also true for slightly larger \sigma, so that the property of having Gaussian tails is equivalent to the subgaussian property (we omit the proof). It will be convenient to investigate the property of Gaussian tails in terms of subgaussianity.

Concentration and relative entropy

Before we can tackle the problem of dimension-free concentration, we must begin by making the connection between subgaussianity and entropy in the most basic setting.

Let (\mathbb{X},d) be a metric space. A function f:\mathbb{X}\rightarrow \mathbb{R} is L-Lipschitz (L-Lip) if |f(x)-f(y)|\le L d(x,y) for each x,y\in\mathbb{X}. One thing that we can do with Lipschitz functions is to define a distance between probability measures (we will assume in the sequel that the necessary measurability conditions are satisfied): for probability measures \mu,\nu on \mathbb{X}, define the Wasserstein distance as

    \[W_1(\mu,\nu) := \sup_{f\in 1\text{-Lip}} \left| \int f d\mu - \int f d\nu \right|.\]

The idea is that two measures are close if the expectations of a large class of function is close. In the case of W_1, the class of function being used is the class 1-Lip.

As we are interested in concentration of Lipschitz functions, it is intuitive that a quantity such as W_1 should enter the picture. On the other hand, we have seen in earlier lectures that the relative entropy D(\nu || \mu) can also be viewed as a “sort of distance” between probability measures (albeit not a metric). It is not clear, a priori, how W_1 and D are related. We will presently see that relative entropy is closely related to moment generating functions, and therefore to tails of random variables: in particular, we can characterize concentration on a fixed space by comparing the Wasserstein metric and relative entropy.

Proposition. The following are equivalent for X\sim \mu:

  1. \mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2} for every \lambda>0 and f\in 1\text{-Lip}.
  2. W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every probability measure \nu\ll\mu.

Note that this result characterizes those measures \mu on a fixed metric space (\mathbb{X},d) that exhibit Gaussian concentration. There is no independence as of yet, and thus no notion of “dimension-free” concentration for functions of independent random variables: the present result is in “fixed dimension”.

Example. Let d(x,y)=\mathbf{1}_{x\neq y} be the trivial metric. A function f is 1-Lip with respect to d if

    \[| f(x) - f(y) | \le \mathbf{1}_{x\neq y}         \quad\text{for each } x,y\in\mathbb{X},\]

that is, if and only if \mathrm{osc}\,f:=\sup f - \inf f \le 1. Hence we have

    \[W_1(\mu,\nu) = \sup_{f:\mathrm{osc}\,f \le 1} \left| \int f d\mu - \int f d\nu \right|          = \|\mu-\nu\|_{\rm TV}.\]

Thus 2 in the above proposition holds with \sigma^2=\frac{1}{4} by Pinsker’s inequality

    \[\|\mu-\nu\|_{\rm TV} \le \sqrt{\tfrac{1}{2}D(\nu||\mu)}.\]

We consequently find by 1 above that

    \[\mathbb{E}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2/8}\]

for every function f such that \mathrm{osc}\,f \le 1. Thus the above Proposition reduces in this case to the well known Hoeffding lemma, which states that bounded random variables are subgaussian.

Let us turn to the proof of the Proposition. The first observation is a classic result that connects relative entropy with moment generating functions. It dates back to the very beginnings of statistical mechanics (see the classic treatise by J. W. Gibbs (1902), Theorem III, p. 131).

Lemma. (Gibbs variational principle) Let Z be any random variable. Then

    \[\log \mathbb{E}_{\mathbb{P}}[e^Z]         = \sup_{\mathbb{Q} \ll \mathbb{P}}         \{         \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})         \}.\]

Proof. Assume that Z is bounded above by some constant Z\le M <\infty (otherwise replace Z by \min\{Z,M\} and then let M\uparrow\infty at the end). Define a probability measure \mathbb{\tilde Q} by

    \[d \mathbb{\tilde Q} = \frac{e^Z d\mathbb{P}}{\mathbb{E}_{\mathbb{P}}[e^Z]}.\]

Then

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z] - D(\mathbb{Q} || \mathbb{\tilde Q})         = \log\mathbb{E}_{\mathbb{P}}[e^Z]          - \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{Q}}{d \mathbb{P}}\right]         + \mathbb{E}_{\mathbb{Q}}\left[\log\frac{d \mathbb{\tilde Q}}{d \mathbb{P}}\right]         = \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P}).\]

As the relative entropy is always positive, we have

    \[\log\mathbb{E}_{\mathbb{P}}[e^Z]          \ge \mathbb{E}_{\mathbb{Q}}[Z] - D(\mathbb{Q} || \mathbb{P})\]

for every \mathbb{Q}\ll\mathbb{P}, and equality is obtained by choosing the optimizer \mathbb{Q}=\mathbb{\tilde Q}. \square

Using the variational principle, it is easy to prove the Proposition.

Proof of the Proposition. By the variational principle, we have

    \[\mathbb{E}_{\mathbb{P}}[e^{\lambda\{f(X)-\mathbb{E}f(X)\}}] \le e^{\lambda^2\sigma^2/2}\]

if and only if

    \[\lambda\, \mathbb{E}_{\mathbb{Q}}[f(X)] - \lambda\, \mathbb{E}_{\mathbb{P}}[f(X     )]         - D(\mathbb{Q} || \mathbb{P}) \le \frac{\lambda^2\sigma^2}{2}\]

for all \mathbb{Q}\ll\mathbb{P}. Optimizing over \lambda, we find that 1 is equivalent to the validity of

    \[\mathbb{E}_{\mathbb{Q}}[f(X)] - \mathbb{E}_{\mathbb{P}}[f(X     )]         \le \sqrt{2\sigma^2 D(\mathbb{Q} || \mathbb{P})}\]

for all f\in 1\text{-Lip} and \mathbb{Q}\ll\mathbb{P}. \square

Tensorization and optimal transport

So far we have considered concentration in a fixed metric space (\mathbb{X},d). If X_1,\ldots,X_n are independent random variables, we can certainly apply the Proposition to X=(X_1,\ldots,X_n) with the product distribution \mu^{\otimes n}. However, to establish dimension-free concentration, we would have to check that the conditions of the Proposition hold for \mu^{\otimes n} for every n with the same constant \sigma! This is hardly a satisfactory answer: we would like to characterize dimension-free concentration directly in terms of a property of \mu only. To this end, a natural conjecture might be that if the conditions of the Proposition hold for the measure \mu, then that will already imply the same property for the measures \mu^{\otimes n} for every n. This turns out not to be quite true, but this idea will lead us in the right direction.

Motivated by the above, we set out to answer the following

Question. Suppose that \mu satisfies W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} for every \nu\ll\mu. Does this imply that a similar property is satisfied by the product measures \mu^{\otimes n}?

Such a conclusion is often referred to as a tensorization property. To make progress in this direction, we must understand the classic connection between Wasserstein distances and optimal transportation.

Theorem. (Kantorovich-Rubinstein duality, 1958) Let \mu and \nu be probability measures on a Polish space. Let \mathcal{C}(\mu,\nu):=\{\mathrm{Law}(X,Y): X\sim\mu, Y\sim\nu \} be the set of couplings of \mu and \nu. Then

    \[W_1(\mu,\nu) = \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

The right side of this equation is a so-called “optimal transport problem”. For this reason, inequalities such as W_1(\nu,\mu) \le \sqrt{2\sigma^2 D(\nu||\mu)} are often called transportation-information inequalities.

The full proof of Kantorovich-Rubinstein duality is part of the theory of optimal transportation and is beyond our scope (optimal transportation is itself a fascinating topic with many connections to other areas of mathematics such as probability theory, PDEs, and geometric analysis—perhaps a topic for another semester?) Fortunately, we will only need the easy half of the theorem in the sequel.

Proof of lower bound. For each f\in 1\text{-Lip} and \mathbb{M}\in \mathcal{C}(\mu,\nu), we have

    \[\mathbb{E}_\mu [f] - \mathbb{E}_\nu [f]          = \mathbb{E}_\mathbb{M}[f(X)-f(Y)]          \le \mathbb{E}_\mathbb{M}[d(X,Y)],\]

from which we immediately get

    \[W_1(\mu,\nu) \le \inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)} \mathbb{E}_\mathbb{M}[d(X,Y)].\]

This proves the easy direction in the above theorem. \square

It turns out that the optimal transportation approach is the “right” way to tensorize transportation-information inequalities. Even though the following result is not quite yet what we need to prove dimension-free concentration, it already suffices to derive some interesting results.

Proposition. (Tensorization) Suppose that

    \[\inf_{\mathbb{M}\in \mathcal{C}(\nu,\mu)} \mathbb{E}_{\mathbb{M}}[d(X,Y)]^2          \le 2\sigma^2 D(\nu || \mu)\]

for all \nu\ll\mu. Then, for any n\ge 1,

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}          \sum_{i=1}^n \mathbb{E}_{\mathbb{M}}[d(X_i,Y_i)]^2         \le 2\sigma^2 D(\mathbb{Q} || \mu^{\otimes n})\]

for all \mathbb{Q} \ll\mu^{\otimes n}.

We postpone the proof of this result until the next lecture.

Example. Let d(x,y) = \mathbf{1}_{x\neq y}. By Pinsker’s inequality

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mu,\nu)}         \mathbb{P}_{\mathbb{M}}[X\neq Y]^2 = \|\mu-\nu\|_{\rm TV}^2         \le \frac{1}{2} D(\nu || \mu)         \quad\mbox{for all }\nu \ll \mu.\]

Define the weighted Hamming distance for positive weights c_i as

    \[d_n(x,y) = \sum_{i=1}^n c_i \mathbf{1}_{x_i\neq y_i}.\]

Then, by Cauchy-Schwarz and tensorization we get

    \[\inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}         \mathbb{E}_{\mathbb{M}}[d_n(X,Y)]^2        \le        \left(\sum_{i=1}^n c_i^2\right) \inf_{\mathbb{M}\in \mathcal{C}(\mathbb{Q},\mu^{\otimes n})}\sum_{i=1}^n          \mathbb{P}_{\mathbb{M}}[X_i\neq Y_i]^2         \le \frac{1}{2} \left(\sum_{i=1}^n c_i^2\right) D(\mathbb{Q} || \mu^{\otimes n})\]

for each \mathbb{Q}\ll\mu^{\otimes n}. So, we have

    \[\mathbb{E}[e^{\lambda\{f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n)\}}]         \le e^{\lambda^2\sigma^2/2},\]

with \sigma^2=\frac{1}{4} \sum_{i=1}^n c_i^2, for each \lambda>0 and each function f which is 1-Lip with respect to d_n. This implies

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-2t^2/\sum_{i=1}^n c_i^2}.\]

That is, we recover the well known bounded difference inequality.

Outlook

We have not yet shown that the transportation-information inequality holds for X\sim N(0,1) on (\mathbb{R},|\cdot|). Even once we establish this, however, the tensorization result we have given above is not sufficient to prove dimension-free Gaussian concentration in the sense of Tsirelson-Ibragimov-Sudakov. Indeed, if we apply the above tensorization result, then at best we can get

    \[\mathbb{P}[f(X_1,\ldots,X_n)-\mathbb{E}f(X_1,\ldots,X_n) \ge t]          \le e^{-t^2/2\sum_{i=1}^n c_i^2}\]

whenever

    \[|f(x)-f(y)|\le \sum_{i=1}^n c_i |x_i-y_i|.\]

Setting the weights c_i=1, we find a tail bound of the form e^{-t^2/2n} whenever f is 1\text{-Lip} with respect to the \ell_1-norm |f(x)-f(y)|\le\|x-y\|_1. Note that this is not dimension-free: the factor 1/n appears inside the exponent! On the other hand, Gaussian concentration shows that we have a dimension-free tail bound e^{-t^2/2} whenever f is 1\text{-Lip} with respect to the \ell_2-norm |f(x)-f(y)|\le\|x-y\|_2. Note that the latter property is strictly stronger than the former because \|\cdot\|_1\le\sqrt{n}\,\|\cdot\|_2! Our tensorization method is not sufficiently strong, however, to yield this type of dimension-free result.

Fortunately, we now have enough ingredients to derive a slightly stronger transportation-information inequality that is not only sufficient, but also necessary for dimension-free concentration. Stay tuned!

Lecture by Ramon van Handel | Scribed by Patrick Rebeschini

11. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

← Older posts

css.php