**Models of growing graphs**

A natural general model of randomly growing graphs can be defined as follows. For and a graph on vertices, define the random graph by induction. First, set ; we call the *seed* of the graph evolution process. Then, given , is formed from by adding a new vertex and some new edges according to some adaptive rule. If is a single vertex, we write simply instead of .

There are several rules one can consider; here we study perhaps the two most natural ones: uniform attachment and preferential attachment (denoted and in the following). Moreover, for simplicity we focus on the case of growing t*rees,* where at every time step a single edge is added. Uniform attachment trees are defined recursively as follows: given , is formed from by adding a new vertex and adding a new edge where the vertex is chosen uniformly at random among vertices of , independently of all past choices. Preferential attachment trees are defined similarly, except that is chosen with probability proportional to its degree:

where for a tree we denote by the degree of vertex in .

**Questions: detection and estimation**

The most basic questions to consider are those of *detection* and *estimation*. Can one detect the influence of the initial seed graph? If so, is it possible to estimate the seed? Can one find the root if the process was started from a single node? We introduce these questions in the general model of randomly growing graphs described above, even though we study them in the special cases of uniform and preferential attachment trees later.

The detection question can be rephrased in the terminology of hypothesis testing. Given two potential seed graphs and , and an observation which is a graph on vertices, one wishes to test whether or . The question then boils down to whether one can design a test with asymptotically (in ) nonnegligible power. This is equivalent to studying the total variation distance between and , so we naturally define

where and are random elements in the finite space of unlabeled graphs with vertices. This limit is well-defined because is nonincreasing in (since if , then the evolution of the random graphs can be coupled such that for all ) and always nonnegative.

If the seed has an influence, it is natural to ask whether one can estimate from for large . If so, can the subgraph corresponding to the seed be located in ? We study this latter question in the simple case when the process starts from a single vertex called the *root*. (In the case of preferential attachment, starting from a single vertex is not well-defined; in this case we start the process from a single edge and the goal is to find one of its endpoints.) A *root-finding algorithm* is defined as follows. Given and a target accuracy , a root-finding algorithm outputs a set of vertices such that the root is in with probability at least (with respect to the random generation of ).

An important aspect of this definition is that the size of the output set is allowed to depend on , but not on the size of the input graph. Therefore it is not clear that root-finding algorithms exist at all. Indeed, there are examples when they do not exist: consider a path that grows by picking one of its two ends at random and extending it by a single edge. However, it turns out that in many interesting cases root-finding algorithms do exist. In such cases it is natural to ask for the best possible value of .

**The influence of the seed**

Consider distinguishing between a PA tree started from a star with vertices, , and a PA tree started from a path with vertices, . Since the preferential attachment mechanism incorporates the rich-get-richer phenomenon, one expects the degree of the center of the star in to be significantly larger than the degree of any of the initial vertices in the path in . This intuition guided Bubeck, Mossel, and Racz when they initiated the theoretical study of the influence of the seed in PA trees. They showed that this intuition is correct: the limiting distribution of the maximum degree of the PA tree indeed depends on the seed. Using this they were able to show that for any two seeds and with at least vertices and different degree profiles we have

However, statistics based solely on degrees cannot distinguish all pairs of nonisomorphic seeds. This is because if and have the same degree profiles, then it is possible to couple and such that they have the same degree profiles for every . In order to distinguish between such seeds, it is necessary to incorporate information about the graph structure into the statistics that are studied. This was done successfully by Curien, Duquesne, Kortchemski, and Manolescu, who analyzed statistics that measure the *geometry* of large degree nodes. These results can be summarized in the following theorem.

Theorem:The seed has an influence in PA trees in the following sense. For any trees and that are nonisomorphic and have at least vertices, we have

In the case of uniform attachment, degrees do not play a special role, so initially one might even think that the seed has no influence in the limit. However, it turns out that the right perspective is not to look at degrees but rather the sizes of appropriate subtrees (we shall discuss such statistics later). By extending the approach of Curien et al. to deal with such statistics, Bubeck, Eldan, Mossel, and Racz showed that the seed has an influence in uniform attachment trees as well.

Theorem:The seed has an influence in UA trees in the following sense. For any trees and that are nonisomorphic and have at least vertices, we have

These results, together with a lack of examples showing opposite behavior, suggest that for most models of randomly growing graphs the seed has influence.

Question:How common is the phenomenon observed in Theorems 1 and 2? Is there a natural large class of randomly growing graphs for which the seed has an influence? That is, models where for any two seeds and (perhaps satisfying an extra condition), we have . Is there a natural model where the seed has no influence?

**Finding Adam**

These theorems about the influence of the seed open up the problem of *finding* the seed. Here we present the results of Bubeck, Devroye, and Lugosi who first studied root-finding algorithms in the case of UA and PA trees.

They showed that root-finding algorithms indeed exist for PA trees and that the size of the best confidence set is polynomial in .

Theorem:There exists a polynomial time root-finding algorithm for PA trees with

for some finite constant . Furthermore, there exists a positive constant such that any root-finding algorithm for PA trees must satisfy

They also showed the existence of root-finding algorithms for UA trees. In this model, however, there are confidence sets whose size is *subpolynomial* in . Moreover, the size of any confidence set has to be at least *superpolylogarithmic* in .

Theorem:There exists a polynomial time root-finding algorithm for UA trees with

for some finite constant . Furthermore, there exists a positive constant such that any root-finding algorithm for UA trees must satisfy

These theorems show an interesting quantitative difference between the two models: finding the root is exponentially more difficult in PA than in UA. While this might seem counter-intuitive at first, the reason behind this can be traced back to the rich-get-richer phenomenon: the effect of a rare event where not many vertices attach to the root gets amplified by preferential attachment, making it harder to find the root.

**Proofs using Polya urns**

We now explain the basic ideas that go into proving Theorems 3 and 4 and prove some simpler cases. While UA and PA are arguably the most basic models of randomly growing graphs, the evolution of various simple statistics, such as degrees or subtree sizes, can be described using even simpler building blocks: Polya urns. In this post we assume familiarity with Polya urns; we refer to the lecture notes for a primer on Polya urns for the interested reader.

**A root-finding algorithm based on the centroid**

We start by presenting a simple root-finding algorithm for UA trees. This algorithm is not optimal, but its analysis is simple and highlights the basic ideas.

For a tree , if we remove a vertex , then the tree becomes a forest consisting of disjoint subtrees of the original tree. Let denote the size (i.e., the number of vertices) of the largest component of this forest. A vertex that minimizes is known as a *centroid* of ; one can show that there can be at most two centroids. We define the confidence set by taking the set of vertices with smallest values.

Theorem:The centroid-based defined above is a root-finding algorithm for the UA tree. More precisely, if

then

where denotes the root, and denotes the unlabeled version of .

*Proof:* We label the vertices of the UA tree in chronological order. We start by introducing some notation that is useful throughout the proof. For , denote by the tree containing vertex in the forest obtained by removing in all edges between vertices . See the figure for an illustration.

Let denote the size of a tree , i.e., the number of vertices it contains. Note that the vector

evolves according to the classical P\’olya urn with colors, with initial state . Therefore the normalized vector

converges in distribution to a Dirichlet distribution with parameters .

Now observe that

We bound the two terms appearing above separately, starting with the first one. Note that

and both and converge in distribution to a uniform random variable in . Hence a union bound gives us that

For the other term, first observe that for any we have

Now using results on P\’olya urns we have that for every such that , the random variable

converges in distribution to the distribution. Hence by a union bound we have that

Putting together the two bounds gives that

which concludes the proof due to the assumption on .

The same estimator works for the preferential attachment tree as well, if one takes

for some positive constant . The proof mirrors the one above, but involves a few additional steps; we refer to Bubeck et al. for details.

For uniform attachment the bound on given by Theorem 5 is not optimal. It turns out that it is possible to write down the maximum likelihood estimator (MLE) for the root in the UA model; we do not do so here, see Bubeck et al. One can view the estimator based on the centroid as a certain “relaxation” of the MLE. By constructing a certain “tighter” relaxation of the MLE, one can obtain a confidence set with size subpolynomial in as described in Theorem 4. The analysis of this is the most technical part of Bubeck et al. and we refer to this paper for more details.

**Lower bounds**

As mentioned above, the MLE for the root can be written down explicitly. This aids in showing a lower bound on the size of a confidence set. In particular, Bubeck et al. define a set of trees whose probability of occurrence under the UA model is not too small, yet the MLE provably fails, giving the lower bound described in Theorem 4. We refer to Bubeck et al. for details.

On the other hand, for the PA model it is not necessary to use the structure of the MLE to obtain a lower bound. A simple symmetry argument suffices to show the lower bound in Theorem 3, which we now sketch.

First observe that the probability of error for the optimal procedure is non-decreasing with , since otherwise one could simulate the process to obtain a better estimate. Thus it suffices to show that the optimal procedure must have a probability of error of at least for some finite . We show that there is some finite such that with probability at least , the root is isomorphic to at least vertices in . Thus if a procedure outputs at most vertices, then it must make an error at least half the time (so with probability at least ).

Observe that the probability that the root is a leaf in is

By choosing , this happens with probability . Furthermore, conditioned on the root being a leaf, with constant probability vertex is connected to leaves, which are then isomorphic to the root.

**Open problems**

There are many open problems and further directions that one can pursue; the four papers we have discussed contain 20 open problems and conjectures alone, and we urge the reader to have a look and try to solve them!

]]>**Some properties to look for in non-convex optimization**

We will say that a function admits first order optimality (respectively second order optimality) if all critical points (respectively all local minima) of are global minima (of course first order optimality implies second order optimality for smooth functions). In particular with first order optimality one has that gradient descent converges to the global minimum, and with second order optimality this is also true provided that one avoids saddle points. To obtain rates of convergence it can be useful to make more quantitative statements. For example we say that is -Polyak if

Clearly -Polyak implies first order optimality, but more importantly it also implies linear convergence rate for gradient descent on . A variant of this condition is -weak-quasi-convexity:

in which case gradient descent converges at the slow non-smooth rate (and in this case it is also robust to noise, i.e. one can write a stochastic gradient descent version). The proofs of these statements just mimic the usual convex proofs. For more on these conditions see for instance this paper.

**Linearized Residual Networks**

Recall that a neural network is just a map where are linear maps (i.e. they are the matrices parametrizing the neural network) and is some non-linear map (the most popular one, ReLu, is the just the coordinate-wise positive part). Alternatively you can think of a neural network as a sequence of hidden states where and . In 2015 a team of researcher at MSR Asia introduced the concept of a *residual neural network* where the hidden states are now updated as before for even but for odd we set . Apparently this trick allowed them to train much deeper networks, though it is not clear why this would help from a theoretical point of view (the intuition is that at least when the network is initialized with all matrices being it still does something non-trivial, namely it computes the identity).

In their most recent paper Moritz Hardt and Tengyu Ma try to explain why adding this “identity connection” could be a good idea from a geometric point of view. They consider an (extremely) simplified model where there is no non-linearity, i.e. is the identity map. A neural network is then just a product of matrices. In particular the landscape we are looking at for least-squares with such a model is of the form:

which is of course a non-convex function (just think of the function and observe that on the segment it gives the non-convex function ). However it actually satisfies the second-order optimality condition:

Proposition[Kawaguchi 2016]Assume that has a full rank covariance matrix and that for some deterministic matrix . Then all local minima of are global minima.

I won’t give the proof of this result as it requires to take the second derivative of which is a bit annoying (I will give below the proof of the first derivative). Now in this linearized setting the residual network version (where the identity connection is added at every layer) corresponds simply to a reparametrization around the identity, in other words we consider now the following function:

Proposition[Hardt and Ma 2016]Assume that has a full rank covariance matrix and that for some deterministic matrix . Then has first order optimality on the set .

Thus adding the identity connection makes the objective function better behave around the starting points with all-zeros matrices (in the sense that gradient descent doesn’t have to worry about avoiding saddle points). The proof is just a few lines of standard calculations to take derivatives of functions with matrix-valued inputs.

**Proof:** One has with and ,

so with and ,

which exactly means that the derivative of with respect to is equal to . On the set under consideration one has that and are invertible (and so is by assumption), and thus if this derivative is equal to it muts be that and thus (which is the global minimum).

**Linearized recurrent neural networks**

The simplest version of a recurrent neural network is as follows. It is a mapping of the form (we are thinking of doing sequence to sequence prediction). In these networks the hidden state is updated as (with ) and the output is . I will now describe a paper by Hardt, Ma and Recht (see also this blog post) that tries to understand the geometry of least-squares for this problem in the linearized version where . That is we are looking at the function:

where is obtained from via some unknown recurrent neural network with parameters . First observe that by induction one can easily see that and . In particular, assuming that is an i.i.d. isotropic sequence one obtains

and thus

In particular we see that the effect of is decoupled from the other variables and that is appears as a convex function, thus we will just ignore it. Next we make the natural assumption that the spectral radius of is less than (for otherwise the influence of the initial input is growing over time which doesn’t seem natural) and thus up to some small error term (for large ) one can consider the *idealized risk*:

The next idea is a cute one which makes the above expression more tractable. Consider the series and its Fourier transform:

By Parseval’s theorem the idealized risk is equal to the distance between and (i.e. ). We will now show that under appropriate further assumptions, for any that is weakly-quasi-convex in (in particular this shows that the idealized risk is weakly-quasi-convex). The big assumption that Hardt, Ma and Recht make is that the system is a “single-input single-output” model, that is both and are scalar. In this case it turns out that control theory shows that there is a “canonical controlable form” where , and has zeros everywhere except on the upper diagonal where it has ones and on the last row where it has (I don’t know the proof of this result, if some reader has a pointer for a simple proof please share in the comments!). Note that with this form the system is simple to interpret as one has and . Now with just a few lines of algebra:

Thus we are just asking to check the weak-quasi-convexity of

Weak-quasi-convexity is preserved by linear functions, so we just need to understand the map

which is weak-quasi-convex provided that has a positive inner product with . In particular we just proved the following:

Theorem[Hardt, Ma, Recht 2016]

Let and assume there is some cone of angle less than such that . Then the idealized risk is -weakly-quasi-convex on the set of such that .

(In the paper they specifically pick the cone where the imaginary part is larger than the real part.) This theorem naturally suggests that by overparametrizing the network (i.e. adding dimensions to and ) one could have a nicer landscape (indeed in this case the above condition can be easier to check), see the paper for more details!

]]>**Local max-cut and smoothed analysis**

Let be a connected graph with vertices and be an edge weight function. The local max-cut problem asks to find a partition of the vertices whose total cut weight

is locally maximal, in the sense that one cannot increase the cut weight by changing the value of at a single vertex (recall that actually finding the global maximum is NP-hard). See the papers linked to above for motivation on this problem.

There is a simple local search algorithm for this problem, sometimes referred to as “FLIP”: start from some initial and iteratively flip vertices (i.e. change the sign of at a vertex) to improve the cut weight until reaching a local maximum. It is easy to build instances where FLIP takes exponential time, however in “practice” it seems that FLIP always converges quickly. This motivates the *smoothed analysis* of FLIP, that is we want to understand what is the typical number of steps for FLIP when the edge weights is perturbed by a small amount of noise. Formally we now assume that the weight on edge is given by a random variable which has a density with respect to the Lebesgue measure bounded from above by (for example this forbids to be too close to a point mass). We assume that these random variables are independent.

Theorem(Etscheid and Roglin [2014]): With probability FLIP terminates in steps for some universal constant .

We improve this result from quasi-polynomial to polynomial, assuming that we put some noise on the interaction between every pair of vertices, or in other words assuming that the graph is complete.

Theorem(Angel, Bubeck, Peres, Wei [2016]): Let be complete. With probability FLIP terminates in steps.

I will now prove the Etscheid and Roglin result.

**Proof strategy**

To simplify notation let us introduce the Hamiltonian

We want to find a local max of . For any and , we denote by the state equal to except for the coordinate corresponding to which is flipped. For such there exists a vector such that

More specifically is defined by

We say that flipping a vertex is a *move, *and it is an improving move if the value of strictly improves. We say that a sequence of moves is -slow from if

It is sufficient to show that with high probability there is no -slow sequence with say and (indeed in this case after steps FLIP must have stopped, for otherwise the value would exceed the maximal possible value of ). We will do this in in three main steps, a probability step, a linear algebra step, and a combinatorial step.

**Probability step**

Lemma:Let be linearly independent vectors in . Then one has

*Proof:* The inequality follows from a simple change of variables. Let be a full rank matrix whose first rows are and it is completed so that is the identity on the subspace orthogonal to . Let be the density of , and the density of . One has and the key observation is that since has integer coefficients, its determinant must be an integer too, and since it is non-zero one has Thus one gets:

**Linear algebra step**

Lemma:Consider a sequence of improving moves with distinct vertices (say ) that repeat at least twice in this sequence. Let be the corresponding move coefficients, and for each let (respectively ) be the first (respectively second) time at which moves. Then the vectors , , are linearly independent. Furthermore for any that did not move between the times and one has (and for any ).

*Proof:* The last sentence of the lemma is obvious. For the linear independence let be such that . Consider a new graph with vertex set and such that is connected to if appears an odd number of times between the times and . This defines an oriented graph, however if is connected to but is not connected to then one has while (and furthermore for any ) and thus . In other words we can consider a subset of where is an undirected graph, and outside of this subset is identically zero. To reduce notation we simply assume that is undirected. Next we observe that if and are connected then one must have (this uses the fact that we look at the {\em first} consecutive times at which the vertices move) and in particular (again using that for any ) one must have . Now let be some connected component of , and let be the unique value of on . Noting that the ‘s corresponding to different components of have difference support (more precisely with one has for any and for any ) one obtains . On the other hand since the sequence of moves is improving one must have , which implies and finally (thus concluding the proof of the linear independence).

**Combinatorial step**

Lemma:Let . There exists and such that the number of vertices that repeat at least twice in the segment is at least .

*Proof (from ABPW):* Define the surplus of a sequence to be the difference between the number of elements and the number of distinct elements in the sequence. Let be the maximum surplus in any segment of length in . Observe that . Let us now assume that for any segment of length , the number of vertices that repeat at least twice is at most . Then one has by induction

This shows that has to be greater than which concludes the proof.

**Putting things together**

We want to show that with one has

By the combinatorial lemma we know that it is enough to show that:

Now using the probability lemma together with the linear algebra lemma (and observing that critically only depends on the value of at the vertices in , and thus the union bound over only gives a factor instead of ) one obtains that the above probability is bounded by

which concludes the proof of the Etscheid and Roglin result.

Note that a natural route to get a polynomial-time bound from the above proof would be to remove the term in the combinatorial lemma but we show in our paper that this is impossible. Our result comes essentially from improvements to the linear algebra step (this is more difficult as the Etscheid and Roglin linear algebra lemma is particularly friendly for the union bound step, so we had to find another way to do the union bound).

]]>- The junior faculty (Shayan Oveis Gharan, Thomas Rothvoss, and Yin Tat Lee) are all doing groundbreaking (and award-winning) work at the interface optimization/probability. In my opinion the junior faculty roster is a key element in the choice of grad school, as typically junior faculty have much more time to dedicate to students. In particular I know that Yin Tat Lee is looking for graduate students starting next Fall.
- Besides the theory of computation group, UW has lots of resources in optimization such as TOPS (Trends in Optimization Seminar), and many optimization faculty in various department (Maryam Fazel, Jim Burke, Dmitriy Drusvyatskiy, Jeff Bilmes, Zaid Harchaoui) which means many interesting classes to take!
- The Theory Group at Microsoft Research is just a bridge away from UW, and we have lots of activities on optimization/probability there too. In fact I am also looking for one graduate student, to be co-advised with a faculty from UW.

Long story short, if you are a talented young mathematician interested in making a difference in optimization then you should apply to the CS department at UW, and here is the link to do so.

]]>

Theorem[Bubeck and Ganguly]If the distribution is log-concave, i.e., if it has density for some convex function , and if , then

(1)

where is an appropriately scaled Wishart matrix coming from vectors having i.i.d. entries and is a GOE matrix, both with the diagonal removed.

The proof hinges on a high-dimensional entropic central limit theorem, so a large part of the post is devoted to entropic central limit theorems and ways of proving them. Without further ado let us jump right in.

**Pinsker’s inequality: from total variation to relative entropy**

Our goal is now to bound from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so cannot be computed directly. Coupling these two random matrices also seems challenging.

In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. Here we use Pinsker’s inequality to switch to relative entropy:

(2)

where denotes the relative entropy of with respect to . We next take a detour to entropic central limit theorems and techniques involved in their proof, before coming back to bounding the right hand side in (2).

**An introduction to entropic CLTs**

Let denote the density of , the -dimensional standard Gaussian distribution, and let be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity . Then

where the second equality follows from the fact that is quadratic in , and the first two moments of and are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities. It turns out that much more is true.

The central limit theorem states that if are i.i.d. real-valued random variables with zero mean and unit variance, then converges in distribution to a standard Gaussian random variable as . There are many other senses in which converges to a standard Gaussian, the entropic CLT being one of them.

Theorem[Entropic CLT]Let be i.i.d. real-valued random variables with zero mean and unit variance, and let . If , then

as . Moreover, the entropy of increases monotonically, i.e., for every .

The condition is necessary for an entropic CLT to hold; for instance, if the are discrete, then for all .

The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor, and Johnson and Barron in the early 2000s, using two different techniques.

The fact that follows from the entropy power inequality, which goes back to Shannon in 1948. This implies that for all , and so it was naturally conjectured that increases monotonically. However, proving this turned out to be challenging. Even the inequality was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor proved in general that for all .

In the following we sketch some of the main ideas that go into the proof of these results, in particular following the technique introduced by Ball, Barthe, and Naor.

**From relative entropy to Fisher information**

Our goal is to show that some random variable , which is a convolution of many i.i.d. random variables, is close to a Gaussian . One way to approach this is to *interpolate* between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define

for , and let denote the density of . We have and . This semigroup has several desirable properties. For instance, if the density of is isotropic, then so is . Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.

For a density function , let

be the *Fisher information matrix*. The Cramer-Rao bound states that

More generally this holds for the covariance of any unbiased estimator of the mean. The *Fisher information* is defined as

It is sometimes more convenient to work with the Fisher information distance, defined as . Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative.

Now we are ready to state the De Bruijn identity, which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance:

This implies that the relative entropy between and —which is our quantity of interest—can be expressed as follows:

(3)

Thus our goal is to bound the Fisher information distance .

**Bounding the Fisher information distance**

We first recall a classical result by Blachman and Stam that shows that Fisher information decreases under convolution.

Theorem[Blachman; Stam]Let be independent random variables taking values in , and let be such that . Then

In the i.i.d. case, this bound becomes .

Ball, Barthe, and Naor gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3. (See Bubeck and Ganguly for a short proof.)

Theorem[Variational characterization of Fisher information]Let be a sufficiently smooth density on , let be a unit vector, and let be the marginal of in direction . Then we have

(4)

for any continuously differentiable vector field with the property that for every , . Moreover, if satisfies , then there is equality for some suitable vector field .

The Blachman-Stam theorem follows from this characterization by taking the constant vector field . Then we have , and so the right hand side of (4) becomes , where recall that is the Fisher information matrix. In the setting of Theorem 3 the density of is a product density: , where is the density of . Consequently the Fisher information matrix is a diagonal matrix, , and thus , concluding the proof of Theorem 3 using Theorem 4.

Given the characterization of Theorem 4, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions.

Theorem[Artstein, Ball, Barthe, and Naor]Let be i.i.d. random variables with a density having a positive spectral gap . (We say that a random variable has spectral gap if for every sufficiently smooth , we have . In particular, log-concave random variables have a positive spectral gap, see Bobkov (1999).) Then for any with we have that

When , then , and thus using (3) we obtain a rate of convergence of in the entropic CLT.

A result similar to Theorem 5 was proven independently and roughly at the same time by Johnson and Barron using a different approach involving score functions.

**A high-dimensional entropic CLT**

The techniques of Artstein, Ball, Barthe, and Naor generalize to higher dimensions, as was recently shown by Bubeck and Ganguly.

A result similar to Theorem 5 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3) again.

Theorem[Bubeck and Ganguly]Let be a random vector with i.i.d. entries from a distribution with zero mean, unit variance, and spectral gap . Let be a matrix such that , the identity matrix. Let

and

Then we have that

where denotes the standard Gaussian measure in .

To interpret this result, consider the case where the matrix is built by picking rows one after the other uniformly at random on the Euclidean sphere in , conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition ). We then expect to have and (we leave the details as an exercise for the reader), and so Theorem 7 tells us that .

**Back to Wishart and GOE**

We now turn our attention back to bounding the relative entropy ; recall (2). Since the Wishart matrix contains the (scaled) inner products of vectors in , it is natural to relate and , since the former comes from the latter by adding an additional -dimensional vector to the vectors already present. Specifically, we have the following:

where is a -dimensional random vector with i.i.d. entries from , which are also independent from . Similarly we can write the matrix using :

This naturally suggests to use the chain rule for relative entropy and bound

by induction on . We get that

By convexity of the relative entropy we also have that

Thus our goal is to understand and bound for , and then apply the bound to (followed by taking expectation over ). This is precisely what was done in Theorem 6, the high-dimensional entropic CLT, for satisfying . Since does not necessarily satisfy , we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader.

LemmaLet and be such that . Then for any isotropic random variable taking values in we have that

(5)

We then apply this lemma with and . Observe that

and hence in expectation the middle two terms of the right hand side of (5) cancel each other out.

The last term in (5),

should be understood as the relative entropy between a centered Gaussian with covariance given by and a standard Gaussian in . Controlling the expectation of this term requires studying the probability that is close to being non-invertible, which requires bounds on the left tail of the smallest singular of . Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to Bubeck and Ganguly for more details on this point.

Finally, the high-dimensional entropic CLT can now be applied to see that

From the induction on we get another factor of , arriving at

We conclude that the dimension threshold is , and the information-theoretic proof that we have outlined sheds light on why this threshold is .

]]>**Barrier to detecting geometry: when Wishart becomes GOE**

Recall from the previous post that is a random geometric graph where the underlying metric space is the -dimensional unit sphere , and where the latent labels of the nodes are i.i.d. uniform random vectors in . Our goal now is to show the impossibility result of Bubeck, Ding, Eldan, and Racz: if , then it is impossible to distinguish between and the Erdos-Renyi random graph . More precisely, we have that

(1)

when and , where denotes total variation distance.

There are essentially three main ways to bound the total variation of two distributions from above: (i) if the distributions have nice formulas associated with them, then exact computation is possible; (ii) through *coupling* the distributions; or (iii) by using inequalities between probability metrics to switch the problem to bounding a different notion of distance between the distributions. Here, while the distribution of does not have a nice formula associated with it, the main idea is to view this random geometric graph as a function of an Wishart matrix with degrees of freedom—i.e., a matrix of inner products of -dimensional Gaussian vectors—denoted by . It turns out that one can view as (essentially) the same function of an GOE random matrix—i.e., a symmetric matrix with i.i.d. Gaussian entries on and above the diagonal—denoted by . The upside of this is that both of these random matrix ensembles have explicit densities that allow for explicit computation. We explain this connection here in the special case of for simplicity; see Bubeck et al. for the case of general .

Recall that if is a standard normal random variable in , then is uniformly distributed on the sphere . Consequently we can view as a function of an appropriate Wishart matrix, as follows. Let be an matrix where the entries are i.i.d. standard normal random variables, and let be the corresponding Wishart matrix. Note that and so . Thus the matrix defined as

has the same law as the adjacency matrix of . Denote the map that takes to by , i.e., .

In a similar way we can view as a function of an matrix drawn from the Gaussian Orthogonal Ensemble (GOE). Let be a symmetric random matrix where the diagonal entries are i.i.d. normal random variables with mean zero and variance 2, and the entries above the diagonal are i.i.d. standard normal random variables, with the entries on and above the diagonal all independent. Then has the same law as the adjacency matrix of . Note that only depends on the sign of the off-diagonal elements of , so in the definition of we can replace with , where is the identity matrix.

We can thus conclude that

The densities of these two random matrix ensembles are explicit and well known (although we do not state them here), which allow for explicit calculations. The outcome of these calculations is the following result, proven independently and simultaneously by Bubeck et al. and Jiang and Li.

Theorem[Bubeck, Ding, Eldan, and Racz; Jiang and Li]Define the random matrix ensembles and as above. If , then

We conclude that it is impossible to detect underlying geometry whenever .

**The universality of the threshold dimension**

How robust is the result presented above? We have seen that the detection threshold is intimately connected to the threshold of when a Wishart matrix becomes GOE. Understanding the robustness of this result on random matrices is interesting in its own right, and this is what we will pursue in the remainder of this post, which is based on a recent paper by Bubeck and Ganguly.

Let be an random matrix with i.i.d. entries from a distribution that has mean zero and variance . The matrix is known as the Wishart matrix with degrees of freedom. As we have seen above, this arises naturally in geometry, where is known as the Gram matrix of inner products of points in . The Wishart matrix also appears naturally in statistics as the sample covariance matrix, where is the number of samples and is the number of parameters. (Note that in statistics the number of samples is usually denoted by , and the number of parameters is usually denoted by ; here our notation is taken with the geometric perspective in mind.)

We consider the Wishart matrix with the diagonal removed, and scaled appropriately:

In many applications—such as to random graphs as above—the diagonal of the matrix is not relevant, so removing it does not lose information. Our goal is to understand how large does the dimension have to be so that is approximately like , which is defined as the Wigner matrix with zeros on the diagonal and i.i.d. standard Gaussians above the diagonal. In other words, is drawn from the Gaussian Orthogonal Ensemble (GOE) with the diagonal replaced with zeros.

A simple application of the multivariate central limit theorem gives that if is fixed and , then converges to in distribution. The main result of Bubeck and Ganguly establishes that this holds as long as under rather general conditions on the distribution .

Theorem[Bubeck and Ganguly]If the distribution is log-concave, i.e., if it has density for some convex function , and if , then

(2)

On the other hand, if has a finite fourth moment and , then

(3)

This result extends Theorem 1 from the previous post and Theorem 1 from above, and establishes as the universal critical dimension (up to logarithmic factors) for sufficiently smooth measures : is approximately Gaussian if and only if is much larger than . For random graphs, as seen above, this is the dimension barrier to extracting geometric information from a network: if the dimension is much greater than the cube of the number of vertices, then all geometry is lost. In the setting of statistics this means that the Gaussian approximation of a Wishart matrix is valid as long as the sample size is much greater than the cube of the number of parameters. Note that for some statistics of a Wishart matrix the Gaussian approximation is valid for much smaller sample sizes (e.g., the largest eigenvalue behaves as in the limit even when the number of parameters is on the same order as the sample size (Johnstone, 2001)).

To distinguish the random matrix ensembles, we have seen in the previous post that signed triangles work up until the threshold dimension in the case when is standard normal. It turns out that the same statistic works in this more general setting; when the entries of the matrices are centered, this statistic can be written as . We leave the details as an exercise for the reader.

We note that for (2) to hold it is necessary to have some smoothness assumption on the distribution . For instance, if is purely atomic, then so is the distribution of , and thus its total variation distance to is . The log-concave assumption gives this necessary smoothness, and it is an interesting open problem to understand how far this can be relaxed.

We will see the proof (and in particular the connection to entropic CLT!) in the next post.

]]>**A simple random geometric graph model and basic questions**

We study perhaps the simplest model of a random geometric graph, where the underlying metric space is the -dimensional unit sphere , and where the latent labels of the nodes are i.i.d. uniform random vectors in . More precisely, the random geometric graph is defined as follows. Let be independent random vectors, uniformly distributed on . In , distinct nodes and are connected by an edge if and only if , where the threshold value is such that For example, when we have .

The most natural random graph model without any structure is the standard Erdos-Renyi random graph , where any two of the vertices are independently connected with probability .

We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdos-Renyi model, while the alternative is that it is drawn from . In brief:

(1)

To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, and , denoted by ; recall that the total variation distance between two probability measures and is defined as . We are interested in particular in the case when the dimension is *large*, growing with .

It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with while still being able to detect it? Most of this post will focus on this question.

If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a *true representation*, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the post.

**The dimension threshold for detecting underlying geometry**

The high-dimensional setting of the random geometric graph was first studied by Devroye, Gyorgy, Lugosi, and Udina, who showed that geometry is indeed lost in high dimensions: if is fixed and , then . More precisely, they show that this convergence happens when , but this is not tight. The dimension threshold for dense graphs was recently found by Bubeck, Ding, Eldan, and Racz, and it turns out that it is , in the following sense.

Theorem[Bubeck, Ding, Eldan, and Racz 2014]Let be fixed. Then

Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time ).

(2)

(3)

Most of this post is devoted to understanding (3), that is, how the two models can be distinguished; the impossibility result of (2) will be discussed in a future post. At the end we will also consider this same question for *sparse graphs* (where ), where determining the dimension threshold is an intriguing open problem.

**The triangle test**

A natural test to uncover geometric structure is to count the number of triangles in . Indeed, in a purely random scenario, vertex being connected to both and says nothing about whether and are connected. On the other hand, in a geometric setting this implies that and are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us.

Given that is connected to both and , and are more likely to be connected under than under .

For a graph , let denote its adjacency matrix. Then

is the indicator variable that three vertices , , and form a triangle, and so the number of triangles in is

By linearity of expectation, for both models the expected number of triangles is times the probability of a triangle between three specific vertices. For the Erd\H{o}s-R\’enyi random graph the edges are independent, so the probability of a triangle is , and thus we have

For it turns out that for any fixed we have

(4)

for some constant , which gives that

Showing (4) is somewhat involved, but in essence it follows from the *concentration of measure* phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of around the equator. We sketch here the main intuition for , which is illustrated in the figure below.

Let , , and be independent uniformly distributed points in . Then

where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately . To compute this conditional probability what we really need to know is the typical angle is between and . By rotational invariance we may assume that , and hence , the first coordinate of . One way to generate is to sample a -dimensional standard Gaussian and then normalize it by its length. Since the norm of a -dimensional standard Gaussian is very well concentrated around , it follows that is on the order of . Conditioned on , this typical angle gives the boost in the conditional probability that we see.

If and are two independent uniform points on , then their inner product is on the order of due to the concentration of measure phenomenon on the sphere.

Thus we see that the boost in the number of triangles in the geometric setting is in expectation:

To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation. A simple calculation shows that

and also

Thus we see that if , which is equivalent to .

**Signed triangles are more powerful**

While triangles detect geometry up until , are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when , as do several other natural statistics. Yet the underlying geometry can be detected even when .

The simple idea that leads to this improvement is to consider *signed triangles*. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. The following figure summarizes these observations.

The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define

The key insight motivating this definition is that the variance of signed triangles is *much smaller* than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the term vanishes, leaving only the term. It is a simple exercise to show that

and

On the other hand it can be shown that

(5)

so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for and we have

Putting everything together we get that if , which is equivalent to . This concludes the proof of (3) from Theorem 1.

**Estimating the dimension**

Until now we discussed *detecting* geometry. However, the insights gained above allow us to also touch upon the more subtle problem of *estimating* the underlying dimension .

Dimension estimation can also be done by counting the “number” of signed triangles as above. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound on in~(5) is not enough. Still, we believe that the lower bound should give the true value of the expected value for an appropriate constant , and hence we expect to have that

(6)

Thus, using the variance bound from above, we get that dimension estimation should be possible using signed triangles whenever , which is equivalent to . Showing (6) for general seems involved; Bubeck et al. showed that it holds for , which can be considered as a proof of concept.

Theorem[Bubeck, Ding, Eldan, and Racz 2014]There exists a universal constant such that for all integers and , one has

This result is tight, as demonstrated by a result of Eldan, which implies that and are indistinguishable when .

**The mysterious sparse regime**

We conclude this post by discussing an intriguing conjecture for *sparse graphs*. It is again natural to consider the number of triangles as a way to distinguish between and . Bubeck et al. show that this statistic works whenever , and conjecture that this is tight.

Conjecture[Bubeck, Ding, Eldan, and Racz 2014]Let be fixed and assume . Then

The main reason for this conjecture is that, when , and seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is from (2) (which holds uniformly over ), which is very far from !

]]>

Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.

This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).

**The stochastic block model and exact recovery**

The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:

, the number of vertices;

a probability distribution that describes the relative sizes of the communities;

and , a symmetric matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

The number of communities, , is implicit in this notation; in these notes we assume that is a fixed constant.

A random graph from is defined as follows:

- The vertex set of the graph is .
- Every vertex is independently assigned a (hidden) label from the probability distribution on . That is, for every .
- Given the labels of the vertices, each (unordered) pair of vertices is connected independently with probability .

Example[Symmetric communities]A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.

This is modeled by the SBM with for all and if and otherwise, with .

We write for a graph generated according to the SBM without the hidden vertex labels revealed.

The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

There are various notions of success that are worth studying. In this post we focus on *exact recovery*: we aim to recover the labels of all vertices exactly with high probability (whp).

More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.

For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be .

It is thus natural to scale the edge probability matrix accordingly, i.e., to consider , where .

We also assume that the communities have linear size, i.e., that is independent of , and for all .

**From exact recovery to testing multivariate Poisson distributions**

As a thought experiment, imagine that not only is the graph given, but also all vertex labels are revealed, except for that of a given vertex . Is it possible to determine the label of ?

Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.

Given the labels of all vertices except , the information we have about is the number of nodes in each community it is connected to. In other words, we know the *degree profile* of , where, for a given labelling of the graph’s vertices, the -th component is the number of edges between and the vertices in community .

The distribution of the degree profile depends on the community that belongs to. Recall that the community sizes are given by a multinomial distribution with parameters and , and hence the relative size of community concentrates on . Thus if , the degree profile can be approximated by independent binomials, with approximately distributed as

where denotes the binomial distribution with trials and success probability . In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community is approximately Poisson distributed with mean

where is the -th unit vector. Defining , this can be abbreviated as , where denotes the -th column of the matrix .

We call the quantity the *community profile* of community ; this is the quantity that determines the distribution of the degree profile of vertices from a given community.

Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between multivariate Poisson distributions.

The prior on the label of is given by , and we get to observe the degree profile , which comes from one of multivariate Poisson distributions, which have mean times the community profiles , .

**Testing multivariate Poisson distributions**

We now turn to understanding the testing problem described above; the setup is as follows.

We consider a Bayesian hypothesis testing problem with hypotheses.

The random variable takes values in with prior given by , i.e., .

We do not observe , but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of :

given , the mean is .

In short:

In more detail:

where

and

Our goal is to infer the value of from a realization of .

The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing , selects

as an estimate for the value of , with ties broken arbitrarily.

Let denote the error of the MAP estimator.

One can think of the MAP estimator as a tournament of pairwise comparisons of the hypotheses:

if

then the MAP estimate is not .

The probability that one makes an error during such a comparison is exactly

(1)

For finite , the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., .

In particular, we have that

(2)

Thus we desire to understand the magnitude of the error probability in (1) in the particular case when the conditional distribution of given is a multivariate Poisson distribution with mean vector on the order of . The following result determines this error up to first order in the exponent.

Lemma[Abbe and Sandon 2015]For any with and , we have

(3)

where

(4)

Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed the *Chernoff-Hellinger divergence*.

We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.

Observe that

decays rapidly away from

so we can obtain a good estimate of the sum

by simply estimating the term

Now observe that must satisfy

after some algebra this is equivalent to

Let denote the maximizer in the expression of in~\eqref{eq:D_+}.

By differentiating in , we obtain that satisfies

and so

Thus we see that

from which, after some algebra, we get that

The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to

gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.

**Characterizing exact recoverability using CH-divergence**

Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity in (4).

The discussion above then implies that plays an important role in the threshold for exact recovery.

In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that

Suppose on the contrary that

for some and .

This implies that the error probability in the testing problem is for some for all vertices in communities and . Since the number of vertices in these communities is linear in , and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most

.

It turns out that this indeed gives the recoverability threshold.

Theorem[Abbe and Sandon 2015]Let denote the number of communities,

let with denote the community prior,

let ,

and let be a symmetric matrix with no two rows equal.

Exact recovery is solvable in if and only if

(5)

This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example[Symmetric communities]Consider again symmetric communities, that is,

for all ,

if ,

and otherwise, with .

Then exact recovery is solvable in if and only if

(6)

We note that in this case is the same as the Hellinger divergence.

Above we have heuristically described why the condition (5) is necessary.

To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is .

Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is by a union bound.

However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their size neighborhoods interact.

One can then apply *two* rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.

(1)

for some . Effectively this means that the estimated cumulative loss outside of the ball is infinite (recall that is proportional to ). Thus to enforce (1) (at time ) we will actually set the loss estimate to on . The price one pays for this is that the loss estimator is now unbiased only on , which in turn means that we control the cumulative regret with respect to points in only. We believe that in the so-called *stochastic setting* where the loss sequence is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in at all round (for ) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order (note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first rounds and then move out the optimum very far from this region for the remaining rounds. To solve this issue we introduce a *restart condition*: essentially if the algorithm believes that the optimum might be outside the region then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon ). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region given in (1) will not quite work, and in fact we will construct a region which verifies

Furthermore we will need to take which in turn forces us to take (indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially where instead of comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take to be times smaller than the predicted value from Part 1) and hence the final regret bound (indeed recall that we got a bound of where the first comes from the fact that we need to take a learning rate times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the comes from the relative entropy distance to the optimum at the beginning).

We will use extensively the following result:

Lemma:Let be an isotropic log-concave measure. Then for any such that one has .

**Restart condition**

The simplest idea to restart goes as follows. Let us consider some (with defined as in (1)). Why is it there? Well there must be some time in the past where we estimated the cumulative regret of to be at least . In particular if this now has a small estimated regret, say smaller than then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus . Thus our restart condition looks like this: restart if

A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the constant to be large enough, we know that the optimum (over all of ) of the “true” cumulative loss is still in at the time when we restart.

**Hoping for negative regret**

If we catch the adversary’s presumptive attempt to move the optimum out of quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region for a good reason, and thus if we compare ourselves to the point which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where is the time at which enters the boundary of the focus region,

(2)

Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as for , and thus this term divided by (which is then ) could easily compensate the variance term which is . This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step we had increased the learning rate from to then we would have the term multiplied by in the final regret bound and thus we would only need to have to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of ). Let us make this slightly more formal.

**Turning up the learning rate**

Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: . One can see with the same calculations than in Part 1 that:

In particular if we increase the learning rate at some times by then we have

Using that the normalizing constants are all larger than (this can be proved using the fact that covariance of the exponential weights is at least some ) and assuming that (i.e. ) one gets roughly, with ,

where the second inequality uses that (see Part 2).

We see that by taking we can guarantee negative regret with respect to any whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times ). Thus we see that we should take if we want to have a regret of order . Recall also that we can afford to update the learning rate only times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take (and thus , ) while guaranteeing that the focus region satisfies

As we explained at the beginning of this post this will conclude the proof of the regret bound.

**Updating the focus region more carefully**

We update the focus region only at times such that, once space is rescaled so that is isotropic,

and in this case we set

Somewhat clearly we won’t make more than updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)

This follows from:

Lemma:Let be a convex body and be the centered unit ball. Suppose that . Then .

*Proof:* Let us prove the contrapositive and assume that there is a point with . Denote , and consider the sets .

Note that those sets are disjoint. Indeed, the intervals are disjoint, which implies that the projections of the ellipsoids onto the span of are disjoint. So, we have

which concludes the proof.

**That’s it folks!**

This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.

Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample in time . Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on convex sets (which are dimensional). All of this can be found in the paper of course!

]]>**Gaussian core**

First let us revisit the introduction of the core. Recall that we are considering a kernel such that is the distribution of for some random variable to be defined. To control the regret we want to have the following inequality for any convex function :

which means that the distribution of the random variable should satisfy

(1)

The core is defined so that (1) is in fact an equality. As we discussed in the previous post the core is a deep mathematical construction, but unfortunately we could not find a way to generalize the Part 1’s variance calculation when the core is non-Gaussian. In what follows we describe a slightly different construction which will allow us to satisfy (1) with being Gaussian. A key observation is that if is *convexly dominated* by , i.e. for any convex one has , then (1) is satisfied by taking to be the core of :

Thus it suffices to show that for any that we care about one can find a Gaussian “inside” of (in the sense that is convexly dominated by ). Then instead of taking the core of to define the kernel one can take the core of (which we will also refer to as the Gaussian core of ).

Next we show that we can essentially restrict our attention to the case where is log-concave, and then we show how to find a Gaussian inside a log-concave measure.

**Approximate log-concavity**

Recall that in Part 1 we replaced the variance calculation by simply showing that the loss estimates are bounded. Thus we see by induction and using Hoeffding-Azuma that with probability at least one has

Using an union bound over an -net together with Lipschitzness of (which we didn’t prove but it follows the same lines than boundedness) one has in fact that with high probability (say ) there is a convex function (recall that is a convex function) such that

In particular provided that the above display shows that (whose density is proportional to ) is -approximately log-concave (recall that is said to be -approximately log-concave if there is a log-concave function such that ). To simplify the discussion in these lectures we will in fact assume that is exactly log-concave.

We note that the above concentration argument makes us lose a factor in the regret. Indeed in Part 1 we used to (informally) obtain the regret (or with even more optimistic calculations). We see that now we are forced to take a learning rate which is smaller than this, which in turn multiplies the regret bound by a factor .

**Finding a Gaussian inside a log-concave**

Let be an isotropic log-concave measure. We will show that a centered Gaussian with covariance is (approximately) convexly dominated by . We note in passing that by going back to the calculations at the end of Part 1 it is easy to see that this factor in the covariance will in turn force us to take to be smaller than what we hoped at the end of Part 1, leading to yet again an extra in the regret bound.

In fact we will show that any centered measure supported on a ball of radius is (exactly) convexly dominated by . This implies the (approximate) Gaussian convex domination we mentioned above since most of the mass of is in such a ball (it is easy to deal with the remaining mass but we won’t worry about it here).

Let us fix some convex function . We want to show that . Since both measures are centered we may add any linear function to without affecting this inequality, so we may legitimately assume that and for all . By scaling we might also assume that the maximum on the centered ball of radius is and thus it is enough to show . By convexity is above the linear function where is the maximizer of on the centered ball of radius . Note also that and thus it is enough to show (recall that ):

where , which is implied by

It only remains to use the fact that an isotropic real log-concave random variable verifies , and thus the above display holds true with .

]]>