Geometry of linearized neural networks

This week we had the pleasure to host Tengyu Ma from Princeton University who told us about the recent progress he has made with co-authors to understand various linearized versions of neural networks. I will describe here two such results, one for Residual Neural Networks and one for Recurrent Neural Networks.

Some properties to look for in non-convex optimization

We will say that a function f admits first order optimality (respectively second order optimality) if all critical points (respectively all local minima) of f are global minima (of course first order optimality implies second order optimality for smooth functions). In particular with first order optimality one has that gradient descent converges to the global minimum, and with second order optimality this is also true provided that one avoids saddle points. To obtain rates of convergence it can be useful to make more quantitative statements. For example we say that f is \alpha-Polyak if

    \[\|\nabla f(x)\|^2 \geq \alpha (f(x) - f^*) .\]

Clearly \alpha-Polyak implies first order optimality, but more importantly it also implies linear convergence rate for gradient descent on f. A variant of this condition is \alpha-weak-quasi-convexity:

    \[\langle \nabla f(x), x-x^*\rangle \geq \alpha (f(x) - f^*) ,\]

in which case gradient descent converges at the slow non-smooth rate 1/\sqrt{t} (and in this case it is also robust to noise, i.e. one can write a stochastic gradient descent version). The proofs of these statements just mimic the usual convex proofs. For more on these conditions see for instance this paper.

Linearized Residual Networks

Recall that a neural network is just a map x \in \mathbb{R}^n \mapsto \sigma \circ A_L \circ \sigma A_{L-1} \hdots \sigma \circ A_0 (x) where A_0,\hdots, A_{L} are linear maps (i.e. they are the matrices parametrizing the neural network) and \sigma is some non-linear map (the most popular one, ReLu, is the just the coordinate-wise positive part). Alternatively you can think of a neural network as a sequence of hidden states h_0, h_1,\hdots,h_L where h_0=x and h_{t+1} = \sigma(A_t h_{t}). In 2015 a team of researcher at MSR Asia introduced the concept of a residual neural network where the hidden states are now updated as before for t even but for t odd we set h_{t+1} = h_{t-1} + \sigma(A_t h_t). Apparently this trick allowed them to train much deeper networks, though it is not clear why this would help from a theoretical point of view (the intuition is that at least when the network is initialized with all matrices being 0 it still does something non-trivial, namely it computes the identity).

In their most recent paper Moritz Hardt and Tengyu Ma try to explain why adding this “identity connection” could be a good idea from a geometric point of view. They consider an (extremely) simplified model where there is no non-linearity, i.e. \sigma is the identity map. A neural network is then just a product of matrices. In particular the landscape we are looking at for least-squares with such a model is of the form:

    \[g(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L A_i x - y\|^2 ,\]

which is of course a non-convex function (just think of the function (a,b) \mapsto (ab-1)^2 and observe that on the segment a=b it gives the non-convex function (a^2-1)^2). However it actually satisfies the second-order optimality condition:

Proposition [Kawaguchi 2016]

Assume that x has a full rank covariance matrix and that y=Rx for some deterministic matrix R. Then all local minima of g are global minima.

I won’t give the proof of this result as it requires to take the second derivative of g which is a bit annoying (I will give below the proof of the first derivative). Now in this linearized setting the residual network version (where the identity connection is added at every layer) corresponds simply to a reparametrization around the identity, in other words we consider now the following function:

    \[f(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L (A_i+\mathrm{I}) x - y\|^2 .\]

Proposition [Hardt and Ma 2016]

Assume that x has a full rank covariance matrix and that y= Rx for some deterministic matrix R. Then f has first order optimality on the set \{(A_0, \hdots, A_L) : \|A_i\| <1\}.

Thus adding the identity connection makes the objective function better behave around the starting points with all-zeros matrices (in the sense that gradient descent doesn’t have to worry about avoiding saddle points). The proof is just a few lines of standard calculations to take derivatives of functions with matrix-valued inputs.

Proof: One has with E = R - \prod_{i=0}^L (A_i+\mathrm{I}) and \Sigma = \E x x^{\top},

    \[f = \E (E x)^{\top} Ex = \mathrm{Tr}(E \Sigma E^{\top}) =: \|E\|_{\Sigma}^2 ,\]

so with E_{<i} = \prod_{j <i} (A_j + \mathrm{I}) and E_{>i}=\prod_{j >i} (A_j + \mathrm{I}),

    \begin{eqnarray*} f(A_0,\hdots, A_i + V, \hdots, A_L) & = & \|R - \prod_{j <i} (A_j + \mathrm{I}) \times (A_i + V +\mathrm{I}) \times \prod_{j >i} (A_j + \mathrm{I})\|_{\Sigma}^2 \\ & = & \|E + E_{<i} V E_{>i}\|_{\Sigma}^2 \\ & = & \|E\|_{\Sigma}^2 + 2 \langle \Sigma E, E_{<i} V E_{>i} \rangle + \|E_{<i} V E_{>i}\|_{\Sigma}^2 , \end{eqnarray*}

which exactly means that the derivative of f with respect to A_i is equal to E_{>i}^{\top} \Sigma E E_{<i}^{\top}. On the set under consideration one has that E_{>i} and E_{<i} are invertible (and so is \Sigma by assumption), and thus if this derivative is equal to 0 it muts be that E=0 and thus f=0 (which is the global minimum).

Linearized recurrent neural networks

The simplest version of a recurrent neural network is as follows. It is a mapping of the form (x_1,\hdots,x_T) \mapsto (y_1,\hdots,y_T) (we are thinking of doing sequence to sequence prediction). In these networks the hidden state is updated as h_{t+1} = \sigma_1(A h_{t} + B x_{t}) (with h_1=0) and the output is y_t = \sigma_2(C h_t + D x_t). I will now describe a paper by Hardt, Ma and Recht (see also this blog post) that tries to understand the geometry of least-squares for this problem in the linearized version where \sigma_1 = \sigma_2 = \mathrm{Id}. That is we are looking at the function:

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\mathbb{E}_{(x_t)} \frac{1}{T} \sum_{t=1}^T \|\hat{y}_t - y_t\|^2 ,\]

where (y_t) is obtained from (x_t) via some unknown recurrent neural network with parameters A,B,C,D. First observe that by induction one can easily see that h_{t+1} = \sum_{k=1}^t A^{t-k} B x_k and y_t = D x_t + \sum_{k=1}^{t-1} C A^{t-1-k} B x_k. In particular, assuming that (x_t) is an i.i.d. isotropic sequence one obtains

    \[\E \|y_t - \hat{y}_t\|_2^2 = \|D-\hat{D}\|_F^2 + \sum_{k=1}^{t-1} \|\hat{C} \hat{A}^{t-1-k} \hat{B} - C A^{t-1-k} B\|_F^2 ,\]

and thus

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\|D-\hat{D}\|_F^2 + \sum_{k=1}^{T-1} (1- \frac{k}{T}) \|\hat{C} \hat{A}^{k-1} \hat{B} - C A^{k-1} B\|_F^2 .\]

In particular we see that the effect of \hat{D} is decoupled from the other variables and that is appears as a convex function, thus we will just ignore it. Next we make the natural assumption that the spectral radius of A is less than 1 (for otherwise the influence of the initial input x_1 is growing over time which doesn’t seem natural) and thus up to some small error term (for large T) one can consider the idealized risk:

    \[(\hat{A},\hat{B}, \hat{C}) \mapsto \sum_{k=0}^{+\infty} \|\hat{C} \hat{A}^{k} \hat{B} - C A^{k} B\|_F^2 .\]

The next idea is a cute one which makes the above expression more tractable. Consider the series r_k = C A^k B and its Fourier transform:

    \[G(\theta) = \sum_{k=0}^{+\infty} r_k \exp(i k \theta) = C (\sum_{k=0}^{+\infty} (\exp(i \theta) A)^k) B = C(\mathrm{I} - \exp(i \theta) A)^{-1} B .\]

By Parseval’s theorem the idealized risk is equal to the L_2 distance between G and \hat{G} (i.e. \int_{[-\pi, \pi]} \|G(\theta)-\hat{G}(\theta)\|_F^2 d\theta). We will now show that under appropriate further assumptions, for any \theta that \|G(\theta) - \hat{G}(\theta)\|_F^2 is weakly-quasi-convex in (\hat{A},\hat{B},\hat{C}) (in particular this shows that the idealized risk is weakly-quasi-convex). The big assumption that Hardt, Ma and Recht make is that the system is a “single-input single-output” model, that is both x_t and y_t are scalar. In this case it turns out that control theory shows that there is a “canonical controlable form” where B=(0,\hdots,0,1), C=(c_1,\hdots,c_n) and A has zeros everywhere except on the upper diagonal where it has ones and on the last row where it has a_n,\hdots, a_1 (I don’t know the proof of this result, if some reader has a pointer for a simple proof please share in the comments!). Note that with this form the system is simple to interpret as one has Ah = (h(2),\hdots, h(n-1), \langle a,h\rangle) and Ah+Bx =(h(2),\hdots, h(n-1), \langle a,h\rangle+x). Now with just a few lines of algebra:

    \[G(\theta) = \frac{c_1 + \hdots + c_{n} z^{n-1}}{z^n + a_1 z^{n-1} + \hdots + a_n}, \ \text{where} \ z=\exp(i \theta) .\]

Thus we are just asking to check the weak-quasi-convexity of

    \[(\hat{a}, \hat{c}) \mapsto |\frac{\hat{c}_1 + \hdots + \hat{c}_{n} z^{n-1}}{z^n + \hat{a}_1 z^{n-1} + \hdots + \hat{a}_n} - \frac{u}{v}|^2 .\]

Weak-quasi-convexity is preserved by linear functions, so we just need to understand the map

    \[(\hat{u},\hat{v}) \in \mathbb{C} \times \mathbb{C} \mapsto |\frac{\hat{u}}{\hat{v}} - \frac{u}{v}|^2 ,\]

which is weak-quasi-convex provided that \hat{v} has a positive inner product with v. In particular we just proved the following:

Theorem [Hardt, Ma, Recht 2016]

Let C(a) := \{z^n + a_1 z^{n-1} + \hdots + a_n , z \in \mathbb{C}, |z|=1\} and assume there is some cone \mathcal{C} \subset \mathbb{C}^2 of angle less than \pi/2-\alpha such that C(a) \subset \mathcal{C}. Then the idealized risk is \alpha-weakly-quasi-convex on the set of \hat{a} such that C(\hat{a}) \subset \mathcal{C}.

(In the paper they specifically pick the cone \mathcal{C} where the imaginary part is larger than the real part.) This theorem naturally suggests that by overparametrizing the network (i.e. adding dimensions to a and c) one could have a nicer landscape (indeed in this case the above condition can be easier to check), see the paper for more details!

Posted in Optimization | 1 Comment

Local max-cut in smoothed polynomial time

Omer Angel, Yuval Peres, Fan Wei, and myself have just posted to the arXiv our paper showing that local max-cut is in smoothed polynomial time. In this post I briefly explain what is the problem, and I give a short proof of the previous state of the art result on this problem, which was a paper by Etscheid and Roglin showing that local max-cut is in quasi-polynomial time.

Local max-cut and smoothed analysis

Let G = (V,E) be a connected graph with n vertices and w: E\rightarrow [-1,1] be an edge weight function. The local max-cut problem asks to find a partition of the vertices \sigma: V\rightarrow \{-1,1\} whose total cut weight

    \[\frac12 \sum_{uv \in E} w(uv) \big(1-\sigma(u)\sigma(v)\big) ,\]

is locally maximal, in the sense that one cannot increase the cut weight by changing the value of \sigma at a single vertex (recall that actually finding the global maximum is NP-hard). See the papers linked to above for motivation on this problem.

There is a simple local search algorithm for this problem, sometimes referred to as “FLIP”: start from some initial \sigma_0 and iteratively flip vertices (i.e. change the sign of \sigma at a vertex) to improve the cut weight until reaching a local maximum. It is easy to build instances (\sigma_0, w) where FLIP takes exponential time, however in “practice” it seems that FLIP always converges quickly. This motivates the smoothed analysis of FLIP, that is we want to understand what is the typical number of steps for FLIP when the edge weights is perturbed by a small amount of noise. Formally we now assume that the weight on edge e \in E is given by a random variable X_e \in [-1,1] which has a density with respect to the Lebesgue measure bounded from above by \phi (for example this forbids X_e to be too close to a point mass). We assume that these random variables are independent.

Theorem(Etscheid and Roglin [2014]): With probability 1-o(1) FLIP terminates in O((\phi^c n^{c \log(n)}) steps for some universal constant c>0.

We improve this result from quasi-polynomial to polynomial, assuming that we put some noise on the interaction between every pair of vertices, or in other words assuming that the graph is complete.

Theorem(Angel, Bubeck, Peres, Wei [2016]): Let G be complete. With probability 1-o(1) FLIP terminates in O(\phi^5 n^{15.1}) steps.

I will now prove the Etscheid and Roglin result.

Proof strategy

To simplify notation let us introduce the Hamiltonian

    \[H(\sigma) =-\frac{1}{2} \sum_{uv \in E} X_{uv} \sigma(u) \sigma(v).\]

We want to find a local max of H. For any \sigma \in \{-1,1\}^V and v \in V, we denote by \sigma^{-v} the state equal to \sigma except for the coordinate corresponding to v which is flipped. For such \sigma,v there exists a vector \alpha = \alpha(\sigma,v) \in \{-1,0,1\}^E such that

    \[H(\sigma^{-v}) = H(\sigma) + \langle \alpha, X \rangle .\]

More specifically \alpha=(\alpha_{uw})_{uw \in E} is defined by

    \[\left\{\begin{array}{ll} \alpha_{uv} = \sigma(v) \sigma(u) & \forall u \neq v \\ \alpha_{uw} = 0 & \text{if} \ v \not\in \{u,w\} \end{array}\right.\]

We say that flipping a vertex is a move, and it is an improving move if the value of H strictly improves. We say that a sequence of moves L=(v_1,\hdots, v_{\ell}) is \epsilon-slow from \sigma_0 if

    \[H(\sigma_{i}) -  H(\sigma_{i-1}) \in (0, \epsilon], \forall i \in [\ell].\]

It is sufficient to show that with high probability there is no \epsilon-slow sequence with say \ell=2n and \epsilon = 1/(\phi n^{\log(n)})^c (indeed in this case after 2n \times (\phi n^{\log(n)})^c \times n^2 steps FLIP must have stopped, for otherwise the H value would exceed the maximal possible value of n^2). We will do this in in three main steps, a probability step, a linear algebra step, and a combinatorial step.

Probability step

Lemma: Let \alpha_1, \hdots, \alpha_k be k linearly independent vectors in \mathbb{Z}^E. Then one has

    \[\mathbb{P}(\forall i \in [k], \langle \alpha_i, X \rangle \in (0,\epsilon]) \leq (\phi \epsilon)^{k} .\]

Proof: The inequality follows from a simple change of variables. Let A \in \mathbb{Z}^{|E| \times E} be a full rank matrix whose first k rows are \alpha_1, \hdots, \alpha_k and it is completed so that A is the identity on the subspace orthogonal to \alpha_1, \hdots, \alpha_k. Let g be the density of A X, and f=\prod_{i=1}^{|E|} f_i the density of X. One has g(y) = |\mathrm{det}(A^{-1})| f(A^{-1} y)| and the key observation is that since A has integer coefficients, its determinant must be an integer too, and since it is non-zero one has |\mathrm{det}(A^{-1})|=|\mathrm{det}(A)^{-1}| \leq 1 . Thus one gets:

    \begin{align*} \mathbb{P}(\forall i \in [k], \langle \alpha_i, X \rangle \in (0,\epsilon]) & = \int_{(0,\epsilon]^k \times \mathbb{R}^{|E|-k}} g(y) dy \\ & \leq \int_{(0,\epsilon]^k \times \mathbb{R}^{|E|-k}} \prod_{i=1}^{|E|} f_i((A^{-1} y)_i) dy \\ & \leq (\phi \epsilon)^k \int_{\mathbb{R}^{|E|-k}} \prod_{i=k+1}^{|E|} f_i(y_i) dy = (\phi \epsilon)^k. \end{align*}

Linear algebra step

Lemma: Consider a sequence of \ell improving moves with k distinct vertices (say v_1, \hdots, v_k) that repeat at least twice in this sequence. Let \alpha_1, \hdots, \alpha_{\ell} \in \{-1,0,1\}^E be the corresponding move coefficients, and for each i \in [k] let s_i (respectively t_i) be the first (respectively second) time at which v_i moves. Then the vectors \tilde{\alpha}_i = \alpha_{s_i} + \alpha_{t_i} \in \mathbb{Z}^E, i \in [k], are linearly independent. Furthermore for any v that did not move between the times s_i and t_i one has \tilde{\alpha}_i(vv_i) = 0 (and for any e \not\ni v_i, \tilde{\alpha}_i(e) = 0).

Proof: The last sentence of the lemma is obvious. For the linear independence let \lambda \in \R^k be such that \sum_{i=1}^k \lambda_i \tilde{\alpha}_i = 0. Consider a new graph H with vertex set [k] and such that i is connected to j if v_i appears an odd number of times between the times s_j and t_j. This defines an oriented graph, however if i is connected to j but j is not connected to i then one has \tilde{\alpha}_j(v_iv_j) \in \{-2,2\} while \tilde{\alpha}_i(v_iv_j)=0 (and furthermore for any m \not\in \{i,j\}, \tilde{\alpha}_m(v_iv_j)=0) and thus \lambda_j=0. In other words we can consider a subset of [k] where H is an undirected graph, and outside of this subset \lambda is identically zero. To reduce notation we simply assume that H is undirected. Next we observe that if i and j are connected then one must have \tilde{\alpha}_j(v_iv_j)= - \tilde{\alpha}_i(v_iv_j) (this uses the fact that we look at the {\em first} consecutive times at which the vertices move) and in particular (again using that for any m \not\in \{i,j\}, \tilde{\alpha}_m(v_iv_j)=0) one must have \lambda_i=\lambda_j. Now let C be some connected component of H, and let \lambda_C be the unique value of \lambda on C. Noting that the \tilde{\alpha}‘s corresponding to different components of H have difference support (more precisely with C_E := \cup_{j \in C} \{e \ni j\} one has for any i \in C, \tilde{\alpha}_i |_{C_E^c} = 0 and for any i \not\in C, \tilde{\alpha}_i |_{C_E}=0) one obtains \lambda_C \sum_{i \in C} \tilde{\alpha}_i = 0. On the other hand since the sequence of moves is improving one must have \sum_{i \in C} \tilde{\alpha}_i \neq 0, which implies \lambda_C = 0 and finally \lambda = 0 (thus concluding the proof of the linear independence).

Combinatorial step

Lemma: Let v_1, \hdots, v_{2n} \in V. There exists \ell \in \mathbb{N} and i \in [2n - \ell] such that the number of vertices that repeat at least twice in the segment v_i, \hdots, v_{i + \ell} is at least \ell / (2 \log_2(n)).

Proof (from ABPW): Define the surplus of a sequence to be the difference between the number of elements and the number of distinct elements in the sequence. Let s_{\ell} be the maximum surplus in any segment of length \ell in v_1, \hdots, v_{2n}. Observe that s_{2n} \geq n. Let us now assume that for any segment of length \ell, the number of vertices that repeat at least twice is at most \epsilon \ell. Then one has by induction

    \[s_{2n} \leq 2 s_{n} + \epsilon 2n \leq 2 \epsilon n \log_2(n) .\]

This shows that \epsilon has to be greater than 1/(2 \log_2(n)) which concludes the proof.

Putting things together

We want to show that with \epsilon = 1/(\phi n^{\log(n)})^c one has

    \[\mathbb{P}(\exists (\sigma_0, L) \; \epsilon-\text{slow and} \; |L| = 2n) = o(1) .\]

By the combinatorial lemma we know that it is enough to show that:

    \begin{align*} & \sum_{\ell=1}^{2n} \mathbb{P}(\exists (\sigma_0, L) \; \epsilon-\text{slow}, \; |L| = \ell, \; \text{and} \\ & \text{L has at least} \; \ell/(2 \log_2(n)) \; \text{repeating vertices}) = o(1) . \end{align*}

Now using the probability lemma together with the linear algebra lemma (and observing that critically \tilde{\alpha} only depends on the value of \sigma at the vertices in L, and thus the union bound over \sigma_0 only gives a 2^{\ell} factor instead of 2^{n}) one obtains that the above probability is bounded by

    \[2^{\ell} n^{\ell} (\phi \epsilon)^{\ell/(2 \log_2(n))} ,\]

which concludes the proof of the Etscheid and Roglin result.

Note that a natural route to get a polynomial-time bound from the above proof would be to remove the \log(n) term in the combinatorial lemma but we show in our paper that this is impossible. Our result comes essentially from improvements to the linear algebra step (this is more difficult as the Etscheid and Roglin linear algebra lemma is particularly friendly for the union bound step, so we had to find another way to do the union bound).

Posted in Optimization, Probability theory | 2 Comments

Prospective graduate student? Consider University of Washington!

This post is targeted at prospective graduate students, especially foreigners from outside the US, who are primarily interested in optimization but also have a taste for probability theory (basically readers of this blog!). As a foreigner myself I remember that during my undergraduate studies my view of the US was essentially restricted to the usual suspects, Princeton, Harvard, MIT, Berkeley, Stanford. These places are indeed amazing, but I would like to try to raise awareness that, in terms of the interface optimization/probability, University of Washington (and especially the theory of computation group there) has a reasonable claim for the place with the most amazing opportunities right now in this space:

  1. The junior faculty (Shayan Oveis Gharan, Thomas Rothvoss, and Yin Tat Lee) are all doing groundbreaking (and award-winning) work at the interface optimization/probability. In my opinion the junior faculty roster is a key element in the choice of grad school, as typically junior faculty have much more time to dedicate to students. In particular I know that Yin Tat Lee is looking for graduate students starting next Fall.
  2. Besides the theory of computation group, UW has lots of resources in optimization such as TOPS (Trends in Optimization Seminar), and many optimization faculty in various department (Maryam Fazel, Jim Burke, Dmitriy Drusvyatskiy, Jeff Bilmes, Zaid Harchaoui) which means many interesting classes to take!
  3. The Theory Group at Microsoft Research is just a bridge away from UW, and we have lots of activities on optimization/probability there too. In fact I am also looking for one graduate student, to be co-advised with a faculty from UW.

Long story short, if you are a talented young mathematician interested in making a difference in optimization then you should apply to the CS department at UW, and here is the link to do so.

Posted in Uncategorized | 3 Comments

Guest post by Miklos Racz: Entropic central limit theorems and a proof of the fundamental limits of dimension estimation

In this post we give a proof of the fundamental limits of dimension estimation in random geometric graphs, based on the recent work of Bubeck and Ganguly. We refer the reader to the previous post for a detailed introduction; here we just recall the main theorem we will prove.

Theorem [Bubeck and Ganguly]

If the distribution \mu is log-concave, i.e., if it has density f(\cdot) = e^{-\varphi(\cdot)} for some convex function \varphi, and if \frac{d}{n^{3} \log^{2} \left( d \right)} \to \infty, then

(1)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) \to 0, \end{equation*}

where \mathcal{W}_{n,d} is an appropriately scaled Wishart matrix coming from vectors having i.i.d. \mu entries and \mathcal{G}_{n} is a GOE matrix, both with the diagonal removed.

The proof hinges on a high-dimensional entropic central limit theorem, so a large part of the post is devoted to entropic central limit theorems and ways of proving them. Without further ado let us jump right in.

 

Pinsker’s inequality: from total variation to relative entropy

Our goal is now to bound \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) cannot be computed directly. Coupling these two random matrices also seems challenging.

In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. Here we use Pinsker’s inequality to switch to relative entropy:

(2)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right)^{2} \leq \frac{1}{2} \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right), \end{equation*}

where \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) denotes the relative entropy of \mathcal{W}_{n,d} with respect to \mathcal{G}_{n}. We next take a detour to entropic central limit theorems and techniques involved in their proof, before coming back to bounding the right hand side in (2).

 

An introduction to entropic CLTs

Let \phi denote the density of \gamma_{n}, the n-dimensional standard Gaussian distribution, and let f be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity I_{n}. Then

    \begin{align*} 0 \leq \mathrm{Ent} \left( f \, \| \, \phi \right) &= \int f \log f - \int f \log \phi \\ &= \int f \log f - \int \phi \log \phi = \mathrm{Ent} \left( \phi \right) - \mathrm{Ent} \left( f \right), \end{align*}

where the second equality follows from the fact that \log \phi \left( x \right) is quadratic in x, and the first two moments of f and \phi are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities. It turns out that much more is true.

The central limit theorem states that if Z_{1}, Z_{2}, \dots are i.i.d. real-valued random variables with zero mean and unit variance, then S_{m} := \left( Z_{1} + \dots + Z_{m} \right) / \sqrt{m} converges in distribution to a standard Gaussian random variable as m \to \infty. There are many other senses in which S_{m} converges to a standard Gaussian, the entropic CLT being one of them.

Theorem [Entropic CLT]

Let Z_{1}, Z_{2}, \dots be i.i.d. real-valued random variables with zero mean and unit variance, and let S_{m} := \left( Z_{1} + \dots + Z_{m} \right) / \sqrt{m}. If \mathrm{Ent} \left( Z_{1} \, \| \, \phi \right) < \infty, then

    \[ \mathrm{Ent} \left( S_{m} \right) \nearrow \mathrm{Ent} \left( \phi \right) \]

as m \to \infty. Moreover, the entropy of S_{m} increases monotonically, i.e., \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{m+1} \right) for every m \geq 1.

The condition \mathrm{Ent} \left( Z_{1} \, \| \, \phi \right) < \infty is necessary for an entropic CLT to hold; for instance, if the Z_{i} are discrete, then h \left( S_{m} \right) = - \infty for all m.

The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor, and Johnson and Barron in the early 2000s, using two different techniques.

The fact that \mathrm{Ent} \left( S_{1} \right) \leq \mathrm{Ent} \left( S_{2} \right) follows from the entropy power inequality, which goes back to Shannon in 1948. This implies that \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{2m} \right) for all m \geq 0, and so it was naturally conjectured that \mathrm{Ent} \left( S_{m} \right) increases monotonically. However, proving this turned out to be challenging. Even the inequality \mathrm{Ent} \left( S_{2} \right) \leq \mathrm{Ent} \left( S_{3} \right) was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor proved in general that \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{m+1} \right) for all m \geq 1.

In the following we sketch some of the main ideas that go into the proof of these results, in particular following the technique introduced by Ball, Barthe, and Naor.

 

From relative entropy to Fisher information

Our goal is to show that some random variable Z, which is a convolution of many i.i.d. random variables, is close to a Gaussian G. One way to approach this is to interpolate between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define

    \[ P_{t} Z := e^{-t} Z + \sqrt{1 - e^{-2t}} G \]

for t \in [0,\infty), and let f_{t} denote the density of P_{t} Z. We have P_{0} Z = Z and P_{\infty} Z = G. This semigroup has several desirable properties. For instance, if the density of Z is isotropic, then so is f_{t}. Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.

For a density function f : \mathbb{R}^{n} \to \mathbb{R}_{+}, let

    \[ \mathcal{I} \left( f \right) := \int \frac{\nabla f (\nabla f)^{T}}{f} = \E \left[ \left( \nabla \log f \right) \left( \nabla \log f \right)^{T} \right] \]

be the Fisher information matrix. The Cramer-Rao bound states that

    \[ \mathrm{Cov} \left( f \right) \succeq \mathcal{I} \left( f \right)^{-1}. \]

More generally this holds for the covariance of any unbiased estimator of the mean. The Fisher information is defined as

    \[ I \left( f \right) := \Tr \left( \mathcal{I} \left( f \right) \right). \]

It is sometimes more convenient to work with the Fisher information distance, defined as J(f) := I(f) - I(\phi) = I(f) - n. Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative.

Now we are ready to state the De Bruijn identity, which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance:

    \[ \partial_{t} \mathrm{Ent} \left( f_{t} \right) = J \left( f_{t} \right). \]

This implies that the relative entropy between f and \phi—which is our quantity of interest—can be expressed as follows:

(3)   \begin{equation*} \mathrm{Ent} \left( f \, \| \, \phi \right) = \mathrm{Ent} \left( \phi \right) - \mathrm{Ent} \left( f \right) = \int_{0}^{\infty} J \left( f_{t} \right) dt. \end{equation*}

Thus our goal is to bound the Fisher information distance J(f_{t}).

 

Bounding the Fisher information distance

We first recall a classical result by Blachman and Stam that shows that Fisher information decreases under convolution.

Theorem [Blachman; Stam]

Let Y_{1}, \dots, Y_{d} be independent random variables taking values in \mathbb{R}, and let a \in \mathbb{R}^{d} be such that \left\| a \right\|_{2} = 1. Then

    \[ I \left( \sum_{i=1}^{d} a_{i} Y_{i} \right) \leq \sum_{i=1}^{d} a_{i}^{2} I \left( Y_{i} \right). \]

In the i.i.d. case, this bound becomes \left\| a \right\|_{2}^{2} I \left( Y_{1} \right).

Ball, Barthe, and Naor gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3. (See Bubeck and Ganguly for a short proof.)

Theorem [Variational characterization of Fisher information]

Let w : \mathbb{R}^{d} \to \left( 0, \infty \right) be a sufficiently smooth density on \mathbb{R}^{d}, let a \in \mathbb{R}^{d} be a unit vector, and let h be the marginal of w in direction a. Then we have

(4)   \begin{equation*} I \left( h \right) \leq \int_{\mathbb{R}^{d}} \left( \frac{\mathrm{div} \left( pw \right)}{w} \right)^{2} w \end{equation*}

for any continuously differentiable vector field p : \mathbb{R}^{d} \to \mathbb{R}^{d} with the property that for every x, \left\langle p \left( x \right), a \right\rangle = 1. Moreover, if w satisfies \int \left\| x \right\|^{2} w \left( x \right) < \infty, then there is equality for some suitable vector field p.

The Blachman-Stam theorem follows from this characterization by taking the constant vector field p \equiv a. Then we have \mathrm{div} \left( pw \right) = \left\langle \nabla w, a \right\rangle, and so the right hand side of (4) becomes a^{T} \mathcal{I} \left( w \right) a, where recall that \mathcal{I} is the Fisher information matrix. In the setting of Theorem 3 the density w of \left( Y_{1}, \dots, Y_{d} \right) is a product density: w \left( x_{1}, \dots, x_{d} \right) = f_{1} \left( x_{1} \right) \times \dots \times f_{d} \left( x_{d} \right), where f_{i} is the density of Y_{i}. Consequently the Fisher information matrix is a diagonal matrix, \mathcal{I} \left( w \right) = \mathrm{diag} \left( I \left( f_{1} \right), \dots, I \left( f_{d} \right) \right), and thus a^{T} \mathcal{I} \left( w \right) a = \sum_{i=1}^{d} a_{i}^{2} I \left( f_{i} \right), concluding the proof of Theorem 3 using Theorem 4.

Given the characterization of Theorem 4, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions.

Theorem [Artstein, Ball, Barthe, and Naor]

Let Y_{1}, \dots, Y_{d} be i.i.d. random variables with a density having a positive spectral gap c. (We say that a random variable has spectral gap c if for every sufficiently smooth g, we have \mathrm{Var} \left( g \right) \leq \tfrac{1}{c} \E g'^{2}. In particular, log-concave random variables have a positive spectral gap, see Bobkov (1999).) Then for any a \in \mathbb{R}^{d} with \left\| a \right\|_{2} = 1 we have that

    \[ J \left( \sum_{i=1}^{d} a_{i} Y_{i} \right) \leq \frac{2 \left\| a \right\|_{4}^{4}}{c + (2-c) \left\| a \right\|_{4}^{4}} J \left( Y_{1} \right). \]

When a = \frac{1}{\sqrt{d}} \mathbf{1}, then \frac{2 \left\| a \right\|_{4}^{4}}{c + (2-c) \left\| a \right\|_{4}^{4}} = O \left( 1 / d \right), and thus using (3) we obtain a rate of convergence of O \left( 1 / d \right) in the entropic CLT.

A result similar to Theorem 5 was proven independently and roughly at the same time by Johnson and Barron using a different approach involving score functions.

 

A high-dimensional entropic CLT

The techniques of Artstein, Ball, Barthe, and Naor generalize to higher dimensions, as was recently shown by Bubeck and Ganguly.

A result similar to Theorem 5 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3) again.

Theorem [Bubeck and Ganguly]

Let Y \in \mathbb{R}^{d} be a random vector with i.i.d. entries from a distribution \nu with zero mean, unit variance, and spectral gap c \in (0,1]. Let A \in \mathbb{R}^{n \times d} be a matrix such that AA^{T} = I_{n}, the n \times n identity matrix. Let

    \[\varepsilon = \max_{i \in [d]} \left( A^T A \right)_{i,i}\]

and

    \[\zeta = \max_{i,j \in [d], i \neq j} \left| \left( A^T A \right)_{i,j} \right|.\]

Then we have that

    \[ \mathrm{Ent} \left( A Y \, \| \, \gamma_{n} \right) \leq n \min \left\{ 2 \left( \varepsilon + \zeta^{2} d \right) / c, 1 \right\} \mathrm{Ent} \left( \nu \, \| \, \gamma_{1} \right), \]

where \gamma_{n} denotes the standard Gaussian measure in \mathrm{R}^{n}.

To interpret this result, consider the case where the matrix A is built by picking rows one after the other uniformly at random on the Euclidean sphere in \mathbb{R}^{d}, conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition AA^{T} = I_{n}). We then expect to have \varepsilon \simeq n/d and \zeta \simeq \sqrt{n} / d (we leave the details as an exercise for the reader), and so Theorem 7 tells us that \mathrm{Ent} \left( A Y \, \| \, \gamma_{n} \right) \lesssim n^{2}/d.

 

Back to Wishart and GOE

We now turn our attention back to bounding the relative entropy \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right); recall (2). Since the Wishart matrix contains the (scaled) inner products of n vectors in \mathbb{R}^{d}, it is natural to relate \mathcal{W}_{n+1,d} and \mathcal{W}_{n,d}, since the former comes from the latter by adding an additional d-dimensional vector to the n vectors already present. Specifically, we have the following:

    \[ \mathcal{W}_{n+1,d} = \begin{pmatrix} \mathcal{W}_{n,d} & \frac{1}{\sqrt{d}} \mathbb{X} X \\ \frac{1}{\sqrt{d}} \left( \mathbb{X} X \right)^{T} & 0 \end{pmatrix}, \]

where X is a d-dimensional random vector with i.i.d. entries from \mu, which are also independent from \mathbb{X}. Similarly we can write the matrix \mathcal{G}_{n+1} using \mathcal{G}_{n}:

    \[ \mathcal{G}_{n+1} = \begin{pmatrix} \mathcal{G}_{n} & \gamma_{n} \\ \gamma_{n}^{T} & 0 \end{pmatrix}. \]

This naturally suggests to use the chain rule for relative entropy and bound \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right)

by induction on n. We get that

    \[ \mathrm{Ent} \left( \mathcal{W}_{n+1,d} \, \| \, \mathcal{G}_{n+1} \right) = \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) + \mathbb{E}_{W_{n,d}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathcal{W}_{n,d} \, \| \, \gamma_{n} \right) \right]. \]

By convexity of the relative entropy we also have that

    \[ \mathbb{E}_{W_{n,d}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathcal{W}_{n,d} \, \| \, \gamma_{n} \right) \right] \leq \mathbb{E}_{\mathbb{X}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathbb{X} \, \| \, \gamma_{n} \right) \right]. \]

Thus our goal is to understand and bound \mathrm{Ent} \left( A X \, \| \, \gamma_{n} \right) for A \in \mathbb{R}^{n \times d}, and then apply the bound to A = \tfrac{1}{\sqrt{d}} \mathbb{X} (followed by taking expectation over \mathbb{X}). This is precisely what was done in Theorem 6, the high-dimensional entropic CLT, for A satisfying AA^T = I_{n}. Since A = \tfrac{1}{\sqrt{d}} \mathbb{X} does not necessarily satisfy AA^T = I_{n}, we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader.

Lemma

Let A \in \mathbb{R}^{n \times d} and Q \in \mathbb{R}^{n \times n} be such that QA \left( QA \right)^{T} = I_{n}. Then for any isotropic random variable X taking values in \mathbb{R}^{d} we have that

(5)   \begin{equation*} \mathrm{Ent} \left( A X \, \| \, \gamma_{n} \right) = \mathrm{Ent} \left( QA X \, \| \, \gamma_{n} \right) + \frac{1}{2} \mathrm{Tr} \left( A A^{T} \right) - \frac{n}{2} + \log \left| \det \left( Q \right) \right|. \end{equation*}

We then apply this lemma with A = \tfrac{1}{\sqrt{d}} \mathbb{X} and Q = \left( \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} \right)^{-1/2}. Observe that

    \[\mathbb{E} \mathrm{Tr} \left( A A^{T} \right) = \tfrac{1}{d} \mathbb{E} \mathrm{Tr} \left( \mathbb{X} \mathbb{X}^{T} \right) = \tfrac{1}{d} \times n \times d = n,\]

and hence in expectation the middle two terms of the right hand side of (5) cancel each other out.

The last term in (5),

    \[- \tfrac{1}{4} \log \det \left( \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} \right),\]

should be understood as the relative entropy between a centered Gaussian with covariance given by \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} and a standard Gaussian in \mathbb{R}^{n}. Controlling the expectation of this term requires studying the probability that \mathbb{X} \mathbb{X}^{T} is close to being non-invertible, which requires bounds on the left tail of the smallest singular of \mathbb{X}. Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to Bubeck and Ganguly for more details on this point.

Finally, the high-dimensional entropic CLT can now be applied to see that

    \[\mathrm{Ent} \left( QA X \, \| \, \gamma_{n} \right) \lesssim n^{2} / d.\]

From the induction on n we get another factor of n, arriving at

    \[\mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) \lesssim n^{3} / d.\]

We conclude that the dimension threshold is d \approx n^{3}, and the information-theoretic proof that we have outlined sheds light on why this threshold is n^{3}.

Posted in Probability theory, Random graphs | Leave a comment

Guest post by Miklos Racz: The fundamental limits of dimension estimation in random geometric graphs

This post is a continuation of the previous one, where we explored how to detect geometry in a random geometric graph. We now consider the other side of the coin: when is it impossible to detect geometry and what are techniques for proving this? We begin by discussing the G(n,p,d) model introduced in the previous post and then turn to a more general setup, proving a robustness result on the threshold dimension for detection. The proof of the latter result also gives us the opportunity to learn about the fascinating world of entropic central limit theorems.

Barrier to detecting geometry: when Wishart becomes GOE

Recall from the previous post that G(n,p,d) is a random geometric graph where the underlying metric space is the d-dimensional unit sphere \mathbb{S}^{d-1} = \left\{ x \in \mathbb{R}^d : \left\| x \right\|_2 = 1 \right\}, and where the latent labels of the nodes are i.i.d. uniform random vectors in \mathbb{S}^{d-1}. Our goal now is to show the impossibility result of Bubeck, Ding, Eldan, and Racz: if d \gg n^{3}, then it is impossible to distinguish between G(n,p,d) and the Erdos-Renyi random graph G(n,p). More precisely, we have that

(1)   \begin{equation*} \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 0 \end{equation*}

when d \gg n^{3} and n \to \infty, where \mathrm{TV} denotes total variation distance.

There are essentially three main ways to bound the total variation of two distributions from above: (i) if the distributions have nice formulas associated with them, then exact computation is possible; (ii) through coupling the distributions; or (iii) by using inequalities between probability metrics to switch the problem to bounding a different notion of distance between the distributions. Here, while the distribution of G(n,p,d) does not have a nice formula associated with it, the main idea is to view this random geometric graph as a function of an n \times n Wishart matrix with d degrees of freedom—i.e., a matrix of inner products of n d-dimensional Gaussian vectors—denoted by W(n,d). It turns out that one can view G(n,p) as (essentially) the same function of an n \times n GOE random matrix—i.e., a symmetric matrix with i.i.d. Gaussian entries on and above the diagonal—denoted by M(n). The upside of this is that both of these random matrix ensembles have explicit densities that allow for explicit computation. We explain this connection here in the special case of p = 1/2 for simplicity; see Bubeck et al. for the case of general p.

Recall that if Y_{1} is a standard normal random variable in \mathbb{R}^d, then Y_1 / \left\| Y_1 \right\| is uniformly distributed on the sphere \mathbb{S}^{d-1}. Consequently we can view G\left( n, 1/2, d \right) as a function of an appropriate Wishart matrix, as follows. Let Y be an n \times d matrix where the entries are i.i.d. standard normal random variables, and let W \equiv W (n,d) = YY^T be the corresponding n \times n Wishart matrix. Note that W_{ii} = \left\langle Y_i, Y_i \right\rangle = \left\| Y_i \right\|^2 and so \left\langle Y_i / \left\| Y_i \right\|, Y_j / \left\| Y_j \right\| \right\rangle = W_{ij} / \sqrt{W_{ii} W_{jj}}. Thus the n \times n matrix A defined as

    \[ A_{i,j} = \begin{cases} 1 & \text{if } W_{ij} \geq 0 \text{ and } i \neq j\\ 0 & \text{otherwise} \end{cases} \]

has the same law as the adjacency matrix of G\left(n,1/2,d\right). Denote the map that takes W to A by H, i.e., A = H \left( W \right).

In a similar way we can view G \left( n, 1/2 \right) as a function of an n \times n matrix drawn from the Gaussian Orthogonal Ensemble (GOE). Let M\left( n \right) be a symmetric n \times n random matrix where the diagonal entries are i.i.d. normal random variables with mean zero and variance 2, and the entries above the diagonal are i.i.d. standard normal random variables, with the entries on and above the diagonal all independent. Then B = H \left( M(n) \right) has the same law as the adjacency matrix of G(n,p). Note that B only depends on the sign of the off-diagonal elements of M\left(n \right), so in the definition of B we can replace M\left( n \right) with M \left( n, d \right) := \sqrt{d} M \left( n \right) + d I_n, where I_n is the n \times n identity matrix.

We can thus conclude that

    \begin{align*} \mathrm{TV} \left( G(n,1/2,d), G(n,1/2) \right) &= \mathrm{TV} \left( H \left( W(n,d) \right), H \left( M(n,d) \right) \right) \\ &\leq \mathrm{TV} \left( W(n,d), M(n,d) \right). \end{align*}

The densities of these two random matrix ensembles are explicit and well known (although we do not state them here), which allow for explicit calculations. The outcome of these calculations is the following result, proven independently and simultaneously by Bubeck et al. and Jiang and Li.

Theorem [Bubeck, Ding, Eldan, and Racz; Jiang and Li]

Define the random matrix ensembles W\left( n, d \right) and M \left( n, d \right) as above. If d / n^3 \to \infty, then

    \begin{equation*}\label{eq:Wishart_GOE} \mathrm{TV} \left( W \left( n, d \right), M \left( n, d \right) \right) \to 0. \end{equation*}

We conclude that it is impossible to detect underlying geometry whenever d \gg n^{3}.

The universality of the threshold dimension

How robust is the result presented above? We have seen that the detection threshold is intimately connected to the threshold of when a Wishart matrix becomes GOE. Understanding the robustness of this result on random matrices is interesting in its own right, and this is what we will pursue in the remainder of this post, which is based on a recent paper by Bubeck and Ganguly.

Let \mathbb{X} be an n \times d random matrix with i.i.d. entries from a distribution \mu that has mean zero and variance 1. The n \times n matrix \mathbb{X} \mathbb{X}^{T} is known as the Wishart matrix with d degrees of freedom. As we have seen above, this arises naturally in geometry, where \mathbb{X} \mathbb{X}^{T} is known as the Gram matrix of inner products of n points in \mathbb{R}^{d}. The Wishart matrix also appears naturally in statistics as the sample covariance matrix, where d is the number of samples and n is the number of parameters. (Note that in statistics the number of samples is usually denoted by n, and the number of parameters is usually denoted by p; here our notation is taken with the geometric perspective in mind.)

We consider the Wishart matrix with the diagonal removed, and scaled appropriately:

    \[ \mathcal{W}_{n,d} = \frac{1}{\sqrt{d}} \left( \mathbb{X} \mathbb{X}^{T} - \mathrm{diag} \left( \mathbb{X} \mathbb{X}^{T} \right) \right). \]

In many applications—such as to random graphs as above—the diagonal of the matrix is not relevant, so removing it does not lose information. Our goal is to understand how large does the dimension d have to be so that \mathcal{W}_{n,d} is approximately like \mathcal{G}_{n}, which is defined as the n \times n Wigner matrix with zeros on the diagonal and i.i.d. standard Gaussians above the diagonal. In other words, \mathcal{G}_{n} is drawn from the Gaussian Orthogonal Ensemble (GOE) with the diagonal replaced with zeros.

A simple application of the multivariate central limit theorem gives that if n is fixed and d \to \infty, then \mathcal{W}_{n,d} converges to \mathcal{G}_{n} in distribution. The main result of Bubeck and Ganguly establishes that this holds as long as d \, \widetilde{\gg} \, n^{3} under rather general conditions on the distribution \mu.

Theorem [Bubeck and Ganguly]

If the distribution \mu is log-concave, i.e., if it has density f(\cdot) = e^{-\varphi(\cdot)} for some convex function \varphi, and if \frac{d}{n^{3} \log^{2} \left( d \right)} \to \infty, then

(2)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) \to 0. \end{equation*}

On the other hand, if \mu has a finite fourth moment and \frac{d}{n^{3}} \to 0, then

(3)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) \to 1. \end{equation*}

This result extends Theorem 1 from the previous post and Theorem 1 from above, and establishes n^{3} as the universal critical dimension (up to logarithmic factors) for sufficiently smooth measures \mu: \mathcal{W}_{n,d} is approximately Gaussian if and only if d is much larger than n^{3}. For random graphs, as seen above, this is the dimension barrier to extracting geometric information from a network: if the dimension is much greater than the cube of the number of vertices, then all geometry is lost. In the setting of statistics this means that the Gaussian approximation of a Wishart matrix is valid as long as the sample size is much greater than the cube of the number of parameters. Note that for some statistics of a Wishart matrix the Gaussian approximation is valid for much smaller sample sizes (e.g., the largest eigenvalue behaves as in the limit even when the number of parameters is on the same order as the sample size (Johnstone, 2001)).

To distinguish the random matrix ensembles, we have seen in the previous post that signed triangles work up until the threshold dimension in the case when \mu is standard normal. It turns out that the same statistic works in this more general setting; when the entries of the matrices are centered, this statistic can be written as A \mapsto \mathrm{Tr} \left( A^{3} \right). We leave the details as an exercise for the reader.

We note that for (2) to hold it is necessary to have some smoothness assumption on the distribution \mu. For instance, if \mu is purely atomic, then so is the distribution of \mathcal{W}_{n,d}, and thus its total variation distance to \mathcal{G}_{n} is 1. The log-concave assumption gives this necessary smoothness, and it is an interesting open problem to understand how far this can be relaxed.

We will see the proof (and in particular the connection to entropic CLT!) in the next post.

Posted in Uncategorized | Leave a comment

Guest post by Miklos Racz: Estimating the dimension of a random geometric graph on a high-dimensional sphere

Following the previous post in which we studied community detection, in this post we study the fundamental limits of inferring geometric structure in networks. Many networks coming from physical considerations naturally have an underlying geometry, such as the network of major roads in a country. In other networks this stems from a latent feature space of the nodes. For instance, in social networks a person might be represented by a feature vector of their interests, and two people are connected if their interests are close enough; this latent metric space is referred to as the social space. We are particularly interested in the high-dimensional regime, which brings about a host of new questions, such as estimating the dimension.

A simple random geometric graph model and basic questions

We study perhaps the simplest model of a random geometric graph, where the underlying metric space is the d-dimensional unit sphere \mathbb{S}^{d-1} = \left\{ x \in \mathbb{R}^d : \left\| x \right\|_2 = 1 \right\}, and where the latent labels of the nodes are i.i.d. uniform random vectors in \mathbb{S}^{d-1}. More precisely, the random geometric graph G \left( n, p, d \right) is defined as follows. Let X_1, \dots, X_n be independent random vectors, uniformly distributed on \mathbb{S}^{d-1}. In G\left( n, p, d \right), distinct nodes i \in \left[n\right] and j \in \left[n \right] are connected by an edge if and only if \left\langle X_i, X_j \right\rangle \geq t_{p,d}, where the threshold value t_{p,d} \in \left[-1,1\right] is such that \mathbb{P} \left( \left\langle X_1, X_2 \right\rangle \geq t_{p,d} \right) = p. For example, when p = 1/2 we have t_{p,d} = 0.

The most natural random graph model without any structure is the standard Erdos-Renyi random graph G(n,p), where any two of the n vertices are independently connected with probability p.

We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdos-Renyi model, while the alternative is that it is drawn from G(n,p,d). In brief:

(1)   \begin{equation*} H_{0} : G \sim G(n,p), \qquad \qquad H_{1} : G \sim G(n,p,d). \end{equation*}

To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, G(n,p) and G(n,p,d), denoted by \mathrm{TV} \left( G(n,p), G(n,p,d) \right); recall that the total variation distance between two probability measures P and Q is defined as \mathrm{TV} \left( P, Q \right) = \tfrac{1}{2} \left\| P - Q \right\|_{1} = \sup_{A} \left| P(A) - Q(A) \right|. We are interested in particular in the case when the dimension d is large, growing with n.

It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with n while still being able to detect it? Most of this post will focus on this question.

If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a true representation, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the post.

The dimension threshold for detecting underlying geometry

The high-dimensional setting of the random geometric graph G(n,p,d) was first studied by Devroye, Gyorgy, Lugosi, and Udina, who showed that geometry is indeed lost in high dimensions: if n is fixed and d \to \infty, then \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 0. More precisely, they show that this convergence happens when d \gg n^{7} 2^{n^2 / 2}, but this is not tight. The dimension threshold for dense graphs was recently found by Bubeck, Ding, Eldan, and Racz, and it turns out that it is d \approx n^3, in the following sense.

Theorem [Bubeck, Ding, Eldan, and Racz 2014]

Let p \in (0,1) be fixed. Then

    \[\mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to \left\{ \begin{array}  0 \;\; \text{ if } \;\; d \gg n^3  \qquad \qquad (2) \\ 1 \;\; \text{ if } \;\; d \ll n^3 \qquad \qquad (3) \end{array} \right.\]

Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time O\left( n^{3} \right)).

(2)   \begin{equation*} \end{equation*}

(3)   \begin{equation*} \end{equation*}

Most of this post is devoted to understanding (3), that is, how the two models can be distinguished; the impossibility result of (2) will be discussed in a future post. At the end we will also consider this same question for sparse graphs (where p = c/n), where determining the dimension threshold is an intriguing open problem.

The triangle test

A natural test to uncover geometric structure is to count the number of triangles in G. Indeed, in a purely random scenario, vertex u being connected to both v and w says nothing about whether v and w are connected. On the other hand, in a geometric setting this implies that v and w are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us.

triangle_question

Given that u is connected to both v and w, v and w are more likely to be connected under G(n,p,d) than under G(n,p).

 

For a graph G, let A denote its adjacency matrix. Then

    \[T_{G} \left( i,j,k \right) := A_{i,j} A_{i,k} A_{j,k}\]

is the indicator variable that three vertices i, j, and k form a triangle, and so the number of triangles in G is

    \[T(G) := \sum_{\{i,j,k\} \in \binom{[n]}{3}} T_{G} \left( i,j,k \right).\]

By linearity of expectation, for both models the expected number of triangles is \binom{n}{3} times the probability of a triangle between three specific vertices. For the Erd\H{o}s-R\’enyi random graph the edges are independent, so the probability of a triangle is p^3, and thus we have

    \[ \mathbb{E} \left[ T \left( G(n,p) \right) \right] = \binom{n}{3} p^3. \]

For G(n,p,d) it turns out that for any fixed p \in \left( 0, 1 \right) we have

(4)   \begin{equation*} \mathbb{P} \left( T_{G(n,p,d)} \left( 1, 2, 3 \right) = 1 \right) \approx p^{3} \left( 1 + \frac{C_{p}}{\sqrt{d}} \right) \end{equation*}

for some constant C_{p} > 0, which gives that

    \[ \mathbb{E} \left[ T \left( G(n,p,d) \right) \right] \geq \binom{n}{3} p^3 \left( 1 + \frac{C_{p}}{\sqrt{d}} \right). \]

Showing (4) is somewhat involved, but in essence it follows from the concentration of measure phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of O \left( 1 / \sqrt{d} \right) around the equator. We sketch here the main intuition for p=1/2, which is illustrated in the figure below.

Let X_1, X_2, and X_3 be independent uniformly distributed points in \mathbb{S}^{d-1}. Then

    \begin{multline*} \mathbb{P} \left( T_{G(n,1/2,d)} \left( 1, 2, 3 \right) = 1 \right) \\ \begin{aligned} &= \mathbb{P} \left( \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0, \langle X_2, X_3 \rangle \geq 0 \right) \\ &= \mathbb{P} \left( \langle X_2, X_3 \rangle \geq 0 \, \middle| \, \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right) \mathbb{P} \left( \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right) \\ &= \frac{1}{4} \times \mathbb{P} \left( \langle X_2, X_3 \rangle \geq 0 \, \middle| \, \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right), \end{aligned} \end{multline*}

where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately 1/2 + c / \sqrt{d}. To compute this conditional probability what we really need to know is the typical angle is between X_1 and X_2. By rotational invariance we may assume that X_1 = (1,0,0, \dots, 0), and hence \langle X_1, X_2 \rangle = X_{2} (1), the first coordinate of X_{2}. One way to generate X_2 is to sample a d-dimensional standard Gaussian and then normalize it by its length. Since the norm of a d-dimensional standard Gaussian is very well concentrated around \sqrt{d}, it follows that X_{2}(1) is on the order of 1/\sqrt{d}. Conditioned on X_{2}(1) \geq 0, this typical angle gives the boost in the conditional probability that we see.

triangle_gnpd

 

 

If X_{1} and X_{2} are two independent uniform points on \mathbb{S}^{d-1}, then their inner product \left\langle X_1, X_2 \right\rangle is on the order of 1/\sqrt{d} due to the concentration of measure phenomenon on the sphere.

 

 

Thus we see that the boost in the number of triangles in the geometric setting is \Theta \left( n^{3} / \sqrt{d} \right) in expectation:

    \[\mathbb{E} \left[ T \left( G(n,p,d) \right) \right] - \E \left[ T \left( G(n,p) \right) \right] \geq \binom{n}{3} \frac{C_p}{\sqrt{d}}.\]

To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation. A simple calculation shows that

    \[\mathrm{Var} \left( T \left( G \left( n, p \right) \right) \right) = \binom{n}{3} \left( p^{3} - p^{6} \right) + \binom{n}{4} \binom{4}{2} \left( p^{5} - p^{6} \right)\]

and also

    \[\mathrm{Var} \left( T \left( G \left( n, p, d \right) \right) \right) \leq n^4.\]

Thus we see that \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 1 if n^{3} / \sqrt{d} \gg \sqrt{n^4}, which is equivalent to d \ll n^2.

Signed triangles are more powerful

While triangles detect geometry up until d \ll n^2, are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when d \ll n^2, as do several other natural statistics. Yet the underlying geometry can be detected even when d \ll n^{3}.

The simple idea that leads to this improvement is to consider signed triangles. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. The following figure summarizes these observations.

triangles_signed

The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define

    \[ \tau \left( G \right) := \sum_{\{i,j,k\} \in \binom{[n]}{3}} \left( A_{i,j} - p \right) \left( A_{i,k} - p \right) \left( A_{j,k} - p \right). \]

The key insight motivating this definition is that the variance of signed triangles is much smaller than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the \Theta \left( n^{4} \right) term vanishes, leaving only the \Theta \left( n^{3} \right) term. It is a simple exercise to show that

    \[\mathbb{E} \left[ \tau \left( G(n,p) \right) \right] = 0\]

and

    \[\mathrm{Var} \left( \tau \left( G(n,p) \right) \right) = \binom{n}{3} p^{3} \left( 1 - p \right)^{3}.\]

On the other hand it can be shown that

(5)   \begin{equation*} \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] \geq c_{p} n^{3} / \sqrt{d}, \end{equation*}

so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for G(n,p,d) and we have

    \[\mathrm{Var} \left( \tau \left( G(n,p,d) \right) \right) \leq n^{3} + \frac{3 n^{4}}{d}.\]

Putting everything together we get that \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 1 if n^{3} / \sqrt{d} \gg \sqrt{n^3 + n^{4} / d}, which is equivalent to d \ll n^3. This concludes the proof of (3) from Theorem 1.

Estimating the dimension

Until now we discussed detecting geometry. However, the insights gained above allow us to also touch upon the more subtle problem of estimating the underlying dimension d.

Dimension estimation can also be done by counting the “number” of signed triangles as above. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound on \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] in~(5) is not enough. Still, we believe that the lower bound should give the true value of the expected value for an appropriate constant c_{p}, and hence we expect to have that

(6)   \begin{equation*} \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] - \E \left[ \tau \left( G(n,p,d+1) \right) \right] = \Theta \left( \frac{n^{3}}{d^{3/2}} \right). \end{equation*}

Thus, using the variance bound from above, we get that dimension estimation should be possible using signed triangles whenever n^{3} / d^{3/2} \gg \sqrt{n^3 + n^{4} / d}, which is equivalent to d \ll n. Showing (6) for general p seems involved; Bubeck et al. showed that it holds for p = 1/2, which can be considered as a proof of concept.

Theorem [Bubeck, Ding, Eldan, and Racz 2014]

There exists a universal constant C > 0 such that for all integers n and d_1 < d_2, one has

    \[ \mathrm{TV} \left( G (n, 1/2, d_1), G(n, 1/2, d_2) \right) \geq 1 - C \left( \frac{d_1}{n} \right)^{2}. \]

This result is tight, as demonstrated by a result of Eldan, which implies that G(n,1/2,d) and G(n,1/2,d+1) are indistinguishable when d \gg n.

The mysterious sparse regime

We conclude this post by discussing an intriguing conjecture for sparse graphs. It is again natural to consider the number of triangles as a way to distinguish between G(n,c/n) and G(n, c/n,d). Bubeck et al. show that this statistic works whenever d \ll \log^{3} \left( n \right), and conjecture that this is tight.

Conjecture [Bubeck, Ding, Eldan, and Racz 2014]

Let c > 0 be fixed and assume d / \log^{3} \left( n \right) \to \infty. Then

    \[ \mathrm{TV} \left( G \left( n, \frac{c}{n} \right), G \left( n, \frac{c}{n}, d \right) \right) \to 0. \]

The main reason for this conjecture is that, when d \gg \log^{3} \left( n \right), G(n, c/n) and G(n, c/n, d) seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is n^{3} from (2) (which holds uniformly over p), which is very far from \log^{3} \left( n \right)!

Posted in Probability theory, Random graphs | Leave a comment

Guest post by Miklos Racz: A primer on exact recovery in the general stochastic block model

Foreword: this will be a series of blog posts written by our Theory Group postdoc Miklos Racz. It is essentially the transcript of a mini-course that Miki gave at the University of Washington and at the XX Brazilian School of Probability, and that I gave at the University of Nice. These notes are also posted on arxiv. I now leave the floor to Miki:

 

Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.

This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).

The stochastic block model and exact recovery

The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:

n, the number of vertices;

a probability distribution p = (p_1, \dots, p_k) that describes the relative sizes of the communities;

and \overline{Q} \in \left[0,1\right]^{k \times k}, a symmetric k \times k matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

The number of communities, k, is implicit in this notation; in these notes we assume that k is a fixed constant.

A random graph from \mathrm{SBM}( n, p, \overline{Q}) is defined as follows:

  • The vertex set of the graph is V = \left\{ 1, \dots, n \right\} \equiv \left[ n \right].
  • Every vertex v \in V is independently assigned a (hidden) label \sigma_v \in \left[ k \right] from the probability distribution p on \left[k \right]. That is, \mathbb{P} \left( \sigma_v = i \right) = p_i for every i \in \left[k \right].
  • Given the labels of the vertices, each (unordered) pair of vertices \left( u, v \right) \in V \times V is connected independently with probability \overline{Q}_{\sigma_{u}, \sigma_{v}}.

Example [Symmetric communities]

A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.

This is modeled by the SBM with p_i = 1/k for all i \in \left[ k \right] and \overline{Q}_{i,j} = a if i = j and \overline{Q}_{i,j} = b otherwise, with a > b > 0.

We write G \sim \mathrm{SBM}(n,p,\overline{Q}) for a graph generated according to the SBM without the hidden vertex labels revealed.

The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

There are various notions of success that are worth studying. In this post we focus on exact recovery: we aim to recover the labels of all vertices exactly with high probability (whp).

More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.

For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be \Omega \left( \ln(n) / n \right).

It is thus natural to scale the edge probability matrix accordingly, i.e., to consider \mathrm{SBM} \left( n, p, \ln \left( n \right) Q / n \right), where Q \in \mathbb{R}_{+}^{k \times k}.

We also assume that the communities have linear size, i.e., that p is independent of n, and p_i \in (0,1) for all i.

From exact recovery to testing multivariate Poisson distributions

As a thought experiment, imagine that not only is the graph G given, but also all vertex labels are revealed, except for that of a given vertex v \in V. Is it possible to determine the label of v?

Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but o(n) labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of v given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.

Given the labels of all vertices except v, the information we have about v is the number of nodes in each community it is connected to. In other words, we know the degree profile d(v) of v, where, for a given labelling of the graph’s vertices, the i-th component d_{i} (v) is the number of edges between v and the vertices in community i.

The distribution of the degree profile d(v) depends on the community that v belongs to. Recall that the community sizes are given by a multinomial distribution with parameters n and p, and hence the relative size of community i \in [k] concentrates on p_i. Thus if \sigma_v = j, the degree profile d(v) = (d_1(v), \dots, d_k(v)) can be approximated by independent binomials, with d_i (v) approximately distributed as

    \[\mathrm{Bin} \left( np_i, \ln(n) Q_{i,j} / n \right) ,\]

where \mathrm{Bin}(m,q) denotes the binomial distribution with m trials and success probability q. In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community j is approximately Poisson distributed with mean

    \[\ln \left( n \right) \sum_{i \in [k]} p_i Q_{i,j} e_i ,\]

where e_i is the i-th unit vector. Defining P = \mathrm{diag}(p), this can be abbreviated as \ln \left( n \right) \left( PQ \right)_{j}, where \left( PQ \right)_{j} denotes the j-th column of the matrix PQ.

We call the quantity \left( PQ \right)_{j} the community profile of community j; this is the quantity that determines the distribution of the degree profile of vertices from a given community.

Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between k multivariate Poisson distributions.

The prior on the label of v is given by p, and we get to observe the degree profile d(v), which comes from one of k multivariate Poisson distributions, which have mean \ln(n) times the community profiles \left( PQ \right)_j, j \in [k].

Testing multivariate Poisson distributions

We now turn to understanding the testing problem described above; the setup is as follows.

We consider a Bayesian hypothesis testing problem with k hypotheses.

The random variable H takes values in [k] with prior given by p, i.e., \mathbb{P} \left( H = j \right) = p_j.

We do not observe H, but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of H:

given H = j, the mean is \lambda(j) \in \mathbb{R}_{+}^{k}.

In short:

    \[ D \, | \, H = j \sim \mathrm{Poi} \left( \lambda(j) \right), \qquad j \in [k]. \]

In more detail:

    \[ \mathbb{P} \left( D = d \, \middle| \, H = j \right) = \mathcal{P}_{\lambda(j)} \left( d \right), \qquad d \in \mathbb{Z}_{+}^{k}, \]

where

    \[ \mathcal{P}_{\lambda(j)} \left( d \right) = \prod_{i \in [k]} \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) \]

and

    \[ \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) = \frac{\lambda_{i}(j)^{d_{i}}}{d_{i}!} e^{-\lambda_{i}(j)}. \]

Our goal is to infer the value of H from a realization of D.

The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing D = d, selects

    \[ \mathrm{argmax}_{j \in [k]} \mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j \]

as an estimate for the value of H, with ties broken arbitrarily.

Let P_e denote the error of the MAP estimator.

One can think of the MAP estimator as a tournament of k-1 pairwise comparisons of the hypotheses:

if

    \[\mathbb{P} \left( D = d \, \middle| \, H = i \right) p_i >\mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j\]

then the MAP estimate is not j.

The probability that one makes an error during such a comparison is exactly

(1)   \begin{equation*} P_{e} \left( i, j \right) := \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathbb{P} \left( D = x \, \middle| \, H = i \right) p_i , \mathbb{P} \left( D = x \, \middle| \, H = j \right) p_j \right\}. \end{equation*}

For finite k, the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., \max_{i,j} P_{e} \left( i, j \right).

In particular, we have that

(2)   \begin{equation*} \frac{1}{k-1} \sum_{i < j} P_{e} \left( i, j \right) \leq P_{e} \leq \sum_{i < j} P_{e} \left( i, j \right). \end{equation*}

Thus we desire to understand the magnitude of the error probability P_{e} \left( i, j \right) in (1) in the particular case when the conditional distribution of D given H is a multivariate Poisson distribution with mean vector on the order of \ln \left( n \right). The following result determines this error up to first order in the exponent.

Lemma [Abbe and Sandon 2015]

For any c_{1}, c_{2} \in \left( 0, \infty \right)^{k} with c_1 \neq c_2 and p_1, p_2 > 0, we have

(3)   \begin{align*} \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= O \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{\ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \\ \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= \Omega \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{k \ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \label{eq:LB} \end{align*}

where

(4)   \begin{equation*} D_{+} \left( c_{1}, c_{2} \right) = \max_{t \in \left[ 0, 1 \right]} \sum_{i \in \left[ k \right]} \left( t c_{1} \left( i \right) + \left( 1 - t \right) c_{2} \left( i \right) - c_{1} \left( i \right)^{t} c_{2} \left( i \right)^{1-t} \right). \end{equation*}

Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed D_{+} the Chernoff-Hellinger divergence.

We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.

Observe that

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

decays rapidly away from

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\} ,\]

so we can obtain a good estimate of the sum

    \[\sum_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

by simply estimating the term

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right), \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \right\} .\]

Now observe that x_{\mathrm{max}} must satisfy

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) ;\]

after some algebra this is equivalent to

    \[x_{\mathrm{max}} \approx \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Let t^{*} denote the maximizer in the expression of D_{+} \left( \lambda, \mu \right) in~\eqref{eq:D_+}.

By differentiating in t, we obtain that t^{*} satisfies

    \[\lambda - \mu - \log \left( \lambda / \mu \right) \cdot \lambda^{t^{*}} \mu^{1 - t^{*}} = 0 ,\]

and so

    \[\lambda^{t^{*}} \mu^{1 - t^{*}} = \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Thus we see that

    \[x_{\mathrm{max}} \approx \lambda^{t^{*}} \mu^{1 - t^{*}} ,\]

from which, after some algebra, we get that

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \approx \exp \left( - D_{+} \left( \lambda, \mu \right) \right) .\]

The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right), \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) \right\}\]

gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.

Characterizing exact recoverability using CH-divergence

Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity D_{+} in (4).

The discussion above then implies that D_{+} plays an important role in the threshold for exact recovery.

In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that

    \[ \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \]

Suppose on the contrary that

D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) < 1

for some i and j.

This implies that the error probability in the testing problem is \Omega \left( n^{\varepsilon - 1} \right) for some \varepsilon > 0 for all vertices in communities i and j. Since the number of vertices in these communities is linear in n, and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most

\left( 1 - \Omega \left( n^{\varepsilon - 1} \right) \right)^{\Omega \left( n \right)} = \exp\left( - \Omega \left( n^{\varepsilon} \right) \right) = o(1).

It turns out that this indeed gives the recoverability threshold.

Theorem [Abbe and Sandon 2015]

Let k \in \mathbb{Z}_{+} denote the number of communities,

let p \in (0,1)^{k} with \left\| p \right\|_{1} = 1 denote the community prior,

let P = \mathrm{diag}(p),

and let Q \in \left( 0, \infty \right)^{k \times k} be a symmetric k \times k matrix with no two rows equal.

Exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(5)   \begin{equation*} \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \end{equation*}

This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example [Symmetric communities]

Consider again k symmetric communities, that is,

p_i = 1/k for all i \in [k],

Q_{i,j} = a if i = j,

and Q_{i,j} = b otherwise, with a, b > 0.

Then exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(6)   \begin{equation*} \left| \sqrt{a} - \sqrt{b} \right| \geq \sqrt{k}. \end{equation*}

We note that in this case D_{+} is the same as the Hellinger divergence.

Above we have heuristically described why the condition (5) is necessary.

To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is o(1/n).

Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is o(1) by a union bound.

However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only o(n) vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their \log(n) size neighborhoods interact.

One can then apply two rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.

Posted in Probability theory, Random graphs | Leave a comment

Kernel-based methods for convex bandits, part 3

(This post absolutely requires to have read Part 1 and Part 2.) A key assumption at the end of Part 1 was that, after rescaling space so that the current exponential weights distribution p_t is isotropic, one has

(1)   \begin{equation*} \mathrm{supp}(p_t) \subset \{x \in \mathbb{R}^n : |x| \leq R\} =:F_{t-1} , \end{equation*}

for some R = \tilde{O}(\sqrt{n}). Effectively this means that the estimated cumulative loss outside of the ball F_t is infinite (recall that p_t is proportional to \exp(- \eta \sum_{s=1}^{t-1} \tilde{\ell}_s)). Thus to enforce (1) (at time t+1) we will actually set the loss estimate \tilde{\ell}_t to +\infty on F_t^c. The price one pays for this is that the loss estimator is now unbiased only on F_t, which in turn means that we control the cumulative regret with respect to points in F_T only. We believe that in the so-called stochastic setting where the loss sequence \ell_t is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in F_t at all round t (for R=\tilde{O}(\sqrt{n})) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order \tilde{O}(n^{4.5} \sqrt{T}) (note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first T^{1-\epsilon} rounds and then move out the optimum very far from this region for the remaining (1-o(1)) T rounds. To solve this issue we introduce a restart condition: essentially if the algorithm believes that the optimum might be outside the region F_t then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon T-t). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region F_t given in (1) will not quite work, and in fact we will construct a region which verifies

    \[F_{t-1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

Furthermore we will need to take R = \tilde{O}(n^2) which in turn forces us to take \lambda = \tilde{O}(1/n^8) (indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially \exp(nR \times \sqrt{\lambda n}) where nR instead of R comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take \lambda to be 1/n times smaller than the predicted value from Part 1) and hence the \tilde{O}(n^{9.5} \sqrt{T}) final regret bound (indeed recall that we got a bound of \frac{n}{\lambda} \sqrt{n T} where the first n comes from the fact that we need to take a learning rate 1/n times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the \sqrt{n} comes from the relative entropy distance to the optimum at the beginning).

We will use extensively the following result:

Lemma: Let p = \exp(-f) be an isotropic log-concave measure. Then for any x such that |x| = \tilde{\Omega}(n) one has f(x) - \min_y f(y) = \Omega(|x|).

Restart condition

The simplest idea to restart goes as follows. Let us consider some x \in \partial F_t (with F_t defined as in (1)). Why is it there? Well there must be some time s in the past where we estimated the cumulative regret of x to be at least \Omega(R/\eta). In particular if this x now has a small estimated regret, say smaller than O(1/\eta) then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus F_t. Thus our restart condition looks like this: restart if

    \[\exists \ x \in \partial F_t : \eta \left(\sum_{s=1}^t \tilde{\ell}_t(x) - \min_{y \in F_t} \sum_{s=1}^t \tilde{\ell}_t(y) \right) = O(1) .\]

A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the O(1) constant to be large enough, we know that the optimum (over all of \mathcal{K}) of the “true” cumulative loss \sum_{s=1}^t K^*_s \ell_s(x) is still in F_t at the time when we restart.

Hoping for negative regret

If we catch the adversary’s presumptive attempt to move the optimum out of F_t quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region F_t for a good reason, and thus if we compare ourselves to the point x which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since x was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where s is the time at which x enters the boundary of the focus region,

(2)   \begin{equation*} \mathrm{Ent}(\delta_x \Vert p_1) - \mathrm{Ent}(\delta_x \Vert p_{s+1}) = \log(p_{s+1}(x) / p_1(x)) . \end{equation*}

Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as \tilde{O}(n) - R for R = \Omega(n), and thus this term divided by \eta (which is then O(- R \sqrt{T n})) could easily compensate the variance term which is O(\eta T) = O(\sqrt{T / n}). This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step s we had increased the learning rate from \eta to (1+\gamma) \eta then we would have the term - \mathrm{Ent}(\delta_x \Vert p_{s+1}) multiplied by \gamma in the final regret bound and thus we would only need to have \gamma R \geq \mathrm{Ent}(\delta_x \Vert p_1) /\eta = \tilde{O}(n) to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of \eta). Let us make this slightly more formal.

Turning up the learning rate

Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: p_{t+1}(x) \propto p_t(x) \exp(- \eta_t \tilde{\ell}_t(x)). One can see with the same calculations than in Part 1 that:

    \begin{align*} \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \leq \frac{\mathrm{Ent}(\delta_x \Vert p_t) - \mathrm{Ent}(\delta_x \Vert p_{t+1}) }{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle \\ & = \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle . \end{align*}

In particular if we increase the learning rate at some times \tau_1, \hdots, \tau_N by \eta_{\tau_i+1} = (1+\gamma) \eta_{\tau_i} then we have

    \begin{align*} & \sum_{t=1}^T \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \sum_{i=1}^N \log(p_{\tau_i+1}) \left(\frac{1}{\eta}_{\tau_i} - \frac{1}{\eta_{\tau_i+1}} \right) \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \frac{\gamma}{1+\gamma} \sum_{i=1}^N \frac{\log(p_{\tau_i+1})}{\eta_{\tau_i}} . \end{align*}

Using that the normalizing constants are all larger than T^{-O(n)} (this can be proved using the fact that covariance of the exponential weights is at least some T^{ -O(1)}) and assuming that \eta_T / \eta_1 = (1+\gamma)^N = O(1) (i.e. N \approx 1/\gamma) one gets roughly, with Q_t(x) = \sum_{s=1}^t \eta_s \tilde{\ell}_s(x),

    \begin{align*} \sum_{t=1}^T \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \lesssim \eta_1 T + \frac{n \log(T)}{\eta_1} - \frac{\gamma}{\eta_1} \sum_{i=1}^N Q_{\tau_i}(x) \\ & \lesssim \sqrt{T} \bigg( (n \log(T))^{3/2} - \gamma \sqrt{n \log(T)} \max_{i \in [N]} Q_{\tau_i}(x) \bigg), \end{align*}

where the second inequality uses that \eta_1 \approx \frac{1}{\sqrt{n \log(T)}} (see Part 2).

We see that by taking \gamma R = \Omega(n \log(T)) we can guarantee negative regret with respect to any x whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times \tau_1, \hdots, \tau_N). Thus we see that we should take \gamma = \tilde{\Theta}(1/\mathrm{poly}(n)) if we want to have a regret of order \tilde{O}(\mathrm{poly}(n) \sqrt{T}). Recall also that we can afford to update the learning rate only N \approx 1/\gamma = \tilde{\Theta}(\mathrm{poly}(n)) times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take N = O(n \log(T)) (and thus \gamma = \Theta(1/(n \log(T)), R = \Theta( (n \log(T))^2)) while guaranteeing that the focus region satisfies

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : \|x - \mathrm{mean}(p_t)\|_{\mathrm{Cov}(p_t)^{-1}} \leq O(n R) \}\]

As we explained at the beginning of this post this will conclude the proof of the n^{9.5} \sqrt{T} regret bound.

Updating the focus region more carefully

We update the focus region F_t only at times such that, once space is rescaled so that p_{t+1} is isotropic,

    \[\mathrm{Vol} \bigg(F_t \cap\{x \in \mathbb{R}^n : |x| \leq R\}\bigg) \leq \frac{1}{2} \mathrm{Vol}(F_t) ,\]

and in this case we set

    \[F_{t+1} := F_t \cap \{x \in \mathbb{R}^n : |x| \leq R\} .\]

Somewhat clearly we won’t make more than N=O(n \log(T)) updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

This follows from:

Lemma: Let \mathcal{K} be a convex body and \mathcal{E} be the centered unit ball. Suppose that \mathrm{Vol}(\mathcal{K} \cap \mathcal{E}) \geq \tfrac{1}{2} \mathrm{Vol}(\mathcal{K}). Then \mathcal{K} \subset 10 n \mathcal{E}.

Proof: Let us prove the contrapositive and assume that there is a point x \in \mathcal{K} with |x| > 10n. Denote s_i = \frac{2i}{10n}, i=1,..,5n and consider the sets \mathcal{K}_i = (1 - s_i) (\mathcal{E} \cap \mathcal{K}) + s_i x.

Note that those sets are disjoint. Indeed, the intervals (1 - s_i) [-1,1] + |x| s_i are disjoint, which implies that the projections of the ellipsoids (1 - s_i) \mathcal{E} + s_i x onto the span of x are disjoint. So, we have

    \[\mathrm{Vol}(\mathcal{K}) \geq \sum_{i=1}^{5 n} \mathrm{Vol}(\mathcal{K}_i) = \sum_{i=1}^{5 n} (1 - s_i)^{n} \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) \geq 2 \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) ,\]

which concludes the proof.

That’s it folks!

This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability 1/\mathrm{poly}(T) do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.

Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample x_t in time \mathrm{poly}(n) T. Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on \mathrm{poly}(n) convex sets (which are n-1 dimensional). All of this can be found in the paper of course!

Posted in Optimization, Theoretical Computer Science | Leave a comment

Kernel-based methods for convex bandits, part 2

The goal of this second lecture is to explain how to do the variance calculation we talked about at the end of Part 1 for the case where the exponential weights distribution is non-Gaussian. We will lose here a factor n^2 in the regret bound compared to the optimistic calculations from Part 1, and thus by the end of this lecture we could still hope to prove a regret bound of order n^{4.5} \sqrt{T}.

Gaussian core

First let us revisit the introduction of the core. Recall that we are considering a kernel K such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. To control the regret we want to have the following inequality for any convex function f:

    \[K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) ,\]

which means that the distribution q of the random variable Z should satisfy

(1)   \begin{equation*} \langle q, f \rangle \leq \langle Kp, f\rangle . \end{equation*}

The core is defined so that (1) is in fact an equality. As we discussed in the previous post the core is a deep mathematical construction, but unfortunately we could not find a way to generalize the Part 1’s variance calculation when the core is non-Gaussian. In what follows we describe a slightly different construction which will allow us to satisfy (1) with q being Gaussian. A key observation is that if r is convexly dominated by p, i.e. for any convex f one has \langle r, f \rangle \leq \langle p, f\rangle, then (1) is satisfied by taking q to be the core of r:

    \[\langle q, f \rangle = \mathbb{E}_{X \sim q} f(X) = \mathbb{E}_{X \sim q, Y \sim r} f((1-\lambda) X + \lambda Y) \leq \mathbb{E}_{X \sim q, Y \sim p} f((1-\lambda) X + \lambda Y) = \langle Kp, f\rangle .\]

Thus it suffices to show that for any p that we care about one can find a Gaussian r “inside” of p (in the sense that r is convexly dominated by p). Then instead of taking the core of p to define the kernel one can take the core of r (which we will also refer to as the Gaussian core of p).

Next we show that we can essentially restrict our attention to the case where p is log-concave, and then we show how to find a Gaussian inside a log-concave measure.

Approximate log-concavity

Recall that in Part 1 we replaced the variance calculation by simply showing that the loss estimates \tilde{\ell}_t(y) are bounded. Thus we see by induction and using Hoeffding-Azuma that with probability at least 1-\delta one has

    \[\left|\sum_{s=1}^{t-1} (\tilde{\ell}_s (y) - K_s^* \ell_s(y)) \right| \lesssim \sqrt{T \log(1/\delta)} .\]

Using an union bound over an \epsilon-net together with Lipschitzness of \tilde{\ell}_s (which we didn’t prove but it follows the same lines than boundedness) one has in fact that with high probability (say 1-1/T^{10}) there is a convex function f (recall that \mathbb{E}_{x_t \sim K_t p_t} \tilde{\ell}_t = K_t^* \ell_t is a convex function) such that

    \[\|f - \sum_{s=1}^{t-1} \tilde{\ell}_s\|_{\infty} \lesssim \sqrt{T n \log(T)} .\]

In particular provided that \eta \lesssim \frac{1}{\sqrt{T n \log(T)}} the above display shows that p_{t} (whose density is proportional to \exp(-\eta \sum_{s=1}^{t-1} \tilde{\ell}_s)) is O(1)-approximately log-concave (recall that p is said to be \epsilon-approximately log-concave if there is a log-concave function q such that \epsilon \leq p/q \leq 1/\epsilon). To simplify the discussion in these lectures we will in fact assume that p_t is exactly log-concave.

We note that the above concentration argument makes us lose a n factor in the regret. Indeed in Part 1 we used \eta \approx \sqrt{n / T} to (informally) obtain the regret n^{2.5} \sqrt{T} (or n^{1.5} \sqrt{T} with even more optimistic calculations). We see that now we are forced to take a learning rate which is 1/n smaller than this, which in turn multiplies the regret bound by a factor n.

Finding a Gaussian inside a log-concave

Let p be an isotropic log-concave measure. We will show that a centered Gaussian G with covariance \tilde{O}(\frac{1}{n}) I_n is (approximately) convexly dominated by p. We note in passing that by going back to the calculations at the end of Part 1 it is easy to see that this factor \tilde{O}(\frac{1}{n}) in the covariance will in turn force us to take \lambda to be \tilde{O}(\frac{1}{n}) smaller than what we hoped at the end of Part 1, leading to yet again an extra n in the regret bound.

In fact we will show that any centered measure q supported on a ball of radius \epsilon=O(1) is (exactly) convexly dominated by p. This implies the (approximate) Gaussian convex domination we mentioned above since most of the mass of G is in such a ball (it is easy to deal with the remaining \mathrm{poly}(1/T) mass but we won’t worry about it here).

Let us fix some convex function f. We want to show that \langle q, f \rangle \leq \langle p, f \rangle. Since both measures are centered we may add any linear function to f without affecting this inequality, so we may legitimately assume that f(0)=0 and f(x) \geq 0 for all x\in \mathbb{R}^n. By scaling we might also assume that the maximum on the centered ball of radius \epsilon is 1 and thus it is enough to show \langle p, f \rangle \geq 1. By convexity f is above the linear function x \mapsto 1+ \nabla f(y)^{\top} (x - y) where y is the maximizer of f on the centered ball of radius \epsilon. Note also that |\nabla f(y)| \geq 1/\epsilon and thus it is enough to show (recall that f \geq 0):

    \[\langle p, x \mapsto \max(0, \theta^{\top} (x-y)) \rangle \geq \epsilon ,\]

where \theta = \nabla f(y) /|\nabla f(y)|, which is implied by

    \[\langle p, x \mapsto \max(0, \theta^{\top} x) \rangle \geq 2 \epsilon .\]

It only remains to use the fact that an isotropic real log-concave random variable X verifies \mathbb{E} (X \textbf{1}\{X \geq 0\}) \approx 1, and thus the above display holds true with \epsilon = O(1).

Posted in Optimization, Theoretical Computer Science | Leave a comment

Kernel-based methods for bandit convex optimization, part 1

A month ago Ronen Eldan, Yin Tat Lee and myself posted our latest work on bandit convex optimization. I’m quite happy with the result (first polynomial time method with poly(dimension)\times\sqrt{T}-regret) but I’m even more excited by the techniques we developed. Next week I will give three 60 minutes lectures at MSR to explain those techniques. I thought the notes for these lectures could be useful to people who are going through the main paper as the informal style of blog posts/lectures allow to gloss over some of the annoying details we had to deal with in the real proof. Each lecture covers essentially one major idea of the paper. Here we go for the notes of the first lecture!

Bandit convex optimization
Let \mathcal{K} a convex body in \mathbb{R}^n. For t=1,\hdots, T, the player selects (at random) x_t \in \mathcal{K} and simultaneously an adversary selects a convex function \ell_t : \mathcal{K} \rightarrow [0,1]. The player observes/suffers the loss \ell_t(x_t). The player’s performance is measured by the regret:

    \[R_T = \sum_{t=1}^T \ell_t(x_t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T \ell_t(x) .\]

The goal of these notes is to give a “simple” informal proof of a \mathrm{poly}(n) \sqrt{T}-regret (with high probability).

The full information engine
Most previous works on bandit convex optimization were using Online Gradient Descent as a full information primitive (in the full information setting one observes the complete loss function \ell_t). Here we will use continuous exponential weights. Let us recall what this means and the basic regret bound one can obtain. For sake of notation we use the same symbol for a probability measure p and for its density with respect to the Lebesgue measure (when it has one). We denote \langle \cdot, \cdot \rangle for the usual L^2 inner product (i.e. \langle f, g\rangle = \int_{x \in \mathbb{R}^n} f(x) g(x) dx) and \mathrm{Ent}(f \Vert g) = \langle f, \log(f/g) \rangle.

Theorem: Let \ell_1, \hdots, \ell_T be arbitrary functions from \mathcal{K} to \mathbb{R}_+. Let p_1 be some measure on \mathcal{K} and define by induction

    \[p_{t+1}(x) = \frac{1}{Z_{t+1}} p_t(x) \exp(- \eta \ell_t(x)) ,\]

where Z_{t+1} is a normalization constant and \eta>0 is some fixed learning rate. Then

    \[\sum_{t=1}^T \langle p_t - q , \ell_t \rangle \leq \frac{\mathrm{Ent}(q \Vert p_1) - \mathrm{Ent}(q \Vert p_{T+1})}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \ell_t^2 \rangle .\]

Note that the quantity \sum_{t=1}^T \langle p_t - q , \ell_t \rangle should be interpreted as the expected regret of playing x_t at random from p_t compared to playing the fixed distribution q. Just as a teaser for the experts reading this, a key point of the argument in the third lecture will be to use the term - \mathrm{Ent}(q \Vert p_{T+1}) in the regret bound (usually we simply drop this term as it is negative).

Proof: Observe that

    \[{\mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) = \langle q , \log(p_{t+1} / {p_t}) \rangle = - \log Z_{t+1} - \eta \langle q, \ell_t \rangle}.\]

Using the basic inequalities \log(1+s) \leq s and \exp(-s) \leq 1 - s + \frac{s^2}{2} for any s \geq 0, as well as the fact that \ell_t \geq 0 one obtains

    \[\log(Z_{t+1}) = \log(\langle p_t , \exp(- \eta \ell_t) \rangle) \leq - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle .\]

Rearranging the two above displays we proved

    \[\eta \langle p_t - q, \ell_t \rangle \leq \mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle ,\]

which concludes the proof.

Kernels
The usual bandit argument is to use the bandit feedback \ell_t(x_t) to build some unbiased estimator \tilde{\ell}_t of \ell_t (i.e. such that \mathbb{E}_{x_t \sim p_t} \tilde{\ell}_t = \ell_t) and then use the full information primitive with the estimated losses instead of the real ones. The previous theorem indicates that the only difficulty is then to control the variance of these estimators, i.e. \mathbb{E} \langle p_t, \tilde{\ell_t}^2 \rangle. This is where the shape assumption on the losses has to be used: the observation of the loss at x_t has to be somehow propagated to nearby points so as to have a low variance estimator. The bottleneck of bandit convex optimization is exactly that it is highly unclear how to leverage the convexity assumption to propagate information to nearby points. Our first idea is that one should not estimate \ell_t directly, but rather it is enough to estimate a coarse approximation to \ell_t, where the level of coarseness depends on our level of uncertainty about where lies the optimum of the current cumulative loss. In particular “far” (where distances are measured with respect to the variance of our current exponential weights) from our current believed optimum (which is basically the center of mass of our current exponential weights) a very rough estimation of \ell_t is probably enough. We found that the best way to realize this idea is via kernels as we now explain.

Let K_t : \mathcal{K} \times \mathcal{K} \rightarrow \mathbb{R}_+ be a kernel (which will depend on the current exponential weights distribution p_t), which we think of as a linear operator over probability measures through the formula K_t q(x) = \int K_t(x,y) q(y) dy. The adjoint K_t^* acts on functions: K_t^* f(y) = \int f(x) K_t(x,y) dy (indeed \langle K_t q , f \rangle = \langle q, K_t^* f \rangle). The key point of K_t^* f is that there is a canonical unbiased estimator of it based on the observation of f(x_t) with x_t sampled from some distribution q, namely f(x_t) K_t(x_t, \cdot) / q(x_t). That’s great, and we will see soon how to control the variance of this estimator, but first let us talk about a more pressing issue: with these estimators one would control the regret as if the adversary was playing the losses K_t^* \ell_t, but what we want is to control the regret with respect to real losses \ell_t. In other words we control terms of the form \langle p_t - q, K_t^* \ell_t \rangle = \langle K_t (p_t - q), \ell_t \rangle but what we really care about is something like \langle q_t - q, \ell_t \rangle for some q_t potentially different from p_t (indeed we can play from some q_t which can be different from the exponential weights distribution). A natural guess is to take q_t = K_t p_t, in which case we would be satisfied by an inequality of the form \langle K_t p_t - q, \ell_t \rangle \lesssim \langle K_t (p_t - q), \ell_t \rangle.

Generalized Bernoulli convolutions
(What follows is an almost exact copy of Section 1.2.2. in the main paper.) As we just explained above for a given measure p we want to find a kernel K such that \langle Kp - \delta_x, f \rangle \lesssim \langle K(p - \delta_x), f \rangle for all convex functions f and all points x \in \mathcal{K}. We note that for any \lambda \in (0,1) one has

(1)   \begin{equation*}  \langle Kp - \delta_x, f \rangle \leq \frac{1}{\lambda} \langle K(p - \delta_x), f \rangle \Leftrightarrow K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) . \end{equation*}

Leveraging the fact that f is convex we see that a natural kernel to consider is such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. Indeed in this case one has

    \[K^* f(x) = \mathbb{E} f((1-\lambda) Z + \lambda x) \leq (1-\lambda) \mathbb{E} f(Z) + \lambda f(x) .\]

Thus this kernel satisfies the right hand side of (1) if Z is defined to be equal to K p, that is Z satisfies the following distributional identity, where X \sim p,

(2)   \begin{equation*}  Z \; \overset{D}{=} \; (1-\lambda) Z + \lambda X . \end{equation*}

If (2) holds true we say that Z is the core of p. It is easy to see that the core always exists and is unique by taking Z = \sum_{k=0}^{+\infty} (1-\lambda)^k \lambda X_k where X_0, X_1, \hdots are i.i.d. copies of X. Interestingly such random variables have a long history for the special case of a random sign X where they are called Bernoulli convolutions (they were introduced by Erdos in a 1939 paper). Our notion of core can thus be viewed as a generalized Bernoulli convolution. We refer the reader to the following very nice a survey on Bernoulli convolutions by Peres, Schlag and Solomyak, and we simply mention that the main objective in this literature is to understand for which values of \lambda is the random variable Z “smooth” (say for instance absolutely continuous with respect to the Lebesgue measure).

Summarizing the discussion so far, we see that by playing at round t from K_t p_t, where K_t is the kernel described above and p_t is the continuous exponential weights strategy on the estimated losses \tilde{\ell}_s = \ell_s(x_s) \frac{K_s(x_s, \cdot)}{K_s p_s(x_s)} one has for any q,

    \[\mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \langle q, \ell_t \rangle) \leq \frac{1}{\lambda} \mathbb{E} \left(\frac{\mathrm{Ent}(q \Vert p_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \left(\frac{K_t(x_t, \cdot)}{K_t p_t(x_t)}\right)^2 \rangle\right) .\]

The term in the left hand side is almost exactly the expected regret by taking q to be uniform on a small ball around the minimizer of \sum_{t=1}^T \ell_t, and in this case the term \mathrm{Ent}(q \Vert p_1) is of order n \log(T).

Variance control in the Gaussian case
All that remains to be done is to control the variance term \mathbb{E}_{x \sim K p} \langle p, \tilde{\ell}^2 \rangle where \tilde{\ell}(y) = \frac{K(x, y)}{K p(x)} = \frac{K(x,y)}{\int K(x,y') p(y') dy}. More precisely if this quantity is O(1) then we obtain a regret of \tilde{O}\left( \frac{1}{\lambda} \sqrt{n T}\right). This variance control will turn out to be more challenging than one might expect and our goal for the rest of this first lecture is to derive some intuition from the case where p is a standard Gaussian. In the next lecture we will see how to somehow reduce the general case to the Gaussian case. In the calculations below we will also assume that p is compactly supported on the Euclidean ball of radius R = \mathrm{poly}(n, \log(T)). Of course this is at odds with the Gaussian assumption, but again this issue will be dealt with in the next lecture. The more tricky part will be to ensure that the exponential weights distribution is indeed truncated outside of a ball of such radius (observe that p is assumed to be isotropic, so with respect to the distances in the original space this ball might represent a tiny ball in \mathcal{K}). This truncation (and dealing with the issues that result from it) will be the topic of the third lecture (this is where the negative entropic term in the regret bound of exponential weights will turn out to be useful!).
First observe that to bound the variance term it is in fact sufficient to show that K(x,y) / K(x,y') is bounded for all y,y' in the support of p and all x in the support of Kp. In fact it is fine if we have this control with probability at least 1-1/T^{10} with respect to the draw of x from Kp (recall that eventually x is the played point, so events with small probability on x can intuitively be ignored). Next we need to understand what is K(x,y) when p is a standard Gaussian. Observe that, with c denoting the core of p, one always has K(x,y) = K \delta_y (x) = \mathrm{cst} \times c\left(\frac{x-\lambda y}{1-\lambda}\right). Furthermore it is a straightforward calculation that the core of standard Gaussian is a centered Gaussian with covariance \frac{\lambda}{2-\lambda} I_n. Thus we obtain:

    \begin{align*} \frac{K(x,y)}{K(x,y')} & = \exp \left(\frac{2-\lambda}{2 \lambda} \left(\left\vert \frac{x-\lambda y'}{1-\lambda} \right\vert^2 - \left\vert \frac{x-\lambda y}{1-\lambda} \right\vert^2 \right) \right) \\ & \leq \exp\left( \frac{1}{(1-\lambda)^2} (4 R |x| + 2 \lambda R^2) \right). \end{align*}

Finally note that with high probability one has |x| \lesssim \lambda R + \sqrt{\lambda n \log(T)}. Ignoring logarithmic factors, and provided that somehow one could take R = \tilde{O}(\sqrt{n}) (which is the ball that contains most of the mass of a standard Gaussian so it seems reasonable), we see that for \lambda = \tilde{O}(1/n^2) we have a constant variance, which in turn lead to a regret of order n^{2.5} \sqrt{T}. In fact by being more careful in bounding the variance one could hope that \lambda = \tilde{O}(1/n) is enough (basically in the display above one could do a moment generating function calculation instead of Cauchy-Schwarz to replace the term R|x| by |x|^2) which in turn would give a n^{1.5} \sqrt{T} regret. This latter result is what we conjecture to be optimal, but we are quite far from being able to prove this!

In the next lecture we will do a little bit of geometry of log-concave measures to show how to reduce the general p case to the Gaussian case. In the third and final lecture we will deal with the truncation of the exponential weights distribution (this will force us to significantly modify the algorithm).

Posted in Optimization, Theoretical Computer Science | Leave a comment