COLT 2018 call for papers

Philippe Rigollet and myself will be the program chairs for this year’s edition of COLT. It will be in Stockholm in July, which I hear is absolutely gorgeous at that time of the year. We also have a fantastic lineup of invited speakers: Stephane Mallat, Susan Murphy, and Johan Hastad. We believe we will also guarantee high quality reviews for your submissions as we have assembled a great (and extended) program committee to make sure that we cover all aspects of modern learning theory (in particular we expect the number of subreviewers to be minimal). See below for the PC, as well as the call for papers. Looking forward to read all your submissions in February!

Program Committee:

Jacob Abernethy (Georgia Tech)
Shivani Agarwal (University of Pennsylvania)
Shipra Agrawal (Columbia University)
Alexandr Andoni (Columbia University)
Pranjal Awasthi (Rutgers University)
Francis Bach (INRIA)
Nina Balcan (Carnegie Mellon University)
Afonso Bandeira (New-York University)
Mihail Belkin (Ohio State University)
Shai Ben-David (University of Waterloo)
Quentin Berthet (University of Cambridge)
Alina Beygelzimer (Yahoo! Research)
Avrim Blum (TTI Chicago)
Guy Bresler (MIT)
Olivier Cappé (Université Paris-Saclay)
Constantine Caramanis (UT Austin)
Alexandra Carpentier (OvGU Magdeburg)
Nicolo Cesa-Bianchi (University of Milan)
Arnak Dalalyan (ENSAE)
Amit Daniely (Hebrew University)
Ronen Eldan (Weizman Institute)
Tim van Erven (Leiden University)
Vitaly Feldman (IBM Research)
Aurelien Garivier (University Paul-Sabatier)
Rong Ge (Duke University)
Claudio Gentile (Universita degli Studi dell’Insubria)
Steve Hanneke
Elad Hazan (Princeton University)
Daniel Hsu (Columbia University)
Prateek Jain (Microsoft Research)
Satyen Kale (Google)
Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)
Wouter Koolen (CWI, Amsterdam)
Tomer Koren (Google)
John Lafferty (Yale University)
Po-Ling Loh (UW Madison)
Gabor Lugosi (ICREA and Pompeu Fabra University)
Shie Mannor (Technion)
Yishay Mansour (Tel Aviv University and Google)
Ankur Moitra (MIT)
Robert Nowak (UW Madison)
Vianney Perchet (ENS Paris-Saclay and Criteo)
Alexandre Proutiere (KTH)
Luis Rademacher (UC Davis)
Maxim Raginsky (University of Illinois)
Sasha Rakhlin (University of Pennsylvania)
Lorenzo Rosasco (MIT and Universita’ di Genova)
Robert Schapire (Microsoft Research)
Ohad Shamir (Weizman Institute)
David Steurer (ETH Zurich)
Suvrit Sra (MIT)
Nathan Srebro (TTI Chicago)
Karthik Sridharan (Cornell University)
Csaba Szepesvari (University of Alberta)
Matus Telgarsky (University of Illinois)
Ambuj Tewari (University of Michigan)
Alexandre Tsybakov (ENSAE-CREST)
Ruth Urner (Max Planck Institute)
Santosh Vempala (Georgia Tech)
Roman Vershynin (UC Irvine)
Manfred Warmuth (UC Santa Cruz)
Yihong Wu (Yale University)


Call for papers:

The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:
-Design and analysis of learning algorithms
-Statistical and computational complexity of learning
-Optimization methods for learning
-Unsupervised, semi-supervised, online and active learning
-Interactions with other mathematical fields
-Interactions with statistical physics
-Artificial neural networks, including deep learning
-High-dimensional and non-parametric statistics
-Learning with algebraic or combinatorial structure
-Geometric and topological data analysis
-Bayesian methods in learning
-Planning and control, including reinforcement learning
-Learning with system constraints: e.g. privacy, memory or communication budget
-Learning from complex data: e.g., networks, time series, etc.
-Learning in other settings: e.g. social, economic, and game-theoretic

Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.
All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.

Paper Awards:

COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.

Dual Submissions:

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.


Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.

Rebuttal Phase:

As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.

Open Problems Session:

We also invite submission of open problems. A separate call for open problems will be made available on the conference website.

Important Dates:

-Paper submission deadline: February 16, 2018, 11:00 PM EST
-Author feedback: April 18-21, 2018
-Author notification: May 2, 2018
-Conference: July 5-9, 2018 (welcome reception on the 4th)


Papers should be submitted through EasyChair at


Posted in Conference/workshop | Leave a comment

Michael B. Cohen

This is an incredibly difficult post to write. Michael Benjamin Cohen, an amazing student and person passed away this week in Berkeley. Below is the official MSR statement (where he spent the summer together with many friends) and some personal remembrance. Also this is a picture of one of the many many happy moments we had together this summer at MSR (with Yin Tat Lee too):

MSR statement:

The Microsoft Research family is shocked and deeply saddened by the tragic loss of our colleague and friend Michael Cohen.  Michael was a brilliant mathematician and a rising star in his field.  He spent this past summer with us as a Microsoft Research Fellow, doing what he loved most. Over the summer, he made sweeping progress in online learning and online algorithms, two fields he had just recently become acquainted with. In addition to solving five open problems in these areas, he continued his substantial progress on the k-server problem, one of the most celebrated and notoriously difficult challenges in the space of adaptive algorithms.

Michael was a truly exceptional individual.  We will remember Michael for his infectious smile and his larger-than-life personality. We will never forget his unrelenting curiosity, his thirst for knowledge, and his deep love for mathematics and theoretical foundations of computing.  We are stunned by his loss.  We will hold onto our memories of Michael, and know that his ideas and scientific accomplishments will continue on as important advances.

We extend our most sincere condolences to Michael’s family and friends.

Personal notes:

I have known Michael for less than a year, but in that short time span he had a profound impact on me, like only a handful of people have had in my scientific life. I will always remember our first meeting, on October 26th 2016 at MIT. We were about to start lunch with a small group of graduate students and Michael entered the room, he (gently) interrupted the conversation and his first sentence to me was a question about mirror descent that I was not able to answer (we now know the answer, and as it turns out his question was pretty deep and the answer highly non-trivial). This is a typical Michael story, and I’m sure anyone who has interacted with him had similar experiences, where a seemingly innocuous question (at a perhaps inopportune time) turned out to be extremely deep and interesting.

Michael’s way of doing mathematics was truly remarkable. He was relentlessly searching for the right way to look at a problem, and he was never satisfied with a hacky explanation. As far as I know he never wrote anything on paper while doing research. I assume that if the calculations were too long to fit in his head then it was probably the wrong approach anyway. Doing research with him was an immense pleasure. He was always willing to share his new and original point of view on his current readings. He was also interested in a breadth of topics that is only matched by a tiny fraction of researchers in theoretical computer science. As James Lee shared in his comment on Luca’s blog, I too was feeling that all of this was just warm-up for him. I am incredibly sad to have lost a friend and a great collaborator, but I’m also sad that we will never see Michael’s next steps.

Beyond his mathematical gift he was also a kind and gentle human being. Perhaps it was not obvious to see at first sight with his exuberant personality, but he was compassionate and caring.

We truly lost one of the best. It is really hard to cope with the unfairness of these events. We will do our best to share Michael’s wonderful ideas with the world, and to make his scientific accomplishements live on. Needless to say my heart breaks for his family and all my thoughts are with them at this moment.

Posted in Uncategorized | 7 Comments

Smooth distributed convex optimization

A couple of months ago we (Kevin Scaman, Francis Bach, Yin Tat Lee, Laurent Massoulie and myself) uploaded a new paper on distributed convex optimization. We came up with a pretty clean picture for the optimal oracle complexity of this setting, which I will describe below. I should note that there are hundreds of papers on this topic, but the point of the post is to show our simple cute calculations and not to survey the immense literature on distributed optimization, see the paper itself for a number of pointers to other recent works.

Distributed optimization setting

Let G=(V,E) be an undirected graph on n vertices (|V| =n) and with diameter \Delta. We will think of the nodes in G as the computing units. To each vertex v \in V there is an associated convex function f_v : \mathbb{R}^d \rightarrow \mathbb{R}. For machine learning applications one can think that each computing unit has access to a “private” dataset, and f_v(x) represents the fit of the model corresponding to x \in \mathbb{R}^d on this dataset (say measured on least squares loss, or logistic loss for example). The goal will be to find in a distributed way the optimal “consensus” point:

    \[x^* \in \mathrm{argmin}_{x \in \mathbb{R}^d} \sum_{v \in V} f_v(x) .\]

The distributed processing protocol is as follows: asynchronously/in parallel, each node v can (i) compute a (local) gradient \nabla f_i in time 1, and (ii) communicate a vector in \mathbb{R}^d to its neighbors in G in time \tau. We denote by x_{v,t} the local model (essentially its guess for x^*) of node v at time t \in \mathbb{R}_+. We aim to characterize the smallest time T_{\epsilon} such that one can guarantee that all nodes v satisfy \bar{f}(x_{v,T}) - \bar{f}(x^*) \leq \epsilon where \bar{f} = \frac1n \sum_{v \in V} f_v.

We focus on the case where \bar{f} is \beta-smooth and \alpha-strongly convex (\kappa =\beta/\alpha is the condition number), which is arguably the most challenging case since one expects linear convergence (i.e., the scaling of T_{\epsilon} in \epsilon should be \log(1/\epsilon)) which a priori makes the interaction of optimization error and communication error potentially delicate (one key finding is that in fact it is not delicate!). Also, having in mind applications outside of large-scale machine learning (such as “federated” learning), we will make no assumptions about the functions at different vertices relate to each other.

A trivial answer

Recall that Nesterov’s accelerated gradient descent solves the serial problem in time O(\sqrt{\kappa} \log(1/\epsilon)). Trivially one can distribute a step of Nesterov’s accelerated gradient descent in time O(\tau \Delta) (simply designate a master node at the beginning, and everybody sends its local gradient to the master node in time O(\tau \Delta)). Thus we arrive at the upper bound T_{\epsilon} = O((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon)) using a trivial (centralized) algorithm. We now show (slightly informally, see the paper for proper definitions) that this in fact optimal!

First let us recall the lower bound proof in the serial case (see for example Theorem 3.15 here). The idea is to introduce the function f(x) = \beta L[x,x] - 2 \alpha x(1) + \alpha |x|^2 where L is the Laplacian of the path graph on \N, or in other words

    \[L[x,x] = \sum_{i \in \mathbb{N}} (x(i) - x(i+1))^2 .\]

First it is easy to see that this function is indeed O(\beta)-smooth and \Omega(\alpha)-strongly convex. The key point is that, for any algorithm starting at x_0=0 and such that each iteration stays in the linear span of the previously computed gradients (a very natural assumption) then

    \[x_t(i) = 0 \;\; \text{for all} \;\; i > t ~.\]

In words one can say that each gradient calculation “discovers” a new edge of the path graph involved in the definition of f. Concluding the serial proof is then just a matter of brute force calculations.

Now let us move to the distributed setting, and consider two vertices u and v that realize the diameter of G. The idea goes as follows: let L_1 (respectively L_2) be the Laplacian of even edges of the path graph on \mathbb{N} (respectively the odd edges), that is

    \[L_1[x,x] = \sum_{i \in \mathbb{N} : i \; \text{even}} (x(i) - x(i+1))^2 .\]

Now define \frac{1}{n} f_u = \beta L_1[x,x] - 2 \alpha x(1) + \frac{\alpha}{2} |x|^2, \frac{1}{n} f_v = \beta L_2[x,x] + \frac{\alpha}{2} |x|^2, and f_w = 0 for any w \not\in \{u,v\}. The key observation is that node u does not “know” about the even edges until it receives a message from v and vice versa. Thus it fairly easy to show that in this case one has:

    \[x_{u,t}(i) = 0 \;\; \text{for all} \;\; i > t / (1+\tau \Delta) ~,\]

which effectively amounts to a slowdown by a factor (1+\tau \Delta) compared to the serial case and proves the lower bound T_{\epsilon} = \Omega((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon).

Not so fast!

One can say that the algorithm proposed above defeats a bit the purpose of the distributed setting. Indeed the centralized communication protocol it relies on is not robust to various real-life issues such as machine failures, time-varying graphs, edges with different latency, etc. An elegant and practical solution is to restrict communication to be gossip-like. That is local computations have now to be communicated via matrix multiplication with a walk matrix W which we define as satisfying the following three conditions: (i) W(i,j) \neq 0 \Rightarrow (i,j) \in E, (ii) \mathrm{Ker}(W) =\mathrm{Span}((1,\hdots,1)), and (iii) W \succeq 0. Let us briefly discuss these conditions: (i) simply means that if x \in \mathbb{R}^V represents real values stored at the vertices, then W x can be calculated with a distributed communication protocol; (ii) says that if there is consensus (that is all vertices hold the same value) then no communication occurs with this matrix multiplication; and (iii) will turn out to be natural in a just a moment for our algorithm based on duality. A prominent example of a walk matrix would be the (normalized) Laplacian of G

We denote by \gamma the inverse condition number of W on (1,\hdots,1)^{\perp} (that is the ratio of the smallest non-zero eigvenvalue of W to its largest eigenvalue), also known as the spectral gap of G when W is the Laplacian. Notice that \gamma naturally controls the number of gossip steps to reach consensus, in the sense that gossip steps corresponds to gradient descent steps on f(x) = x^{\top} W x, which will converge in O(\gamma^{-1} \log(1/\epsilon)) steps. Doing an “accelerated gossip” (also known as Chebyshev gossiping) one could thus hope to essentially replace the diameter \Delta by O(\gamma^{-1/2} \log(1/\epsilon)). Notice that this is hopeful thinking because in the centralized model \Delta steps gets you to an exact consensus, while in the gossip model one only reaches an \epsilon-approximate consensus and errors might compound. In fact with a bit of graph theory one can immediately see that simply replacing \Delta by O(\gamma^{-1/2} \log(1/\epsilon)) is too good to be true: there are graphs (namely expanders) where \Delta is of order \log(n) while \gamma is of order of a constant, and thus an upper bound of the form (say) \Omega((1+\tau \gamma^{-1/2} \log(1/\epsilon)) \sqrt{\kappa} \log(1/\epsilon) would violate our previous lower bound by a factor 1/\log(n).

To save the day we will make extra assumptions, namely that each local function f_v has condition number \kappa and that in addition to computing local gradient the vertices can also compute local gradients of the Fenchel dual functions f_v^*. The latter assumption can be removed at the expense of extra logarithmic factors but we will ignore this point here (see the paper for some hints as well as further discussion on this point). For the former assumption we note that the lower bound proof given above completely breaks under this assumption. However one can save the construction for some specific graphs (finding the correct generalization to arbitrary graphs is one of our open problems). For example imagine a line graph, and cluster the vertices into three groups, the first third, the middle, and the last third. Then one could distribute the even part of the Laplacian on \mathbb{N} in the first group, and the odd part on the last group, as well as distribute the Euclidean norm evenly among all vertices. This construction verifies that each vertex function has condition number O(\kappa) and furthermore the rest of the argument still goes through. Interestingly in this case one also has \Delta \approx 1/\sqrt{\gamma} and thus this proves that for the line graph one has T_{\epsilon} = \Omega((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon) for gossip algorithms. We will now show a matching upper bound (which holds for arbitrary graphs).

Dual algorithm

For X \in \mathbb{R}^{d \times n} (which we think of as a set of column vectors, one for each vertex i \in V = [n]), denote X_i for the i^{th} column and let F(X) := \sum_{i =1}^n f_i(X_i). We are interested in minimizing F under the constraint that all columns are equal, which can be written as X \sqrt{W} = 0. By definition of the Fenchel dual F^* and a simple change of variable one has:

    \[\min_{X \in \mathbb{R}^{d\times n} : X \sqrt{W} = 0} F(X) = \max_{\lambda \in \mathbb{R}^{d \times n}} - F^*(\lambda \sqrt{W}) .\]

Next observe that gradient ascent on \lambda can be written as

    \[\lambda^+ = \lambda + \eta \nabla F^*(\lambda \sqrt{W}) \sqrt{W}\]

and with the notation y = \lambda \sqrt{W} this is simply y^+ = y - \eta \nabla F^*(y) W. Crucially \nabla F^*(y) W exactly corresponds to gossiping the local conjugate gradients (which are also the local models) \nabla f_i^*(y_{i,t}) =: X_{i,t}. In other words we only have to understand the condition number of the function \lambda \mapsto F^*(\lambda \sqrt{W}). The beauty of all of this is that this condition number is precisely \kappa / \gamma (i.e. it naturally combines the condition number of the vertex functions with the “condition number” of the graph). Thus by accelerating gradient ascent we arrive at a time complexity of O((1+\tau) \sqrt{\kappa / \gamma} \log(1/\epsilon)) (recall that a gossip step takes time \tau). We call the corresponding algorithm SSDA (Single-Step Dual Accelerated). One can improve it slightly in the case of low communication cost by doing multiple rounds of communication between two gradient computations (essentially replacing W by W^k). We call the corresponding algorithm MSDA (Multi-Step Dual Accelerated) and its attains the optimal (in the worst case over graphs) complexity of O((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon)).

Posted in Optimization | Leave a comment

COLT 2017 accepted papers

The list of accepted papers at COLT 2017 has been published and it looks particularly good (see below with links to arxiv version)! The growth trend of previous years continues with 228 submissions (14% increase from 2016) and 73 accepted papers. Note also that this year’s edition will be in Amsterdam beginning of July which should be fun. The deadline for early registration is in two days, so hurry up!

COLT 2017 accepted papers
– Constantinos Daskalakis, Manolis Zampetakis and Christos Tzamos. Ten Steps of EM Suffice for Mixtures of Two Gaussians
– Shachar Lovett and Jiapeng Zhang. Noisy Population Recovery from Unknown Noise
– Jonathan Scarlett, Ilija Bogunovic and Volkan Cevher. Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization
– Avrim Blum and Yishay Mansour. Efficient Co-Training of Linear Separators under Weak Dependence
– Michal Moshkovitz and Dana Moshkovitz. Mixing Implies Lower Bounds for Space Bounded Learning
– Mitali Bafna and Jonathan Ullman. The Price of Selection in Differential Privacy
– Nader Bshouty, Dana Drachsler Cohen, Martin Vechev and Eran Yahav. Learning Disjunctions of Predicates
– Avinatan Hassidim and Yaron Singer. Submodular Optimization under Noise
– Debarghya Ghoshdastidar, Ulrike von Luxburg, Maurilio Gutzeit and Alexandra Carpentier. Two-Sample Tests for Large Random Graphs using Network Statistics
– Andreas Maurer. A second-order look at stability and generalization
– Eric Balkanski and Yaron Singer. The Sample Complexity of Optimizing a Convex Function
– Daniel Vainsencher, Shie Mannor and Huan Xu. Ignoring Is a Bliss: Learning with Large Noise Through Reweighting-Minimization
– Alekh Agarwal, Haipeng Luo, Behnam Neyshabur and Robert Schapire. Corralling a Band of Bandit Algorithms
– Nikita Zhivotovskiy. Optimal learning via local entropies and sample compression
– Max Simchowitz, Kevin Jamieson and Benjamin Recht. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
– Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang. Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes
– Nicolas Brosse, Alain Durmus, Eric Moulines and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo
– Maxim Raginsky, Alexander Rakhlin and Matus Telgarsky. Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis
– Alon Cohen, Tamir Hazan and Tomer Koren. Tight Bounds for Bandit Combinatorial Optimization
– Moran Feldman, Christopher Harshaw and Amin Karbasi. Greed Is Good: Near-Optimal Submodular Maximization via Greedy Optimization
– Bin Hu, Peter Seiler and Anders Rantzer. A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints
– Jerry Li. Robust Sparse Estimation Tasks in High Dimensions (*to be merge)
– Yeshwanth Cherapanamjeri, Prateek Jain and Praneeth Netrapalli. Thresholding based Efficient Outlier Robust PCA
– Amir Globerson, Roi Livni and Shai Shalev-Shwartz. Effective Semisupervised Learning on Manifolds
– Yuchen Zhang, Percy Liang and Moses Charikar. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics
– Ashok Cutkosky and Kwabena Boahen. Online Learning Without Prior Information
Joon Kwon, Vianney Perchet and Claire Vernade. Sparse Stochastic Bandits
– Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian and Nathan Srebro. Learning Non-Discriminatory Predictors
– Arpit Agarwal, Shivani Agarwal, Sepehr Assadi and Sanjeev Khanna. Learning with Limited Rounds of Adaptivity: Coin Tossing, Multi-Armed Bandits, and Ranking from Pairwise Comparisons
– Jerry Li and Ludwig Schmidt. Robust Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities
Alexandr Andoni, Daniel Hsu, Kevin Shi and Xiaorui Sun. Correspondence retrieval
– Andrea Locatelli, Alexandra Carpentier and Samory Kpotufe. Adaptivity to Noise Parameters in Nonparametric Active Learning
Salil Vadhan. On Learning versus Refutation
– Sebastian Casalaina-Martin, Rafael Frongillo, Tom Morgan and Bo Waggoner. Multi-Observation Elicitation
Vitaly Feldman and Thomas Steinke. Generalization for Adaptively-chosen Estimators via Stable Median
Shipra Agrawal, Vashist Avadhanula, Vineet Goyal and Assaf Zeevi. Thompson Sampling for the MNL-Bandit
Rafael Frongillo and Andrew Nobel. Memoryless Sequences for Differentiable Losses
– Pranjal Awasthi, Avrim Blum, Nika Haghtalab and Yishay Mansour. Efficient PAC Learning from the Crowd
– Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning
– Yury Polyanskiy, Ananda Theertha Suresh and Yihong Wu. Sample complexity of population recovery
– Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski and Sanjeev Arora. On the Ability of Neural Nets to Express Distributions
Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao and Ruosong Wang. Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration
Dylan Foster, Alexander Rakhlin and Karthik Sridharan. ZIGZAG: A new approach to adaptive online learning
Posted in Conference/workshop | 1 Comment

Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss

A week or so ago at our Theory Lunch we had the pleasure to listen to Harishchandra Ramadas (student of Thomas Rothvoss) who told us about their latest discrepancy algorithm. I think the algorithm is quite interesting as it combines ideas from gradient descent and multiplicative weights in a non-trivial (yet very simple) way. Below I reprove Spencer’s 6 deviations theorem with their machinery (in the actual paper Levy, Ramadas and Rothvoss do more than this).

First let me remind you the setting (see also this previous blog post for some motivation on discrepancy and a bit more context; by the way it is funny to read the comments in that post after this): given v_1, \hdots, v_n \in \mathbb{S}^{n-1} one wants to find x \in \{-1,1\}^n (think of it as a “coloring” of the coordinates) such that \max_{i \in [n]} |x \cdot v_i| \leq C for some numerical constant C>0 (when v_i is a normalized vectors of 1‘s and 0‘s the quantity |x \cdot v_i| represents the unbalancedness of the coloring in the set corresponding to v_i). Clearly it suffices to give a method to find x \in [-1,1]^n with at least half of its coordinates equal to -1 and 1 and such that \max_{i \in [n]} |x \cdot v_i| \leq C' for some numerical constant C'>0 (indeed one can then simply recurse on the coordinates not yet set to -1 or 1; this is the so-called “partial coloring” argument). Note also that one can drop the absolute value by taking v_i and -v_i (the number of constraints then becomes 2n but this is easy to deal with and we ignore it here for sake of simplicity).

The algorithm

Let x_0 = 0, w_0 = 1 \in \mathbb{R}^n. We run an iterative algorithm which keeps at every time step t \in \mathbb{N} a subspace U_t of valid update directions and then proceeds as follows. First find (using for instance a basis for U_t) z_t \in \mathbb{S}^{n-1} \bigcap U_t such that

(1)   \begin{equation*}  \sum_{i=1}^n \frac{w_t(i)}{\|w\|_1} (v_i \cdot z_t)^2 \leq \frac{1}{\mathrm{dim}(U_t)} . \end{equation*}

Then update x_{t+1}= x_t + \lambda_t z_t where \lambda_t \in [0,1] is maximal so that x_{t+1} remains in [-1,1]^n. Finally update the exponential weights by w_{t+1}(i) = w_t(i) \exp( v_i \cdot (x_{t+1} - x_t) ).


It remains to describe the subspace U_t. For this we introduce the set I_t \subset [n] containing the n/16^{th} largest coordinates of w_t (the “inactive” coordinates) and the set F_t \subset [n] containing the coordinates of x_t equal to -1 or 1 (the “frozen” coordinates). The subspace U_t is now described as the set of points orthogonal to (i) x_t, (ii) e_j, j \in F_t, (iii) v_i, i \in I_t, (iv) \sum_{i=1}^n w_t(i) v_i. The intuition for (i) and (ii) is rather clear: for (i) one simply wants to ensure that the method keeps making progress towards the boundary of the cube (i.e., |x_{t+1}| > |x_t|) while for (ii) one wants to make sure that coordinates which are already “colored” (i.e., set to -1 or 1) are not updated. In particular (i) and (ii) together ensures that at each step either the norm squared of x_t augments by 1 (in particular \lambda_t=1) or that one fixes forever one of the coordinates to -1 or 1. In particular this means that after at most 3 n /2 iterations one will have a partial coloring (i.e., half of the coordinates set to -1 or 1, which was our objective). Property (iii) is about ensuring that we stop walking in the directions where we are not making good progress (there are many ways to ensure this and this precise form will make sense towards the end of the analysis). Property (iv) is closely related, and while it might be only a technical condition it can also be understood as ensuring that locally one is not increasing the softmax of the constraints, indeed (iv) exactly says that one should move orthogonally to the gradient of \log(\sum_{i=1}^n \exp(x \cdot v_i)).

The analysis

Let Z_t = \sum_{i=1}^n w_t(i). Note that since z_t is on the sphere and \lambda_t \in [0,1] one has that |v_i \cdot (x_{t+1} - x_t)| \leq 1. Thus using \exp(x) \leq 1 + x + x^2 for x \in [-1,1], as well as property (iv) (i.e., \sum_{i=1}^n w_t(i) v_i \cdot z_t = 0) and \lambda_t \in [0,1] one obtains:

    \[Z_{t+1} = \sum_{i=1}^n w_t(i) \exp(v_i \cdot (x_{t+1} - x_t)) \leq \sum_{i=1}^n w_t(i) (1 + (v_i \cdot z_t)^2) .\]

Observe now that the subspace U_t has dimension at least n/4 (say for n \geq 16) and thus by (1) and the above inequalities one gets:

    \[Z_{t+1} \leq (1+ 4/n) Z_t .\]

In particular for any t \leq 2n, Z_{t} \leq C n for some numerical constant C >0. It only remains to observe that this ensures w_{2n}(i) = O(1) for any i \in [n] (this concludes the proof since we already observed that at time 2 n at least half of the coordinates are colored). For this last implication we simply use property (iii). Indeed assume that some coordinate i satisfies at some time t \leq 2n, w_t(i) > c e for some c>0. Since each update increases the weights (multiplicatively) by at most e it means that there is a previous time (say s) where this weight was larger than c and yet it got updated, meaning that it was not in the top n/16 weights, and in particular one had Z_s \geq c n / 16 which contradicts Z_{s} \leq C n for c large enough (namely c > 16 C).

Posted in Theoretical Computer Science | 1 Comment

New journal: Mathematical Statistics and Learning

I am thrilled to announce the launch of a new journal, “Mathematical Statistics and Learning”, to be edited be the European Mathematical Society. The goal of the journal is be the natural home for the top works addressing the mathematical challenges that arise from the current data revolution (as well as breakthroughs on classical data analysis problems!). I personally wish such a journal had existed for at least a decade and I am very happy to be part of this endeavor as an associate editor. Please consider submitting your very best mathematical work in statistics and learning to this journal!

Some more details provided by Gabor Lugosi and Shahar Mendelson on behalf of the Editorial Board:

The journal is devoted to research articles of the highest quality in all aspects of Mathematical Statistics and Learning, including those studied in traditional areas of Statistics and in Machine Learning as well as in Theoretical Computer Science and Signal Processing. We believe that at this point in time there is no venue for top level mathematical publications in those areas, and our aim is to make the new journal such a venue.

The journal’s Editorial Board consists of the Editors,

Luc Devroye (McGill),

Gabor Lugosi (UPF Barcelona),

Shahar Mendelson (Technion and ANU),

Elchanan Mossel (MIT),

Mike Steele (U. Pennsylvania),

Alexandre Tsybakov (ENSAE),

Roman Vershynin (U. Michigan),


and the Associate Editors,

Sebastien Bubeck (Microsoft Research),

Andrea Montanari (Stanford),

Jelani Nelson (Harvard),

Philippe Rigollet (MIT),

Sara van de Geer (ETH – Zurich),

Ramon van Handel (Princeton),

Rachel Ward (UT – Austin).

The success of the journal depends entirely on our community;  we need your help and support in making it the success we believe it can be. We therefore ask that you consider submitting to the journal results you think are of a very high quality.

The first issue of the journal is scheduled to appear in early 2018.

Posted in Announcement | 1 Comment

STOC 2017 accepted papers

The list of accepted papers to STOC 2017 has just been released. Following the trend in recent years there are quite a few learning theory papers! I have already blogged about the kernel-based convex bandit algorithm; as well as the smoothed poly-time local max-cut (a.k.a. asynchronous Hopfield network). Some of the other learning papers that caught my attention: yet again a new viewpoint on acceleration for convex optimization; some progress on the complexity of finding stationary point on non-convex functions; a new twist on tensor decomposition for poly-time learning of latent variable models; an approximation algorithm for low-rank approximation in ell_1 norm; a new framework to learn from adversarial data; some progress on the trace reconstruction problem (amazingly the exact same result was discovered independently by two teams, see here and here); new sampling technique for graphical models; new relevant statistical physics result; faster submodular minimization;  and finally some new results on nearest neighbor search.

Posted in Conference/workshop, Theoretical Computer Science | 2 Comments

Guest post by Miklos Racz: Confidence sets for the root in uniform and preferential attachment trees

In the final post of this series (see here for the previous posts) we consider yet another point of view for understanding networks. In the previous posts we studied random graph models with community structure and also models with an underlying geometry. While these models are important and lead to fascinating problems, they are also static in time. Many real-world networks are constantly evolving, and their understanding requires models that reflect this. This point of view brings about a host of new interesting and challenging statistical inference questions that concern the temporal dynamics of these networks. In this post we study such questions for two canonical models of randomly growing trees: uniform attachment and preferential attachment.

Models of growing graphs

A natural general model of randomly growing graphs can be defined as follows. For n \geq k \geq 1 and a graph S on k vertices, define the random graph G(n,S) by induction. First, set G(k,S) = S; we call S the seed of the graph evolution process. Then, given G(n,S), G(n+1, S) is formed from G(n,S) by adding a new vertex and some new edges according to some adaptive rule. If S is a single vertex, we write simply G(n) instead of G(n,S).

There are several rules one can consider; here we study perhaps the two most natural ones: uniform attachment and preferential attachment (denoted \mathrm{UA} and \mathrm{PA} in the following). Moreover, for simplicity we focus on the case of growing trees, where at every time step a single edge is added. Uniform attachment trees are defined recursively as follows: given \mathrm{UA}(n,S), \mathrm{UA}(n+1,S) is formed from \mathrm{UA}(n,S) by adding a new vertex u and adding a new edge uv where the vertex v is chosen uniformly at random among vertices of \mathrm{UA} \left( n, S \right), independently of all past choices. Preferential attachment trees are defined similarly, except that v is chosen with probability proportional to its degree:

    \[ \mathbb{P}\left( v = i \, \middle| \, \mathrm{PA}(n, S) \right) = \frac{d_{\mathrm{PA}(n, S)}(i)}{2 \left( n - 1 \right)}, \]

where for a tree T we denote by d_{T} (u) the degree of vertex u in T.

Questions: detection and estimation

The most basic questions to consider are those of detection and estimation. Can one detect the influence of the initial seed graph? If so, is it possible to estimate the seed? Can one find the root if the process was started from a single node? We introduce these questions in the general model of randomly growing graphs described above, even though we study them in the special cases of uniform and preferential attachment trees later.

The detection question can be rephrased in the terminology of hypothesis testing. Given two potential seed graphs S and T, and an observation R which is a graph on n vertices, one wishes to test whether R \sim G(n, S) or R \sim G(n, T). The question then boils down to whether one can design a test with asymptotically (in n) nonnegligible power. This is equivalent to studying the total variation distance between G(n, S) and G(n, T), so we naturally define

    \begin{equation*} \delta(S, T) := \lim_{n \to \infty} \mathrm{TV}(G(n, S), G(n, T)), \end{equation*}

where G(n,S) and G(n,T) are random elements in the finite space of unlabeled graphs with n vertices. This limit is well-defined because \mathrm{TV}(G(n, S), G(n, T)) is nonincreasing in n (since if G(n,S) = G(n,T), then the evolution of the random graphs can be coupled such that G(n', S) = G(n', T) for all n' \geq n) and always nonnegative.

If the seed has an influence, it is natural to ask whether one can estimate S from G(n,S) for large n. If so, can the subgraph corresponding to the seed be located in G(n,S)? We study this latter question in the simple case when the process starts from a single vertex called the root. (In the case of preferential attachment, starting from a single vertex is not well-defined; in this case we start the process from a single edge and the goal is to find one of its endpoints.) A root-finding algorithm is defined as follows. Given G(n) and a target accuracy \epsilon \in (0,1), a root-finding algorithm outputs a set H\left( G(n), \epsilon \right) of K(\epsilon) vertices such that the root is in H\left( G(n), \epsilon \right) with probability at least 1-\epsilon (with respect to the random generation of G(n)).

An important aspect of this definition is that the size of the output set is allowed to depend on \epsilon, but not on the size n of the input graph. Therefore it is not clear that root-finding algorithms exist at all. Indeed, there are examples when they do not exist: consider a path that grows by picking one of its two ends at random and extending it by a single edge. However, it turns out that in many interesting cases root-finding algorithms do exist. In such cases it is natural to ask for the best possible value of K(\epsilon).

The influence of the seed

Consider distinguishing between a PA tree started from a star with 10 vertices, S_{10}, and a PA tree started from a path with 10 vertices, P_{10}. Since the preferential attachment mechanism incorporates the rich-get-richer phenomenon, one expects the degree of the center of the star in \mathrm{PA}(n,S_{10}) to be significantly larger than the degree of any of the initial vertices in the path in \mathrm{PA}(n,P_{10}). This intuition guided Bubeck, Mossel, and Racz when they initiated the theoretical study of the influence of the seed in PA trees. They showed that this intuition is correct: the limiting distribution of the maximum degree of the PA tree indeed depends on the seed. Using this they were able to show that for any two seeds S and T with at least 3 vertices and different degree profiles we have

    \[\delta_{\mathrm{PA}} (S,T) > 0.\]

However, statistics based solely on degrees cannot distinguish all pairs of nonisomorphic seeds. This is because if S and T have the same degree profiles, then it is possible to couple \mathrm{PA}(n,S) and \mathrm{PA}(n,T) such that they have the same degree profiles for every n. In order to distinguish between such seeds, it is necessary to incorporate information about the graph structure into the statistics that are studied. This was done successfully by Curien, Duquesne, Kortchemski, and Manolescu, who analyzed statistics that measure the geometry of large degree nodes. These results can be summarized in the following theorem.

Theorem: The seed has an influence in PA trees in the following sense. For any trees S and T that are nonisomorphic and have at least 3 vertices, we have

    \[\delta_{\mathrm{PA}}(S,T) > 0.\]

In the case of uniform attachment, degrees do not play a special role, so initially one might even think that the seed has no influence in the limit. However, it turns out that the right perspective is not to look at degrees but rather the sizes of appropriate subtrees (we shall discuss such statistics later). By extending the approach of Curien et al. to deal with such statistics, Bubeck, Eldan, Mossel, and Racz showed that the seed has an influence in uniform attachment trees as well.

Theorem: The seed has an influence in UA trees in the following sense. For any trees S and T that are nonisomorphic and have at least 3 vertices, we have

    \[\delta_{\mathrm{UA}}(S,T) > 0.\]

These results, together with a lack of examples showing opposite behavior, suggest that for most models of randomly growing graphs the seed has influence.

Question: How common is the phenomenon observed in Theorems 1 and 2? Is there a natural large class of randomly growing graphs for which the seed has an influence? That is, models where for any two seeds S and T (perhaps satisfying an extra condition), we have \delta (S,T) > 0. Is there a natural model where the seed has no influence?

Finding Adam

These theorems about the influence of the seed open up the problem of finding the seed. Here we present the results of Bubeck, Devroye, and Lugosi who first studied root-finding algorithms in the case of UA and PA trees.

They showed that root-finding algorithms indeed exist for PA trees and that the size of the best confidence set is polynomial in 1/\epsilon.

Theorem: There exists a polynomial time root-finding algorithm for PA trees with

    \[K(\epsilon) \leq c \frac{\log^{2} (1/\epsilon)}{\epsilon^{4}}\]

for some finite constant c. Furthermore, there exists a positive constant c' such that any root-finding algorithm for PA trees must satisfy

    \[K(\epsilon) \geq \frac{c'}{\epsilon}.\]

They also showed the existence of root-finding algorithms for UA trees. In this model, however, there are confidence sets whose size is subpolynomial in 1/\epsilon. Moreover, the size of any confidence set has to be at least superpolylogarithmic in 1/\epsilon.

Theorem: There exists a polynomial time root-finding algorithm for UA trees with

    \[K(\epsilon) \leq \exp \left( c \tfrac{\log (1/\epsilon)}{\log \log (1/\epsilon)} \right)\]

for some finite constant c. Furthermore, there exists a positive constant c' such that any root-finding algorithm for UA trees must satisfy

    \[K(\epsilon) \geq \exp \left( c' \sqrt{\log (1/\epsilon)} \right).\]

These theorems show an interesting quantitative difference between the two models: finding the root is exponentially more difficult in PA than in UA. While this might seem counter-intuitive at first, the reason behind this can be traced back to the rich-get-richer phenomenon: the effect of a rare event where not many vertices attach to the root gets amplified by preferential attachment, making it harder to find the root.

Proofs using Polya urns

We now explain the basic ideas that go into proving Theorems 3 and 4 and prove some simpler cases. While UA and PA are arguably the most basic models of randomly growing graphs, the evolution of various simple statistics, such as degrees or subtree sizes, can be described using even simpler building blocks: Polya urns. In this post we assume familiarity with Polya urns; we refer to the lecture notes for a primer on Polya urns for the interested reader.

A root-finding algorithm based on the centroid

We start by presenting a simple root-finding algorithm for UA trees. This algorithm is not optimal, but its analysis is simple and highlights the basic ideas.

For a tree T, if we remove a vertex v \in V(T), then the tree becomes a forest consisting of disjoint subtrees of the original tree. Let \psi_{T} \left( v \right) denote the size (i.e., the number of vertices) of the largest component of this forest. A vertex v that minimizes \psi_{T} \left( v \right) is known as a centroid of T; one can show that there can be at most two centroids. We define the confidence set H_{\psi} by taking the set of K vertices with smallest \psi values.

Theorem: The centroid-based H_{\psi} defined above is a root-finding algorithm for the UA tree. More precisely, if

    \[K \geq \frac{5}{2} \frac{\log \left( 1 / \epsilon \right)}{\epsilon},\]


    \[ \liminf_{n \to \infty} \mathbb{P} \left( 1 \in H_{\psi} \left( \mathrm{UA} \left( n \right)^{\circ} \right) \right) \geq 1 - \frac{4\epsilon}{1-\epsilon}, \]

where 1 denotes the root, and \mathrm{UA} \left( n \right)^{\circ} denotes the unlabeled version of \mathrm{UA} \left( n \right).

Proof: We label the vertices of the UA tree in chronological order. We start by introducing some notation that is useful throughout the proof. For 1 \leq i \leq k, denote by T_{i,k} the tree containing vertex i in the forest obtained by removing in \mathrm{UA}\left( n \right) all edges between vertices \left\{ 1, \dots, k \right\}. See the figure for an illustration.


Let \left| T \right| denote the size of a tree T, i.e., the number of vertices it contains. Note that the vector

    \[\left( \left| T_{1, k} \right|, \dots, \left| T_{k,k} \right| \right)\]

evolves according to the classical P\’olya urn with k colors, with initial state \left( 1, \dots, 1 \right). Therefore the normalized vector

    \[\left( \left| T_{1, k} \right|, \dots, \left| T_{k,k} \right| \right) / n\]

converges in distribution to a Dirichlet distribution with parameters \left( 1, \dots, 1 \right).

Now observe that

    \[ \mathbb{P} \left( 1 \notin H_{\psi} \right) \leq \mathbb{P} \left( \exists i > K : \psi \left( i \right) \leq \psi \left( 1 \right) \right) \leq \mathbb{P} \left( \psi \left( 1 \right) \geq \left( 1 - \epsilon \right) n \right) + \mathbb{P} \left( \exists i > K : \psi \left( i \right) \leq \left( 1 - \epsilon \right) n \right). \]

We bound the two terms appearing above separately, starting with the first one. Note that

    \[\psi \left( 1 \right) \leq \max \left\{ \left| T_{1,2} \right|, \left| T_{2,2} \right| \right\},\]

and both \left| T_{1,2} \right| / n and \left| T_{2,2} \right| / n converge in distribution to a uniform random variable in \left[ 0, 1 \right]. Hence a union bound gives us that

    \[ \limsup_{n \to \infty} \mathbb{P} \left( \psi \left( 1 \right) \geq \left( 1 - \epsilon \right) n \right) \leq 2 \lim_{n \to \infty} \mathbb{P} \left( \left| T_{1,2} \right| \geq \left( 1 - \epsilon \right) n \right) = 2 \epsilon. \]

For the other term, first observe that for any i > K we have

    \[ \psi \left( i \right) \geq \min_{1 \leq k \leq K} \sum_{j = 1, j \neq k}^{K} \left| T_{j,K} \right|. \]

Now using results on P\’olya urns we have that for every k such that 1 \leq k \leq K, the random variable

    \[\frac{1}{n} \sum_{j = 1, j \neq k}^{K} \left| T_{j,K} \right|\]

converges in distribution to the \mathrm{Beta} \left( K - 1, 1 \right) distribution. Hence by a union bound we have that

    \begin{align*} \limsup_{n \to \infty} \mathbb{P} \left( \exists i > K : \psi \left( i \right) \leq \left( 1 - \epsilon \right) n \right) &\leq \lim_{n \to \infty} \mathbb{P} \left( \exists 1 \leq k \leq K : \sum_{j = 1, j \neq k}^{K} \left| T_{j,K} \right| \leq \left( 1 - \epsilon \right) n \right) \\ &\leq K \left( 1 - \epsilon \right)^{K - 1}. \end{align*}

Putting together the two bounds gives that

    \[ \limsup_{n \to \infty} \mathbb{P} \left( 1 \notin H_{\psi} \right) \leq 2 \epsilon + K \left( 1 - \epsilon \right)^{K-1}, \]

which concludes the proof due to the assumption on K.



The same estimator H_{\psi} works for the preferential attachment tree as well, if one takes

    \[K \geq C \frac{\log^{2} \left( 1 / \epsilon \right)}{\epsilon^{4}}\]

for some positive constant C. The proof mirrors the one above, but involves a few additional steps; we refer to Bubeck et al. for details.

For uniform attachment the bound on K given by Theorem 5 is not optimal. It turns out that it is possible to write down the maximum likelihood estimator (MLE) for the root in the UA model; we do not do so here, see Bubeck et al. One can view the estimator H_{\psi} based on the centroid as a certain “relaxation” of the MLE. By constructing a certain “tighter” relaxation of the MLE, one can obtain a confidence set with size subpolynomial in 1/\epsilon as described in Theorem 4. The analysis of this is the most technical part of Bubeck et al. and we refer to this paper for more details.

Lower bounds

As mentioned above, the MLE for the root can be written down explicitly. This aids in showing a lower bound on the size of a confidence set. In particular, Bubeck et al. define a set of trees whose probability of occurrence under the UA model is not too small, yet the MLE provably fails, giving the lower bound described in Theorem 4. We refer to Bubeck et al. for details.

On the other hand, for the PA model it is not necessary to use the structure of the MLE to obtain a lower bound. A simple symmetry argument suffices to show the lower bound in Theorem 3, which we now sketch.

First observe that the probability of error for the optimal procedure is non-decreasing with n, since otherwise one could simulate the process to obtain a better estimate. Thus it suffices to show that the optimal procedure must have a probability of error of at least \epsilon for some finite n. We show that there is some finite n such that with probability at least 2\epsilon, the root is isomorphic to at least 2c / \epsilon vertices in \mathrm{PA}(n). Thus if a procedure outputs at most c/\epsilon vertices, then it must make an error at least half the time (so with probability at least \epsilon).

Observe that the probability that the root is a leaf in \mathrm{PA}(n) is

    \[\frac{1}{2} \times \frac{3}{4} \times \dots \times \left( 1 - \frac{1}{2n} \right) = \Theta \left( 1 / \sqrt{n} \right).\]

By choosing n = \Theta \left( 1 / \epsilon^{2} \right), this happens with probability \Theta \left( \epsilon \right). Furthermore, conditioned on the root being a leaf, with constant probability vertex 2 is connected to \Theta \left( \sqrt{n} \right) = \Theta \left( 1 / \epsilon \right) leaves, which are then isomorphic to the root.

Open problems

There are many open problems and further directions that one can pursue; the four papers we have discussed contain 20 open problems and conjectures alone, and we urge the reader to have a look and try to solve them!

Posted in Probability theory, Random graphs | Leave a comment

Geometry of linearized neural networks

This week we had the pleasure to host Tengyu Ma from Princeton University who told us about the recent progress he has made with co-authors to understand various linearized versions of neural networks. I will describe here two such results, one for Residual Neural Networks and one for Recurrent Neural Networks.

Some properties to look for in non-convex optimization

We will say that a function f admits first order optimality (respectively second order optimality) if all critical points (respectively all local minima) of f are global minima (of course first order optimality implies second order optimality for smooth functions). In particular with first order optimality one has that gradient descent converges to the global minimum, and with second order optimality this is also true provided that one avoids saddle points. To obtain rates of convergence it can be useful to make more quantitative statements. For example we say that f is \alpha-Polyak if

    \[\|\nabla f(x)\|^2 \geq \alpha (f(x) - f^*) .\]

Clearly \alpha-Polyak implies first order optimality, but more importantly it also implies linear convergence rate for gradient descent on f. A variant of this condition is \alpha-weak-quasi-convexity:

    \[\langle \nabla f(x), x-x^*\rangle \geq \alpha (f(x) - f^*) ,\]

in which case gradient descent converges at the slow non-smooth rate 1/\sqrt{t} (and in this case it is also robust to noise, i.e. one can write a stochastic gradient descent version). The proofs of these statements just mimic the usual convex proofs. For more on these conditions see for instance this paper.

Linearized Residual Networks

Recall that a neural network is just a map x \in \mathbb{R}^n \mapsto \sigma \circ A_L \circ \sigma A_{L-1} \hdots \sigma \circ A_0 (x) where A_0,\hdots, A_{L} are linear maps (i.e. they are the matrices parametrizing the neural network) and \sigma is some non-linear map (the most popular one, ReLu, is the just the coordinate-wise positive part). Alternatively you can think of a neural network as a sequence of hidden states h_0, h_1,\hdots,h_L where h_0=x and h_{t+1} = \sigma(A_t h_{t}). In 2015 a team of researcher at MSR Asia introduced the concept of a residual neural network where the hidden states are now updated as before for t even but for t odd we set h_{t+1} = h_{t-1} + \sigma(A_t h_t). Apparently this trick allowed them to train much deeper networks, though it is not clear why this would help from a theoretical point of view (the intuition is that at least when the network is initialized with all matrices being 0 it still does something non-trivial, namely it computes the identity).

In their most recent paper Moritz Hardt and Tengyu Ma try to explain why adding this “identity connection” could be a good idea from a geometric point of view. They consider an (extremely) simplified model where there is no non-linearity, i.e. \sigma is the identity map. A neural network is then just a product of matrices. In particular the landscape we are looking at for least-squares with such a model is of the form:

    \[g(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L A_i x - y\|^2 ,\]

which is of course a non-convex function (just think of the function (a,b) \mapsto (ab-1)^2 and observe that on the segment a=b it gives the non-convex function (a^2-1)^2). However it actually satisfies the second-order optimality condition:

Proposition [Kawaguchi 2016]

Assume that x has a full rank covariance matrix and that y=Rx for some deterministic matrix R. Then all local minima of g are global minima.

I won’t give the proof of this result as it requires to take the second derivative of g which is a bit annoying (I will give below the proof of the first derivative). Now in this linearized setting the residual network version (where the identity connection is added at every layer) corresponds simply to a reparametrization around the identity, in other words we consider now the following function:

    \[f(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L (A_i+\mathrm{I}) x - y\|^2 .\]

Proposition [Hardt and Ma 2016]

Assume that x has a full rank covariance matrix and that y= Rx for some deterministic matrix R. Then f has first order optimality on the set \{(A_0, \hdots, A_L) : \|A_i\| <1\}.

Thus adding the identity connection makes the objective function better behave around the starting points with all-zeros matrices (in the sense that gradient descent doesn’t have to worry about avoiding saddle points). The proof is just a few lines of standard calculations to take derivatives of functions with matrix-valued inputs.

Proof: One has with E = R - \prod_{i=0}^L (A_i+\mathrm{I}) and \Sigma = \E x x^{\top},

    \[f = \E (E x)^{\top} Ex = \mathrm{Tr}(E \Sigma E^{\top}) =: \|E\|_{\Sigma}^2 ,\]

so with E_{<i} = \prod_{j <i} (A_j + \mathrm{I}) and E_{>i}=\prod_{j >i} (A_j + \mathrm{I}),

    \begin{eqnarray*} f(A_0,\hdots, A_i + V, \hdots, A_L) & = & \|R - \prod_{j <i} (A_j + \mathrm{I}) \times (A_i + V +\mathrm{I}) \times \prod_{j >i} (A_j + \mathrm{I})\|_{\Sigma}^2 \\ & = & \|E + E_{<i} V E_{>i}\|_{\Sigma}^2 \\ & = & \|E\|_{\Sigma}^2 + 2 \langle \Sigma E, E_{<i} V E_{>i} \rangle + \|E_{<i} V E_{>i}\|_{\Sigma}^2 , \end{eqnarray*}

which exactly means that the derivative of f with respect to A_i is equal to E_{>i}^{\top} \Sigma E E_{<i}^{\top}. On the set under consideration one has that E_{>i} and E_{<i} are invertible (and so is \Sigma by assumption), and thus if this derivative is equal to 0 it muts be that E=0 and thus f=0 (which is the global minimum).

Linearized recurrent neural networks

The simplest version of a recurrent neural network is as follows. It is a mapping of the form (x_1,\hdots,x_T) \mapsto (y_1,\hdots,y_T) (we are thinking of doing sequence to sequence prediction). In these networks the hidden state is updated as h_{t+1} = \sigma_1(A h_{t} + B x_{t}) (with h_1=0) and the output is y_t = \sigma_2(C h_t + D x_t). I will now describe a paper by Hardt, Ma and Recht (see also this blog post) that tries to understand the geometry of least-squares for this problem in the linearized version where \sigma_1 = \sigma_2 = \mathrm{Id}. That is we are looking at the function:

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\mathbb{E}_{(x_t)} \frac{1}{T} \sum_{t=1}^T \|\hat{y}_t - y_t\|^2 ,\]

where (y_t) is obtained from (x_t) via some unknown recurrent neural network with parameters A,B,C,D. First observe that by induction one can easily see that h_{t+1} = \sum_{k=1}^t A^{t-k} B x_k and y_t = D x_t + \sum_{k=1}^{t-1} C A^{t-1-k} B x_k. In particular, assuming that (x_t) is an i.i.d. isotropic sequence one obtains

    \[\E \|y_t - \hat{y}_t\|_2^2 = \|D-\hat{D}\|_F^2 + \sum_{k=1}^{t-1} \|\hat{C} \hat{A}^{t-1-k} \hat{B} - C A^{t-1-k} B\|_F^2 ,\]

and thus

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\|D-\hat{D}\|_F^2 + \sum_{k=1}^{T-1} (1- \frac{k}{T}) \|\hat{C} \hat{A}^{k-1} \hat{B} - C A^{k-1} B\|_F^2 .\]

In particular we see that the effect of \hat{D} is decoupled from the other variables and that is appears as a convex function, thus we will just ignore it. Next we make the natural assumption that the spectral radius of A is less than 1 (for otherwise the influence of the initial input x_1 is growing over time which doesn’t seem natural) and thus up to some small error term (for large T) one can consider the idealized risk:

    \[(\hat{A},\hat{B}, \hat{C}) \mapsto \sum_{k=0}^{+\infty} \|\hat{C} \hat{A}^{k} \hat{B} - C A^{k} B\|_F^2 .\]

The next idea is a cute one which makes the above expression more tractable. Consider the series r_k = C A^k B and its Fourier transform:

    \[G(\theta) = \sum_{k=0}^{+\infty} r_k \exp(i k \theta) = C (\sum_{k=0}^{+\infty} (\exp(i \theta) A)^k) B = C(\mathrm{I} - \exp(i \theta) A)^{-1} B .\]

By Parseval’s theorem the idealized risk is equal to the L_2 distance between G and \hat{G} (i.e. \int_{[-\pi, \pi]} \|G(\theta)-\hat{G}(\theta)\|_F^2 d\theta). We will now show that under appropriate further assumptions, for any \theta that \|G(\theta) - \hat{G}(\theta)\|_F^2 is weakly-quasi-convex in (\hat{A},\hat{B},\hat{C}) (in particular this shows that the idealized risk is weakly-quasi-convex). The big assumption that Hardt, Ma and Recht make is that the system is a “single-input single-output” model, that is both x_t and y_t are scalar. In this case it turns out that control theory shows that there is a “canonical controlable form” where B=(0,\hdots,0,1), C=(c_1,\hdots,c_n) and A has zeros everywhere except on the upper diagonal where it has ones and on the last row where it has a_n,\hdots, a_1 (I don’t know the proof of this result, if some reader has a pointer for a simple proof please share in the comments!). Note that with this form the system is simple to interpret as one has Ah = (h(2),\hdots, h(n-1), \langle a,h\rangle) and Ah+Bx =(h(2),\hdots, h(n-1), \langle a,h\rangle+x). Now with just a few lines of algebra:

    \[G(\theta) = \frac{c_1 + \hdots + c_{n} z^{n-1}}{z^n + a_1 z^{n-1} + \hdots + a_n}, \ \text{where} \ z=\exp(i \theta) .\]

Thus we are just asking to check the weak-quasi-convexity of

    \[(\hat{a}, \hat{c}) \mapsto |\frac{\hat{c}_1 + \hdots + \hat{c}_{n} z^{n-1}}{z^n + \hat{a}_1 z^{n-1} + \hdots + \hat{a}_n} - \frac{u}{v}|^2 .\]

Weak-quasi-convexity is preserved by linear functions, so we just need to understand the map

    \[(\hat{u},\hat{v}) \in \mathbb{C} \times \mathbb{C} \mapsto |\frac{\hat{u}}{\hat{v}} - \frac{u}{v}|^2 ,\]

which is weak-quasi-convex provided that \hat{v} has a positive inner product with v. In particular we just proved the following:

Theorem [Hardt, Ma, Recht 2016]

Let C(a) := \{z^n + a_1 z^{n-1} + \hdots + a_n , z \in \mathbb{C}, |z|=1\} and assume there is some cone \mathcal{C} \subset \mathbb{C}^2 of angle less than \pi/2-\alpha such that C(a) \subset \mathcal{C}. Then the idealized risk is \alpha-weakly-quasi-convex on the set of \hat{a} such that C(\hat{a}) \subset \mathcal{C}.

(In the paper they specifically pick the cone \mathcal{C} where the imaginary part is larger than the real part.) This theorem naturally suggests that by overparametrizing the network (i.e. adding dimensions to a and c) one could have a nicer landscape (indeed in this case the above condition can be easier to check), see the paper for more details!

Posted in Machine learning, Optimization | 2 Comments

Local max-cut in smoothed polynomial time

Omer Angel, Yuval Peres, Fan Wei, and myself have just posted to the arXiv our paper showing that local max-cut is in smoothed polynomial time. In this post I briefly explain what is the problem, and I give a short proof of the previous state of the art result on this problem, which was a paper by Etscheid and Roglin showing that local max-cut is in quasi-polynomial time.

Local max-cut and smoothed analysis

Let G = (V,E) be a connected graph with n vertices and w: E\rightarrow [-1,1] be an edge weight function. The local max-cut problem asks to find a partition of the vertices \sigma: V\rightarrow \{-1,1\} whose total cut weight

    \[\frac12 \sum_{uv \in E} w(uv) \big(1-\sigma(u)\sigma(v)\big) ,\]

is locally maximal, in the sense that one cannot increase the cut weight by changing the value of \sigma at a single vertex (recall that actually finding the global maximum is NP-hard). See the papers linked to above for motivation on this problem.

There is a simple local search algorithm for this problem, sometimes referred to as “FLIP”: start from some initial \sigma_0 and iteratively flip vertices (i.e. change the sign of \sigma at a vertex) to improve the cut weight until reaching a local maximum. It is easy to build instances (\sigma_0, w) where FLIP takes exponential time, however in “practice” it seems that FLIP always converges quickly. This motivates the smoothed analysis of FLIP, that is we want to understand what is the typical number of steps for FLIP when the edge weights is perturbed by a small amount of noise. Formally we now assume that the weight on edge e \in E is given by a random variable X_e \in [-1,1] which has a density with respect to the Lebesgue measure bounded from above by \phi (for example this forbids X_e to be too close to a point mass). We assume that these random variables are independent.

Theorem(Etscheid and Roglin [2014]): With probability 1-o(1) FLIP terminates in O((\phi^c n^{c \log(n)}) steps for some universal constant c>0.

We improve this result from quasi-polynomial to polynomial, assuming that we put some noise on the interaction between every pair of vertices, or in other words assuming that the graph is complete.

Theorem(Angel, Bubeck, Peres, Wei [2016]): Let G be complete. With probability 1-o(1) FLIP terminates in O(\phi^5 n^{15.1}) steps.

I will now prove the Etscheid and Roglin result.

Proof strategy

To simplify notation let us introduce the Hamiltonian

    \[H(\sigma) =-\frac{1}{2} \sum_{uv \in E} X_{uv} \sigma(u) \sigma(v).\]

We want to find a local max of H. For any \sigma \in \{-1,1\}^V and v \in V, we denote by \sigma^{-v} the state equal to \sigma except for the coordinate corresponding to v which is flipped. For such \sigma,v there exists a vector \alpha = \alpha(\sigma,v) \in \{-1,0,1\}^E such that

    \[H(\sigma^{-v}) = H(\sigma) + \langle \alpha, X \rangle .\]

More specifically \alpha=(\alpha_{uw})_{uw \in E} is defined by

    \[\left\{\begin{array}{ll} \alpha_{uv} = \sigma(v) \sigma(u) & \forall u \neq v \\ \alpha_{uw} = 0 & \text{if} \ v \not\in \{u,w\} \end{array}\right.\]

We say that flipping a vertex is a move, and it is an improving move if the value of H strictly improves. We say that a sequence of moves L=(v_1,\hdots, v_{\ell}) is \epsilon-slow from \sigma_0 if

    \[H(\sigma_{i}) -  H(\sigma_{i-1}) \in (0, \epsilon], \forall i \in [\ell].\]

It is sufficient to show that with high probability there is no \epsilon-slow sequence with say \ell=2n and \epsilon = 1/(\phi n^{\log(n)})^c (indeed in this case after 2n \times (\phi n^{\log(n)})^c \times n^2 steps FLIP must have stopped, for otherwise the H value would exceed the maximal possible value of n^2). We will do this in in three main steps, a probability step, a linear algebra step, and a combinatorial step.

Probability step

Lemma: Let \alpha_1, \hdots, \alpha_k be k linearly independent vectors in \mathbb{Z}^E. Then one has

    \[\mathbb{P}(\forall i \in [k], \langle \alpha_i, X \rangle \in (0,\epsilon]) \leq (\phi \epsilon)^{k} .\]

Proof: The inequality follows from a simple change of variables. Let A \in \mathbb{Z}^{|E| \times E} be a full rank matrix whose first k rows are \alpha_1, \hdots, \alpha_k and it is completed so that A is the identity on the subspace orthogonal to \alpha_1, \hdots, \alpha_k. Let g be the density of A X, and f=\prod_{i=1}^{|E|} f_i the density of X. One has g(y) = |\mathrm{det}(A^{-1})| f(A^{-1} y)| and the key observation is that since A has integer coefficients, its determinant must be an integer too, and since it is non-zero one has |\mathrm{det}(A^{-1})|=|\mathrm{det}(A)^{-1}| \leq 1 . Thus one gets:

    \begin{align*} \mathbb{P}(\forall i \in [k], \langle \alpha_i, X \rangle \in (0,\epsilon]) & = \int_{(0,\epsilon]^k \times \mathbb{R}^{|E|-k}} g(y) dy \\ & \leq \int_{(0,\epsilon]^k \times \mathbb{R}^{|E|-k}} \prod_{i=1}^{|E|} f_i((A^{-1} y)_i) dy \\ & \leq (\phi \epsilon)^k \int_{\mathbb{R}^{|E|-k}} \prod_{i=k+1}^{|E|} f_i(y_i) dy = (\phi \epsilon)^k. \end{align*}

Linear algebra step

Lemma: Consider a sequence of \ell improving moves with k distinct vertices (say v_1, \hdots, v_k) that repeat at least twice in this sequence. Let \alpha_1, \hdots, \alpha_{\ell} \in \{-1,0,1\}^E be the corresponding move coefficients, and for each i \in [k] let s_i (respectively t_i) be the first (respectively second) time at which v_i moves. Then the vectors \tilde{\alpha}_i = \alpha_{s_i} + \alpha_{t_i} \in \mathbb{Z}^E, i \in [k], are linearly independent. Furthermore for any v that did not move between the times s_i and t_i one has \tilde{\alpha}_i(vv_i) = 0 (and for any e \not\ni v_i, \tilde{\alpha}_i(e) = 0).

Proof: The last sentence of the lemma is obvious. For the linear independence let \lambda \in \R^k be such that \sum_{i=1}^k \lambda_i \tilde{\alpha}_i = 0. Consider a new graph H with vertex set [k] and such that i is connected to j if v_i appears an odd number of times between the times s_j and t_j. This defines an oriented graph, however if i is connected to j but j is not connected to i then one has \tilde{\alpha}_j(v_iv_j) \in \{-2,2\} while \tilde{\alpha}_i(v_iv_j)=0 (and furthermore for any m \not\in \{i,j\}, \tilde{\alpha}_m(v_iv_j)=0) and thus \lambda_j=0. In other words we can consider a subset of [k] where H is an undirected graph, and outside of this subset \lambda is identically zero. To reduce notation we simply assume that H is undirected. Next we observe that if i and j are connected then one must have \tilde{\alpha}_j(v_iv_j)= - \tilde{\alpha}_i(v_iv_j) (this uses the fact that we look at the {\em first} consecutive times at which the vertices move) and in particular (again using that for any m \not\in \{i,j\}, \tilde{\alpha}_m(v_iv_j)=0) one must have \lambda_i=\lambda_j. Now let C be some connected component of H, and let \lambda_C be the unique value of \lambda on C. Noting that the \tilde{\alpha}‘s corresponding to different components of H have difference support (more precisely with C_E := \cup_{j \in C} \{e \ni j\} one has for any i \in C, \tilde{\alpha}_i |_{C_E^c} = 0 and for any i \not\in C, \tilde{\alpha}_i |_{C_E}=0) one obtains \lambda_C \sum_{i \in C} \tilde{\alpha}_i = 0. On the other hand since the sequence of moves is improving one must have \sum_{i \in C} \tilde{\alpha}_i \neq 0, which implies \lambda_C = 0 and finally \lambda = 0 (thus concluding the proof of the linear independence).

Combinatorial step

Lemma: Let v_1, \hdots, v_{2n} \in V. There exists \ell \in \mathbb{N} and i \in [2n - \ell] such that the number of vertices that repeat at least twice in the segment v_i, \hdots, v_{i + \ell} is at least \ell / (2 \log_2(n)).

Proof (from ABPW): Define the surplus of a sequence to be the difference between the number of elements and the number of distinct elements in the sequence. Let s_{\ell} be the maximum surplus in any segment of length \ell in v_1, \hdots, v_{2n}. Observe that s_{2n} \geq n. Let us now assume that for any segment of length \ell, the number of vertices that repeat at least twice is at most \epsilon \ell. Then one has by induction

    \[s_{2n} \leq 2 s_{n} + \epsilon 2n \leq 2 \epsilon n \log_2(n) .\]

This shows that \epsilon has to be greater than 1/(2 \log_2(n)) which concludes the proof.

Putting things together

We want to show that with \epsilon = 1/(\phi n^{\log(n)})^c one has

    \[\mathbb{P}(\exists (\sigma_0, L) \; \epsilon-\text{slow and} \; |L| = 2n) = o(1) .\]

By the combinatorial lemma we know that it is enough to show that:

    \begin{align*} & \sum_{\ell=1}^{2n} \mathbb{P}(\exists (\sigma_0, L) \; \epsilon-\text{slow}, \; |L| = \ell, \; \text{and} \\ & \text{L has at least} \; \ell/(2 \log_2(n)) \; \text{repeating vertices}) = o(1) . \end{align*}

Now using the probability lemma together with the linear algebra lemma (and observing that critically \tilde{\alpha} only depends on the value of \sigma at the vertices in L, and thus the union bound over \sigma_0 only gives a 2^{\ell} factor instead of 2^{n}) one obtains that the above probability is bounded by

    \[2^{\ell} n^{\ell} (\phi \epsilon)^{\ell/(2 \log_2(n))} ,\]

which concludes the proof of the Etscheid and Roglin result.

Note that a natural route to get a polynomial-time bound from the above proof would be to remove the \log(n) term in the combinatorial lemma but we show in our paper that this is impossible. Our result comes essentially from improvements to the linear algebra step (this is more difficult as the Etscheid and Roglin linear algebra lemma is particularly friendly for the union bound step, so we had to find another way to do the union bound).

Posted in Optimization, Probability theory | 2 Comments