Komlos conjecture, Gaussian correlation conjecture, and a bit of machine learning

Today I would like to talk (somewhat indirectly) about a beautiful COLT 2014 paper by Nick Harvey and Samira Samadi. The problem studied in this paper goes as follows: imagine that you have a bunch of data points in \mathbb{R}^n with a certain mean value \mu. Your goal is to generate a sequence out these data points, such that the partial averages of this new sequence are as close as possible to \mu. The “herding” procedure of Max Welling has the property that the average of the t^{th} first terms in the new sequence is at distance at most O(1/t) from \mu (this was shown by Francis Bach, Simon Lacoste-Julien and Guillaume Obozinski by viewing the herding procedure as a simple Frank-Wolfe). In their paper Nick and Samira improves the dependency on the dimension n in the O notation by using a new algorithm inspired by the theory of discrepancy. In the rest of this post I would like to discuss the basic notions of this theory. What follows is a transcription of what Ronen Eldan (a postdoc in the theory group at MSR) explained to me (of course all errors are mine). I will use c and C to denote absolute numerical constants that take whatever value I need them to take.

Discrepancy theory

Consider a large finite set of vertices V, and a size-n set \mathcal{P} of subsets of vertices (in other words (V, \mathcal{P}) is an hypergraph). For sake of simplicity we will assume here that elements of \mathcal{P} are also of size n. We are now interested in the following combinatorial optimization problem:

    \begin{align*} & \qquad \;\; \mathrm{min.} \qquad \;\;\max_{A \in \mathcal{P}} \left| \sum_{v \in A} f(v) \right| \\ & f : V \rightarrow \{-1,1\} . \end{align*}

In words, one is looking for a partitioning f of the vertices, so as to have the hyperedges A \in \mathcal{P} as balanced as possible.

Alright, so let’s try something simple, just randomly partition the vertices: clearly a simple union bounds shows that the obtained objective value in this case is of order \sqrt{n \log n}. Denote D (for Discrepancy) the smallest objective value (in particular we just saw that D \leq C \sqrt{n \log n}). A beautiful result of Joel Spencer shows that one can always do much better than random partitioning:

Theorem (Spencer’s six deviations, 1985): D \leq 6 \sqrt{n}.

We will now see a splendid (partial) proof of this result. Let \gamma be the Gaussian measure in \mathbb{R}^d (in the following we will take d=|V|). First let me remind you of a basic fact about this measure, and a related conjecture. A set of the form S_{v,t} = \{x : |v^{\top} x| \leq t\}, for some v \in \mathbb{R}^n, t >0, is called a slab.

Theorem (Sidak, 1967): Let A be a symmetric convex set and B be a slab, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Obviously there is equality if A and B are two orthogonal slabs, and the theorem says (in particular) that an angle between the slabs would imply positive correlation.

Conjecture (Gaussian correlation): Let A and B be two symmetric convex sets, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Ok, now back to the proof of Spencer’s theorem. First, let’s rewrite things suggestively in terms of the slabs S_A := S_{\frac{\mathds{1}_A}{\sqrt{n}}, C}, A \in \mathcal{P}. We want to show that there exists f \in \{-1,1\}^d such that f \in \cap_{A \in \mathcal{P}} S_A. Using Sidak’s theorem one clearly has \gamma(\cap_{A \in \mathcal{P}} S_A) \geq c^d (with c being roughly like 1-2 \exp(-C^2/2)). From there it is now a fairly short route to show that, if c \geq 0.99 (say), then the intersection of \{-1,1\}^d with \cap_{A \in \mathcal{P}} S_A must be non-empty. Ronen showed me a non-trivial (and algorithmically non-efficient) 5 lines argument for this but I think that it’s yet unpublished work so I won’t reproduce it here. Another proof of this, which yields an efficient algorithm for finding a point in the intersection, is given in this very nice paper of Lovett and Meka.


In what we discussed so far the underlying hypergraph was completely arbitrary. For some specific graphs one can clearly improve the bound on D. Indeed if all hyperedges are disjoint, then D=0. More interestingly, Beck and Fiala showed in 1981 that, if each vertex appears in at most t hyperedges, then D \leq 2 t. They also conjectured that the correct bound should be O(\sqrt{t}). In 1998 Banaszczyk showed a bound of O(\sqrt{t \log(n)}) using the following theorem (which is a variant of the result mentioned above):

Theorem (Banaszczyk, 1998): Let A be a convex set such that \gamma(A) \geq 1/2, and u_1, \hdots u_m \in \mathbb{R}^d such that \|u_i\|_2 \leq C. Then there exists f \in \{-1,1\}^m such that \sum_{i=1}^m f_i u_i \in A.

The longstanding Komlos’ conjecture says that in fact this holds for A=[-1,1]^n (despite the fact that the Gaussian measure of this set is c^n).

Finally let me mention another open problem, which is the matrix version of Spencer’s six deviations, see this blog post by Raghu Meka for details.


Posted in Optimization, Probability theory, Theoretical Computer Science | 5 Comments

A zest of number theory

I just encountered an amazing number theoretic result. It is probably very well known, but for those who never saw it it’s quite something, so I thought I would share it.

Let n be a positive integer. A partition of n is a multiset of positive integers such that the sum of its elements is equal to n (for example \{1,1,2\} is a partition of 4). The partition function p(n) is defined as the number of partitions of n. The amazing result that I want to tell you about is the formula for the asymptotic equivalent of p(n). Before giving away the formula let’s try to get some intuition about the order of magnitude one can expect. For this let me first look at the simpler quantity q(n), defined as the number of multiset of [n] with n elements (for example q(2)=3). With the stars and bars argument (thanks to Luc Devroye for reminding me about this elementary combinatorial technique!) one can easily see that q(n) \geq 2^n-1. Moreover it is also easy to see that there is an injection from multisets of [\sqrt{n}] with \sqrt{n} elements to partitions of n (observe that the sum of the elements of such a multiset is necessarily smaller than n). Thus

    \[p(n) \geq q(\sqrt{n}) \geq 2^{\sqrt{n}}-1.\]

A more involved argument, based on Young Diagram representation of partitions, shows that p(n) \leq 2^{100 \sqrt{n}} (don’t quote me on the 100, I think it’s correct but I have only rough calculations here).

Now here we go for the asymptotic formula, discovered by Hardy and Ramanujan in 1918:

    \[p(n) \sim \frac{1}{4 \sqrt{3} n} \exp\left(\pi \sqrt{2 n /3}\right) .\]

In my opinion this is a beautiful connection between \pi, e, and a basic property of integers. The proof is based on the circle method, an elementary technique in complex analysis (see here for the details).

Posted in Theoretical Computer Science | Leave a comment

Probability in high dimension

The Barcelona events have just ended, and I’m happy to report that everything went very smoothly. In my opinion the quality of the works presented at COLT and at the Foundations of Learning Theory workshop were truly outstanding. I hope to blog about the results that I found the most exciting next week.

The point of this post is to advertise the lecture notes of my friend (and colleague) Ramon van Handel. These notes cover all the basics of modern high dimensional probability. Acquaintance with this material is extremely useful in all the fields that I’m interested in (machine learning, statistics, optimization, and theoretical computer science). It is the first time that I see this material presented in such a cohesive manner, and it is an excellent companion to the book of Boucheron, Lugosi and Massart. Last but not least Ramon is an amazing mathematical writer (for instance if you want to learn about filtering I highly recommend his lecture notes on Hidden Markov Models).

Posted in Probability theory | Leave a comment

Theory of Convex Optimization for Machine Learning

I am extremely happy to release the first draft of my monograph based on the lecture notes published last year on this blog. (Comments on the draft are welcome!) The abstract reads as follows:

This monograph presents the main mathematical ideas in convex optimization. Starting from the fundamental theory of black-box optimization, the material progresses towards recent advances in structural optimization and stochastic optimization. Our presentation of black-box optimization, strongly influenced by the seminal book of Nesterov, includes the analysis of the Ellipsoid Method, as well as (accelerated) gradient descent schemes. We also pay special attention to non-Euclidean settings (relevant algorithms include Frank Wolfe, Mirror Descent, and Dual Averaging) and discuss their relevance in machine learning. We provide a gentle introduction to structural optimization with FISTA (to optimize a sum of a smooth and a simple non-smooth term), Saddle-Point Mirror Prox (Nemirovski’s alternative to Nesterov’s smoothing), and a concise description of Interior Point Methods. In stochastic optimization we discuss Stochastic Gradient Descent, mini-batches, Random Coordinate Descent, and sublinear algorithms. We also briefly touch upon convex relaxation of combinatorial problems and the use of randomness to round solutions, as well as random walks based methods.

Posted in Optimization | 13 Comments

COLT 2014 accepted papers

The accepted papers for COLT 2014 have just been posted! This year we had a record number of 140 submissions, out of which 52 were accepted (38 for a 20mn presentation and 14 for a 5mn presentation). In my opinion the program is very strong, I am looking forward to listen to all these talks in Barcelona! By the way now may be a very good time to register, the deadline for early registration is coming up fast!

Here is a list of the accepted papers together with links when I could find one (if you want me to add a link just send me an email).

COLT 2014 accepted papers

– Jacob Abernethy, Chansoo Lee, Abhinav Sinha and Ambuj Tewari. Learning with Perturbations via Gaussian Smoothing

– Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon. Learning Sparsely Used Overcomplete Dictionaries

– Alekh Agarwal, Ashwin Badanidiyuru, Miroslav Dudik, Robert Schapire and Aleksandrs Slivkins. Robust Multi-objective Learning with Mentor Feedback- Morteza Alamgir, –

– Ulrike von Luxburg and Gabor Lugosi. Density-preserving quantization with application to graph downsampling

– Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher and James Voss. The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

– Sanjeev Arora, Rong Ge and Ankur Moitra. New Algorithms for Learning Incoherent and Overcomplete Dictionaries

– Ashwinkumar Badanidiyuru, John Langford and Aleksandrs Slivkins. Resourceful Contextual Bandits

– Shai Ben-David and Ruth Urner. The sample complexity of agnostic learning under deterministic labels

– Aditya Bhaskara, Moses Charikar and Aravindan Vijayaraghavan. Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability

– Evgeny Burnaev and Vladimir Vovk. Efficiency of conformalized ridge regression

– Karthekeyan Chandrasekaran and Richard M. Karp. Finding a most biased coin with fewest flips

– Yudong Chen, Xinyang Yi and Constantine Caramanis. A Convex Formulation for Mixed Regression: Minimax Optimal Rates

– Amit Daniely, Nati Linial and Shai Shalev-Shwartz. The complexity of learning halfspaces using generalized linear methods

– Amit Daniely and Shai Shalev-Shwartz. Optimal Learners for Multiclass Problems

– Constantinos Daskalakis and Gautam Kamath. Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

– Ofer Dekel, Jian Ding, Tomer Koren and Yuval Peres. Online Learning with Composite Loss Functions Can Be Hard

– Tim van Erven, Wojciech Kotlowski and Manfred K. Warmuth. Follow the Leader with Dropout Perturbations

– Vitaly Feldman and Pravesh Kothari. Learning Coverage Functions and Private Release of Marginals

– Vitaly Feldman and David Xiao. Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

– Pierre Gaillard, Gilles Stoltz and Tim van Erven. A Second-order Bound with Excess Losses

– Eyal Gofer. Higher-Order Regret Bounds with Switching Costs

– Sudipto Guha and Kamesh Munagala. Stochastic Regret Minimization via Thompson Sampling

– Moritz Hardt, Raghu Meka, Prasad Raghavendra and Benjamin Weitz. Computational Limits for Matrix Completion

– Moritz Hardt and Mary Wootters. Fast Matrix Completion Without the Condition Number

– Elad Hazan, Zohar Karnin and Raghu Meka. Volumetric Spanners: an Efficient Exploration Basis for Learning

– Prateek Jain and Sewoong Oh. Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

– Kevin Jamieson, Matthew Malloy, Robert Nowak and Sebastien Bubeck. lil’ UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits

– Satyen Kale. Multiarmed Bandits With Limited Expert Advice

– Varun Kanade and Justin Thaler. Distribution-Independent Reliable Learning

– Ravindran Kannan, Santosh S. Vempala and David Woodruff. Principal Component Analysis and Higher Correlations for Distributed Data

– Emilie Kaufmann, Olivier Cappé and Aurélien Garivier. On the Complexity of A/B Testing

– Matthäus Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding

– Kfir Levy, Elad Hazan and Tomer Koren. Logistic Regression: Tight Bounds for Stochastic and Online Optimization

– Ping Li, Cun-Hui Zhang and Tong Zhang. Compressed Counting Meets Compressed Sensing

– Che-Yu Liu and Sébastien Bubeck. Most Correlated Arms Identification

– Stefan Magureanu, Richard Combes and Alexandre Proutière. Lipschitz Bandits:Regret Lower Bounds and Optimal Algorithms

– Shie Mannor, Vianney Perchet and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization

– Brendan McMahan and Francesco Orabona. Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

– Shahar Mendelson. Learning without Concentration

– Aditya Menon and Robert Williamson. Bayes-Optimal Scorers for Bipartite Ranking

– Elchanan Mossel, Joe Neeman and Allan Sly. Belief Propagation, Robust Reconstruction and Optimal Recovery of Block Models

– Andreas Maurer, Massimiliano Pontil and Bernardino Romera-Paredes. An Inequality with Applications to Structured Sparsity and Multitask Dictionary Learning

– Alexander Rakhlin and Karthik Sridharan. Online Nonparametric Regression

– Harish Ramaswamy, Balaji S.B., Shivani Agarwal and Robert Williamson. On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems

– Samira Samadi and Nick Harvey. Near-Optimal Herding

– Rahim Samei, Pavel Semukhin, Boting Yang and Sandra Zilles. Sample Compression for Multi-label Concept Classes

– Ingo Steinwart, Chloe Pasin and Robert Williamson. Elicitation and Identification of Properties

– Ilya Tolstikhin, Gilles Blanchard and Marius Kloft. Localized Complexities for Transductive Learning

– Robert Williamson. The Geometry of Losses

– Jiaming Xu, Marc Lelarge and Laurent Massoulie. Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

– Se-Young Yun and Alexandre Proutiere. Community Detection via Random and Adaptive Sampling

– Yuchen Zhang, Martin Wainwright and Michael Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression

Posted in Conference/workshop | 1 Comment

On the influence of the seed graph in the preferential attachment model


The preferential attachment model, introduced in 1992 by Mahmoud and popularized in 1999 by Barabási and Albert, has attracted a lot of attention in the last decade. In its simplest form it describes the evolution of a random tree. Formally we denote by \mathrm{PA}(n) the preferential attachment tree on n vertices which is defined by induction as follows. First \mathrm{PA}(2) is the unique tree on two vertices. Then, given \mathrm{PA}(n), \mathrm{PA}(n+1) is formed from \mathrm{PA}(n) by adding a new vertex u and a new edge uv where v is selected at random among vertices in \mathrm{PA}(n) according to the following probability distribution:

    \[\mathbb{P}\left(v = i \ \middle| \, \mathrm{PA}(n) \right) = \frac{d_{\mathrm{PA}(n)}(i)}{2 \left( n - 1 \right)} ,\]

where d_T(u) denotes the degree of vertex u in a tree T. In other words vertices of large degrees are more likely to attract the new nodes. This model of evolution is argued to be a good approximation for things such a network of citations, or the internet network.

One of the main reason for the success of the preferential attachment model is the following theorem, which shows that the degree distribution in \mathrm{PA}(n) follows a power-law, a feature that many real-world networks (such as the internet) exhibit but which is not reproduced by standard random graph models such as the Erdös-Rényi model.

Theorem [Bollobás, Riordan, Spencer and Tusnády (2001)Let d be fixed. Then as n \to +\infty, the proportion of vertices with degree d tends in probability to

    \[\frac{4}{(d+1)(d+2)(d+3)} \sim \frac{4}{d^3} .\]

While the above theorem is a fine and interesting mathematical result, I do not view it as the critical aspect of the preferential attachment model (note that Wikipedia disagrees). In my opinion \mathrm{PA}(n) is simply interesting merely because of its natural rule of evolution.

Now think about the application of the PA model to the internet. Of course there is a few obvious objections, such as the fact that in \mathrm{PA}(n) a website can only link to one other website. While this is clearly ridiculous I think that \mathrm{PA}(n) still contains the essence of what one would like to capture to model the evolution of the internet. However there is one potentially important aspect which is overlooked: in the early days of the internet the PA model was probably very far from being a good approximation to the evolution of the network. It is perhaps reasonable to assume that after 1995 the network was evolving according to PA, but certainly from 1970 to 1995 the evolution followed fundamentally different rules. This observation suggests to study the preferential attachment model with a seed.

Thus we are now interested in \mathrm{PA}(n, T), where T is a finite seed tree. Formally \mathrm{PA}(n, T) is also defined by induction, where \mathrm{PA}(|T|, T) = T and \mathrm{PA}(n+1,T) is formed from \mathrm{PA}(n, T) as before. A very basic question which seems to have been overlooked in the literature is the following: what is the influence of the seed T on \mathrm{PA}(n, T) as n goes to infinity?

In our recent joint work with Elchanan Mossel and Miklos Racz we looked exactly at this question. More precisely we ask the following: given two seed trees T and S, do the distributions \mathrm{PA}(n, S) and \mathrm{PA}(n, T) remain separated (say in total variation distance) as n goes to infinity? In other words we are interested in the following quantity:

    \[\delta(S, T) = \lim_{n \to \infty} \mathrm{TV}(\mathrm{PA}(n, S), \mathrm{PA}(n, T)) .\]

A priori it could be that \delta(S, T) = 0 for any S and T, which would mean that the seed has no influence and that the preferential attachment “forgets” its initial conditions. We prove that this is far from true:

Theorem [Bubeck, Mossel and Racz (2014)] Let S and T be two finite trees on at least 3 vertices. If the degree distributions in S and T are different, then \delta \left( S, T \right) > 0.

If I wanted to make a bold statement I could say that this theorem implies the following: by looking at the internet network today, one can still “see” the influence of the topological structure of the internet back in the 90’s. In other words to a certain extent one can go back in time and potentially infer some properties that people may have believed to be lost (perhaps some army’s secrets hidden in the structure of the ARPANET?). Of course at this stage this is pure science fiction, but the theorem certainly leaves that possibility open. Note that we believe that the theorem can even be strengthen to the following statement:

Conjecture Let S and T be two finite trees on at least 3 vertices. If S and T are non-isomorphic, then \delta \left( S, T \right) > 0.

These statements show that even when n is large one can still “see” in \mathrm{PA}(n, T) the influence of the original seed T. However by considering the total variation distance we allow global statistics that depend on entire tree \mathrm{PA}(n, T). What about local statistics that could be computed by an agent looking only at a finite neighborhood around her? Mathematically this question can be interpreted in the framework of the Benjamini-Schramm limit. Recall that a sequence of random graphs (G_n) tends to a random infinite rooted tree (\mathcal{T}, Z) (Z is the random root) if for any r \in \mathbb{N}, the random ball of radius r around a random vertex k_n in G_n tends in distribution to the random ball of radius r around Z in \mathcal{T}. In other words when n is large enough a random agent cannot tell if she is in G_n or in \mathcal{T} by looking at a finite neighborhood around her. One has the following theorem for the weak limit of the PA model:

Theorem [Berger, Borgs, Chayes and Saberi (2014)The Benjamini-Schramm limit of \mathrm{PA}(n) is the Pólya-point graph with m=1.

We extend this result to an arbitrary seed and show that locally the seed has no influence:

Theorem [Bubeck, Mossel and Racz (2014)For any tree T the Benjamini-Schramm limit of \mathrm{PA}(n, T) is the Pólya-point graph with m=1.

Thus, while the army’s secret of the 90’s might be at risk if one looks at the overall topology of the current internet network, these secrets are safe from any local agent who would only access a (random) finite part of the network.

These new results on the PA model naturally lead to a host of new problems. We end the paper with a list of 7 open problems, I recommend to take a look at them (and try to solve them)!

Posted in Random graphs | 7 Comments

Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization


About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function f (in the previous post I dealt only with the smooth case). Precisely f is \alpha-strongly convex and \beta-smooth, and we denote by Q=\beta / \alpha the condition number of f. As explained in this post, in this case the basic gradient descent algorithm requires O(Q \log(1/\epsilon)) iterations to reach \epsilon-accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of O(\sqrt{Q} \log(1/\epsilon)). This is particularly relevant in Machine Learning applications since the strong convexity parameter \alpha can often be viewed as a regularization term, and 1/\alpha can be as large as the sample size. Thus reducing the number of step from “\mathrm{sample \; size}” to “\sqrt{\mathrm{sample \; size}}” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004’s book. Start at an arbitrary initial point x_1 = y_1 and then iterate the following equations for s \geq 1,

    \begin{eqnarray*} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & \left(1 + \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) y_{s+1} - \frac{\sqrt{Q}-1}{\sqrt{Q}+1} y_s . \end{eqnarray*}

Theorem Let f be \alpha-strongly convex and \beta-smooth, then Nesterov’s Accelerated Gradient Descent satisfies

    \[f(y_t) - f(x^*) \leq \frac{\alpha + \beta}{2} \|x_1 - x^*\|^2 \exp\left(- \frac{t-1}{\sqrt{Q}} \right).\]

Proof: We define \alpha-strongly convex quadratic functions \Phi_s, s \geq 1 by induction as follows:

(1)   \begin{eqnarray*} \Phi_1(x) & = & f(x_1) + \frac{\alpha}{2} \|x-x_1\|^2 , \notag \\ \Phi_{s+1}(x) & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s(x) \notag \\ & & + \frac{1}{\sqrt{Q}} \left(f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \right).  \end{eqnarray*}

Intuitively \Phi_s becomes a finer and finer approximation (from below) to f in the following sense:

(2)   \begin{equation*}  \Phi_{s+1}(x) \leq f(x) + \left(1 - \frac{1}{\sqrt{Q}}\right)^s (\Phi_1(x) - f(x)). \end{equation*}

The above inequality can be proved immediately by induction, using the fact that by \alpha-strong convexity one has

    \[f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \leq f(x) .\]

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below f is \Phi_s. The following inequality answers this question:

(3)   \begin{equation*}  f(y_s) \leq \min_{x \in \mathbb{R}^n} \Phi_s(x) . \end{equation*}

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by \beta-smoothness one has f(x) - f(x^*) \leq \frac{\beta}{2} \|x-x^*\|^2):

    \begin{eqnarray*} f(y_t) - f(x^*) & \leq & \Phi_t(x^*) - f(x^*) \\ & \leq & \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} (\Phi_1(x^*) - f(x^*)) \\ & \leq & \frac{\alpha + \beta}{2} \|x_1-x^*\|^2 \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} . \end{eqnarray*}

We now prove (3) by induction (note that it is true at s=1 since x_1=y_1). Let \Phi_s^* = \min_{x \in \mathbb{R}^n} \Phi_s(x). Using the definition of y_{s+1} (and \beta-smoothness), convexity, and the induction hypothesis, one gets

    \begin{align*} & f(y_{s+1}) \\ & \leq f(x_s) - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & = \left(1 - \frac{1}{\sqrt{Q}}\right) f(y_s) + \left(1 - \frac{1}{\sqrt{Q}}\right)(f(x_s) - f(y_s)) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & \leq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{align*}

Thus we now have to show that

(4)   \begin{equation*}  \Phi_{s+1}^* \geq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{equation*}

To prove this inequality we have to understand better the functions \Phi_s. First note that \nabla^2 \Phi_s(x) = \alpha \mathrm{I}_n (immediate by induction) and thus \Phi_s has to be of the following form:

    \[\Phi_s(x) = \Phi_s^* + \frac{\alpha}{2} \|x - v_s\|^2 ,\]

for some v_s \in \mathbb{R}^n. Now observe that by differentiating (1) and using the above form of \Phi_s one obtains

    \[\nabla \Phi_{s+1}(x) = \alpha \left(1 - \frac{1}{\sqrt{Q}}\right) (x-v_s) + \frac{1}{\sqrt{Q}} \nabla f(x_s) + \frac{\alpha}{\sqrt{Q}} (x-x_s) .\]

In particular \Phi_{s+1} is by definition minimized at v_{s+1} which can now be defined by induction using the above identity, precisely:

(5)   \begin{equation*}  v_{s+1} = \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) . \end{equation*}

Using the form of \Phi_s and \Phi_{s+1}, as well as the original definition (1) one gets the following identity by evaluating \Phi_{s+1} at x_s:

(6)   \begin{equation*}  \Phi_{s+1}^* + \frac{\alpha}{2} \|x_s - v_{s+1}\|^2 = \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{\alpha}{2} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 + \frac{1}{\sqrt{Q}} f(x_s) . \end{equation*}

Note that thanks to (5) one has

    \begin{eqnarray*} \|x_s - v_{s+1}\|^2   & = & \left(1 - \frac{1}{\sqrt{Q}}\right)^2 \|x_s - v_s\|^2 + \frac{1}{\alpha^2 Q} \|\nabla f(x_s)\|^2 \\ & & - \frac{2}{\alpha \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) , \end{eqnarray*}

which combined with (6) yields

    \begin{eqnarray*} \Phi_{s+1}^* & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{1}{\sqrt{Q}} f(x_s) + \frac{\alpha}{2 \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 \\ & & \qquad - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 + \frac{1}{\sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) . \end{eqnarray*}

Finally we show by induction that v_s - x_s = \sqrt{Q}(x_s - y_s), which concludes the proof of (4) and thus also concludes the proof of the theorem:

    \begin{eqnarray*} v_{s+1} - x_{s+1} & = & \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} x_s - (\sqrt{Q}-1) y_s - \frac{\sqrt{Q}}{\beta} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} y_{s+1} - (\sqrt{Q}-1) y_s - x_{s+1} \\ & = & \sqrt{Q} (x_{s+1} - y_{s+1}) , \end{eqnarray*}

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of y_{s+1} and the last one from the definition of x_{s+1}.


Posted in Optimization | 7 Comments

COLT deadline next week

COLT deadline is approaching fast, don’t forget to send your awesome learning theory paper(s) before Friday 7th! Also recall that if you get a paper into COLT it will (i) give you an excuse to spend a few days in Barcelona at the best time of the year and (ii) you could stay a few more days to attend our awesome Foundations of Learning Theory workshop.

In other news I attended ITCS in Princeton last month and it was absolutely great. Here are a few papers that I really liked:

– Amir Shpilka, Avishay Tal and Ben Lee Volk, On the Structure of Boolean Functions with Small Spectral Norm

– Andrew Wan, John Wright and Chenggang Wu. Decision Trees, Protocols, and the Fourier Entropy-Influence Conjecture

– Cristopher Moore and Leonard Schulman, Tree Codes and a Conjecture on Exponential Sums

– Fernando Brandao, Aram Harrow, James Lee and Yuval Peres, Adversarial hypothesis testing and a quantum Stein’s Lemma for restricted measurements

– Yossi Azar, Uriel Feige, Michal Feldman and Moshe Tennenholtz, Sequential Decision Making with Vector Outcomes

– Manor Mendel and Assaf Naor, Expanders with respect to Hadamard spaces and random graphs

– David Gamarnik and Madhu Sudan, Limits of local algorithms over sparse random graphs

– Rishi Gupta, Tim Roughgarden and C. Seshadhri, Decompositions of Triangle-Dense Graphs

During my visit to the Theory Group at MSR I also learned about the following topics which my readers will probably like too: Tucker’s lemma (see also the related Borsuk-Ulam theorem), the lamplighter graph and random walks on it, sparse regularity lemma, and algorithmic applications of evolving sets.

Posted in Conference/workshop | Leave a comment

One year of blogging

A year ago I started this blog. I did not expect this experience to be so rewarding. For instance seeing weeks after the weeks the growing interest for the optimization posts gave me the stamina to pursue this endaveor to its end. Meeting with readers at various conferences (or housewarming parties…) has been a lot of fun too.

Now is probably a good time to look back on 2013 as well as to look forward to what this blog will become in 2014.


I’m a bandit in 2013

First of all I’m happy to report that ‘I’m a bandit’ was viewed more than 55 000 times in 2013!!! As you can see from the plot below (taken from Google analytics) there has been three significant spikes.


– On March 22nd I received my first significant link from John Langford on his own blog hunch.net.

– On July 17th I made a post with ‘deep learning’ in the title. It was retweeted, reblogged, facebooked, G+’ed, etc.

– The last spike on December 13th is quite interesting, as it comes from my first link from the Machine Learning group on reddit.

Of course the stars of the blog so far have been the optimization lecture notes. But the star among the stars is the post on Nesterov’s Accelerated Gradient Descent which has been viewed THREE TIMES MORE than any other post in this sequence! Apart from optimization stuff this blog hosted a few other topics such as metric embeddings, convex geometry or graph theory. Browsing these older posts should be easier now with the new Archives page.


I’m a bandit in 2014

My main objective for the first half of 2014 is to turn the optimization posts into an actual book (or rather a long survey). This will be published in the Foundations and Trends in Machine Learning series (alongside with my previous survey on bandits). Of course I expect this project to take up a lot of my time, so I won’t post too much from February to May. On the other hand I am hopeful that during this downtime I will host quite a few interesting guest posts.

Once the first draft for the optimization lecture notes is out (probably in early May) I will have more time to dedicate to the blog. I plan to start a new series of posts on a topic that I find fascinating, this is the recent theory of graphs limits. I believe (and I’m not the only one!) that in the years to come this theory will prove to be a powerful tool for network analysis, and in particular for statistical analysis on network data. More on this in a few months!

Posted in Uncategorized | 5 Comments

A good NIPS!

This year’s edition of NIPS was a big success. As you probably already know we had the surprise visit of Mark Zuckerberg (see here for the reason behind this visit). More interestingly (or perhaps less interestingly depending on who you are) here are some of the papers that I liked the most.


Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) by Francis Bach and Eric Moulines. With noisy gradients (see this post) it is known that the best rate of convergence to minimize an \alpha-strongly convex function is of order 1/(\alpha t) while for convex functions it is of order 1/\sqrt{t}. Unfortunately in Machine Learning applications the strong convexity parameter \alpha is often a regularization parameter that can be as small as 1/\sqrt{t}, in which case the standard analysis using strong convexity do not yield any acceleration. In this paper Bach and Moulines show that in the case of the square loss (whose strong convexity parameter depends on the smallest non-zero eigenvalue of the covariance matrix of the covariates) one can obtain a rate of order 1/t (i.e. with no dependency on the smallest eigenvalue \alpha). The proposed algorithm is simply Stochastic Gradient Descent with a constant step-size (and averaging for the output). A more intricate algorithm is also proposed for the logistic loss.

On Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal operator

    \[\mathcal{P}_f(y) = \argmin_x \|x - y\|^2_2 + f(x) .\]

Recall that this proximal operator arises naturally for the minimization of a function of the form F + f, i.e. when one wants to minimize some function F while enforcing some of the ‘properties’ of f in the solution. For instance with f = \|\cdot\|_1 one would like to output a sparse solution. Thus it is very natural to try to understand the relation between \mathcal{P}_{f+g} and \mathcal{P}_f, \mathcal{P}_g. This paper consider various properties under which one has \mathcal{P}_{f+g} = \mathcal{P}_f \circ \mathcal{P}_g.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Rie Johnson and Tong Zhang. This paper gives a beautiful new algorithm achieving the same performances as SDCA and SAG (see this post). The algorithm/analysis are much more intuitive than the one of SDCA and SAG. I will make a more detailed post on this paper later next year.

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent by Shai Shalev-Shwartz and Tong Zhang. Both SDCA and SAG have a linear dependency on the condition number Q. For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on \sqrt{Q}. This paper partially bridges the gap between these results and present an accelerated version of SDCA using mini batches.

Mixed Optimization for Smooth Functions by Mehrdad Mahdavi, Lijun Zhang and Rong Jin. This paper considers a new setting which seems quite natural: what if on top of noisy first order oracle one can also access a regular first order oracle? Mehrdad will do a guest post on this problem soon, but the short answer is that with only a logarithmic number of calls to the regular oracle one can attain a rate of order 1/t for smooth optimization (while with only the noisy oracle the rate is 1/\sqrt{t}).

Estimation, Optimization, and Parallelism when Data is Sparse by John Duchi, Mike Jordan and Brendan McMahan. For this paper too I am hoping to have a guest post (by John Duchi this time) that would explain the new algorithm and its properties. The idea is roughly to do a gradient descent where the step-size adapts to the sparsity of the observed gradient, allowing for much faster rates in certain situations.



Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search by Alexander Zimin and Gergely Neu. This paper shows that one can solve episodic loop-free MDPs by simply using a combinatorial semi-bandit strategy (see this paper by Audibert, myself and Lugosi where we solved the semi-bandit problem). I believe that this paper initiate a research direction that will be very fruitful in the future. Namely reducing (or rather reformulating) a complicated sequential decision making problem as a linear bandit (or semi-bandit). A similar approach is very popular in optimization where everyone knows that one should try very hard to formulate the problem of interest as a convex program. On the other hand such an approach in online learning/sequential decision making has not been recognized yet. I believe that at the moment the most promising direction is to try to formulate the problem as a linear bandit as it is both an extremely general problem but also one for which we have seemingly canonical algorithms. A related paper is Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions by Abbasi-Yadkori, Bartlett, Kanade, Seldin and Szepesvari.

Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards by Thomas Bonald and Alexandre Proutiere. This paper considers the famous Bayesian setting of Berry, Chen, Zame, Heath and Shepp where one has a countable set of arms with Bernoulli distributions and means drawn uniformly on [0,1]. This paper shows the first optimal strategy with a Bayesian regret of order \sqrt{2 n}. I recommend to take a look at the strategy, it is both very elegant and quite smart.

Sequential Transfer in Multi-armed Bandit with Finite Set of Models by Mohammad Azar, Alessandro Lazaric and Emma Brunskill. This paper considers an elegant and simple model for transfer learning in multi-armed bandit problems. To put it simply the setting is the one of a Bayesian multi-armed bandit where the underlying parameter \theta is replaced by a new fresh independent sample very N steps. If the prior is known then this problem is fairly straightforward (given our current knowledge of the multi-armed bandit). The issue is what to do when the prior is unknown. The paper proposes an interesting strategy based on learning latent variable models via the method of moments (see this paper by Anandkumar, Ge, Hsu, Kakade, and Telgarsky for instance). While this gives non-trivial results I believe that much more can be said by fully embracing the sequential nature of the problem rather than trying to blend together a batch method with standard bandit strategies. I suspect that the technical difficulties to obtain an optimal strategy for this setting will be tremendous (which is quite exciting!).

Eluder Dimension and the Sample Complexity of Optimistic Exploration by Daniel Russo and Benjamin Van Roy; Prior-free and prior-dependent regret bounds for Thompson Sampling by myself and Che-Yu Liu. These two papers ask the same question (with the latter paper being inspired by the former): we know that Thompson Sampling can be as good as UCB when initialized with a well-chosen priors, but can we show that it has a significant advantage in the cases where the prior is actually informative? The first paper addresses this question in the context of linear bandits (which is, as pointed out above, the most fundamental bandit problem), while the second paper considers a much more restricted class of problems but for which surprisingly strong results can be obtained.


Online Learning

Dimension-Free Exponentiated Gradient by Francesco Orabona. This paper introduces a new regularizer for Mirror Descent and shows that it adapts automatically to the norm of the unknown comparator. In my opinion we know very few interesting regularizers for MD (especially in the full-information setting) and any non-trivial addition to this set seems difficult. This paper manages to do just that. It would be interesting to see if this gives new insights for bandit regularizers.

Minimax Optimal Algorithms for Unconstrained Linear Optimization by Brendan McMahan and Jacob Abernethy. My advice for this paper is to look at Section 3.2 which gives one very concrete and very cool application of the method they develop.



I’m running out of stamina so I will now just list the other papers that I found very interesting.

Density estimation from unweighted k-nearest neighbor graphs: a roadmap by von Luxburg and Alamgir. 

Stochastic blockmodel approximation of a graphon: Theory and consistent estimation by Airoldi, Costa and Chan. 

Near-Optimal Entrywise Sampling for Data Matrices by Achlioptas, Karnin and Liberty.

Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic by Sharpnack, Krishnamurthy and Singh.

Estimating the Unseen: Improved Estimators for Entropy and other Properties by Valiant and Valiant. 

Information-theoretic lower bounds for distributed statistical estimation with communication constraints by Zhang, Duchi, Jordan and Wainwright.

Posted in Conference/workshop | 3 Comments