COLT 2014 accepted papers

The accepted papers for COLT 2014 have just been posted! This year we had a record number of 140 submissions, out of which 52 were accepted (38 for a 20mn presentation and 14 for a 5mn presentation). In my opinion the program is very strong, I am looking forward to listen to all these talks in Barcelona! By the way now may be a very good time to register, the deadline for early registration is coming up fast!

Here is a list of the accepted papers together with links when I could find one (if you want me to add a link just send me an email).

COLT 2014 accepted papers

- Jacob Abernethy, Chansoo Lee, Abhinav Sinha and Ambuj Tewari. Learning with Perturbations via Gaussian Smoothing

- Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon. Learning Sparsely Used Overcomplete Dictionaries

- Alekh Agarwal, Ashwin Badanidiyuru, Miroslav Dudik, Robert Schapire and Aleksandrs Slivkins. Robust Multi-objective Learning with Mentor Feedback- Morteza Alamgir, -

- Ulrike von Luxburg and Gabor Lugosi. Density-preserving quantization with application to graph downsampling

- Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher and James Voss. The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

- Sanjeev Arora, Rong Ge and Ankur Moitra. New Algorithms for Learning Incoherent and Overcomplete Dictionaries

- Ashwinkumar Badanidiyuru, John Langford and Aleksandrs Slivkins. Resourceful Contextual Bandits

- Shai Ben-David and Ruth Urner. The sample complexity of agnostic learning under deterministic labels

- Aditya Bhaskara, Moses Charikar and Aravindan Vijayaraghavan. Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability

- Evgeny Burnaev and Vladimir Vovk. Efficiency of conformalized ridge regression

- Karthekeyan Chandrasekaran and Richard M. Karp. Finding a most biased coin with fewest flips

- Yudong Chen, Xinyang Yi and Constantine Caramanis. A Convex Formulation for Mixed Regression: Minimax Optimal Rates

- Amit Daniely, Nati Linial and Shai Shalev-Shwartz. The complexity of learning halfspaces using generalized linear methods

- Amit Daniely and Shai Shalev-Shwartz. Optimal Learners for Multiclass Problems

- Constantinos Daskalakis and Gautam Kamath. Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

- Ofer Dekel, Jian Ding, Tomer Koren and Yuval Peres. Online Learning with Composite Loss Functions Can Be Hard

- Tim van Erven, Wojciech Kotlowski and Manfred K. Warmuth. Follow the Leader with Dropout Perturbations

- Vitaly Feldman and Pravesh Kothari. Learning Coverage Functions and Private Release of Marginals

- Vitaly Feldman and David Xiao. Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

- Pierre Gaillard, Gilles Stoltz and Tim van Erven. A Second-order Bound with Excess Losses

- Eyal Gofer. Higher-Order Regret Bounds with Switching Costs

- Sudipto Guha and Kamesh Munagala. Stochastic Regret Minimization via Thompson Sampling

- Moritz Hardt, Raghu Meka, Prasad Raghavendra and Benjamin Weitz. Computational Limits for Matrix Completion

- Moritz Hardt and Mary Wootters. Fast Matrix Completion Without the Condition Number

- Elad Hazan, Zohar Karnin and Raghu Meka. Volumetric Spanners: an Efficient Exploration Basis for Learning

- Prateek Jain and Sewoong Oh. Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

- Kevin Jamieson, Matthew Malloy, Robert Nowak and Sebastien Bubeck. lil’ UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits

- Satyen Kale. Multiarmed Bandits With Limited Expert Advice

- Varun Kanade and Justin Thaler. Distribution-Independent Reliable Learning

- Ravindran Kannan, Santosh S. Vempala and David Woodruff. Principal Component Analysis and Higher Correlations for Distributed Data

- Emilie Kaufmann, Olivier Cappé and Aurélien Garivier. On the Complexity of A/B Testing

- Matthäus Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding

- Kfir Levy, Elad Hazan and Tomer Koren. Logistic Regression: Tight Bounds for Stochastic and Online Optimization

- Ping Li, Cun-Hui Zhang and Tong Zhang. Compressed Counting Meets Compressed Sensing

- Che-Yu Liu and Sébastien Bubeck. Most Correlated Arms Identification

- Stefan Magureanu, Richard Combes and Alexandre Proutière. Lipschitz Bandits:Regret Lower Bounds and Optimal Algorithms

- Shie Mannor, Vianney Perchet and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization

- Brendan McMahan and Francesco Orabona. Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

- Shahar Mendelson. Learning without Concentration

- Aditya Menon and Robert Williamson. Bayes-Optimal Scorers for Bipartite Ranking

- Elchanan Mossel, Joe Neeman and Allan Sly. Belief Propagation, Robust Reconstruction and Optimal Recovery of Block Models

- Andreas Maurer, Massimiliano Pontil and Bernardino Romera-Paredes. An Inequality with Applications to Structured Sparsity and Multitask Dictionary Learning

- Alexander Rakhlin and Karthik Sridharan. Online Nonparametric Regression

- Harish Ramaswamy, Balaji S.B., Shivani Agarwal and Robert Williamson. On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems

- Samira Samadi and Nick Harvey. Near-Optimal Herding

- Rahim Samei, Pavel Semukhin, Boting Yang and Sandra Zilles. Sample Compression for Multi-label Concept Classes

- Ingo Steinwart, Chloe Pasin and Robert Williamson. Elicitation and Identification of Properties

- Ilya Tolstikhin, Gilles Blanchard and Marius Kloft. Localized Complexities for Transductive Learning

- Robert Williamson. The Geometry of Losses

- Jiaming Xu, Marc Lelarge and Laurent Massoulie. Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

- Se-Young Yun and Alexandre Proutiere. Community Detection via Random and Adaptive Sampling

- Yuchen Zhang, Martin Wainwright and Michael Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression

Posted in Conference/workshop | Leave a comment

On the influence of the seed graph in the preferential attachment model

    \[\]

The preferential attachment model, introduced in 1992 by Mahmoud and popularized in 1999 by Barabási and Albert, has attracted a lot of attention in the last decade. In its simplest form it describes the evolution of a random tree. Formally we denote by \mathrm{PA}(n) the preferential attachment tree on n vertices which is defined by induction as follows. First \mathrm{PA}(2) is the unique tree on two vertices. Then, given \mathrm{PA}(n), \mathrm{PA}(n+1) is formed from \mathrm{PA}(n) by adding a new vertex u and a new edge uv where v is selected at random among vertices in \mathrm{PA}(n) according to the following probability distribution:

    \[\mathbb{P}\left(v = i \ \middle| \, \mathrm{PA}(n) \right) = \frac{d_{\mathrm{PA}(n)}(i)}{2 \left( n - 1 \right)} ,\]

where d_T(u) denotes the degree of vertex u in a tree T. In other words vertices of large degrees are more likely to attract the new nodes. This model of evolution is argued to be a good approximation for things such a network of citations, or the internet network.

One of the main reason for the success of the preferential attachment model is the following theorem, which shows that the degree distribution in \mathrm{PA}(n) follows a power-law, a feature that many real-world networks (such as the internet) exhibit but which is not reproduced by standard random graph models such as the Erdös-Rényi model.

Theorem [Bollobás, Riordan, Spencer and Tusnády (2001)Let d be fixed. Then as n \to +\infty, the proportion of vertices with degree d tends in probability to

    \[\frac{4}{(d+1)(d+2)(d+3)} \sim \frac{4}{d^3} .\]

While the above theorem is a fine and interesting mathematical result, I do not view it as the critical aspect of the preferential attachment model (note that Wikipedia disagrees). In my opinion \mathrm{PA}(n) is simply interesting merely because of its natural rule of evolution.

Now think about the application of the PA model to the internet. Of course there is a few obvious objections, such as the fact that in \mathrm{PA}(n) a website can only link to one other website. While this is clearly ridiculous I think that \mathrm{PA}(n) still contains the essence of what one would like to capture to model the evolution of the internet. However there is one potentially important aspect which is overlooked: in the early days of the internet the PA model was probably very far from being a good approximation to the evolution of the network. It is perhaps reasonable to assume that after 1995 the network was evolving according to PA, but certainly from 1970 to 1995 the evolution followed fundamentally different rules. This observation suggests to study the preferential attachment model with a seed.

Thus we are now interested in \mathrm{PA}(n, T), where T is a finite seed tree. Formally \mathrm{PA}(n, T) is also defined by induction, where \mathrm{PA}(|T|, T) = T and \mathrm{PA}(n+1,T) is formed from \mathrm{PA}(n, T) as before. A very basic question which seems to have been overlooked in the literature is the following: what is the influence of the seed T on \mathrm{PA}(n, T) as n goes to infinity?

In our recent joint work with Elchanan Mossel and Miklos Racz we looked exactly at this question. More precisely we ask the following: given two seed trees T and S, do the distributions \mathrm{PA}(n, S) and \mathrm{PA}(n, T) remain separated (say in total variation distance) as n goes to infinity? In other words we are interested in the following quantity:

    \[\delta(S, T) = \lim_{n \to \infty} \mathrm{TV}(\mathrm{PA}(n, S), \mathrm{PA}(n, T)) .\]

A priori it could be that \delta(S, T) = 0 for any S and T, which would mean that the seed has no influence and that the preferential attachment “forgets” its initial conditions. We prove that this is far from true:

Theorem [Bubeck, Mossel and Racz (2014)] Let S and T be two finite trees on at least 3 vertices. If the degree distributions in S and T are different, then \delta \left( S, T \right) > 0.

If I wanted to make a bold statement I could say that this theorem implies the following: by looking at the internet network today, one can still “see” the influence of the topological structure of the internet back in the 90′s. In other words to a certain extent one can go back in time and potentially infer some properties that people may have believed to be lost (perhaps some army’s secrets hidden in the structure of the ARPANET?). Of course at this stage this is pure science fiction, but the theorem certainly leaves that possibility open. Note that we believe that the theorem can even be strengthen to the following statement:

Conjecture Let S and T be two finite trees on at least 3 vertices. If S and T are non-isomorphic, then \delta \left( S, T \right) > 0.

These statements show that even when n is large one can still “see” in \mathrm{PA}(n, T) the influence of the original seed T. However by considering the total variation distance we allow global statistics that depend on entire tree \mathrm{PA}(n, T). What about local statistics that could be computed by an agent looking only at a finite neighborhood around her? Mathematically this question can be interpreted in the framework of the Benjamini-Schramm limit. Recall that a sequence of random graphs (G_n) tends to a random infinite rooted tree (\mathcal{T}, Z) (Z is the random root) if for any r \in \mathbb{N}, the random ball of radius r around a random vertex k_n in G_n tends in distribution to the random ball of radius r around Z in \mathcal{T}. In other words when n is large enough a random agent cannot tell if she is in G_n or in \mathcal{T} by looking at a finite neighborhood around her. One has the following theorem for the weak limit of the PA model:

Theorem [Berger, Borgs, Chayes and Saberi (2014)The Benjamini-Schramm limit of \mathrm{PA}(n) is the Pólya-point graph with m=1.

We extend this result to an arbitrary seed and show that locally the seed has no influence:

Theorem [Bubeck, Mossel and Racz (2014)For any tree T the Benjamini-Schramm limit of \mathrm{PA}(n, T) is the Pólya-point graph with m=1.

Thus, while the army’s secret of the 90′s might be at risk if one looks at the overall topology of the current internet network, these secrets are safe from any local agent who would only access a (random) finite part of the network.

These new results on the PA model naturally lead to a host of new problems. We end the paper with a list of 7 open problems, I recommend to take a look at them (and try to solve them)!

Posted in Random graphs | 4 Comments

Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization

    \[\]

About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function f (in the previous post I dealt only with the smooth case). Precisely f is \alpha-strongly convex and \beta-smooth, and we denote by Q=\beta / \alpha the condition number of f. As explained in this post, in this case the basic gradient descent algorithm requires O(Q \log(1/\epsilon)) iterations to reach \epsilon-accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of O(\sqrt{Q} \log(1/\epsilon)). This is particularly relevant in Machine Learning applications since the strong convexity parameter \alpha can often be viewed as a regularization term, and 1/\alpha can be as large as the sample size. Thus reducing the number of step from “\mathrm{sample \; size}” to “\sqrt{\mathrm{sample \; size}}” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004′s book. Start at an arbitrary initial point x_1 = y_1 and then iterate the following equations for s \geq 1,

    \begin{eqnarray*} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & \left(1 + \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) y_{s+1} - \frac{\sqrt{Q}-1}{\sqrt{Q}+1} y_s . \end{eqnarray*}

Theorem Let f be \alpha-strongly convex and \beta-smooth, then Nesterov’s Accelerated Gradient Descent satisfies

    \[f(y_t) - f(x^*) \leq \frac{\alpha + \beta}{2} \|x_1 - x^*\|^2 \exp\left(- \frac{t-1}{\sqrt{Q}} \right).\]

Proof: We define \alpha-strongly convex quadratic functions \Phi_s, s \geq 1 by induction as follows:

(1)   \begin{eqnarray*} \Phi_1(x) & = & f(x_1) + \frac{\alpha}{2} \|x-x_1\|^2 , \notag \\ \Phi_{s+1}(x) & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s(x) \notag \\ & & + \frac{1}{\sqrt{Q}} \left(f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \right).  \end{eqnarray*}

Intuitively \Phi_s becomes a finer and finer approximation (from below) to f in the following sense:

(2)   \begin{equation*}  \Phi_{s+1}(x) \leq f(x) + \left(1 - \frac{1}{\sqrt{Q}}\right)^s (\Phi_1(x) - f(x)). \end{equation*}

The above inequality can be proved immediately by induction, using the fact that by \alpha-strong convexity one has

    \[f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \leq f(x) .\]

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below f is \Phi_s. The following inequality answers this question:

(3)   \begin{equation*}  f(y_s) \leq \min_{x \in \mathbb{R}^n} \Phi_s(x) . \end{equation*}

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by \beta-smoothness one has f(x) - f(x^*) \leq \frac{\beta}{2} \|x-x^*\|^2):

    \begin{eqnarray*} f(y_t) - f(x^*) & \leq & \Phi_t(x^*) - f(x^*) \\ & \leq & \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} (\Phi_1(x^*) - f(x^*)) \\ & \leq & \frac{\alpha + \beta}{2} \|x_1-x^*\|^2 \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} . \end{eqnarray*}

We now prove (3) by induction (note that it is true at s=1 since x_1=y_1). Let \Phi_s^* = \min_{x \in \mathbb{R}^n} \Phi_s(x). Using the definition of y_{s+1} (and \beta-smoothness), convexity, and the induction hypothesis, one gets

    \begin{align*} & f(y_{s+1}) \\ & \leq f(x_s) - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & = \left(1 - \frac{1}{\sqrt{Q}}\right) f(y_s) + \left(1 - \frac{1}{\sqrt{Q}}\right)(f(x_s) - f(y_s)) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & \leq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{align*}

Thus we now have to show that

(4)   \begin{equation*}  \Phi_{s+1}^* \geq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{equation*}

To prove this inequality we have to understand better the functions \Phi_s. First note that \nabla^2 \Phi_s(x) = \alpha \mathrm{I}_n (immediate by induction) and thus \Phi_s has to be of the following form:

    \[\Phi_s(x) = \Phi_s^* + \frac{\alpha}{2} \|x - v_s\|^2 ,\]

for some v_s \in \mathbb{R}^n. Now observe that by differentiating (1) and using the above form of \Phi_s one obtains

    \[\nabla \Phi_{s+1}(x) = \alpha \left(1 - \frac{1}{\sqrt{Q}}\right) (x-v_s) + \frac{1}{\sqrt{Q}} \nabla f(x_s) + \frac{\alpha}{\sqrt{Q}} (x-x_s) .\]

In particular \Phi_{s+1} is by definition minimized at v_{s+1} which can now be defined by induction using the above identity, precisely:

(5)   \begin{equation*}  v_{s+1} = \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) . \end{equation*}

Using the form of \Phi_s and \Phi_{s+1}, as well as the original definition (1) one gets the following identity by evaluating \Phi_{s+1} at x_s:

(6)   \begin{equation*}  \Phi_{s+1}^* + \frac{\alpha}{2} \|x_s - v_{s+1}\|^2 = \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{\alpha}{2} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 + \frac{1}{\sqrt{Q}} f(x_s) . \end{equation*}

Note that thanks to (5) one has

    \begin{eqnarray*} \|x_s - v_{s+1}\|^2   & = & \left(1 - \frac{1}{\sqrt{Q}}\right)^2 \|x_s - v_s\|^2 + \frac{1}{\alpha^2 Q} \|\nabla f(x_s)\|^2 \\ & & - \frac{2}{\alpha \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) , \end{eqnarray*}

which combined with (6) yields

    \begin{eqnarray*} \Phi_{s+1}^* & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{1}{\sqrt{Q}} f(x_s) + \frac{\alpha}{2 \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 \\ & & \qquad - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 + \frac{1}{\sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) . \end{eqnarray*}

Finally we show by induction that v_s - x_s = \sqrt{Q}(x_s - y_s), which concludes the proof of (4) and thus also concludes the proof of the theorem:

    \begin{eqnarray*} v_{s+1} - x_{s+1} & = & \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} x_s - (\sqrt{Q}-1) y_s - \frac{\sqrt{Q}}{\beta} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} y_{s+1} - (\sqrt{Q}-1) y_s - x_{s+1} \\ & = & \sqrt{Q} (x_{s+1} - y_{s+1}) , \end{eqnarray*}

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of y_{s+1} and the last one from the definition of x_{s+1}.

\Box

Posted in Optimization | 2 Comments

COLT deadline next week

COLT deadline is approaching fast, don’t forget to send your awesome learning theory paper(s) before Friday 7th! Also recall that if you get a paper into COLT it will (i) give you an excuse to spend a few days in Barcelona at the best time of the year and (ii) you could stay a few more days to attend our awesome Foundations of Learning Theory workshop.

In other news I attended ITCS in Princeton last month and it was absolutely great. Here are a few papers that I really liked:

- Amir Shpilka, Avishay Tal and Ben Lee Volk, On the Structure of Boolean Functions with Small Spectral Norm

- Andrew Wan, John Wright and Chenggang Wu. Decision Trees, Protocols, and the Fourier Entropy-Influence Conjecture

- Cristopher Moore and Leonard Schulman, Tree Codes and a Conjecture on Exponential Sums

- Fernando Brandao, Aram Harrow, James Lee and Yuval Peres, Adversarial hypothesis testing and a quantum Stein’s Lemma for restricted measurements

- Yossi Azar, Uriel Feige, Michal Feldman and Moshe Tennenholtz, Sequential Decision Making with Vector Outcomes

- Manor Mendel and Assaf Naor, Expanders with respect to Hadamard spaces and random graphs

- David Gamarnik and Madhu Sudan, Limits of local algorithms over sparse random graphs

- Rishi Gupta, Tim Roughgarden and C. Seshadhri, Decompositions of Triangle-Dense Graphs

During my visit to the Theory Group at MSR I also learned about the following topics which my readers will probably like too: Tucker’s lemma (see also the related Borsuk-Ulam theorem), the lamplighter graph and random walks on it, sparse regularity lemma, and algorithmic applications of evolving sets.

Posted in Conference/workshop | Leave a comment

One year of blogging

A year ago I started this blog. I did not expect this experience to be so rewarding. For instance seeing weeks after the weeks the growing interest for the optimization posts gave me the stamina to pursue this endaveor to its end. Meeting with readers at various conferences (or housewarming parties…) has been a lot of fun too.

Now is probably a good time to look back on 2013 as well as to look forward to what this blog will become in 2014.

 

I’m a bandit in 2013

First of all I’m happy to report that ‘I’m a bandit’ was viewed more than 55 000 times in 2013!!! As you can see from the plot below (taken from Google analytics) there has been three significant spikes.

traffic

- On March 22nd I received my first significant link from John Langford on his own blog hunch.net.

- On July 17th I made a post with ‘deep learning’ in the title. It was retweeted, reblogged, facebooked, G+’ed, etc.

- The last spike on December 13th is quite interesting, as it comes from my first link from the Machine Learning group on reddit.

Of course the stars of the blog so far have been the optimization lecture notes. But the star among the stars is the post on Nesterov’s Accelerated Gradient Descent which has been viewed THREE TIMES MORE than any other post in this sequence! Apart from optimization stuff this blog hosted a few other topics such as metric embeddings, convex geometry or graph theory. Browsing these older posts should be easier now with the new Archives page.

 

I’m a bandit in 2014

My main objective for the first half of 2014 is to turn the optimization posts into an actual book (or rather a long survey). This will be published in the Foundations and Trends in Machine Learning series (alongside with my previous survey on bandits). Of course I expect this project to take up a lot of my time, so I won’t post too much from February to May. On the other hand I am hopeful that during this downtime I will host quite a few interesting guest posts.

Once the first draft for the optimization lecture notes is out (probably in early May) I will have more time to dedicate to the blog. I plan to start a new series of posts on a topic that I find fascinating, this is the recent theory of graphs limits. I believe (and I’m not the only one!) that in the years to come this theory will prove to be a powerful tool for network analysis, and in particular for statistical analysis on network data. More on this in a few months!

Posted in Uncategorized | 5 Comments

A good NIPS!

This year’s edition of NIPS was a big success. As you probably already know we had the surprise visit of Mark Zuckerberg (see here for the reason behind this visit). More interestingly (or perhaps less interestingly depending on who you are) here are some of the papers that I liked the most.

Optimization

- Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) by Francis Bach and Eric Moulines. With noisy gradients (see this post) it is known that the best rate of convergence to minimize an \alpha-strongly convex function is of order 1/(\alpha t) while for convex functions it is of order 1/\sqrt{t}. Unfortunately in Machine Learning applications the strong convexity parameter \alpha is often a regularization parameter that can be as small as 1/\sqrt{t}, in which case the standard analysis using strong convexity do not yield any acceleration. In this paper Bach and Moulines show that in the case of the square loss (whose strong convexity parameter depends on the smallest non-zero eigenvalue of the covariance matrix of the covariates) one can obtain a rate of order 1/t (i.e. with no dependency on the smallest eigenvalue \alpha). The proposed algorithm is simply Stochastic Gradient Descent with a constant step-size (and averaging for the output). A more intricate algorithm is also proposed for the logistic loss.

- On Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal operator

    \[\mathcal{P}_f(y) = \argmin_x \|x - y\|^2_2 + f(x) .\]

Recall that this proximal operator arises naturally for the minimization of a function of the form F + f, i.e. when one wants to minimize some function F while enforcing some of the ‘properties’ of f in the solution. For instance with f = \|\cdot\|_1 one would like to output a sparse solution. Thus it is very natural to try to understand the relation between \mathcal{P}_{f+g} and \mathcal{P}_f, \mathcal{P}_g. This paper consider various properties under which one has \mathcal{P}_{f+g} = \mathcal{P}_f \circ \mathcal{P}_g.

- Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Rie Johnson and Tong Zhang. This paper gives a beautiful new algorithm achieving the same performances as SDCA and SAG (see this post). The algorithm/analysis are much more intuitive than the one of SDCA and SAG. I will make a more detailed post on this paper later next year.

- Accelerated Mini-Batch Stochastic Dual Coordinate Ascent by Shai Shalev-Shwartz and Tong Zhang. Both SDCA and SAG have a linear dependency on the condition number Q. For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on \sqrt{Q}. This paper partially bridges the gap between these results and present an accelerated version of SDCA using mini batches.

- Mixed Optimization for Smooth Functions by Mehrdad Mahdavi, Lijun Zhang and Rong Jin. This paper considers a new setting which seems quite natural: what if on top of noisy first order oracle one can also access a regular first order oracle? Mehrdad will do a guest post on this problem soon, but the short answer is that with only a logarithmic number of calls to the regular oracle one can attain a rate of order 1/t for smooth optimization (while with only the noisy oracle the rate is 1/\sqrt{t}).

- Estimation, Optimization, and Parallelism when Data is Sparse by John Duchi, Mike Jordan and Brendan McMahan. For this paper too I am hoping to have a guest post (by John Duchi this time) that would explain the new algorithm and its properties. The idea is roughly to do a gradient descent where the step-size adapts to the sparsity of the observed gradient, allowing for much faster rates in certain situations.

 

Bandits

- Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search by Alexander Zimin and Gergely Neu. This paper shows that one can solve episodic loop-free MDPs by simply using a combinatorial semi-bandit strategy (see this paper by Audibert, myself and Lugosi where we solved the semi-bandit problem). I believe that this paper initiate a research direction that will be very fruitful in the future. Namely reducing (or rather reformulating) a complicated sequential decision making problem as a linear bandit (or semi-bandit). A similar approach is very popular in optimization where everyone knows that one should try very hard to formulate the problem of interest as a convex program. On the other hand such an approach in online learning/sequential decision making has not been recognized yet. I believe that at the moment the most promising direction is to try to formulate the problem as a linear bandit as it is both an extremely general problem but also one for which we have seemingly canonical algorithms. A related paper is Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions by Abbasi-Yadkori, Bartlett, Kanade, Seldin and Szepesvari.

- Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards by Thomas Bonald and Alexandre Proutiere. This paper considers the famous Bayesian setting of Berry, Chen, Zame, Heath and Shepp where one has a countable set of arms with Bernoulli distributions and means drawn uniformly on [0,1]. This paper shows the first optimal strategy with a Bayesian regret of order \sqrt{2 n}. I recommend to take a look at the strategy, it is both very elegant and quite smart.

- Sequential Transfer in Multi-armed Bandit with Finite Set of Models by Mohammad Azar, Alessandro Lazaric and Emma Brunskill. This paper considers an elegant and simple model for transfer learning in multi-armed bandit problems. To put it simply the setting is the one of a Bayesian multi-armed bandit where the underlying parameter \theta is replaced by a new fresh independent sample very N steps. If the prior is known then this problem is fairly straightforward (given our current knowledge of the multi-armed bandit). The issue is what to do when the prior is unknown. The paper proposes an interesting strategy based on learning latent variable models via the method of moments (see this paper by Anandkumar, Ge, Hsu, Kakade, and Telgarsky for instance). While this gives non-trivial results I believe that much more can be said by fully embracing the sequential nature of the problem rather than trying to blend together a batch method with standard bandit strategies. I suspect that the technical difficulties to obtain an optimal strategy for this setting will be tremendous (which is quite exciting!).

- Eluder Dimension and the Sample Complexity of Optimistic Exploration by Daniel Russo and Benjamin Van Roy; Prior-free and prior-dependent regret bounds for Thompson Sampling by myself and Che-Yu Liu. These two papers ask the same question (with the latter paper being inspired by the former): we know that Thompson Sampling can be as good as UCB when initialized with a well-chosen priors, but can we show that it has a significant advantage in the cases where the prior is actually informative? The first paper addresses this question in the context of linear bandits (which is, as pointed out above, the most fundamental bandit problem), while the second paper considers a much more restricted class of problems but for which surprisingly strong results can be obtained.

 

Online Learning

- Dimension-Free Exponentiated Gradient by Francesco Orabona. This paper introduces a new regularizer for Mirror Descent and shows that it adapts automatically to the norm of the unknown comparator. In my opinion we know very few interesting regularizers for MD (especially in the full-information setting) and any non-trivial addition to this set seems difficult. This paper manages to do just that. It would be interesting to see if this gives new insights for bandit regularizers.

- Minimax Optimal Algorithms for Unconstrained Linear Optimization by Brendan McMahan and Jacob Abernethy. My advice for this paper is to look at Section 3.2 which gives one very concrete and very cool application of the method they develop.

 

Other

I’m running out of stamina so I will now just list the other papers that I found very interesting.

- Density estimation from unweighted k-nearest neighbor graphs: a roadmap by von Luxburg and Alamgir. 

- Stochastic blockmodel approximation of a graphon: Theory and consistent estimation by Airoldi, Costa and Chan. 

- Near-Optimal Entrywise Sampling for Data Matrices by Achlioptas, Karnin and Liberty.

- Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic by Sharpnack, Krishnamurthy and Singh.

- Estimating the Unseen: Improved Estimators for Entropy and other Properties by Valiant and Valiant. 

- Information-theoretic lower bounds for distributed statistical estimation with communication constraints by Zhang, Duchi, Jordan and Wainwright.

Posted in Conference/workshop | 3 Comments

The hunter and the rabbit

In this post I will tell you the story of the hunter and the rabbit. To keep you interested let me say that in the story we will see a powerful (yet trivial) inequality that I learned today from Yuval Peres, which he calls a reversed second moment method (we will also review the basic second moment method).

We consider a sequential game between a hunter and a rabbit. At each time step t \geq 0 the hunter (respectively the rabbit) is at a certain vertex H_t (respectively at R_t) in the cycle C_n (with vertices labeled \{0, \hdots, n-1\}). We impose the constraint that H_{t+1} and H_t must be neighbors, that is the hunter can only move one step at a time. We impose no restriction on the rabbit (he can jump around like a real rabbit). Let \tau = \min \{t \geq 0 : R_t = H_t \} be the capture time. We allow for randomized strategies for the two players.Of course the hunter wants to minimize the expected capture time while the rabbit wants to maximize it.

December 3rd clarification: the processes R_t and H_t are independent, in other words the hunter does not see the rabbit and vice versa (the game would be trivial otherwise!).

Theorem [Adler et al., 2003]: The hunter can always ensure \mathbb{E} \tau \leq c_1 n \log n and the rabbit can always ensure \mathbb{E} \tau \geq c_2 n \log n.

We will now sketch the proof of this beautiful theorem following the method of [Babichenko et al. 2012]. Quite intuitively it is enough to show that there exists a strategy for the hunter that ensures for any deterministic rabbit \mathbb{P}(\tau < n) \leq c_1' / \log n, and respectively that there exists a strategy for the rabbit that ensures for any deterministic hunter \mathbb{P}(\tau < n ) \geq c_2' / \log n (see Lemma 2.2. here for a formal proof of this).

The hunter’s strategy

Let a, b be independent uniform random variables on [0,1]. The location of the hunter at time t is H_t = \lceil a n + b t \rceil \ \mathrm{mod} \ n. In some sense this hunter moves at a random speed determined by b and in a random direction determined by a. The key idea to analyze this strategy is to consider the number of collisions before time K_n = \sum_{t=0}^{n-1} \mathbf{1}\{R_t = H_t\}. Clearly

    \[\mathbb{P}( \tau < n ) = \mathbb{P}(K_n > 0) \geq \frac{(\mathbb{E} (K_n) )^2}{\mathbb{E} (K_n^2)},\]

where the inequality is a ‘strong’ form of the second moment method which can be proved with Cauchy-Schwarz as follows (recall that the basic second moment simply says \mathbb{P}(K_n = 0) \leq \mathbb{P}(|K_n - \mathbb{E} (K_n)| \geq |\mathbb{E} (K_n)|) \leq \mathrm{Var}(K_n) / (\mathbb{E} (K_n))^2):

    \[(\mathbb{E} (K_n))^2 = (\mathbb{E} (\mathbf{1}\{K_n \neq 0\} K_n))^2 \leq \mathbb{E}(\mathbf{1}\{K_n \neq 0\}^2) \mathbb{E} (K_n^2) = \mathbb{P}(K_n \neq 0) \mathbb{E} (K_n^2) .\]

Now it is just a matter of computing \mathbb{E} K_n and \mathbb{E} K_n^2 which can be done very easily, see page 9 here. 

The rabbit’s strategy

The rabbit’s strategy is an order of magnitude more complicated/more beautiful. The basic intuition to design a good rabbit’s strategy is the reversed second moment method which is the following trivial inequality:

    \[\mathbb{P}( \tau < n) = \mathbb{P}(K_n > 0) \leq \frac{\mathbb{E}(K_{2 n})}{\mathbb{E} (K_{2 n} | K_n > 0)} .\]

Let us look into this last term in more details. Of course the rabbit’s location at a given time step should be uniformly distributed, thus the numerator will be equal to 2 (recall that here the hunter is deterministic). Thus we want to maximize the denominator, in other words once the rabbit has collided with the hunter, he should try to collide as much as possible after that! This is in my opinion a truly beautiful intuition. The way to achieve this proposed in [Babichenko et al. 2012] goes as follows. Let U be a random variable uniformly distributed in \{0, \hdots, n-1\} and Z=(X,Y) be a simple random walk (independent of U) in \mathbb{Z}^2. Let T_0 = 0 and for k \geq 0,

    \[T_{k+1} = \inf\{ t \geq T_k : Y_t = k+1\} .\]

The rabbit is then defined by R_t = (X_{T_t} + U) \ \mathrm{mod} \ n (with R_0 = U). Using basic properties of the simple random walk in 2 dimensions one can analyze the term \mathbb{E} (K_{2 n} | K_n > 0) in a few lines, see page 7 here. I suspect that one could come up with other rabbit strategies that achieve the same excepted capture time.

Posted in Probability theory | 2 Comments

Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part II/II

The goal of this post is to develop a Conditional Gradient method that converges exponentially fast while basically solving only a linear minimization problem over the domain \mathcal{X} on each iteration.

To this end we consider the following relaxation of the problem \min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y which we term a Local Linear Oracle (LLO).

Definition 1: A LLO with parameter \rho for the convex domain \mathcal{X} is a procedure that given a point x\in\mathcal{X}, a scalar r>0 and a linear objective c\in\mathbb{R}^n returns a point y\in\mathcal{X} that satisfies:

1. \forall{z\in\mathcal{X}\cap\mathbb{B}_{r}(x)}:\, c^{\top}y \leq c^{\top}z

2. \Vert{x - y}\Vert_2 \leq \rho\cdot{}r

 

Clearly by taking y_t to be the output of such a an LLO and choosing \gamma_t appropriately we would get exponentially fast convergence of the form \exp\left({-O\left({\frac{t}{Q\rho^2}}\right)}\right) (see Part I for more details).

In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain \mathcal{X} and we specify the parameter \rho.

As a simple easy case, let us consider the very specific case in which \mathcal{X} = \Delta_n, that is the domain is just the n-dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,

    \begin{eqnarray*} &&\min_{y\in\Delta_n}y^{\top}c \\ && \textrm{s.t. } \Vert{x-y}\Vert_1 \leq \sqrt{n}r \end{eqnarray*}

where \Vert{x-y}\Vert_1 = \sum_{i=1}^n\vert{x(i)-y(i)}\vert.

Denote by y^* the optimal solution to the above problem. One can verify that since \Vert{x-y^*}\Vert_1 \leq \sqrt{n}r it follows that c^{\top}y^* \leq c^{\top}z \forall{}z\in\Delta_n\cap\mathbb{B}_r(x). Moreover it holds that \Vert{y^*-x}\Vert_2 \leq \Vert{y^*-x}\Vert_1 \leq \sqrt{n}r. Thus solving the above \ell_1-constrained linear problem yields a LLO for the simplex with parameter \rho = \sqrt{n}. Most importantly, the above \ell_1-constrained problem is solvable using only a single linear minimization step over \Delta_n and additional computation that is polynomial in the number of non-zeros in the input point x. To see this observe that y^* is just the outcome of taking weight of \sqrt{n}r/2 from the non-zero entries of x that correspond to the largest (singed) entries in the vector c and moving it entirely to a single entry that corresponds to the smallest (signed) entry in c. This just requires to check for each non-zero entry in x the value of the corresponding entry in c, sorting these values and reducing weights until a total weight of \sqrt{n}r/2 has been reduced. Finding the smallest entry in c, although a trivial operation, is just a linear minimization problem over the simplex.

What about the general case in which \mathcal{X} is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.

Given the input point to the LLO x\in\mathcal{X} let us write x as a convex combination of vertices of the polytope, that is x = \sum_i\lambda_iv_i where \lambda_i > 0, \sum_i\lambda_i=1 and each v_i is a vertex of \mathcal{X}. Suppose now that there exists a constant \mu = \mu(\mathcal{X}) such that given a point z\in\mathcal{X} which satisfies \Vert{x-z}\Vert_2 \leq r, z could be written also as a convex combination of vertices of the polytope, z=\sum_i\gamma_iv_i such that \Vert{\lambda - \gamma}\Vert_1 \leq \mu\cdot{}r. Denote by m the number of vertices of \mathcal{X} and by \lbrace{v_i}\rbrace_{i=1}^m the set of these vertices. Consider the following optimization problem,

    \begin{eqnarray*} &&\min_{\gamma\in\Delta_m}\sum_{i=1}^m\gamma_iv_i^{\top}c \\ && \textrm{s.t. } \Vert{\lambda-\gamma}\Vert_1 \leq \mu\cdot{}r \end{eqnarray*}

Since the above problem is again a linear \ell_1-constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of \mathcal{X} plus time that is polynomial in the number of non-zeros in \lambda (which is at most t+1 after t iterations of the algorithm and in particular does not depend on m which may be exponential in n). Let \gamma^* be an optimal solution to the above problem and let y^* = \sum_{i=1}^m\gamma^*_iv_i. Then clearly from our definition of \mu we have that \forall{z}\in\mathcal{X}\cap\mathbb{B}_r(x) it holds that c^{\top}y^* \leq c^{\top}z. Also it is not hard to show that \Vert{x-y^*}\Vert_2 \leq \mu{}Dr. Thus if indeed such a constant \mu exists then we have a construction of a LLO with parameter \rho = \mu{}D that requires only a single call to a linear optimization oracle of \mathcal{X}.

The following theorem states the convergence rate of the proposed method .

Theorem 1: Given a polytope \mathcal{X} with parameter \mu as defined above, there exists an algorithm that after t linear minimization steps over the domain \mathcal{X} produces a point x_t\in\mathcal{X} such that

    \begin{eqnarray*} f(x_t) - f(x^*) \leq \left({f(x_0)-f(x^*)}\right)e^{-\frac{1}{4Q\mu^2D^2}t} \end{eqnarray*}

The following theorem states that indeed a constant \mu as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.

Theorem 2: Let \mathcal{X} be a polytope defined as \mathcal{X} = \lbrace{x\in\mathbb{R}^n \, | \, A_1x = b_1 \, A_2x \leq b_2}\rbrace such that A_2\in\mathbb{R}^{m_2\times{n}}. Assume that there exists parameters \psi >0,\xi >0 such that:

1. for any matrix M which consists of at most n linearly independent rows of A_2 it holds that \Vert{M}\Vert_2 \leq \psi (here \Vert{M}\Vert_2 denotes the spectral norm of M).

2. For every vertex v of \mathcal{X} and every row A_2(i) of A_2 it holds that either A_2(i)^{\top}v = b_2(i) or A_2(i)^{\top}v \leq b_2(i) - \xi (that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).

Then \mu \leq \frac{\sqrt{n}\psi}{\xi}.

We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on \mu given in Theorem 2 is reasonable.

 

The simplex

The n-dimensional simplex can be written in the form of linear constraints as \Delta_n = \lbrace{x \, | \, -\textbf{I}x \leq \vec{0}, \, \vec{1}\cdot{}x = 1}\rbrace. Its not hard to see that for the simplex it holds that \psi = \xi = 1 and thus by Theorem 2 \mu \leq \sqrt{n}. In particular we get that when applying the proposed algorithm to the problem \min_{x\in\Delta_n}\Vert{x}\Vert_2^2 an error \epsilon is achieved after O(n\log(1/\epsilon)) iterations which as stated here, is nearly optimal.

 

The flow polytope

Given a directed acyclic graph with m edges, a source node marked s and a target node marked t, every path from s to t in the graph can be represented by its identifying vector, that is a vector in \lbrace{0,1}\rbrace^m in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from s to t. This polytope is also exactly the set of all unit s-t flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in \mathbb{R}^m in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that \psi = \xi = 1 and thus \mu \leq \sqrt{n}. Since the flow polytope is just the convex-hull of s-t paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.

 

The doubly-stochastic matrices polytope

Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also \mu \leq \sqrt{n}.

The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all n\times{n} permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.

 

Matroid polytopes

A matroid is pair (E,I) where E is a set of elements and I is a set of subsets of E called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.

Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set E is the set of edges of a given graph and the set I is the set of all subsets of E which are cycle-free. In particular I in this case contains all the spanning trees of the graph. A subset S\in{I} could be represented by its identifying vector which lies in \lbrace{0,1}\rbrace^{\vert{E}\vert} which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in I. It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in \vert{E}\vert) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in \vert{E}\vert exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that \xi =1 and \psi < n^{3/2} where n = \vert{E}\vert. Thus \mu < n^{3/2}.

An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.

 

We end this post with some further research questions.

The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant \mu on \psi, \xi. In particular if we take \mathcal{X} to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that \xi = 0 and the analysis breaks down. So, an interesting open question is

Question 3: Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?

Posted in Optimization | Leave a comment

Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part I/II

In a previous post Sebastien presented and analysed the conditional gradient method for minimizing a smooth convex function f over a compact and convex domain \mathcal{X}. The update step of the method is as follows,

    \begin{eqnarray*} y_t \in \arg\min_{y\in\mathcal{X}}\nabla{}f(x_t)^{\top}y \\ x_{t+1} = (1-\gamma_t)x_t + \gamma_ty_t \end{eqnarray*}

where \forall{t}, \gamma_t\in(0,1).

The appeal of the conditional gradient method is two fold:

i) the update step of the method requires only to minimize a linear objective over the domain, which for many domains of interest is computationally cheap (examples are various polytopes that arise in combinatorial optimization such as the flow polytope, the spanning tree polytope and the matching polytope, and the set of unit-trace positive semidefinite matrices), whereas other first order methods require to compute a projection onto the convex set with respect to a certain norm on each iteration which can be computationally much more expensive, and

ii) the method yields sparse solutions in the sense that after t iterations and provided that the first iterate is a vertex of the convex domain, the current iterate is naturally given as the convex sum of at most t+1 vertices of the convex domain (for the simplex this means at most t+1 non-zero entries and for the set of unit-trace psd matrices this means rank at most t+1).

If f is \beta-smooth and the diameter of \mathcal{X} is D (i.e. \max_{x,y\in\mathcal{X}}\Vert{x-y}\Vert_2 = D) then choosing \gamma_t = \frac{2}{t+1} guarantees that f(x_t) - f(x^*) \leq \frac{2\beta{}D^2}{t+1}. This is also the convergence rate of projected gradient descent for smooth optimization. In case the objective function f is both \beta-smooth and \alpha-strongly convex (with respect to the same norm) it is known that projected gradient descent has an error that decreases exponentially fast: an additive error \epsilon is guaranteed after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) iterations where Q = \beta/\alpha is the condition number of f (the complexity notation \tilde{O}() is the same as O() but omits factors that are logarithmic in the diameter of \mathcal{X} and in \beta). What about the conditional gradient method in this case? So our first question for this post is the following,

Question 1: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain? 

The answer is no, as observed here. In particular the convergence rate of a linearly converging CG method must depend polynomially on the dimension which is not the case for the projected gradient descent method. This brings us to our second question,

Question 2: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \textrm{poly}(n)\cdot\tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain?

Note that albeit the \textrm{poly}(n) factor such a result is still interesting since the time to compute a euclidean projection to the domain (or non-euclidean in case of mirror descent methods) may be longer than the time to minimize a linear objective over the domain by a \textrm{poly}(n) multiplicative factor.

Here is the place to remark that several linearly converging CG methods for smooth and strongly convex optimization were proposed before but they rely on assumptions regarding the location of x^* with respect to the boundary of \mathcal{X}. For example if x^* lies in the interior of \mathcal{X} (which means that the problem is an unconstrained one) then the original CG method due to Frank and Wolfe converges exponentially fast, however the number of iterations depends polynomially on the distance of x^* from the boundary, see this paper by Guélat and Marcotte. In case \mathcal{X} is a polytope then a modification of the CG method presented in the same paper, gives a linear convergence rate that is polynomial in the distance of x^* from the boundary of the smallest facet of \mathcal{X} that contains x^*. Here however we don’t want to rely on such strong assumptions on the location of x^* and we aim at a linearly converging method that is free from such restrictions.

 

In the rest of this post we follow our new paper which describes a new CG method for smooth and strongly convex optimization with convergence rate of the form stated in question 2 in case \mathcal{X} is a polyhedral set. Such the convergence rate will depend on geometric properties of the set. The latter dependence is very reasonable for many polytopes that arise in combinatorial optimization problems; indeed domains for which fast and simple combinatorial algorithms for linear optimization exists are what in part makes CG methods an attractive approach for non-linear optimization.

To begin the derivation of this new CG method, let us recall that as Sebastien showed here, the conditional gradient method satisfies the following inequality for each iteration t,

    \begin{eqnarray*} \epsilon_{t+1} \leq (1-\gamma_t)\epsilon_t + \frac{\gamma_t^2\beta}{2}\Vert{y_t-x_t}\Vert^2 \end{eqnarray*}

where \epsilon_t = f(x_t) - f(x^*).

The fact that \Vert{y_t-x_t}\Vert^2 might be as large as the diameter of \mathcal{X} while the approximation error \epsilon_t may be arbitrarily small forces one to choose step sizes \gamma_t that decrease roughly like 1/t in order to get the known 1/t convergence rate.

Let us now consider the case that f is also \alpha-strongly convex. That is,

    \begin{eqnarray*} f(y) \geq f(x) + \nabla{}f(x)^{\top}(y-x) + \frac{\alpha}{2}\Vert{x-y}\Vert^2 \end{eqnarray*}

In particular the above inequality implies that,

    \begin{eqnarray*} \forall{x\in\mathcal{X}}\quad f(x) - f(x^*) \geq \frac{\alpha}{2}\Vert{x-x^*}\Vert^2 \end{eqnarray*}

Now assume that the iterate x_t satisfies that f(x_t)-f(x^*) \leq \epsilon_t for some \epsilon_t \geq 0 and denote by \mathbb{B}_r(x) the euclidean ball of radius r centred at point x. Define r_t = \sqrt{\frac{2\epsilon_t}{\alpha}} and let us now choose y_t as the optimal solution to the following problem,

    \begin{eqnarray*} \min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y \end{eqnarray*}

Note that CG inequality still holds since by the strong convexity we have that x^*\in\mathbb{B}_{r_t}(x_t) and the only important thing in the selection of y_t that is required for the CG inequality to hold is that \nabla{}f(x_t)^{\top}y_t \leq \nabla{}f(x_t)^{\top}x^*. We’ll now get that,

    \begin{eqnarray*} f(x_{t+1}) - f(x^*) \leq \left({1-\gamma_t + \gamma_t^2Q}\right)\epsilon_t \end{eqnarray*}

Thus choosing a constant step size like \gamma_t = \frac{1}{2Q} and an initial error bound \epsilon_1 \geq f(x_1)-f(x^*) results in exponentially fast convergence.

The problem with the above approach is of course that now the optimization problem that needs to be solved on each iteration is no longer a linear minimization step over the original domain \mathcal{X} but a much more difficult problem due to the additional constraint on the distance of y_t from x_t. What we would like to get is an exponentially fast converging method that still requires to solve only a linear minimization problem over \mathcal{X} on each iteration. This will be the subject of our next post.

Posted in Optimization | Leave a comment

5 announcements

First two self-centered announcements:

A new problem around subgraph densities

Nati Linial and I have just uploaded our first paper together, titled ‘On the local profiles of trees‘. Some background on the paper: recently there has been a lot of interest in subgraph densities for very large graphs, mainly because of its importance for the emerging theory of graph limits, see here for a bunch of videos on this, here for a nice survey by Lovasz, and here for the book version of this survey. A basic issue is that we know almost nothing about subgraph densities. Consider for instance the set \Delta(k) of distributions on k-vertex subgraphs induced by very large graphs (that is, a distribution p \in \Delta(k) corresponds to a large graph G in the sense that for a k-vertex subgraph H, p(H) is the probability that by picking k vertices at random in G one obtains H as the induced subgraph). We know very little about \Delta(k). It is non-convex (think of the complete graph, then the empty graph, and then try to take convex combinations of the corresponding distributions). Even worse than that, Hatami and Norine proved that it is undecidable to determine the validity of a linear inequality for the set \Delta(k). Alright, so subgraph densities in general graphs are in some sense intractable. Can we make some simplifying assumptions and recover tractability? It turns out that the answer is essentially yes if you restrict your attention to trees and subtrees! For instance we prove in our paper with Nati that in this case the set of possible distributions becomes convex! We also initiate the study of the defining inequalities for this set, but we are still far from a complete picture. The paper contains a list of 7 open problems, and I strongly recommend the reader to read our short paper and try to solve some of them: they are really really fun to work with!

New paper on the multi-armed bandit

Che-Yu Liu (my first graduate student) and I uploaded a few weeks ago our first paper together, titled ‘Prior-free and prior-dependent regret bounds for Thompson Sampling‘. Let me say that there are still plenty of open questions around the theme developed in this paper. The abstract reads as follows: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit as the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. Building on the techniques of Audibert and Bubeck [2009] and Russo and Roy [2013] we first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by 14 \sqrt{n K} This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by \frac{1}{20} \sqrt{n K}. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.

 

Next are three announcements related to different workshops/conferences:

Simons-Berkeley Research Fellowship 2014-2015

The Simons Institute at UC Berkeley has just posted their call for the Fellowships for next year. Next year’s programs are “Algorithmic Spectral Graph Theory”, “Algorithms & Complexity in Algebraic Geometry” and “Information Theory”. The deadline is December 15. Check out what Justin Thaler has to say about his experience as a research fellow this semester. Let me add that I also enjoy very much my stay there, and the paper with Nati I talked about above would not have been possible without the Simons Institute.

COLT 2014

The call for papers for COLT 2014 is out, see the official website here. Let me also remind you that this edition will be in Barcelona, which should be a lot of fun.

Mathematics of Machine Learning

Together with Nicolo Cesa-Bianchi, Gabor Lugosi, and Sasha Rakhlin, we are organizing a special program in Barcelona on the Mathematics of Machine Learning from April 7, 2014 to July 14, 2014, see the website for all the details. We are still in the process of inviting people but if you are interested in participating please feel free to send us an email. We should also have soon more details on the large workshop that will take place right after COLT.

Posted in Conference/workshop | 1 Comment