The preferential attachment model, introduced in 1992 by Mahmoud and popularized in 1999 by Barabási and Albert, has attracted a lot of attention in the last decade. In its simplest form it describes the evolution of a random tree. Formally we denote by the preferential attachment tree on vertices which is defined by induction as follows. First is the unique tree on two vertices. Then, given , is formed from by adding a new vertex and a new edge where is selected at random among vertices in according to the following probability distribution:

where denotes the degree of vertex in a tree . In other words vertices of large degrees are more likely to attract the new nodes. This model of evolution is argued to be a good approximation for things such a network of citations, or the internet network.

One of the main reason for the success of the preferential attachment model is the following theorem, which shows that the degree distribution in follows a power-law, a feature that many real-world networks (such as the internet) exhibit but which is not reproduced by standard random graph models such as the Erdös-Rényi model.

Theorem [Bollobás, Riordan, Spencer and Tusnády (2001)]Let be fixed. Then as , the proportion of vertices with degree tends in probability to

While the above theorem is a fine and interesting mathematical result, I do not view it as the critical aspect of the preferential attachment model (note that Wikipedia disagrees). In my opinion is simply interesting merely because of its natural rule of evolution.

Now think about the application of the PA model to the internet. Of course there is a few obvious objections, such as the fact that in a website can only link to one other website. While this is clearly ridiculous I think that still contains the essence of what one would like to capture to model the evolution of the internet. However there is one potentially important aspect which is overlooked: in the early days of the internet the PA model was probably very far from being a good approximation to the evolution of the network. It is perhaps reasonable to assume that after 1995 the network was evolving according to PA, but certainly from 1970 to 1995 the evolution followed fundamentally different rules. This observation suggests to study the preferential attachment model *with a seed*.

Thus we are now interested in , where is a finite seed tree. Formally is also defined by induction, where and is formed from as before. A very basic question which seems to have been overlooked in the literature is the following: **what is the influence of the seed on as goes to infinity?**

In our recent joint work with Elchanan Mossel and Miklos Racz we looked exactly at this question. More precisely we ask the following: given two seed trees and , do the distributions and remain separated (say in total variation distance) as goes to infinity? In other words we are interested in the following quantity:

A priori it could be that for any and , which would mean that the seed has no influence and that the preferential attachment “forgets” its initial conditions. We prove that this is far from true:

Theorem [Bubeck, Mossel and Racz (2014)]Let and be two finite trees on at least vertices. If the degree distributions in and are different, then .

If I wanted to make a bold statement I could say that this theorem implies the following: by looking at the internet network today, one can still “see” the influence of the topological structure of the internet back in the 90′s. In other words to a certain extent one can go back in time and potentially infer some properties that people may have believed to be lost (perhaps some army’s secrets hidden in the structure of the ARPANET?). Of course at this stage this is pure science fiction, but the theorem certainly leaves that possibility open. Note that we believe that the theorem can even be strengthen to the following statement:

ConjectureLet and be two finite trees on at least vertices. If and are non-isomorphic, then .

These statements show that even when is large one can still “see” in the influence of the original seed . However by considering the total variation distance we allow global statistics that depend on entire tree . What about local statistics that could be computed by an agent looking only at a finite neighborhood around her? Mathematically this question can be interpreted in the framework of the Benjamini-Schramm limit. Recall that a sequence of random graphs tends to a random infinite rooted tree ( is the random root) if for any , the random ball of radius around a random vertex in tends in distribution to the random ball of radius around in . In other words when is large enough a random agent cannot tell if she is in or in by looking at a finite neighborhood around her. One has the following theorem for the weak limit of the PA model:

Theorem [Berger, Borgs, Chayes and Saberi (2014)]The Benjamini-Schramm limit of is the Pólya-point graph with .

We extend this result to an arbitrary seed and show that locally the seed has no influence:

Theorem [Bubeck, Mossel and Racz (2014)]For any tree the Benjamini-Schramm limit of is the Pólya-point graph with .

Thus, while the army’s secret of the 90′s might be at risk if one looks at the overall topology of the current internet network, these secrets are safe from any local agent who would only access a (random) finite part of the network.

These new results on the PA model naturally lead to a host of new problems. We end the paper with a list of 7 open problems, I recommend to take a look at them (and try to solve them)!

]]>

About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function (in the previous post I dealt only with the smooth case). Precisely is -strongly convex and -smooth, and we denote by the condition number of . As explained in this post, in this case the basic gradient descent algorithm requires iterations to reach -accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of . This is particularly relevant in Machine Learning applications since the strong convexity parameter can often be viewed as a regularization term, and can be as large as the sample size. Thus reducing the number of step from “” to “” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004′s book. Start at an arbitrary initial point and then iterate the following equations for ,

TheoremLet be -strongly convex and -smooth, then Nesterov’s Accelerated Gradient Descent satisfies

*Proof:* We define -strongly convex quadratic functions by induction as follows:

(1)

Intuitively becomes a finer and finer approximation (from below) to in the following sense:

(2)

The above inequality can be proved immediately by induction, using the fact that by -strong convexity one has

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below is . The following inequality answers this question:

(3)

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by -smoothness one has ):

We now prove (3) by induction (note that it is true at since ). Let . Using the definition of (and -smoothness), convexity, and the induction hypothesis, one gets

Thus we now have to show that

(4)

To prove this inequality we have to understand better the functions . First note that (immediate by induction) and thus has to be of the following form:

for some . Now observe that by differentiating (1) and using the above form of one obtains

In particular is by definition minimized at which can now be defined by induction using the above identity, precisely:

(5)

Using the form of and , as well as the original definition (1) one gets the following identity by evaluating at :

(6)

Note that thanks to (5) one has

which combined with (6) yields

Finally we show by induction that , which concludes the proof of (4) and thus also concludes the proof of the theorem:

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of and the last one from the definition of .

]]>In other news I attended ITCS in Princeton last month and it was absolutely great. Here are a few papers that I really liked:

- Amir Shpilka, Avishay Tal and Ben Lee Volk, On the Structure of Boolean Functions with Small Spectral Norm

- Andrew Wan, John Wright and Chenggang Wu. Decision Trees, Protocols, and the Fourier Entropy-Influence Conjecture

- Cristopher Moore and Leonard Schulman, Tree Codes and a Conjecture on Exponential Sums

- Fernando Brandao, Aram Harrow, James Lee and Yuval Peres, Adversarial hypothesis testing and a quantum Stein’s Lemma for restricted measurements

- Yossi Azar, Uriel Feige, Michal Feldman and Moshe Tennenholtz, Sequential Decision Making with Vector Outcomes

- Manor Mendel and Assaf Naor, Expanders with respect to Hadamard spaces and random graphs

- David Gamarnik and Madhu Sudan, Limits of local algorithms over sparse random graphs

- Rishi Gupta, Tim Roughgarden and C. Seshadhri, Decompositions of Triangle-Dense Graphs

During my visit to the Theory Group at MSR I also learned about the following topics which my readers will probably like too: Tucker’s lemma (see also the related Borsuk-Ulam theorem), the lamplighter graph and random walks on it, sparse regularity lemma, and algorithmic applications of evolving sets.

]]>Now is probably a good time to look back on 2013 as well as to look forward to what this blog will become in 2014.

**I’m a bandit in 2013**

First of all I’m happy to report that ‘I’m a bandit’ was viewed more than 55 000 times in 2013!!! As you can see from the plot below (taken from Google analytics) there has been three significant spikes.

- On March 22nd I received my first significant link from John Langford on his own blog hunch.net.

- On July 17th I made a post with ‘deep learning’ in the title. It was retweeted, reblogged, facebooked, G+’ed, etc.

- The last spike on December 13th is quite interesting, as it comes from my first link from the Machine Learning group on reddit.

Of course the stars of the blog so far have been the optimization lecture notes. But the star among the stars is the post on Nesterov’s Accelerated Gradient Descent which has been viewed THREE TIMES MORE than any other post in this sequence! Apart from optimization stuff this blog hosted a few other topics such as metric embeddings, convex geometry or graph theory. Browsing these older posts should be easier now with the new Archives page.

**I’m a bandit in 2014**

My main objective for the first half of 2014 is to turn the optimization posts into an actual book (or rather a long survey). This will be published in the Foundations and Trends in Machine Learning series (alongside with my previous survey on bandits). Of course I expect this project to take up a lot of my time, so I won’t post too much from February to May. On the other hand I am hopeful that during this downtime I will host quite a few interesting guest posts.

Once the first draft for the optimization lecture notes is out (probably in early May) I will have more time to dedicate to the blog. I plan to start a new series of posts on a topic that I find fascinating, this is the recent theory of graphs limits. I believe (and I’m not the only one!) that in the years to come this theory will prove to be a powerful tool for network analysis, and in particular for statistical analysis on network data. More on this in a few months!

]]>**Optimization**

- Non-strongly-convex smooth stochastic approximation with convergence rate by Francis Bach and Eric Moulines. With noisy gradients (see this post) it is known that the best rate of convergence to minimize an -strongly convex function is of order while for convex functions it is of order . Unfortunately in Machine Learning applications the strong convexity parameter is often a regularization parameter that can be as small as , in which case the standard analysis using strong convexity do not yield any acceleration. In this paper Bach and Moulines show that in the case of the square loss (whose strong convexity parameter depends on the smallest non-zero eigenvalue of the covariance matrix of the covariates) one can obtain a rate of order (i.e. with no dependency on the smallest eigenvalue ). The proposed algorithm is simply Stochastic Gradient Descent with a constant step-size (and averaging for the output). A more intricate algorithm is also proposed for the logistic loss.

- On Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal operator

Recall that this proximal operator arises naturally for the minimization of a function of the form , i.e. when one wants to minimize some function while enforcing some of the ‘properties’ of in the solution. For instance with one would like to output a sparse solution. Thus it is very natural to try to understand the relation between and . This paper consider various properties under which one has .

- Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Rie Johnson and Tong Zhang. This paper gives a beautiful new algorithm achieving the same performances as SDCA and SAG (see this post). The algorithm/analysis are much more intuitive than the one of SDCA and SAG. I will make a more detailed post on this paper later next year.

- Accelerated Mini-Batch Stochastic Dual Coordinate Ascent by Shai Shalev-Shwartz and Tong Zhang. Both SDCA and SAG have a linear dependency on the condition number . For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on . This paper partially bridges the gap between these results and present an accelerated version of SDCA using mini batches.

- Mixed Optimization for Smooth Functions by Mehrdad Mahdavi, Lijun Zhang and Rong Jin. This paper considers a new setting which seems quite natural: what if on top of noisy first order oracle one can also access a regular first order oracle? Mehrdad will do a guest post on this problem soon, but the short answer is that with only a logarithmic number of calls to the regular oracle one can attain a rate of order for smooth optimization (while with only the noisy oracle the rate is ).

- Estimation, Optimization, and Parallelism when Data is Sparse by John Duchi, Mike Jordan and Brendan McMahan. For this paper too I am hoping to have a guest post (by John Duchi this time) that would explain the new algorithm and its properties. The idea is roughly to do a gradient descent where the step-size adapts to the sparsity of the observed gradient, allowing for much faster rates in certain situations.

**Bandits**

- Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search by Alexander Zimin and Gergely Neu. This paper shows that one can solve episodic loop-free MDPs by simply using a combinatorial semi-bandit strategy (see this paper by Audibert, myself and Lugosi where we solved the semi-bandit problem). I believe that this paper initiate a research direction that will be very fruitful in the future. Namely reducing (or rather reformulating) a complicated sequential decision making problem as a linear bandit (or semi-bandit). A similar approach is very popular in optimization where everyone knows that one should try very hard to formulate the problem of interest as a convex program. On the other hand such an approach in online learning/sequential decision making has not been recognized yet. I believe that at the moment the most promising direction is to try to formulate the problem as a linear bandit as it is both an extremely general problem but also one for which we have seemingly canonical algorithms. A related paper is Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions by Abbasi-Yadkori, Bartlett, Kanade, Seldin and Szepesvari.

- Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards by Thomas Bonald and Alexandre Proutiere. This paper considers the famous Bayesian setting of Berry, Chen, Zame, Heath and Shepp where one has a countable set of arms with Bernoulli distributions and means drawn uniformly on . This paper shows the first optimal strategy with a Bayesian regret of order . I recommend to take a look at the strategy, it is both very elegant and quite smart.

- Sequential Transfer in Multi-armed Bandit with Finite Set of Models by Mohammad Azar, Alessandro Lazaric and Emma Brunskill. This paper considers an elegant and simple model for transfer learning in multi-armed bandit problems. To put it simply the setting is the one of a Bayesian multi-armed bandit where the underlying parameter is replaced by a new fresh independent sample very steps. If the prior is known then this problem is fairly straightforward (given our current knowledge of the multi-armed bandit). The issue is what to do when the prior is unknown. The paper proposes an interesting strategy based on learning latent variable models via the method of moments (see this paper by Anandkumar, Ge, Hsu, Kakade, and Telgarsky for instance). While this gives non-trivial results I believe that much more can be said by fully embracing the sequential nature of the problem rather than trying to blend together a batch method with standard bandit strategies. I suspect that the technical difficulties to obtain an optimal strategy for this setting will be tremendous (which is quite exciting!).

- Eluder Dimension and the Sample Complexity of Optimistic Exploration by Daniel Russo and Benjamin Van Roy; Prior-free and prior-dependent regret bounds for Thompson Sampling by myself and Che-Yu Liu. These two papers ask the same question (with the latter paper being inspired by the former): we know that Thompson Sampling can be as good as UCB when initialized with a well-chosen priors, but can we show that it has a significant advantage in the cases where the prior is actually informative? The first paper addresses this question in the context of linear bandits (which is, as pointed out above, the most fundamental bandit problem), while the second paper considers a much more restricted class of problems but for which surprisingly strong results can be obtained.

**Online Learning**

- Dimension-Free Exponentiated Gradient by Francesco Orabona. This paper introduces a new regularizer for Mirror Descent and shows that it adapts automatically to the norm of the unknown comparator. In my opinion we know very few interesting regularizers for MD (especially in the full-information setting) and any non-trivial addition to this set seems difficult. This paper manages to do just that. It would be interesting to see if this gives new insights for bandit regularizers.

- Minimax Optimal Algorithms for Unconstrained Linear Optimization by Brendan McMahan and Jacob Abernethy. My advice for this paper is to look at Section 3.2 which gives one very concrete and very cool application of the method they develop.

**Other**

I’m running out of stamina so I will now just list the other papers that I found very interesting.

- Density estimation from unweighted k-nearest neighbor graphs: a roadmap by von Luxburg and Alamgir.

- Stochastic blockmodel approximation of a graphon: Theory and consistent estimation by Airoldi, Costa and Chan.

- Near-Optimal Entrywise Sampling for Data Matrices by Achlioptas, Karnin and Liberty.

- Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic by Sharpnack, Krishnamurthy and Singh.

- Estimating the Unseen: Improved Estimators for Entropy and other Properties by Valiant and Valiant.

- Information-theoretic lower bounds for distributed statistical estimation with communication constraints by Zhang, Duchi, Jordan and Wainwright.

]]>We consider a sequential game between a hunter and a rabbit. At each time step the hunter (respectively the rabbit) is at a certain vertex (respectively at ) in the cycle (with vertices labeled ). We impose the constraint that and must be neighbors, that is the hunter can only move one step at a time. We impose no restriction on the rabbit (he can jump around like a real rabbit). Let be the *capture time*. We allow for randomized strategies for the two players.Of course the hunter wants to minimize the expected capture time while the rabbit wants to maximize it.

**December 3rd clarification:** the processes and are independent, in other words the hunter does not see the rabbit and vice versa (the game would be trivial otherwise!).

Theorem[Adler et al., 2003]: The hunter can always ensure and the rabbit can always ensure .

We will now sketch the proof of this beautiful theorem following the method of [Babichenko et al. 2012]. Quite intuitively it is enough to show that there exists a strategy for the hunter that ensures for any deterministic rabbit , and respectively that there exists a strategy for the rabbit that ensures for any deterministic hunter (see Lemma 2.2. here for a formal proof of this).

**The hunter’s strategy**

Let be independent uniform random variables on . The location of the hunter at time is . In some sense this hunter moves at a random speed determined by and in a random direction determined by . The key idea to analyze this strategy is to consider the number of collisions before time . Clearly

where the inequality is a ‘strong’ form of the second moment method which can be proved with Cauchy-Schwarz as follows (recall that the basic second moment simply says ):

Now it is just a matter of computing and which can be done very easily, see page 9 here.

**The rabbit’s strategy**

The rabbit’s strategy is an order of magnitude more complicated/more beautiful. The basic intuition to design a good rabbit’s strategy is the reversed second moment method which is the following *trivial* inequality:

Let us look into this last term in more details. Of course the rabbit’s location at a given time step should be uniformly distributed, thus the numerator will be equal to (recall that here the hunter is deterministic). Thus we want to maximize the denominator, in other words *once the rabbit has collided with the hunter, he should try to collide as much as possible after that*! This is in my opinion a truly beautiful intuition. The way to achieve this proposed in [Babichenko et al. 2012] goes as follows. Let be a random variable uniformly distributed in and be a simple random walk (independent of ) in . Let and for ,

The rabbit is then defined by (with ). Using basic properties of the simple random walk in 2 dimensions one can analyze the term in a few lines, see page 7 here. I suspect that one could come up with other rabbit strategies that achieve the same excepted capture time.

]]>To this end we consider the following relaxation of the problem which we term a Local Linear Oracle (LLO).

Definition 1:A LLO with parameter for the convex domain is a procedure that given a point , a scalar and a linear objective returns a point that satisfies:1.

2.

Clearly by taking to be the output of such a an LLO and choosing appropriately we would get exponentially fast convergence of the form (see Part I for more details).

In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain and we specify the parameter .

As a simple easy case, let us consider the very specific case in which , that is the domain is just the -dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,

where .

Denote by the optimal solution to the above problem. One can verify that since it follows that . Moreover it holds that . Thus solving the above -constrained linear problem yields a LLO for the simplex with parameter . Most importantly, the above -constrained problem is solvable using only a single linear minimization step over and additional computation that is polynomial in the number of non-zeros in the input point . To see this observe that is just the outcome of taking weight of from the non-zero entries of that correspond to the largest (singed) entries in the vector and moving it entirely to a single entry that corresponds to the smallest (signed) entry in . This just requires to check for each non-zero entry in the value of the corresponding entry in , sorting these values and reducing weights until a total weight of has been reduced. Finding the smallest entry in , although a trivial operation, is just a linear minimization problem over the simplex.

What about the general case in which is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.

Given the input point to the LLO let us write as a convex combination of vertices of the polytope, that is where , and each is a vertex of . Suppose now that there exists a constant such that given a point which satisfies , could be written also as a convex combination of vertices of the polytope, such that . Denote by the number of vertices of and by the set of these vertices. Consider the following optimization problem,

Since the above problem is again a linear -constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of plus time that is polynomial in the number of non-zeros in (which is at most after iterations of the algorithm and in particular does not depend on which may be exponential in ). Let be an optimal solution to the above problem and let . Then clearly from our definition of we have that it holds that . Also it is not hard to show that . Thus if indeed such a constant exists then we have a construction of a LLO with parameter that requires only a single call to a linear optimization oracle of .

The following theorem states the convergence rate of the proposed method .

Theorem 1:Given a polytope with parameter as defined above, there exists an algorithm that after linear minimization steps over the domain produces a point such that

The following theorem states that indeed a constant as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.

Theorem 2:Let be a polytope defined as such that . Assume that there exists parameters such that:1. for any matrix which consists of at most linearly independent rows of it holds that (here denotes the spectral norm of ).

2. For every vertex of and every row of it holds that either or (that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).

Then .

We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on given in Theorem 2 is reasonable.

**The simplex**

The -dimensional simplex can be written in the form of linear constraints as . Its not hard to see that for the simplex it holds that and thus by Theorem 2 . In particular we get that when applying the proposed algorithm to the problem an error is achieved after iterations which as stated here, is nearly optimal.

**The flow polytope**

Given a directed acyclic graph with edges, a source node marked and a target node marked , every path from to in the graph can be represented by its identifying vector, that is a vector in in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from to . This polytope is also exactly the set of all unit flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that and thus . Since the flow polytope is just the convex-hull of paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.

**The doubly-stochastic matrices polytope**

Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also .

The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.

**Matroid polytopes**

A matroid is pair where is a set of elements and is a set of subsets of called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.

Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set is the set of edges of a given graph and the set is the set of all subsets of which are cycle-free. In particular in this case contains all the spanning trees of the graph. A subset could be represented by its identifying vector which lies in which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in . It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in ) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that and where . Thus .

An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.

We end this post with some further research questions.

The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant on . In particular if we take to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that and the analysis breaks down. So, an interesting open question is

]]>

Question 3:Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?

where , .

The appeal of the conditional gradient method is two fold:

i) the update step of the method requires only to minimize a linear objective over the domain, which for many domains of interest is computationally cheap (examples are various polytopes that arise in combinatorial optimization such as the flow polytope, the spanning tree polytope and the matching polytope, and the set of unit-trace positive semidefinite matrices), whereas other first order methods require to compute a projection onto the convex set with respect to a certain norm on each iteration which can be computationally much more expensive, and

ii) the method yields sparse solutions in the sense that after iterations and provided that the first iterate is a vertex of the convex domain, the current iterate is naturally given as the convex sum of at most vertices of the convex domain (for the simplex this means at most non-zero entries and for the set of unit-trace psd matrices this means rank at most ).

If is -smooth and the diameter of is (i.e. ) then choosing guarantees that . This is also the convergence rate of projected gradient descent for smooth optimization. In case the objective function is both -smooth and -strongly convex (with respect to the same norm) it is known that projected gradient descent has an error that decreases exponentially fast: an additive error is guaranteed after at most iterations where is the condition number of (the complexity notation is the same as but omits factors that are logarithmic in the diameter of and in ). What about the conditional gradient method in this case? So our first question for this post is the following,

Question 1:Is there a CG method that given a -smooth and -strongly convex function guarantees an additive error after at most linear optimization steps over the domain?

The answer is no, as observed here. In particular the convergence rate of a linearly converging CG method must depend polynomially on the dimension which is not the case for the projected gradient descent method. This brings us to our second question,

Question 2:Is there a CG method that given a -smooth and -strongly convex function guarantees an additive error after at most linear optimization steps over the domain?

Note that albeit the factor such a result is still interesting since the time to compute a euclidean projection to the domain (or non-euclidean in case of mirror descent methods) may be longer than the time to minimize a linear objective over the domain by a multiplicative factor.

Here is the place to remark that several linearly converging CG methods for smooth and strongly convex optimization were proposed before but they rely on assumptions regarding the location of with respect to the boundary of . For example if lies in the interior of (which means that the problem is an unconstrained one) then the original CG method due to Frank and Wolfe converges exponentially fast, however the number of iterations depends polynomially on the distance of from the boundary, see this paper by Guélat and Marcotte. In case is a polytope then a modification of the CG method presented in the same paper, gives a linear convergence rate that is polynomial in the distance of from the boundary of the smallest facet of that contains . Here however we don’t want to rely on such strong assumptions on the location of and we aim at a linearly converging method that is free from such restrictions.

In the rest of this post we follow our new paper which describes a new CG method for smooth and strongly convex optimization with convergence rate of the form stated in question 2 in case is a polyhedral set. Such the convergence rate will depend on geometric properties of the set. The latter dependence is very reasonable for many polytopes that arise in combinatorial optimization problems; indeed domains for which fast and simple combinatorial algorithms for linear optimization exists are what in part makes CG methods an attractive approach for non-linear optimization.

To begin the derivation of this new CG method, let us recall that as Sebastien showed here, the conditional gradient method satisfies the following inequality for each iteration ,

where .

The fact that might be as large as the diameter of while the approximation error may be arbitrarily small forces one to choose step sizes that decrease roughly like in order to get the known convergence rate.

Let us now consider the case that is also -strongly convex. That is,

In particular the above inequality implies that,

Now assume that the iterate satisfies that for some and denote by the euclidean ball of radius centred at point . Define and let us now choose as the optimal solution to the following problem,

Note that CG inequality still holds since by the strong convexity we have that and the only important thing in the selection of that is required for the CG inequality to hold is that . We’ll now get that,

Thus choosing a constant step size like and an initial error bound results in exponentially fast convergence.

The problem with the above approach is of course that now the optimization problem that needs to be solved on each iteration is no longer a linear minimization step over the original domain but a much more difficult problem due to the additional constraint on the distance of from . What we would like to get is an exponentially fast converging method that still requires to solve only a linear minimization problem over on each iteration. This will be the subject of our next post.

]]>**A new problem around subgraph densities**

Nati Linial and I have just uploaded our first paper together, titled ‘On the local profiles of trees‘. Some background on the paper: recently there has been a lot of interest in subgraph densities for very large graphs, mainly because of its importance for the emerging theory of graph limits, see here for a bunch of videos on this, here for a nice survey by Lovasz, and here for the book version of this survey. A basic issue is that we know almost nothing about subgraph densities. Consider for instance the set of distributions on -vertex subgraphs induced by very large graphs (that is, a distribution corresponds to a large graph in the sense that for a -vertex subgraph , is the probability that by picking vertices at random in one obtains as the induced subgraph). We know very little about . It is non-convex (think of the complete graph, then the empty graph, and then try to take convex combinations of the corresponding distributions). Even worse than that, Hatami and Norine proved that it is undecidable to determine the validity of a linear inequality for the set . Alright, so subgraph densities in general graphs are in some sense intractable. Can we make some simplifying assumptions and recover tractability? It turns out that the answer is essentially yes if you restrict your attention to trees and subtrees! For instance we prove in our paper with Nati that in this case the set of possible distributions becomes convex! We also initiate the study of the defining inequalities for this set, but we are still far from a complete picture. The paper contains a list of 7 open problems, and I strongly recommend the reader to read our short paper and try to solve some of them: they are really really fun to work with!

**New paper on the multi-armed bandit**

Che-Yu Liu (my first graduate student) and I uploaded a few weeks ago our first paper together, titled ‘Prior-free and prior-dependent regret bounds for Thompson Sampling‘. Let me say that there are still plenty of open questions around the theme developed in this paper. The abstract reads as follows: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit as the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. Building on the techniques of Audibert and Bubeck [2009] and Russo and Roy [2013] we first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by . We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.

Next are three announcements related to different workshops/conferences:

**Simons-Berkeley Research Fellowship 2014-2015**

The Simons Institute at UC Berkeley has just posted their call for the Fellowships for next year. Next year’s programs are “Algorithmic Spectral Graph Theory”, “Algorithms & Complexity in Algebraic Geometry” and “Information Theory”. The deadline is December 15. Check out what Justin Thaler has to say about his experience as a research fellow this semester. Let me add that I also enjoy very much my stay there, and the paper with Nati I talked about above would not have been possible without the Simons Institute.

**COLT 2014**

The call for papers for COLT 2014 is out, see the official website here. Let me also remind you that this edition will be in Barcelona, which should be a lot of fun.

**Mathematics of Machine Learning**

Together with Nicolo Cesa-Bianchi, Gabor Lugosi, and Sasha Rakhlin, we are organizing a special program in Barcelona on the Mathematics of Machine Learning from April 7, 2014 to July 14, 2014, see the website for all the details. We are still in the process of inviting people but if you are interested in participating please feel free to send us an email. We should also have soon more details on the large workshop that will take place right after COLT.

]]>This week at the Simons Institute we had the first Big Data workshop on Succinct Data Representations and Applications. Here I would like to briefly talk about one of the ‘stars’ of this workshop: the squared-length sampling technique. I will illustrate this method on three examples (taken from three seminal papers).

**Fast low rank approximation**

Frieze, Kannan and Vempala proposed the following algorithm to compute an approximate low rank decomposition of a matrix (this specific version is taken from Chapter 6 here). We denote by the columns of .

Let be i.i.d. random variables taking values in whose distribution is proportional to the *squared-length of the columns*, more precisely the probability that is equal to is proportional to . Let be such that its column is . Next compute an SVD of and let be the top left singular vectors of . The low rank approximation to is finally given by

One can prove that for this approximation satisfies:

where is the best rank approximation to .

The amazing feature of this algorithm is that the complexity of computing the projection matrix is *independent* of (one needs to be careful with how the sampling of is done but this can be taken care of). In fact it is even possible to obtain an algorithm whose complexity is independent of both and and depends only polynomially in and (though the version described above might be better in practice because the complexity is simply ).

**Fast graph sparsification**

Spielman and Teng looked at the following graph sparsification problem: given a weighted graph , find a ‘sparse’ weighted graph such that

where and are the graph Laplacians of and . The idea is that if one is able to find ‘fast enough’ a sparse approximation to then one can solve many problems on (which is dense) by instead solving them much faster on (think of solving linear systems of the form for example).

Spielman and Srivastava proposed a method to carry this graph sparsification task using the squared-length sampling technique. First they reduce the problem to the following: given vectors that forms a decomposition of the identity, that is

find a small subset of them such that appropriately reweighted they form an approximate decomposition of the identity. Rudelson showed that one can solve this problem by simply sampling vectors from i.i.d. proportionally to their squared-length and reweight them by the square root of the inverse probability of selecting this vector. In other words, he showed that if is an i.i.d. sequence in such that is proportional to then with high probability one has

Batson, Spielman and Srivastava also showed how to find this decomposition with a deterministic algorithm, see also this very nice survey of Naor.

**Approximation guarantee for k-means**

The last example is the now famous k-means++ algorithm of Arthur and Vassilvitskii. Given one would like to find minimizing the following quantity:

The following strategy gives a randomized -approximation algorithm to this NP-hard problem. First select at random from . Then select iteratively the new centers at random, by sampling a point at random from proportionally to their distance squared to the current nearest center.

]]>