The goal of this post is to develop a Conditional Gradient method that converges exponentially fast while basically solving only a linear minimization problem over the domain on each iteration.
To this end we consider the following relaxation of the problem which we term a Local Linear Oracle (LLO).
Definition 1: A LLO with parameter for the convex domain is a procedure that given a point , a scalar and a linear objective returns a point that satisfies:
1.
2.
Clearly by taking to be the output of such a an LLO and choosing appropriately we would get exponentially fast convergence of the form (see Part I for more details).
In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain and we specify the parameter .
As a simple easy case, let us consider the very specific case in which , that is the domain is just the -dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,
where .
Denote by the optimal solution to the above problem. One can verify that since it follows that . Moreover it holds that . Thus solving the above -constrained linear problem yields a LLO for the simplex with parameter . Most importantly, the above -constrained problem is solvable using only a single linear minimization step over and additional computation that is polynomial in the number of non-zeros in the input point . To see this observe that is just the outcome of taking weight of from the non-zero entries of that correspond to the largest (singed) entries in the vector and moving it entirely to a single entry that corresponds to the smallest (signed) entry in . This just requires to check for each non-zero entry in the value of the corresponding entry in , sorting these values and reducing weights until a total weight of has been reduced. Finding the smallest entry in , although a trivial operation, is just a linear minimization problem over the simplex.
What about the general case in which is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.
Given the input point to the LLO let us write as a convex combination of vertices of the polytope, that is where , and each is a vertex of . Suppose now that there exists a constant such that given a point which satisfies , could be written also as a convex combination of vertices of the polytope, such that . Denote by the number of vertices of and by the set of these vertices. Consider the following optimization problem,
Since the above problem is again a linear -constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of plus time that is polynomial in the number of non-zeros in (which is at most after iterations of the algorithm and in particular does not depend on which may be exponential in ). Let be an optimal solution to the above problem and let . Then clearly from our definition of we have that it holds that . Also it is not hard to show that . Thus if indeed such a constant exists then we have a construction of a LLO with parameter that requires only a single call to a linear optimization oracle of .
The following theorem states the convergence rate of the proposed method .
Theorem 1: Given a polytope with parameter as defined above, there exists an algorithm that after linear minimization steps over the domain produces a point such that
The following theorem states that indeed a constant as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.
Theorem 2: Let be a polytope defined as such that . Assume that there exists parameters such that:
1. for any matrix which consists of at most linearly independent rows of it holds that (here denotes the spectral norm of ).
2. For every vertex of and every row of it holds that either or (that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).
Then .
We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on given in Theorem 2 is reasonable.
The simplex
The -dimensional simplex can be written in the form of linear constraints as . Its not hard to see that for the simplex it holds that and thus by Theorem 2 . In particular we get that when applying the proposed algorithm to the problem an error is achieved after iterations which as stated here, is nearly optimal.
The flow polytope
Given a directed acyclic graph with edges, a source node marked and a target node marked , every path from to in the graph can be represented by its identifying vector, that is a vector in in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from to . This polytope is also exactly the set of all unit flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that and thus . Since the flow polytope is just the convex-hull of paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.
The doubly-stochastic matrices polytope
Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also .
The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.
Matroid polytopes
A matroid is pair where is a set of elements and is a set of subsets of called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.
Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set is the set of edges of a given graph and the set is the set of all subsets of which are cycle-free. In particular in this case contains all the spanning trees of the graph. A subset could be represented by its identifying vector which lies in which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in . It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in ) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that and where . Thus .
An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.
We end this post with some further research questions.
The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant on . In particular if we take to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that and the analysis breaks down. So, an interesting open question is
Question 3: Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?