The goal of this post is to develop a Conditional Gradient method that converges exponentially fast while basically solving only a linear minimization problem over the domain on each iteration.
To this end we consider the following relaxation of the problem which we term a Local Linear Oracle (LLO).
Definition 1: A LLO with parameter
for the convex domain
is a procedure that given a point
, a scalar
and a linear objective
returns a point
that satisfies:
1.
2.
Clearly by taking to be the output of such a an LLO and choosing
appropriately we would get exponentially fast convergence of the form
(see Part I for more details).
In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain and we specify the parameter
.
As a simple easy case, let us consider the very specific case in which , that is the domain is just the
-dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,
where .
Denote by the optimal solution to the above problem. One can verify that since
it follows that
. Moreover it holds that
. Thus solving the above
-constrained linear problem yields a LLO for the simplex with parameter
. Most importantly, the above
-constrained problem is solvable using only a single linear minimization step over
and additional computation that is polynomial in the number of non-zeros in the input point
. To see this observe that
is just the outcome of taking weight of
from the non-zero entries of
that correspond to the largest (singed) entries in the vector
and moving it entirely to a single entry that corresponds to the smallest (signed) entry in
. This just requires to check for each non-zero entry in
the value of the corresponding entry in
, sorting these values and reducing weights until a total weight of
has been reduced. Finding the smallest entry in
, although a trivial operation, is just a linear minimization problem over the simplex.
What about the general case in which is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.
Given the input point to the LLO let us write
as a convex combination of vertices of the polytope, that is
where
,
and each
is a vertex of
. Suppose now that there exists a constant
such that given a point
which satisfies
,
could be written also as a convex combination of vertices of the polytope,
such that
. Denote by
the number of vertices of
and by
the set of these vertices. Consider the following optimization problem,
Since the above problem is again a linear -constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of
plus time that is polynomial in the number of non-zeros in
(which is at most
after
iterations of the algorithm and in particular does not depend on
which may be exponential in
). Let
be an optimal solution to the above problem and let
. Then clearly from our definition of
we have that
it holds that
. Also it is not hard to show that
. Thus if indeed such a constant
exists then we have a construction of a LLO with parameter
that requires only a single call to a linear optimization oracle of
.
The following theorem states the convergence rate of the proposed method .
Theorem 1: Given a polytope
with parameter
as defined above, there exists an algorithm that after
linear minimization steps over the domain
produces a point
such that
The following theorem states that indeed a constant as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.
Theorem 2: Let
be a polytope defined as
such that
. Assume that there exists parameters
such that:
1. for any matrix
which consists of at most
linearly independent rows of
it holds that
(here
denotes the spectral norm of
).
2. For every vertex
of
and every row
of
it holds that either
or
(that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).
Then
.
We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on given in Theorem 2 is reasonable.
The simplex
The -dimensional simplex can be written in the form of linear constraints as
. Its not hard to see that for the simplex it holds that
and thus by Theorem 2
. In particular we get that when applying the proposed algorithm to the problem
an error
is achieved after
iterations which as stated here, is nearly optimal.
The flow polytope
Given a directed acyclic graph with edges, a source node marked
and a target node marked
, every path from
to
in the graph can be represented by its identifying vector, that is a vector in
in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from
to
. This polytope is also exactly the set of all unit
flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in
in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that
and thus
. Since the flow polytope is just the convex-hull of
paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.
The doubly-stochastic matrices polytope
Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also .
The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.
Matroid polytopes
A matroid is pair where
is a set of elements and
is a set of subsets of
called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.
Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set is the set of edges of a given graph and the set
is the set of all subsets of
which are cycle-free. In particular
in this case contains all the spanning trees of the graph. A subset
could be represented by its identifying vector which lies in
which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in
. It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in
) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in
exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that
and
where
. Thus
.
An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.
We end this post with some further research questions.
The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant on
. In particular if we take
to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that
and the analysis breaks down. So, an interesting open question is
Question 3: Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?