In a previous post Sebastien presented and analysed the conditional gradient method for minimizing a smooth convex function over a compact and convex domain
. The update step of the method is as follows,
where ,
.
The appeal of the conditional gradient method is two fold:
i) the update step of the method requires only to minimize a linear objective over the domain, which for many domains of interest is computationally cheap (examples are various polytopes that arise in combinatorial optimization such as the flow polytope, the spanning tree polytope and the matching polytope, and the set of unit-trace positive semidefinite matrices), whereas other first order methods require to compute a projection onto the convex set with respect to a certain norm on each iteration which can be computationally much more expensive, and
ii) the method yields sparse solutions in the sense that after iterations and provided that the first iterate is a vertex of the convex domain, the current iterate is naturally given as the convex sum of at most
vertices of the convex domain (for the simplex this means at most
non-zero entries and for the set of unit-trace psd matrices this means rank at most
).
If is
-smooth and the diameter of
is
(i.e.
) then choosing
guarantees that
. This is also the convergence rate of projected gradient descent for smooth optimization. In case the objective function
is both
-smooth and
-strongly convex (with respect to the same norm) it is known that projected gradient descent has an error that decreases exponentially fast: an additive error
is guaranteed after at most
iterations where
is the condition number of
(the complexity notation
is the same as
but omits factors that are logarithmic in the diameter of
and in
). What about the conditional gradient method in this case? So our first question for this post is the following,
Question 1: Is there a CG method that given a
-smooth and
-strongly convex function
guarantees an
additive error after at most
linear optimization steps over the domain?
The answer is no, as observed here. In particular the convergence rate of a linearly converging CG method must depend polynomially on the dimension which is not the case for the projected gradient descent method. This brings us to our second question,
Question 2: Is there a CG method that given a
-smooth and
-strongly convex function
guarantees an
additive error after at most
linear optimization steps over the domain?
Note that albeit the factor such a result is still interesting since the time to compute a euclidean projection to the domain (or non-euclidean in case of mirror descent methods) may be longer than the time to minimize a linear objective over the domain by a
multiplicative factor.
Here is the place to remark that several linearly converging CG methods for smooth and strongly convex optimization were proposed before but they rely on assumptions regarding the location of with respect to the boundary of
. For example if
lies in the interior of
(which means that the problem is an unconstrained one) then the original CG method due to Frank and Wolfe converges exponentially fast, however the number of iterations depends polynomially on the distance of
from the boundary, see this paper by Guélat and Marcotte. In case
is a polytope then a modification of the CG method presented in the same paper, gives a linear convergence rate that is polynomial in the distance of
from the boundary of the smallest facet of
that contains
. Here however we don’t want to rely on such strong assumptions on the location of
and we aim at a linearly converging method that is free from such restrictions.
In the rest of this post we follow our new paper which describes a new CG method for smooth and strongly convex optimization with convergence rate of the form stated in question 2 in case is a polyhedral set. Such the convergence rate will depend on geometric properties of the set. The latter dependence is very reasonable for many polytopes that arise in combinatorial optimization problems; indeed domains for which fast and simple combinatorial algorithms for linear optimization exists are what in part makes CG methods an attractive approach for non-linear optimization.
To begin the derivation of this new CG method, let us recall that as Sebastien showed here, the conditional gradient method satisfies the following inequality for each iteration ,
where .
The fact that might be as large as the diameter of
while the approximation error
may be arbitrarily small forces one to choose step sizes
that decrease roughly like
in order to get the known
convergence rate.
Let us now consider the case that is also
-strongly convex. That is,
In particular the above inequality implies that,
Now assume that the iterate satisfies that
for some
and denote by
the euclidean ball of radius
centred at point
. Define
and let us now choose
as the optimal solution to the following problem,
Note that CG inequality still holds since by the strong convexity we have that and the only important thing in the selection of
that is required for the CG inequality to hold is that
. We’ll now get that,
Thus choosing a constant step size like and an initial error bound
results in exponentially fast convergence.
The problem with the above approach is of course that now the optimization problem that needs to be solved on each iteration is no longer a linear minimization step over the original domain but a much more difficult problem due to the additional constraint on the distance of
from
. What we would like to get is an exponentially fast converging method that still requires to solve only a linear minimization problem over
on each iteration. This will be the subject of our next post.