Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part I/II

In a previous post Sebastien presented and analysed the conditional gradient method for minimizing a smooth convex function f over a compact and convex domain \mathcal{X}. The update step of the method is as follows,

    \begin{eqnarray*} y_t \in \arg\min_{y\in\mathcal{X}}\nabla{}f(x_t)^{\top}y \\ x_{t+1} = (1-\gamma_t)x_t + \gamma_ty_t \end{eqnarray*}

where \forall{t}, \gamma_t\in(0,1).

The appeal of the conditional gradient method is two fold:

i) the update step of the method requires only to minimize a linear objective over the domain, which for many domains of interest is computationally cheap (examples are various polytopes that arise in combinatorial optimization such as the flow polytope, the spanning tree polytope and the matching polytope, and the set of unit-trace positive semidefinite matrices), whereas other first order methods require to compute a projection onto the convex set with respect to a certain norm on each iteration which can be computationally much more expensive, and

ii) the method yields sparse solutions in the sense that after t iterations and provided that the first iterate is a vertex of the convex domain, the current iterate is naturally given as the convex sum of at most t+1 vertices of the convex domain (for the simplex this means at most t+1 non-zero entries and for the set of unit-trace psd matrices this means rank at most t+1).

If f is \beta-smooth and the diameter of \mathcal{X} is D (i.e. \max_{x,y\in\mathcal{X}}\Vert{x-y}\Vert_2 = D) then choosing \gamma_t = \frac{2}{t+1} guarantees that f(x_t) - f(x^*) \leq \frac{2\beta{}D^2}{t+1}. This is also the convergence rate of projected gradient descent for smooth optimization. In case the objective function f is both \beta-smooth and \alpha-strongly convex (with respect to the same norm) it is known that projected gradient descent has an error that decreases exponentially fast: an additive error \epsilon is guaranteed after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) iterations where Q = \beta/\alpha is the condition number of f (the complexity notation \tilde{O}() is the same as O() but omits factors that are logarithmic in the diameter of \mathcal{X} and in \beta). What about the conditional gradient method in this case? So our first question for this post is the following,

Question 1: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain? 

The answer is no, as observed here. In particular the convergence rate of a linearly converging CG method must depend polynomially on the dimension which is not the case for the projected gradient descent method. This brings us to our second question,

Question 2: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \textrm{poly}(n)\cdot\tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain?

Note that albeit the \textrm{poly}(n) factor such a result is still interesting since the time to compute a euclidean projection to the domain (or non-euclidean in case of mirror descent methods) may be longer than the time to minimize a linear objective over the domain by a \textrm{poly}(n) multiplicative factor.

Here is the place to remark that several linearly converging CG methods for smooth and strongly convex optimization were proposed before but they rely on assumptions regarding the location of x^* with respect to the boundary of \mathcal{X}. For example if x^* lies in the interior of \mathcal{X} (which means that the problem is an unconstrained one) then the original CG method due to Frank and Wolfe converges exponentially fast, however the number of iterations depends polynomially on the distance of x^* from the boundary, see this paper by Guélat and Marcotte. In case \mathcal{X} is a polytope then a modification of the CG method presented in the same paper, gives a linear convergence rate that is polynomial in the distance of x^* from the boundary of the smallest facet of \mathcal{X} that contains x^*. Here however we don’t want to rely on such strong assumptions on the location of x^* and we aim at a linearly converging method that is free from such restrictions.

 

In the rest of this post we follow our new paper which describes a new CG method for smooth and strongly convex optimization with convergence rate of the form stated in question 2 in case \mathcal{X} is a polyhedral set. Such the convergence rate will depend on geometric properties of the set. The latter dependence is very reasonable for many polytopes that arise in combinatorial optimization problems; indeed domains for which fast and simple combinatorial algorithms for linear optimization exists are what in part makes CG methods an attractive approach for non-linear optimization.

To begin the derivation of this new CG method, let us recall that as Sebastien showed here, the conditional gradient method satisfies the following inequality for each iteration t,

    \begin{eqnarray*} \epsilon_{t+1} \leq (1-\gamma_t)\epsilon_t + \frac{\gamma_t^2\beta}{2}\Vert{y_t-x_t}\Vert^2 \end{eqnarray*}

where \epsilon_t = f(x_t) - f(x^*).

The fact that \Vert{y_t-x_t}\Vert^2 might be as large as the diameter of \mathcal{X} while the approximation error \epsilon_t may be arbitrarily small forces one to choose step sizes \gamma_t that decrease roughly like 1/t in order to get the known 1/t convergence rate.

Let us now consider the case that f is also \alpha-strongly convex. That is,

    \begin{eqnarray*} f(y) \geq f(x) + \nabla{}f(x)^{\top}(y-x) + \frac{\alpha}{2}\Vert{x-y}\Vert^2 \end{eqnarray*}

In particular the above inequality implies that,

    \begin{eqnarray*} \forall{x\in\mathcal{X}}\quad f(x) - f(x^*) \geq \frac{\alpha}{2}\Vert{x-x^*}\Vert^2 \end{eqnarray*}

Now assume that the iterate x_t satisfies that f(x_t)-f(x^*) \leq \epsilon_t for some \epsilon_t \geq 0 and denote by \mathbb{B}_r(x) the euclidean ball of radius r centred at point x. Define r_t = \sqrt{\frac{2\epsilon_t}{\alpha}} and let us now choose y_t as the optimal solution to the following problem,

    \begin{eqnarray*} \min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y \end{eqnarray*}

Note that CG inequality still holds since by the strong convexity we have that x^*\in\mathbb{B}_{r_t}(x_t) and the only important thing in the selection of y_t that is required for the CG inequality to hold is that \nabla{}f(x_t)^{\top}y_t \leq \nabla{}f(x_t)^{\top}x^*. We’ll now get that,

    \begin{eqnarray*} f(x_{t+1}) - f(x^*) \leq \left({1-\gamma_t + \gamma_t^2Q}\right)\epsilon_t \end{eqnarray*}

Thus choosing a constant step size like \gamma_t = \frac{1}{2Q} and an initial error bound \epsilon_1 \geq f(x_1)-f(x^*) results in exponentially fast convergence.

The problem with the above approach is of course that now the optimization problem that needs to be solved on each iteration is no longer a linear minimization step over the original domain \mathcal{X} but a much more difficult problem due to the additional constraint on the distance of y_t from x_t. What we would like to get is an exponentially fast converging method that still requires to solve only a linear minimization problem over \mathcal{X} on each iteration. This will be the subject of our next post.

This entry was posted in Optimization. Bookmark the permalink.