Revisiting Nesterov’s Acceleration

Nesterov’s accelerated gradient descent (AGD) is hard to understand. Since Nesterov’s 1983 paper people have tried to explain “why” acceleration is possible, with the hope that the answer would go beyond the mysterious (but beautiful) algebraic manipulations of the original proof. This search for a good “why” has accelerated in the recent years under the impulsion of the machine learning community, as one would like to generalize Nesterov’s acceleration in various ways to deal with machine learning applications. Obviously generalizing is difficult when even the basic phenomenon is not well understood. Recent works in the “why” direction are re-interpreting the AGD algorithm from different point of views:

Allen-Zhu and Orrechia view AGD as a linear coupling of Gradient Descent and Mirror Descent. This gives a nice new proof where the acceleration results from a primal-dual point of view on the progress of the algorithm.

Su, Boyd and Candes (building on earlier work of O’Donoghue and Candes) view AGD as the discretization of a certain second-order ODE. Other papers in a somewhat similar direction include the work of Lessard, Recht and Packard; as well as Flammarion and Bach. Note that these papers do not give alternative proofs of acceleration for general smooth functions.

– Other relevant references include a recent paper of Arjevani, Shalev-Shwartz and Shamir, and a blog post by Hardt.

In joint work with Yin-Tat Lee and Mohit Singh we discovered a simple geometric reason for the possiblity of acceleration, which can be turned into a new algorithm with the same theoretical guarantees as AGD, and improved performances in practice. The proof of the convergence rate is 3 lines, assuming a completely elementary geometric fact (in dimension 2) which itself takes less than half a page to prove. The general principle is loosely inspired from the ellipsoid method (except that we work with balls, which makes the algorithm much more efficient). Let me guide you through this.

Preliminaries

Ok so we are trying to minimize on \mathbb{R}^n a function f which is \alpha-strongly convex and \beta smooth (see here for definitions). Denote x^* the minimizer of f, \kappa = \beta / \alpha the condition number of f, \mathrm{B}(x, r^2) for a Euclidean ball centered at x and of radius squared r^2,

    \[x^+ = x - \frac{1}{\beta} \nabla f(x), \;\; x^{++} = x - \frac{1}{\alpha} \nabla f(x) ,\]

which are respectively a standard and long step of gradient descent. The crucial observation is that the local information at x and the assumptions on f allow to enclose the minimizer in some ball, precisely:

    \[x^* \in \mathrm{B}\left(x^{++}, \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) - \frac{2}{\alpha} (f(x^+) - f(x^*)) \right) .\]

Finally we need some fact on the intersection of balls. Take a ball with radius squared R^2, and intersect it with some other ball which has the center of the previous ball on its boundary. This intersection is obviously contained in a ball of radius squared R^2. An easy calculation shows that if the radius squared of one ball shrinks by 1-\epsilon then the intersection is included in a ball of radius squared (1-\epsilon) R^2. Furthermore if both balls shrink (additively) by the same amount (which is an \epsilon fraction of the radius squared for one of the balls), then the intersection is included in a ball of radius squared (1-\sqrt{\epsilon}) R^2. As we will see this fact is the reason why one can accelerate.

A suboptimal algorithm

Assume that we are given x \in \mathbb{R}^n such that x^* \in \mathrm{B}(x,R^2). As we saw before one also has

    \[x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .\]

Taking the intersection of these two balls and using the fact from the previous section we see that there exists x' such that

    \[x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\kappa}\right) R^2\right) .\]

Thus by iterating the map from x to x' (defined implicitely above) we obtain a new algorithm with the same rate of convergence as gradient descent.

Acceleration

Now assume that the point x from the previous section and its guarantee R^2 on the distance to the optimum came from the local information at some other point x_0, that is x = x_0^{++} and R^2 = R_0^2 - \frac{2}{\alpha} (f(x_0^+) - f(x^*)) where R_0^2 = \frac{|\nabla f(x_0)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right). Assume furthermore that x is at least as good as a step of gradient descent at x_0, that is f(x) \leq f(x_0^+). Then we can use the local information at x to reduce the ball around x, indeed one has

    \[f(x^*) \leq f(x^+) \leq f(x) - \frac{1}{2 \beta} \nabla f(x) \leq f(x_0^+) - \frac{1}{2 \beta} \nabla f(x) ,\]

and thus

    \[x^* \in \mathrm{B}\left(x, R_0^2 - \frac{|\nabla f(x)|^2}{\alpha^2 \kappa} \right) .\]

Furthermore, as in the last section, one has

    \[x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .\]

Taking the intersection of the above balls and enclosing it in another ball one obtains (see preliminaries) that there exists x' such that

    \[x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\sqrt{\kappa}}\right) R_0^2 \right) .\]

The only issue at this point is that to iterate the map from x to x' one would need that x' is at least as good as a step of gradient descent on x, which is not necessarily true (just as the assumption f(x_0^{++}) \leq f(x_0^+) is not necessarily true). We circumvent this difficulty by doing a line search on the line going through x' and x^+. The details (which are straightforward given the above discussion) are in the paper, specifically in Section 3. We also report encouraging preliminary experimental results.

This entry was posted in Optimization. Bookmark the permalink.

7 Responses to "Revisiting Nesterov’s Acceleration"

Leave a reply