Revisiting Nesterov’s Acceleration

Nesterov’s accelerated gradient descent (AGD) is hard to understand. Since Nesterov’s 1983 paper people have tried to explain “why” acceleration is possible, with the hope that the answer would go beyond the mysterious (but beautiful) algebraic manipulations of the original proof. This search for a good “why” has accelerated in the recent years under the impulsion of the machine learning community, as one would like to generalize Nesterov’s acceleration in various ways to deal with machine learning applications. Obviously generalizing is difficult when even the basic phenomenon is not well understood. Recent works in the “why” direction are re-interpreting the AGD algorithm from different point of views:

– Allen-Zhu and Orrechia view AGD as a linear coupling of Gradient Descent and Mirror Descent. This gives a nice new proof where the acceleration results from a primal-dual point of view on the progress of the algorithm.

– Su, Boyd and Candes (building on earlier work of O’Donoghue and Candes) view AGD as the discretization of a certain second-order ODE. Other papers in a somewhat similar direction include the work of Lessard, Recht and Packard; as well as Flammarion and Bach. Note that these papers do not give alternative proofs of acceleration for general smooth functions.

– Other relevant references include a recent paper of Arjevani, Shalev-Shwartz and Shamir, and a blog post by Hardt.

In joint work with Yin-Tat Lee and Mohit Singh we discovered a simple geometric reason for the possiblity of acceleration, which can be turned into a new algorithm with the same theoretical guarantees as AGD, and improved performances in practice. The proof of the convergence rate is 3 lines, assuming a completely elementary geometric fact (in dimension 2) which itself takes less than half a page to prove. The general principle is loosely inspired from the ellipsoid method (except that we work with balls, which makes the algorithm much more efficient). Let me guide you through this.

Preliminaries

Ok so we are trying to minimize on $\mathbb{R}^n$ a function $f$ which is $\alpha$ -strongly convex and $\beta$ smooth (see here for definitions). Denote $x^*$ the minimizer of $f$ , $\kappa = \beta / \alpha$ the condition number of $f$ , $\mathrm{B}(x, r^2)$ for a Euclidean ball centered at $x$ and of radius squared $r^2$ ,

$x^+ = x - \frac{1}{\beta} \nabla f(x), \;\; x^{++} = x - \frac{1}{\alpha} \nabla f(x) ,$

which are respectively a standard and long step of gradient descent. The crucial observation is that the local information at $x$ and the assumptions on $f$ allow to enclose the minimizer in some ball, precisely:

$x^* \in \mathrm{B}\left(x^{++}, \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) - \frac{2}{\alpha} (f(x^+) - f(x^*)) \right) .$

Finally we need some fact on the intersection of balls. Take a ball with radius squared $R^2$ , and intersect it with some other ball which has the center of the previous ball on its boundary. This intersection is obviously contained in a ball of radius squared $R^2$ . An easy calculation shows that if the radius squared of one ball shrinks by $1-\epsilon$ then the intersection is included in a ball of radius squared $(1-\epsilon) R^2$ . Furthermore if both balls shrink (additively) by the same amount (which is an $\epsilon$ fraction of the radius squared for one of the balls), then the intersection is included in a ball of radius squared $(1-\sqrt{\epsilon}) R^2$ . As we will see this fact is the reason why one can accelerate.

A suboptimal algorithm

Assume that we are given $x \in \mathbb{R}^n$ such that $x^* \in \mathrm{B}(x,R^2)$ . As we saw before one also has

$x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .$

Taking the intersection of these two balls and using the fact from the previous section we see that there exists $x'$ such that

$x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\kappa}\right) R^2\right) .$

Thus by iterating the map from $x$ to $x'$ (defined implicitely above) we obtain a new algorithm with the same rate of convergence as gradient descent.

Acceleration

Now assume that the point $x$ from the previous section and its guarantee $R^2$ on the distance to the optimum came from the local information at some other point $x_0$ , that is $x = x_0^{++}$ and $R^2 = R_0^2 - \frac{2}{\alpha} (f(x_0^+) - f(x^*))$ where $R_0^2 = \frac{|\nabla f(x_0)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right)$ . Assume furthermore that $x$ is at least as good as a step of gradient descent at $x_0$ , that is $f(x) \leq f(x_0^+)$ . Then we can use the local information at $x$ to reduce the ball around $x$ , indeed one has

$f(x^*) \leq f(x^+) \leq f(x) - \frac{1}{2 \beta} \nabla f(x) \leq f(x_0^+) - \frac{1}{2 \beta} \nabla f(x) ,$

and thus

$x^* \in \mathrm{B}\left(x, R_0^2 - \frac{|\nabla f(x)|^2}{\alpha^2 \kappa} \right) .$

Furthermore, as in the last section, one has

$x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .$

Taking the intersection of the above balls and enclosing it in another ball one obtains (see preliminaries) that there exists $x'$ such that

$x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\sqrt{\kappa}}\right) R_0^2 \right) .$

The only issue at this point is that to iterate the map from $x$ to $x'$ one would need that $x'$ is at least as good as a step of gradient descent on $x$ , which is not necessarily true (just as the assumption $f(x_0^{++}) \leq f(x_0^+)$ is not necessarily true). We circumvent this difficulty by doing a line search on the line going through $x'$ and $x^+$ . The details (which are straightforward given the above discussion) are in the paper, specifically in Section 3. We also report encouraging preliminary experimental results.

7 Responses to "Revisiting Nesterov’s Acceleration"

By Day 1 at IMA Workshop on Optimization and Parsimonious Modeling January 30, 2016 - 4:42 pm

[…] Nesterov’s acceleration of the gradient descent algorithm is mysterious to a lot of people, particularly because there appears to be little geometric intuition behind it. So Sebastian made a new accelerated gradient method based entirely on a geometric principle which involves intersecting closed balls. This new algorithm nearly achieves the rate of convergence of Nesterov’s method (due to an exact line search, there is a log^2(1/eps) instead of a log(1/eps)). I’ll leave it to Sebastian to explain the proof to you. […]
By Day 1 at IMA workshop on Optimization and Parsimonious Modeling January 28, 2016 - 5:55 pm

[…] Nesterov’s acceleration of the gradient descent algorithm is mysterious to a lot of people, particularly because there appears to be little geometric intuition behind it. So Sebastian made a new accelerated gradient method based entirely on a geometric principle which involves intersecting closed balls. This new algorithm nearly achieves the rate of convergence of Nesterov’s method (due to an exact line search, there is a log^2(1/eps) instead of a log(1/eps)). I’ll leave it to [Sebastian to explain the proof to you](https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/). […]
By 1 – Revisiting Nesterov’s Acceleration - Exploding Ads November 30, 2015 - 2:57 pm

[…] Read more here: https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/ […]
By anonymous November 4, 2015 - 7:01 pm

How about the O(1/k^2) convergence for non-strongly covex problems? Have you extended your algorithm and intuition to that case?
- By Sebastien Bubeck November 12, 2015 - 1:33 pm
  
  First let me warn you that my answer will probably be unpleasing to you: in theory, up to logarithmic factors there is no difference between strongly convex and non-strongly convex (for deterministic optimization, with a stochastic oracle it’s another story). Indeed, given an accuracy eps one can just optimize f+eps |x|^2 ; which has condition number of order 1/eps, and thus it takes log(1/eps)/sqrt(eps) to optimize it. If you don’t have a fixed accuracy eps you can also do the usual doubling trick and lose another log factor. It is an interesting problem to find a more elegant way to adapt our algorithm to the non-strongly convex case and hopefully recover the exact 1/k^2 (without extraneous logarithmic factors).
By [Unfinished] Gradient-based optimization algorithms for training in machine learning and EDA global placement process | Exponentials July 15, 2015 - 6:24 pm

[…] [3] Zeyuan Allen-Zhu and Lorenzo Orecchia. “A Novel, Simple Interpretation of Nesterov’s Accelerated Method as a Combination of Gradient and Mirror Descent.” arXiv preprint arXiv:1407.1537 (2014). http://arxiv.org/abs/1407.1537 [4] Sebastien Bubeck, Revisiting Nesterov’s Acceleration, https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/ […]
By Anonymous July 2, 2015 - 5:59 am

Another relevant development is a recent paper by Kim and Fessler that reaches a method very similar to AGD via a fully constructive approach.
It doesn’t give an answer to “why”, but at least gives a non-mysterious “how”.

Dec	MAR	May
	02
2019	2021	2022

Revisiting Nesterov’s Acceleration

7 Responses to "Revisiting Nesterov’s Acceleration"

By Day 1 at IMA Workshop on Optimization and Parsimonious Modeling January 30, 2016 - 4:42 pm

By Day 1 at IMA workshop on Optimization and Parsimonious Modeling January 28, 2016 - 5:55 pm

By 1 – Revisiting Nesterov’s Acceleration - Exploding Ads November 30, 2015 - 2:57 pm

By anonymous November 4, 2015 - 7:01 pm

By Sebastien Bubeck November 12, 2015 - 1:33 pm

By [Unfinished] Gradient-based optimization algorithms for training in machine learning and EDA global placement process | Exponentials July 15, 2015 - 6:24 pm

By Anonymous July 2, 2015 - 5:59 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll