Nesterov’s accelerated gradient descent (AGD) is hard to understand. Since Nesterov’s 1983 paper people have tried to explain “why” acceleration is possible, with the hope that the answer would go beyond the mysterious (but beautiful) algebraic manipulations of the original proof. This search for a good “why” has accelerated in the recent years under the impulsion of the machine learning community, as one would like to generalize Nesterov’s acceleration in various ways to deal with machine learning applications. Obviously generalizing is difficult when even the basic phenomenon is not well understood. Recent works in the “why” direction are re-interpreting the AGD algorithm from different point of views:
– Allen-Zhu and Orrechia view AGD as a linear coupling of Gradient Descent and Mirror Descent. This gives a nice new proof where the acceleration results from a primal-dual point of view on the progress of the algorithm.
– Su, Boyd and Candes (building on earlier work of O’Donoghue and Candes) view AGD as the discretization of a certain second-order ODE. Other papers in a somewhat similar direction include the work of Lessard, Recht and Packard; as well as Flammarion and Bach. Note that these papers do not give alternative proofs of acceleration for general smooth functions.
– Other relevant references include a recent paper of Arjevani, Shalev-Shwartz and Shamir, and a blog post by Hardt.
In joint work with Yin-Tat Lee and Mohit Singh we discovered a simple geometric reason for the possiblity of acceleration, which can be turned into a new algorithm with the same theoretical guarantees as AGD, and improved performances in practice. The proof of the convergence rate is 3 lines, assuming a completely elementary geometric fact (in dimension 2) which itself takes less than half a page to prove. The general principle is loosely inspired from the ellipsoid method (except that we work with balls, which makes the algorithm much more efficient). Let me guide you through this.
Preliminaries
Ok so we are trying to minimize on a function
which is
-strongly convex and
smooth (see here for definitions). Denote
the minimizer of
,
the condition number of
,
for a Euclidean ball centered at
and of radius squared
,
which are respectively a standard and long step of gradient descent. The crucial observation is that the local information at and the assumptions on
allow to enclose the minimizer in some ball, precisely:
Finally we need some fact on the intersection of balls. Take a ball with radius squared , and intersect it with some other ball which has the center of the previous ball on its boundary. This intersection is obviously contained in a ball of radius squared
. An easy calculation shows that if the radius squared of one ball shrinks by
then the intersection is included in a ball of radius squared
. Furthermore if both balls shrink (additively) by the same amount (which is an
fraction of the radius squared for one of the balls), then the intersection is included in a ball of radius squared
. As we will see this fact is the reason why one can accelerate.
A suboptimal algorithm
Assume that we are given such that
. As we saw before one also has
Taking the intersection of these two balls and using the fact from the previous section we see that there exists such that
Thus by iterating the map from to
(defined implicitely above) we obtain a new algorithm with the same rate of convergence as gradient descent.
Acceleration
Now assume that the point from the previous section and its guarantee
on the distance to the optimum came from the local information at some other point
, that is
and
where
. Assume furthermore that
is at least as good as a step of gradient descent at
, that is
. Then we can use the local information at
to reduce the ball around
, indeed one has
and thus
Furthermore, as in the last section, one has
Taking the intersection of the above balls and enclosing it in another ball one obtains (see preliminaries) that there exists such that
The only issue at this point is that to iterate the map from to
one would need that
is at least as good as a step of gradient descent on
, which is not necessarily true (just as the assumption
is not necessarily true). We circumvent this difficulty by doing a line search on the line going through
and
. The details (which are straightforward given the above discussion) are in the paper, specifically in Section 3. We also report encouraging preliminary experimental results.
By Day 1 at IMA Workshop on Optimization and Parsimonious Modeling January 30, 2016 - 4:42 pm
[…] Nesterov’s acceleration of the gradient descent algorithm is mysterious to a lot of people, particularly because there appears to be little geometric intuition behind it. So Sebastian made a new accelerated gradient method based entirely on a geometric principle which involves intersecting closed balls. This new algorithm nearly achieves the rate of convergence of Nesterov’s method (due to an exact line search, there is a log^2(1/eps) instead of a log(1/eps)). I’ll leave it to Sebastian to explain the proof to you. […]
By Day 1 at IMA workshop on Optimization and Parsimonious Modeling January 28, 2016 - 5:55 pm
[…] Nesterov’s acceleration of the gradient descent algorithm is mysterious to a lot of people, particularly because there appears to be little geometric intuition behind it. So Sebastian made a new accelerated gradient method based entirely on a geometric principle which involves intersecting closed balls. This new algorithm nearly achieves the rate of convergence of Nesterov’s method (due to an exact line search, there is a log^2(1/eps) instead of a log(1/eps)). I’ll leave it to [Sebastian to explain the proof to you](https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/). […]
By 1 – Revisiting Nesterov’s Acceleration - Exploding Ads November 30, 2015 - 2:57 pm
[…] Read more here: https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/ […]
By anonymous November 4, 2015 - 7:01 pm
How about the O(1/k^2) convergence for non-strongly covex problems? Have you extended your algorithm and intuition to that case?
By Sebastien Bubeck November 12, 2015 - 1:33 pm
First let me warn you that my answer will probably be unpleasing to you: in theory, up to logarithmic factors there is no difference between strongly convex and non-strongly convex (for deterministic optimization, with a stochastic oracle it’s another story). Indeed, given an accuracy eps one can just optimize f+eps |x|^2 ; which has condition number of order 1/eps, and thus it takes log(1/eps)/sqrt(eps) to optimize it. If you don’t have a fixed accuracy eps you can also do the usual doubling trick and lose another log factor. It is an interesting problem to find a more elegant way to adapt our algorithm to the non-strongly convex case and hopefully recover the exact 1/k^2 (without extraneous logarithmic factors).
By [Unfinished] Gradient-based optimization algorithms for training in machine learning and EDA global placement process | Exponentials July 15, 2015 - 6:24 pm
[…] [3] Zeyuan Allen-Zhu and Lorenzo Orecchia. “A Novel, Simple Interpretation of Nesterov’s Accelerated Method as a Combination of Gradient and Mirror Descent.” arXiv preprint arXiv:1407.1537 (2014). http://arxiv.org/abs/1407.1537 [4] Sebastien Bubeck, Revisiting Nesterov’s Acceleration, https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/ […]
By Anonymous July 2, 2015 - 5:59 am
Another relevant development is a recent paper by Kim and Fessler that reaches a method very similar to AGD via a fully constructive approach.
It doesn’t give an answer to “why”, but at least gives a non-mysterious “how”.