About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function (in the previous post I dealt only with the smooth case). Precisely is -strongly convex and -smooth, and we denote by the condition number of . As explained in this post, in this case the basic gradient descent algorithm requires iterations to reach -accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of . This is particularly relevant in Machine Learning applications since the strong convexity parameter can often be viewed as a regularization term, and can be as large as the sample size. Thus reducing the number of step from “” to “” can be a huge deal, especially in large scale applications.
Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004′s book. Start at an arbitrary initial point and then iterate the following equations for ,
Theorem Let be -strongly convex and -smooth, then Nesterov’s Accelerated Gradient Descent satisfies
Proof: We define -strongly convex quadratic functions by induction as follows:
(1)
Intuitively becomes a finer and finer approximation (from below) to in the following sense:
(2)
The above inequality can be proved immediately by induction, using the fact that by -strong convexity one has
Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below is . The following inequality answers this question:
(3)
The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by -smoothness one has ):
We now prove (3) by induction (note that it is true at since ). Let . Using the definition of (and -smoothness), convexity, and the induction hypothesis, one gets
Thus we now have to show that
(4)
To prove this inequality we have to understand better the functions . First note that (immediate by induction) and thus has to be of the following form:
for some . Now observe that by differentiating (1) and using the above form of one obtains
In particular is by definition minimized at which can now be defined by induction using the above identity, precisely:
(5)
Using the form of and , as well as the original definition (1) one gets the following identity by evaluating at :
(6)
Note that thanks to (5) one has
which combined with (6) yields
Finally we show by induction that , which concludes the proof of (4) and thus also concludes the proof of the theorem:
where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of and the last one from the definition of .
By He Sun May 2, 2014 - 5:35 pm
Just Updating what I am thinking, the idea of nesterov’s method may be cannot match incremental idea, because we’d like to obtain the near optimal for each function component in online learning problem by some sort of line search rules. Moreover, each function component is homogenous, but not share the gradient in the same direction. However, for nesterov’s method, we seems to move more ambitious based on the previous step because we incorporate x_{k-1}. Therefore, it seems hard to combine these two. Am I right about this?
Thanks,
He
By He Sun May 2, 2014 - 5:17 pm
Hi Prof. Bubeck,
Thank you for help. I read the paper.
Two more questions.
1. I feel like PARTAN and nesterov’s method share the same idea in the sense that they both perform a gradient step and an extrapolation step. But the difference is PARTAN needs two exact line searches. So my question is, can PARTAN also achieves the same complexity as nesterov’s method?
2. For online learning problem, can we implement incremental version of nesterov’s method to deal with data on-the-fly? That sounds very cool if we can derive some theoretical results of it.
Thanks,
He
By He Sun April 22, 2014 - 7:16 pm
Hi Prof. Bubeck,
I am a graduate student taking convex optimization course at MIT. Your post attracts me a lot. A specific question, how to compute condition number in practice? For other approximated setting for Q, will it achieve the optimal complexity too?
Thanks,
-He
By Sebastien Bubeck April 23, 2014 - 11:36 am
There is a huge literature on how to estimate the parameters such as smoothness/Lipschitzness/strong convexity as you go. You can start by taking a look at Nesterov’s book (see also the Beck and Teboulle FISTA paper).