About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function (in the previous post I dealt only with the smooth case). Precisely is -strongly convex and -smooth, and we denote by the condition number of . As explained in this post, in this case the basic gradient descent algorithm requires iterations to reach -accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of . This is particularly relevant in Machine Learning applications since the strong convexity parameter can often be viewed as a regularization term, and can be as large as the sample size. Thus reducing the number of step from “” to “” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004’s book. Start at an arbitrary initial point and then iterate the following equations for ,

TheoremLet be -strongly convex and -smooth, then Nesterov’s Accelerated Gradient Descent satisfies

*Proof:* We define -strongly convex quadratic functions by induction as follows:

(1)

Intuitively becomes a finer and finer approximation (from below) to in the following sense:

(2)

The above inequality can be proved immediately by induction, using the fact that by -strong convexity one has

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below is . The following inequality answers this question:

(3)

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by -smoothness one has ):

We now prove (3) by induction (note that it is true at since ). Let . Using the definition of (and -smoothness), convexity, and the induction hypothesis, one gets

Thus we now have to show that

(4)

To prove this inequality we have to understand better the functions . First note that (immediate by induction) and thus has to be of the following form:

for some . Now observe that by differentiating (1) and using the above form of one obtains

In particular is by definition minimized at which can now be defined by induction using the above identity, precisely:

(5)

Using the form of and , as well as the original definition (1) one gets the following identity by evaluating at :

(6)

Note that thanks to (5) one has

which combined with (6) yields

Finally we show by induction that , which concludes the proof of (4) and thus also concludes the proof of the theorem:

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of and the last one from the definition of .

## By Charles Nutter October 28, 2016 - 3:46 am

I am investigation the Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization for application to optimize setting of gate parameters in LSTM RNN blocks. Finding the best estimate of the minimum is very important esp. when considering possible zig-zagging near the minimum. After reading the proof above it appears that most of my concerns are addressed. But, the method appears dangerous accurate and pleasing fast. Could you comment on its applicability to gate and weights design of LSTM blocks? We have only so much time to settle on a useful minimum. Is there an improvement to Nesterov's method.

## By Bikash July 14, 2016 - 11:37 am

Hi Prof Bubeck,

Thanks for this blog article. This helps alot to understand Nesterov's method. I have been exploring and reading about Nesterov's convergence bound for quite some time now. I have a question: Are there some examples of simple and strongly convex functions for which Nesterov's bound is better than the strongly convex bound and worse than the convex quadratic bound.

best,

Bikash

## By Sebastien Bubeck July 27, 2016 - 2:22 pm

I'm not sure I understand the question correctly but you can check Section 5.3 here https://arxiv.org/abs/1602.04847 , I think the example we give could answer your question.

## By Xiang Cheng December 29, 2015 - 5:39 am

Hi,

Is there any way to generalize this acceleration to functions which are smooth/strongly convex in some non-euclidean norm (e.g. l1)?

L2 has the nice property that the lower bound is always quadratic, whose center is shifted in a way that is easy to compute. In contrast, the lower bound for L1 turns into something complicated very quickly.

Does there exist any other norm for which the lower bound is easily update-able?

## By [Unfinished] Gradient-based optimization algorithms for training in machine learning and EDA global placement process | Exponentials July 14, 2015 - 5:27 pm

[…] [2] Sebastien Bubeck, Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, https://blogs.princeton.edu/imabandit/2014/03/06/nesterovs-accelerated-gradient-descent-for-smooth-a… […]

## By He Sun May 2, 2014 - 5:35 pm

Just Updating what I am thinking, the idea of nesterov's method may be cannot match incremental idea, because we'd like to obtain the near optimal for each function component in online learning problem by some sort of line search rules. Moreover, each function component is homogenous, but not share the gradient in the same direction. However, for nesterov's method, we seems to move more ambitious based on the previous step because we incorporate x_{k-1}. Therefore, it seems hard to combine these two. Am I right about this?

Thanks,

He

## By He Sun May 2, 2014 - 5:17 pm

Hi Prof. Bubeck,

Thank you for help. I read the paper.

Two more questions.

1. I feel like PARTAN and nesterov's method share the same idea in the sense that they both perform a gradient step and an extrapolation step. But the difference is PARTAN needs two exact line searches. So my question is, can PARTAN also achieves the same complexity as nesterov's method?

2. For online learning problem, can we implement incremental version of nesterov's method to deal with data on-the-fly? That sounds very cool if we can derive some theoretical results of it.

Thanks,

He

## By He Sun April 22, 2014 - 7:16 pm

Hi Prof. Bubeck,

I am a graduate student taking convex optimization course at MIT. Your post attracts me a lot. A specific question, how to compute condition number in practice? For other approximated setting for Q, will it achieve the optimal complexity too?

Thanks,

-He

## By Sebastien Bubeck April 23, 2014 - 11:36 am

There is a huge literature on how to estimate the parameters such as smoothness/Lipschitzness/strong convexity as you go. You can start by taking a look at Nesterov's book (see also the Beck and Teboulle FISTA paper).