About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function (in the previous post I dealt only with the smooth case). Precisely is -strongly convex and -smooth, and we denote by the condition number of . As explained in this post, in this case the basic gradient descent algorithm requires iterations to reach -accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of . This is particularly relevant in Machine Learning applications since the strong convexity parameter can often be viewed as a regularization term, and can be as large as the sample size. Thus reducing the number of step from “” to “” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004’s book. Start at an arbitrary initial point and then iterate the following equations for ,

TheoremLet be -strongly convex and -smooth, then Nesterov’s Accelerated Gradient Descent satisfies

*Proof:* We define -strongly convex quadratic functions by induction as follows:

(1)

Intuitively becomes a finer and finer approximation (from below) to in the following sense:

(2)

The above inequality can be proved immediately by induction, using the fact that by -strong convexity one has

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below is . The following inequality answers this question:

(3)

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by -smoothness one has ):

We now prove (3) by induction (note that it is true at since ). Let . Using the definition of (and -smoothness), convexity, and the induction hypothesis, one gets

Thus we now have to show that

(4)

To prove this inequality we have to understand better the functions . First note that (immediate by induction) and thus has to be of the following form:

for some . Now observe that by differentiating (1) and using the above form of one obtains

In particular is by definition minimized at which can now be defined by induction using the above identity, precisely:

(5)

Using the form of and , as well as the original definition (1) one gets the following identity by evaluating at :

(6)

Note that thanks to (5) one has

which combined with (6) yields

Finally we show by induction that , which concludes the proof of (4) and thus also concludes the proof of the theorem:

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of and the last one from the definition of .