ORF523: Nesterov’s Accelerated Gradient Descent

In this lecture we consider the same setting than in the previous post (that is we want to minimize a smooth convex function over $\mathbb{R}^n$ ). Previously we saw that the plain Gradient Descent algorithm has a rate of convergence of order $1/t$ after $t$ steps, while the lower bound that we proved is of order $1/t^2$ .

We present now a beautiful algorithm due to Nesterov, called Nesterov’s Accelerated Gradient Descent, which attains a rate of order $1/t^2$ . First we define the following sequences:

$\lambda_0 = 0, \ \lambda_{s} = \frac{1 + \sqrt{1+ 4 \lambda_{s-1}^2}}{2}, \ \text{and} \ \gamma_s = \frac{1 - \lambda_s}{\lambda_{s+1}}.$

(Note that $\gamma_s \leq 0$ .) Now the algorithm is simply defined by the following equations, with an arbitrary initial point $x_1 = y_1$ ,

$\begin{eqnarray*} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & (1 - \gamma_s) y_{s+1} + \gamma_s y_s . \end{eqnarray*}$

In other words, Nesterov’s Accelerated Gradient Descent performs a simple step of gradient descent to go from $x_s$ to $y_{s+1}$ , and then it ‘slides’ a little bit further than $y_{s+1}$ in the direction given by the previous point $y_s$ .

The intuition behind the algorithm is quite difficult to grasp, and unfortunately the analysis will not be very enlightening either. Nonetheless Nesterov’s Accelerated Gradient is an optimal method (in terms of oracle complexity) for smooth convex optimization, as shown by the following theorem. [Added in September 2015: we now have a simple geometric explanation of the phenomenon of acceleration, see this post.]

Theorem (Nesterov 1983) Let $f$ be a convex and $\beta$ -smooth function, then Nesterov’s Accelerated Gradient Descent satisfies

$f(y_t) - f(x^*) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} .$

We follow here the proof by Beck and Teboulle from the paper ‘A fast iterative shrinkage-thresholding algorithm for linear inverse problems‘.

Proof: We start with the following observation, that makes use of Lemma 1 and Lemma 2 from the previous lecture: let $x, y \in \R^n$ , then

$\begin{align*} & f\left(x - \frac{1}{\beta} \nabla f(x) \right) - f(y) \\ & \leq f\left(x - \frac{1}{\beta} \nabla f(x) \right) - f(x) + \nabla f(x)^{\top} (x-y) \\ & \leq \nabla f(x)^{\top} \left(x - \frac{1}{\beta} \nabla f(x) - x \right) + \frac{\beta}{2} \left\|x - \frac{1}{\beta} \nabla f(x) - x \right\|^2 + \nabla f(x)^{\top} (x-y) \\ & = - \frac{1}{2 \beta} \| \nabla f(x) \|^2 + \nabla f(x)^{\top} (x-y) . \end{align*}$

Now let us apply this inequality to $x = x_{s}$ and $y= y_s$ , which gives

(1) $\begin{eqnarray*} f(y_{s+1}) - f(y_s) & = & f\left(x_s - \frac{1}{\beta} \nabla f(x_s) \right) - f(y_s) \notag \\ & \leq & - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 + \nabla f(x_s)^{\top} (x_s-y_s) \notag \\ & = & - \frac{\beta}{2} \| y_{s+1} - x_s \|^2 - \beta (y_{s+1} - x_s)^{\top} (x_s-y_s) . \end{eqnarray*}$

Similarly we apply it to $x=x_s$ and $y=x^*$ which gives

(2) $\begin{equation*} f(y_{s+1}) - f(x^*) \leq - \frac{\beta}{2} \| y_{s+1} - x_s \|^2 - \beta (y_{s+1} - x_s)^{\top} (x_s-x^*) . \end{equation*}$

Now multiplying (1) by $(\lambda_{s}-1)$ and adding the result to (2), one obtains with $\delta_s = f(y_s) - f(x^*)$ ,

$\lambda_{s} \delta_{s+1} - (\lambda_{s} - 1) \delta_s \leq - \frac{\beta}{2} \lambda_{s} \| y_{s+1} - x_s \|^2 - \beta (y_{s+1} - x_s)^{\top} (\lambda_{s} x_{s} - (\lambda_{s} - 1) y_s-x^*) .$

Multiplying this inequality by $\lambda_{s}$ and using that by definition $\lambda_{s-1}^2 = \lambda_{s}^2 - \lambda_{s}$ one obtains

(3) $\begin{align*} & \lambda_{s}^2 \delta_{s+1} - \lambda_{s-1}^2 \delta_s \notag \\ & \leq - \frac{\beta}{2} \bigg( \|\lambda_{s}( y_{s+1} - x_s )\|^2 + 2 \lambda_{s} (y_{s+1} - x_s)^{\top} (\lambda_{s} x_{s} - (\lambda_{s} - 1) y_s-x^*) \bigg). \end{align*}$

Now one can verify that

(4) $\begin{align*} & \|\lambda_{s}( y_{s+1} - x_s )\|^2 + 2 \lambda_{s} (y_{s+1} - x_s)^{\top} (\lambda_{s} x_{s} - (\lambda_{s} - 1) y_s-x^*)\notag \\ & = \| \lambda_{s} y_{s+1} - (\lambda_{s} - 1) y_{s}-x^* \|^2 - \| \lambda_{s} x_{s} - (\lambda_{s} - 1) y_{s}-x^* \|^2 . \end{align*}$

Next remark that, by definition, one has

(5) $\begin{align*} & x_{s+1} = y_{s+1} + \gamma_s (y_s - y_{s+1}) \notag \\ & \Leftrightarrow \lambda_{s+1} x_{s+1} = \lambda_{s+1} y_{s+1} + (1-\lambda_{s})(y_s - y_{s+1}) \notag \\ & \Leftrightarrow \lambda_{s+1} x_{s+1} - (\lambda_{s+1} - 1) y_{s+1}= \lambda_{s} y_{s+1} - (\lambda_{s}-1) y_{s} . \end{align*}$

Putting together (3), (4) and (5) one gets with $u_s = \lambda_{s} x_{s} - (\lambda_{s} - 1) y_{s} - x^*$ ,

$\lambda_{s}^2 \delta_{s+1} - \lambda_{s-1}^2 \delta_s^2 \leq \frac{\beta}{2} \bigg(\|u_s\|^2 - \|u_{s+1}\|^2 \bigg) .$

Summing these inequalities from $s=1$ to $s=t-1$ one obtains:

$\delta_t \leq \frac{\beta}{2 \lambda_{t-1}^2} \|u_1\|^2.$

By induction it is easy to see that $\lambda_{t-1} \geq \frac{t}{2}$ which concludes the proof.

$\Box$

13 Responses to "ORF523: Nesterov’s Accelerated Gradient Descent"

By Ravi October 10, 2017 - 10:33 pm

Where can I get the original Nesterov “A method of solving a convex programming problem with convergence rate O (1/k^2)”
By Lecture 3 Scribe Notes – Theoretical Machine Learning July 19, 2017 - 11:18 am

[…] techniques to converge faster. An interested reader can also read to the proof for this theorem here. Stochastic Gradient Descent with Momentum is generally used in deep learning as it has been […]
By A student May 30, 2017 - 12:58 pm

There seems to be a typo in the equation right after (5), it should be

$\lambda_s^2\delta_{s+1} – \lambda_{s-1}^2\delta_s$

In other words it is not $\delta_s^2$ but rather $\delta_s$,
By A wonderful blog in applied mathematics :(ORF523: Nesterov’s Accelerated Gradient Descent | I’m a bandit) – Tortutle mania November 14, 2016 - 10:03 pm

[…] https://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent/ […]
By Backpropagation for dummies | Sachin Joglekar's blog December 6, 2015 - 11:52 am

[…] the search will eventually come back to the required point as the momentum till go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient […]
By Lagrange duality via the Fenchel conjugate | Look at the corners! October 28, 2015 - 2:58 pm

[…] by some -strongly convex “regularizer”, which will make the dual smooth, such that Nesterov’s Accelerated Gradient Descent can be applied. Of course, we also need to control the approximation error […]
By mistake? October 14, 2015 - 7:24 pm

After 5, if you sum from s=1 to s=t-1 you should get on the right side (||u_1||^2 – ||u_t||^2)
right?
By Anonymous October 13, 2015 - 11:28 am

can we use this method with active set method ??
By Coordinate Descet September 7, 2015 - 10:28 pm

Is it possible to apply the nesterov acceleration to the second order newton method? and to the block coordinate descent method?
- By Sebastien Bubeck September 10, 2015 - 11:13 pm
  
  These are very good questions, both answered by Nesterov. For accelerating Newton’s method see this paper: http://link.springer.com/article/10.1007%2Fs10107-006-0089-x ; and for accelerating coordinate descent see this: http://www.optimization-online.org/DB_FILE/2010/01/2527.pdf .
By Nesterov’s Accelerated Gradient Descent | December 18, 2013 - 3:15 am

[…] 转自:http://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent/ […]
By NIPS 2013 | spider's space December 15, 2013 - 1:29 pm

[…] Both SDCA and SAG have a linear dependency on the condition number . For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on . This paper partially bridges the gap between these results and […]
By The Zen of Gradient Descent | Moody Rd September 7, 2013 - 2:29 pm

[…] Bubeck’s course notes are […]

ORF523: Nesterov’s Accelerated Gradient Descent

13 Responses to "ORF523: Nesterov’s Accelerated Gradient Descent"

By Ravi October 10, 2017 - 10:33 pm

By Lecture 3 Scribe Notes – Theoretical Machine Learning July 19, 2017 - 11:18 am

By A student May 30, 2017 - 12:58 pm

By A wonderful blog in applied mathematics :(ORF523: Nesterov’s Accelerated Gradient Descent | I’m a bandit) – Tortutle mania November 14, 2016 - 10:03 pm

By Backpropagation for dummies | Sachin Joglekar's blog December 6, 2015 - 11:52 am

By Lagrange duality via the Fenchel conjugate | Look at the corners! October 28, 2015 - 2:58 pm

By mistake? October 14, 2015 - 7:24 pm

By Anonymous October 13, 2015 - 11:28 am

By Coordinate Descet September 7, 2015 - 10:28 pm

By Sebastien Bubeck September 10, 2015 - 11:13 pm

By Nesterov’s Accelerated Gradient Descent | December 18, 2013 - 3:15 am

By NIPS 2013 | spider's space December 15, 2013 - 1:29 pm

By The Zen of Gradient Descent | Moody Rd September 7, 2013 - 2:29 pm

Leave a replyLeave a Reply to Anonymous

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll