A short proof for Nesterov’s momentum

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

$\begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}$

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the $1/t^2$ convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that $f$ is $\beta$ -smooth convex function, and we take $\eta = 1/\beta$ in the gradient step. The momentum term $\gamma_t$ will be set to a very particular value, which comes out naturally in the proof.

The two basic inequalities

Let us denote $\delta_t = f(x_t) - f(x^*)$ and $g_t = - \frac{1}{\beta} \nabla f(x_t + d_t)$ (note that $x_{t+1} = x_t + g_t + d_t$ ). Now let us write our favorite inequalities (using $f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2$ and $f(x) - f(y) \leq \nabla f(x) \cdot (x-y)$ ):

$\delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,,$

and

$\delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,.$

On the way to a telescopic sum

Recall now that $|a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2$ , so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient $1$ on the second inequality, so let us do $\lambda_t-1$ times the first inequality plus $1$ times the second inequality. We obtain an inequality whose right hand side is given by $-\frac{\beta}{2}$ times

$\begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}$

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose $\lambda_t$ and $\gamma_t$ . What we would like to have is:

$x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,.$

Observe that (since $d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)$ ) the right hand side can be written as $x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*$ , and thus we see that we simply need to have:

$\lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,.$

Setting the parameters and concluding the proof

Writing $u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2$ we now obtain as a result of the combination of our two starting inequalities:

$\lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,.$

It only remains to select $\lambda_t$ such that $\lambda_t^2 - \lambda_t = \lambda_{t-1}^2$ (i.e., roughly $\lambda_t$ is of order $t$ ) so that by summing the previous inequality one obtains $\delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2}$ which is exactly the $1/t^2$ rate we were looking for.

11 Responses to "A short proof for Nesterov’s momentum"

By Elvis November 22, 2018 - 3:21 am

For the record, it should be noted that \delta_t := f(x_t) – f(x^*) in above calculations.
- By Sebastien Bubeck November 22, 2018 - 12:51 pm
  
  Thanks! Fixed.
By Anonymous November 22, 2018 - 12:41 am

typo: in the choice of lambda_t, it should be lambda_{t-1}^{2}
- By Sebastien Bubeck November 22, 2018 - 12:50 pm
  
  Thanks! Fixed.
By Anonymous November 22, 2018 - 12:28 am

small typo: I think it is d_{t+1} = gamma_{t+1} (d_t + g_t)
- By Anonymous November 23, 2018 - 12:25 am
  
  Ok so there is a typo in the first definition of d_t (momentum term):
  d_t = gamma_t ( x_t – x_{t-1}).
  
  It also changes the condition between gamma and lambda to obtain the telescopic sum.
  
  Btw, thanks a lot for your blog <3
- By Sebastien Bubeck November 23, 2018 - 12:05 pm
  
  Thanks! Fixed.
By Anonymous November 21, 2018 - 9:09 pm

Stupid question in the two basic inequalities. If we use descent lemma and strong convexity (instead of just convexity), do we obtain the geometric rates with short proof?
- By Sebastien Bubeck November 22, 2018 - 12:51 pm
  
  Good question! I don’t know, but it’s worth checking.
By Anonymous November 21, 2018 - 8:20 pm

In section The two basic inequalities, the definition of delta_t is missing.
- By Sebastien Bubeck November 22, 2018 - 12:51 pm
  
  Thanks! Fixed.

A short proof for Nesterov’s momentum

11 Responses to "A short proof for Nesterov’s momentum"

By Elvis November 22, 2018 - 3:21 am

By Sebastien Bubeck November 22, 2018 - 12:51 pm

By Anonymous November 22, 2018 - 12:41 am

By Sebastien Bubeck November 22, 2018 - 12:50 pm

By Anonymous November 22, 2018 - 12:28 am

By Anonymous November 23, 2018 - 12:25 am

By Sebastien Bubeck November 23, 2018 - 12:05 pm

By Anonymous November 21, 2018 - 9:09 pm

By Sebastien Bubeck November 22, 2018 - 12:51 pm

By Anonymous November 21, 2018 - 8:20 pm

By Sebastien Bubeck November 22, 2018 - 12:51 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll