Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that is -smooth convex function, and we take in the gradient step. The momentum term will be set to a very particular value, which comes out naturally in the proof.

**The two basic inequalities**

Let us denote and (note that ). Now let us write our favorite inequalities (using and ):

and

**On the way to a telescopic sum**

Recall now that , so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient on the second inequality, so let us do times the first inequality plus times the second inequality. We obtain an inequality whose right hand side is given by times

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose and . What we would like to have is:

Observe that (since ) the right hand side can be written as , and thus we see that we simply need to have:

**Setting the parameters and concluding the proof**

Writing we now obtain as a result of the combination of our two starting inequalities:

It only remains to select such that (i.e., roughly is of order ) so that by summing the previous inequality one obtains which is exactly the rate we were looking for.

## By Ciara Martin February 24, 2021 - 10:13 am

It was helpful for me.

## By Elvis November 22, 2018 - 3:21 am

For the record, it should be noted that \delta_t := f(x_t) – f(x^*) in above calculations.

## By Sebastien Bubeck November 22, 2018 - 12:51 pm

Thanks! Fixed.

## By Anonymous November 22, 2018 - 12:41 am

typo: in the choice of lambda_t, it should be lambda_{t-1}^{2}

## By Sebastien Bubeck November 22, 2018 - 12:50 pm

Thanks! Fixed.

## By Anonymous November 22, 2018 - 12:28 am

small typo: I think it is d_{t+1} = gamma_{t+1} (d_t + g_t)

## By Anonymous November 23, 2018 - 12:25 am

Ok so there is a typo in the first definition of d_t (momentum term):

d_t = gamma_t ( x_t – x_{t-1}).

It also changes the condition between gamma and lambda to obtain the telescopic sum.

Btw, thanks a lot for your blog <3

## By Sebastien Bubeck November 23, 2018 - 12:05 pm

Thanks! Fixed.

## By Anonymous November 21, 2018 - 9:09 pm

Stupid question in the two basic inequalities. If we use descent lemma and strong convexity (instead of just convexity), do we obtain the geometric rates with short proof?

## By Sebastien Bubeck November 22, 2018 - 12:51 pm

Good question! I don’t know, but it’s worth checking.

## By Anonymous November 21, 2018 - 8:20 pm

In section The two basic inequalities, the definition of delta_t is missing.

## By Sebastien Bubeck November 22, 2018 - 12:51 pm

Thanks! Fixed.