A short proof for Nesterov’s momentum

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

    \begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the 1/t^2 convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that f is \beta-smooth convex function, and we take \eta = 1/\beta in the gradient step. The momentum term \gamma_t will be set to a very particular value, which comes out naturally in the proof.


The two basic inequalities

Let us denote \delta_t = f(x_t) - f(x^*) and g_t = - \frac{1}{\beta} \nabla f(x_t + d_t) (note that x_{t+1} = x_t + g_t + d_t). Now let us write our favorite inequalities (using f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2 and f(x) - f(y) \leq \nabla f(x) \cdot (x-y)):

    \[ \delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,, \]


    \[ \delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,. \]


On the way to a telescopic sum

Recall now that |a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2, so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient 1 on the second inequality, so let us do \lambda_t-1 times the first inequality plus 1 times the second inequality. We obtain an inequality whose right hand side is given by -\frac{\beta}{2} times

    \begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose \lambda_t and \gamma_t. What we would like to have is:

    \[ x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,. \]

Observe that (since d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)) the right hand side can be written as x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*, and thus we see that we simply need to have:

    \[ \lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,. \]


Setting the parameters and concluding the proof

Writing u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2 we now obtain as a result of the combination of our two starting inequalities:

    \[ \lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,. \]

It only remains to select \lambda_t such that \lambda_t^2 - \lambda_t = \lambda_{t-1}^2 (i.e., roughly \lambda_t is of order t) so that by summing the previous inequality one obtains \delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2} which is exactly the 1/t^2 rate we were looking for.

This entry was posted in Optimization. Bookmark the permalink.

16 Responses to "A short proof for Nesterov’s momentum"