Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization


About a year ago I described Nesterov’s Accelerated Gradient Descent in the context of smooth optimization. As I mentioned previously this has been by far the most popular post on this blog. Today I have decided to revisit this post to give a slightly more geometrical proof (though unfortunately still magical in various parts). I will focus on unconstrained optimization of a smooth and strongly convex function f (in the previous post I dealt only with the smooth case). Precisely f is \alpha-strongly convex and \beta-smooth, and we denote by Q=\beta / \alpha the condition number of f. As explained in this post, in this case the basic gradient descent algorithm requires O(Q \log(1/\epsilon)) iterations to reach \epsilon-accuracy. As we shall see below Nesterov’s Accelerated Gradient Descent attains the improved oracle complexity of O(\sqrt{Q} \log(1/\epsilon)). This is particularly relevant in Machine Learning applications since the strong convexity parameter \alpha can often be viewed as a regularization term, and 1/\alpha can be as large as the sample size. Thus reducing the number of step from “\mathrm{sample \; size}” to “\sqrt{\mathrm{sample \; size}}” can be a huge deal, especially in large scale applications.

Without further due let’s get to the algorithm, which can be described quite succintly. Note that everything written below is simply a condensed version of the calculations appearing on pages 71–81 in Nesterov 2004’s book. Start at an arbitrary initial point x_1 = y_1 and then iterate the following equations for s \geq 1,

    \begin{eqnarray*} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & \left(1 + \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) y_{s+1} - \frac{\sqrt{Q}-1}{\sqrt{Q}+1} y_s . \end{eqnarray*}

Theorem Let f be \alpha-strongly convex and \beta-smooth, then Nesterov’s Accelerated Gradient Descent satisfies

    \[f(y_t) - f(x^*) \leq \frac{\alpha + \beta}{2} \|x_1 - x^*\|^2 \exp\left(- \frac{t-1}{\sqrt{Q}} \right).\]

Proof: We define \alpha-strongly convex quadratic functions \Phi_s, s \geq 1 by induction as follows:

(1)   \begin{eqnarray*} \Phi_1(x) & = & f(x_1) + \frac{\alpha}{2} \|x-x_1\|^2 , \notag \\ \Phi_{s+1}(x) & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s(x) \notag \\ & & + \frac{1}{\sqrt{Q}} \left(f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \right).  \end{eqnarray*}

Intuitively \Phi_s becomes a finer and finer approximation (from below) to f in the following sense:

(2)   \begin{equation*}  \Phi_{s+1}(x) \leq f(x) + \left(1 - \frac{1}{\sqrt{Q}}\right)^s (\Phi_1(x) - f(x)). \end{equation*}

The above inequality can be proved immediately by induction, using the fact that by \alpha-strong convexity one has

    \[f(x_s) + \nabla f(x_s)^{\top} (x-x_s) + \frac{\alpha}{2} \|x-x_s\|^2 \leq f(x) .\]

Equation (2) by itself does not say much, for it to be useful one needs to understand how “far” below f is \Phi_s. The following inequality answers this question:

(3)   \begin{equation*}  f(y_s) \leq \min_{x \in \mathbb{R}^n} \Phi_s(x) . \end{equation*}

The rest of the proof is devoted to showing that (3) holds true, but first let us see how to combine (2) and (3) to obtain the rate given by the theorem (we use that by \beta-smoothness one has f(x) - f(x^*) \leq \frac{\beta}{2} \|x-x^*\|^2):

    \begin{eqnarray*} f(y_t) - f(x^*) & \leq & \Phi_t(x^*) - f(x^*) \\ & \leq & \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} (\Phi_1(x^*) - f(x^*)) \\ & \leq & \frac{\alpha + \beta}{2} \|x_1-x^*\|^2 \left(1 - \frac{1}{\sqrt{Q}}\right)^{t-1} . \end{eqnarray*}

We now prove (3) by induction (note that it is true at s=1 since x_1=y_1). Let \Phi_s^* = \min_{x \in \mathbb{R}^n} \Phi_s(x). Using the definition of y_{s+1} (and \beta-smoothness), convexity, and the induction hypothesis, one gets

    \begin{align*} & f(y_{s+1}) \\ & \leq f(x_s) - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & = \left(1 - \frac{1}{\sqrt{Q}}\right) f(y_s) + \left(1 - \frac{1}{\sqrt{Q}}\right)(f(x_s) - f(y_s)) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 \\ & \leq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{align*}

Thus we now have to show that

(4)   \begin{equation*}  \Phi_{s+1}^* \geq \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top} (x_s - y_s) + \frac{f(x_s)}{\sqrt{Q}} - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 . \end{equation*}

To prove this inequality we have to understand better the functions \Phi_s. First note that \nabla^2 \Phi_s(x) = \alpha \mathrm{I}_n (immediate by induction) and thus \Phi_s has to be of the following form:

    \[\Phi_s(x) = \Phi_s^* + \frac{\alpha}{2} \|x - v_s\|^2 ,\]

for some v_s \in \mathbb{R}^n. Now observe that by differentiating (1) and using the above form of \Phi_s one obtains

    \[\nabla \Phi_{s+1}(x) = \alpha \left(1 - \frac{1}{\sqrt{Q}}\right) (x-v_s) + \frac{1}{\sqrt{Q}} \nabla f(x_s) + \frac{\alpha}{\sqrt{Q}} (x-x_s) .\]

In particular \Phi_{s+1} is by definition minimized at v_{s+1} which can now be defined by induction using the above identity, precisely:

(5)   \begin{equation*}  v_{s+1} = \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) . \end{equation*}

Using the form of \Phi_s and \Phi_{s+1}, as well as the original definition (1) one gets the following identity by evaluating \Phi_{s+1} at x_s:

(6)   \begin{equation*}  \Phi_{s+1}^* + \frac{\alpha}{2} \|x_s - v_{s+1}\|^2 = \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{\alpha}{2} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 + \frac{1}{\sqrt{Q}} f(x_s) . \end{equation*}

Note that thanks to (5) one has

    \begin{eqnarray*} \|x_s - v_{s+1}\|^2   & = & \left(1 - \frac{1}{\sqrt{Q}}\right)^2 \|x_s - v_s\|^2 + \frac{1}{\alpha^2 Q} \|\nabla f(x_s)\|^2 \\ & & - \frac{2}{\alpha \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) , \end{eqnarray*}

which combined with (6) yields

    \begin{eqnarray*} \Phi_{s+1}^* & = & \left(1 - \frac{1}{\sqrt{Q}}\right) \Phi_s^* + \frac{1}{\sqrt{Q}} f(x_s) + \frac{\alpha}{2 \sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \|x_s - v_s\|^2 \\ & & \qquad - \frac{1}{2 \beta} \| \nabla f(x_s) \|^2 + \frac{1}{\sqrt{Q}} \left(1 - \frac{1}{\sqrt{Q}}\right) \nabla f(x_s)^{\top}(v_s-x_s) . \end{eqnarray*}

Finally we show by induction that v_s - x_s = \sqrt{Q}(x_s - y_s), which concludes the proof of (4) and thus also concludes the proof of the theorem:

    \begin{eqnarray*} v_{s+1} - x_{s+1} & = & \left(1 - \frac{1}{\sqrt{Q}}\right) v_s + \frac{1}{\sqrt{Q}} x_s - \frac{1}{\alpha \sqrt{Q}} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} x_s - (\sqrt{Q}-1) y_s - \frac{\sqrt{Q}}{\beta} \nabla f(x_s) - x_{s+1} \\ & = & \sqrt{Q} y_{s+1} - (\sqrt{Q}-1) y_s - x_{s+1} \\ & = & \sqrt{Q} (x_{s+1} - y_{s+1}) , \end{eqnarray*}

where the first equality comes from (5), the second from the induction hypothesis, the third from the definition of y_{s+1} and the last one from the definition of x_{s+1}.


This entry was posted in Optimization. Bookmark the permalink.

13 Responses to "Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization"

    • Avatar photo