The Wayback Machine - https://web.archive.org/web/20210123235648/https://blogs.princeton.edu/imabandit/2019/01/09/nemirovskis-acceleration/

Nemirovski’s acceleration

I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a 2-dimensional plane-search subroutine (which, in theory, can be implemented in \log(1/\epsilon) calls to a first-order oracle). He later improved it to only require a 1-dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).

Let f be a 1-smooth function. Denote x^{+} = x - \nabla f(x). Fix a sequence (\lambda_t)_{t \in \N}, to be optimized later. We consider the “conjugate” point \sum_{s =1}^t \lambda_s \nabla f(x_s). The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

    \[ x_{t+1} = \mathrm{argmin}_{x \in P_t} f(x) \, \text{where} \, P_t = \mathrm{span}\left(x_t^+, \sum_{s =1}^t \lambda_s \nabla f(x_s)\right) \,. \]

Let us denote g_s = \nabla f(x_s) and \delta_s = f(x_s) - f(x^*) for shorthand. The key point is that g_{t+1} \in P_t^{\perp}, and in particular \|\sum_{s \leq t} \lambda_s g_s\|^2 = \sum_{s \leq t} \lambda_s^2 \|g_s\|^2. Now recognize that \|g_s\|^2 is a lower bound on the improvement \delta_s - \delta_{s+1} (here we use that x_{s+1} is better than x_s^+). Thus we get:

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s^2 (\delta_s - \delta_{s+1}) \leq \sum_{s \leq t} \delta_s (\lambda_s^2 - \lambda_{s-1}^2) \,. \]

In other words if the sequence \lambda is chosen such that \lambda_s = \lambda_s^2 - \lambda_{s-1}^2 then we get

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s \delta_s \,. \]

This is good because roughly the reverse inequality also holds true by convexity (and the fact that x_s \in P_s so g_s \cdot x_s = 0):

    \[ \sum_{s \leq t} \lambda_s \delta_s \leq \sum_{s \leq t} \lambda_s g_s \cdot (x_s - x^*) \leq \|x^*\| \cdot \| \sum_{s \leq t} \lambda_s g_s\| \,. \]

So finally we get \sum_{s \leq t} \lambda_s \delta_s \leq \|x^*\|^2, and it just remains to realize that \lambda_s is of order s so that \delta_t \leq \|x^*\|^2 / t^2.

This entry was posted in Optimization. Bookmark the permalink.

8 Responses to "Nemirovski’s acceleration"

Leave a reply