I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a -dimensional plane-search subroutine (which, in theory, can be implemented in
calls to a first-order oracle). He later improved it to only require a
-dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).
Let be a
-smooth function. Denote
. Fix a sequence
, to be optimized later. We consider the “conjugate” point
. The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:
Let us denote and
for shorthand. The key point is that
, and in particular
. Now recognize that
is a lower bound on the improvement
(here we use that
is better than
). Thus we get:
In other words if the sequence is chosen such that
then we get
This is good because roughly the reverse inequality also holds true by convexity (and the fact that so
):
So finally we get , and it just remains to realize that
is of order
so that
.
By N June 19, 2019 - 10:32 pm
Hi Sebastien,
Do you have a source for Nemirovski’s results? I was not able to find either the late 70’s result, or the 1981 improvement you mention, on his webpages.
By Sebastien Bubeck June 20, 2019 - 12:23 pm
Here is the Russian paper (or part of it at least): https://blogs.princeton.edu/imabandit/wp-content/uploads/sites/122/2019/06/Nemirovski81_Russian.pdf and a discussion of it in English (by Nemirovski): https://blogs.princeton.edu/imabandit/wp-content/uploads/sites/122/2019/06/Nemirovski81_EnglishDiscussion.pdf
By Yichi June 18, 2019 - 12:19 pm
Hi, Sebastian, thanks for sharing that!
It seems that Nemirovski’s acceleration does not need to know how smooth the function f is in advance. But we should know it as a hyperparameter if we are using Nesterov’s acceleration. So the question is if it is possible to get an accelerated gradient descent algorithm without line search if we are not given the knowledge of the smooth parameter?
By Anonymous June 20, 2019 - 12:27 pm
With line search, there are many variants of Nesterov also does this, such as
this paper by Seb
https://arxiv.org/pdf/1506.08187.pdf
Without a line search, I suspect it is not possible?
By boojum January 11, 2019 - 3:09 pm
In the statement, where you write
x_{t+1} =_{x \in P_t} f(x) where $P_t$ is the span, is there a missing \min?
By Sebastien Bubeck January 11, 2019 - 3:56 pm
Yes “argmin” was missing, thanks!
By Sohail Bahmani January 9, 2019 - 2:39 pm
Hi Sebastien,
Just a typo: in the very last sentence the $g_s$ in the sum should be $\delta_s$.
By Sebastien Bubeck January 9, 2019 - 5:42 pm
Thanks Sohail, fixed!