In this lecture we consider the unconstrained minimization of a function that satisfy the following requirements:
- (i)
admits a minimizer
on
such that
.
- (ii)
is convex on
. Recall that in particular this implies
for any
.
- (iii)
is
-Lipschitz on the
-ball of radius
, that is for any
such that
and any
, one has
.
Subgradient Descent
The simplest strategy to minimize a differentiable function is probably the Gradient Descent scheme. It is an iterative algorithm that starts at some initial point , and then iterates the following equation:
where is the step-size fixed beforehand. The idea of this scheme is simply to do small steps that minimize the local first order Taylor approximation of the function
. When the function is convex but not differentiable it seems rather natural to replace gradients by subgradients. This gives the Subgradient Descent strategy.
Here we need to be a bit careful if we want to analyze this scheme under assumptions (i)-(ii)-(iii). Indeed we have a control on the size of the subgradients only in a ball of radius . Furthermore we also know that the minimizer of the function lies in this ball. Thus it makes sense to enforce that if we leave the ball, then we first project back the point to the ball before taking another gradient step. This gives the (Projected) Subgradient Descent:
The following elementary result gives a rate of convergence for the subgradient method.
Theorem Assume that
satisfies (i)-(ii)-(iii), and that
. Then the (Projected) Subgradient Descent with
satisfies for
,
Proof: Using the definition of subgradients, the definition of the method, and an elementary identity, one obtains
Now note that , and
(just do a picture to convince yourself). Summing the resulting inequality over
, and using that
yield
Plugging in the value of directly gives the statement (recall that by convexity
).
Projected Subgradient Descent
In its general form the Project Subgradient Descent can solve the constrained minimization problem where
is a compact convex set. The idea is simply to project back on the constraint set
in case we leave it with the gradient step. That is we iterate the following equations:
The exact same result applies to this more general algorithm, with exactly the same proof. The only ‘difficulty’ is to prove that in this case too we have . Again a picture makes this result obvious, and it can be proved formally by simply writing down first order optimality conditions for the definition of
.
Later in the course we will see a much more general form of the Projected Subgradient Descent, where the projection is not necessarily done in .
Step-size, decision point
Note that the step-size prescribed by the theorem depends on the number of steps
that we are going to make. It is possible to analyze an anytime version of the algorithm, with time-varying step-size
. For
one can prove the same rate up to a
factor.
The theorem gives two possibilities for the point to be returned by the optimization scheme: Either one can return , which requires to have access to a
order oracle, or one can simply return the average of the points visited by the method, that is
.
Oracle complexity, computational complexity
To reach an -optimal point, one needs to make
calls to the
order oracle. This result is astonishing, because
is independent of the dimension! This fact is what makes gradient methods so suitable for large-scale optimization.
In terms of computational complexity, each step of the (Projected) Subgradient Descent (i.e. when the projection is made on the Euclidean ball) has a linear time complexity! The more general Projected Subgradient Descent might not enjoy such a nice time complexity: it depends on how easy it is to project a point on the constraint set. Again we will discuss these issues further when we consider the generalized scheme that does not necessarily project in
.
Oracle lower bound in high dimension
We will now derive our very first sharp lower bound for an optimization problem. This result was originally discovered by Nemirovski and Yudin, and we follow here the presentation of Nesterov.
Recall from the previous lecture that a black-box optimization procedure is simply a sequence of mappings from ‘history’ to the next query point. To simplify the argument we restrict our attention to mappings with the property that
(1)
where is the output of the first order oracle at point
(in particular
). It is relatively straightforward to generalize the lower bound to any black-box procedure but it makes notation a little bit more messy.
Theorem For any black-box optimization procedure satisfying (1) and for any
, there exists a function
that satisfies (i)-(ii)-(iii) and such that
Proof: Without loss of generality we assume that . To avoid confusion we denote here
for the
coordinate of
. Let
be the canonical basis of
.
We consider now the following convex function:
It is easy to see that
In particular if then for any
one has
. In other words
is
-Lipschitz on the
ball of radius
.
Next we describe the first order oracle for this function: when asked for a subgradient at , it returns
where
is the {\em first} coordinate that satisfy
. In particular when asked for a subgradient at
it returns
. Thus
must lie on the line generated by
. It is easy to see recursively that in fact
must lie in the linear span of
. In particular for
we necessarily have
and thus
.
It remains to compute the minimal value of . Let
be such that
for
and
for
. It is clear that
and thus the minimal value
of
is
Wrapping up, we proved that for any one must have
Taking and
concludes the proof (note in particular that
with these parameters).
What about lower bounds in finite dimension?
The result of the previous section shows that Subgradient Descent is optimal, as far as dimension-uniform rates are concerned. We already know that for a fixed dimension one can obtain much faster rates of convergence as the number of calls to the oracle
goes to infinity. More precisely recall that the Ellipsoid Method satisfy, with
calls
to a first order oracle,
For large (
) this rate is much better than the one derived for the Subgradient Descent. In this ‘large
‘ regime one can prove a lower bound of
on the optimization error. One can bridge the gap between the upper bound given by the Ellipsoid Method and this lower bound with an algorithm known as the Method of centers of gravity. Unfortunately this method has a very bad computational complexity, which is why it is not so interesting for us. We refer the interested reader to the book by Nesterov for more details on these results in finite dimension.
By Deep Learning Basics II – Theoretical Machine Learning August 2, 2017 - 2:19 pm
[…] The time taken for a full gradient descent over the full network is O(P(V+E)). It is very slow to apply on the entire network, so we use Stochastic Gradient Descent to speed up our computation. Reference to Gradient Descent Proof […]
By Juan Ramirez December 1, 2016 - 8:49 pm
Are there lower bounds on the evaluation error when you add to the queriesphi_t the proximals at each iteration for Lipschitz convex functions? I ask because the proximal point method achieves error O(1/t); which is better that the bounds of the model above
By ZIYE July 25, 2016 - 5:45 pm
Thanks very much for the post. But I don’t really understand how you pick the values for \alpha and \gama. It seems to me that the requirement for those two nonnegative coefficients is just: ||x*|| = \gama / (\alpha*sqrt(t)) <= R. So can I let \gama = O(sqrt(t) / (1+t^(1/3)) and \alpha = O(1/(1+t^(1/3))) and get a lower bound O(1/(1+t^(1/3))), which seems to be larger than the subgradient convergence rate.
Am I missing something here? Sorry if I misunderstood the content.
Cheers, Ziye
By ZIYE July 25, 2016 - 5:51 pm
Oh I just realized the setting here is t<=n, so the constant in big 'O' notation actually matters.
By Zhimin May 15, 2014 - 5:21 pm
Thanks for the post. I think there might be something wrong with the proof of the first theorem. It seems like \eta is also depends on s. So you cannot do the summation directly, right?
By qd April 14, 2013 - 2:40 pm
great post!
but is this a typo that
||y_{t+1}-x*||<= ||x_{t+1}-x*||?
seems that x_{t+1} is the projection of y_{t+1} onto the convex ball, so it is closer than y_t+1 to any interior point in X.
Also the proof is a little bit strange to me, it seems you can replace (x^*,f(x^*)) with any feasible point in ball(R), and the argument is still valid! so does that mean no information of optimal solution is applied at all?
By Sebastien Bubeck April 14, 2013 - 2:49 pm
Yes that’s a typo, thanks a lot for spotting it! (I just fixed it.)
Indeed the proof works for any point in the ball, so the upper bound really says f(x_t) – f(x) < … for any x in the ball. But observe that this bound is 'tightest' for x being the minimizer of f on the ball!
By Hein Hundal March 26, 2013 - 3:04 pm
Very Very Nice Post. You might want to check out James Spall’s talk at NIPS titled “Stochastic Search and Optimization” at
http://nips.cc/Conferences/2012/Program/event.php?ID=3131
Cheers, Hein
By Sebastien Bubeck March 26, 2013 - 9:57 pm
I’m glad that you liked the post!
I attended Spall’s tutorial at NIPS. Part III of my course will be about the ‘statistical complexity of optimization’, and I plan to talk about stochastic optimization with noisy 1st order and 0th order oracles. Hopefully I will have time to describe SPSA and prove rates of convergence during this last part!