In this lecture we consider the unconstrained minimization of a function that satisfy the following requirements:

- (i) admits a minimizer on such that .
- (ii) is convex on . Recall that in particular this implies for any .
- (iii) is -Lipschitz on the -ball of radius , that is for any such that and any , one has .

**Subgradient Descent**

The simplest strategy to minimize a differentiable function is probably the Gradient Descent scheme. It is an iterative algorithm that starts at some initial point , and then iterates the following equation:

where is the *step-size* fixed beforehand. The idea of this scheme is simply to do small steps that minimize the local first order Taylor approximation of the function . When the function is convex but not differentiable it seems rather natural to replace gradients by subgradients. This gives the Subgradient Descent strategy.

Here we need to be a bit careful if we want to analyze this scheme under assumptions (i)-(ii)-(iii). Indeed we have a control on the size of the subgradients only in a ball of radius . Furthermore we also know that the minimizer of the function lies in this ball. Thus it makes sense to enforce that if we leave the ball, then we first project back the point to the ball before taking another gradient step. This gives the *(Projected) Subgradient Descent*:

The following elementary result gives a rate of convergence for the subgradient method.

TheoremAssume that satisfies (i)-(ii)-(iii), and that . Then the (Projected) Subgradient Descent with satisfies for ,

*Proof: *Using the definition of subgradients, the definition of the method, and an elementary identity, one obtains

Now note that , and (just do a picture to convince yourself). Summing the resulting inequality over , and using that yield

Plugging in the value of directly gives the statement (recall that by convexity ).

**Projected Subgradient Descent**

In its general form the Project Subgradient Descent can solve the constrained minimization problem where is a compact convex set. The idea is simply to project back on the constraint set in case we leave it with the gradient step. That is we iterate the following equations:

The exact same result applies to this more general algorithm, with exactly the same proof. The only ‘difficulty’ is to prove that in this case too we have . Again a picture makes this result obvious, and it can be proved formally by simply writing down first order optimality conditions for the definition of .

Later in the course we will see a much more general form of the Projected Subgradient Descent, where the projection is not necessarily done in .

**Step-size, decision point**

Note that the step-size prescribed by the theorem depends on the number of steps that we are going to make. It is possible to analyze an *anytime* version of the algorithm, with time-varying step-size . For one can prove the same rate up to a factor.

The theorem gives two possibilities for the point to be returned by the optimization scheme: Either one can return , which requires to have access to a order oracle, or one can simply return the average of the points visited by the method, that is .

**Oracle complexity, computational complexity**

To reach an -optimal point, one needs to make calls to the order oracle. This result is astonishing, because is *independent of the dimension*! This fact is what makes gradient methods so suitable for large-scale optimization.

In terms of computational complexity, each step of the (Projected) Subgradient Descent (i.e. when the projection is made on the Euclidean ball) has a linear time complexity! The more general Projected Subgradient Descent might not enjoy such a nice time complexity: it depends on how easy it is to project a point on the constraint set. Again we will discuss these issues further when we consider the generalized scheme that does not necessarily project in .

**Oracle lower bound in high dimension**

We will now derive our very first *sharp* lower bound for an optimization problem. This result was originally discovered by Nemirovski and Yudin, and we follow here the presentation of Nesterov.

Recall from the previous lecture that a black-box optimization procedure is simply a sequence of mappings from ‘history’ to the next query point. To simplify the argument we restrict our attention to mappings with the property that

(1)

where is the output of the first order oracle at point (in particular ). It is relatively straightforward to generalize the lower bound to *any* black-box procedure but it makes notation a little bit more messy.

TheoremFor any black-box optimization procedure satisfying (1) and for any , there exists a function that satisfies (i)-(ii)-(iii) and such that

*Proof: *Without loss of generality we assume that . To avoid confusion we denote here for the coordinate of . Let be the canonical basis of .

We consider now the following convex function:

It is easy to see that

In particular if then for any one has . In other words is -Lipschitz on the ball of radius .

Next we describe the first order oracle for this function: when asked for a subgradient at , it returns where is the {\em first} coordinate that satisfy . In particular when asked for a subgradient at it returns . Thus must lie on the line generated by . It is easy to see recursively that in fact must lie in the linear span of . In particular for we necessarily have and thus .

It remains to compute the minimal value of . Let be such that for and for . It is clear that and thus the minimal value of is

Wrapping up, we proved that for any one must have

Taking and concludes the proof (note in particular that with these parameters).

**What about lower bounds in finite dimension?**

The result of the previous section shows that Subgradient Descent is optimal, as far as *dimension-uniform* rates are concerned. We already know that for a *fixed* dimension one can obtain much faster rates of convergence as the number of calls to the oracle goes to infinity. More precisely recall that the Ellipsoid Method satisfy, with calls to a first order oracle,

For large () this rate is much better than the one derived for the Subgradient Descent. In this ‘large ‘ regime one can prove a lower bound of on the optimization error. One can bridge the gap between the upper bound given by the Ellipsoid Method and this lower bound with an algorithm known as the Method of centers of gravity. Unfortunately this method has a very bad computational complexity, which is why it is not so interesting for us. We refer the interested reader to the book by Nesterov for more details on these results in finite dimension.

## By Deep Learning Basics II – Theoretical Machine Learning August 2, 2017 - 2:19 pm

[…] The time taken for a full gradient descent over the full network is O(P(V+E)). It is very slow to apply on the entire network, so we use Stochastic Gradient Descent to speed up our computation. Reference to Gradient Descent Proof […]

## By Juan Ramirez December 1, 2016 - 8:49 pm

Are there lower bounds on the evaluation error when you add to the queriesphi_t the proximals at each iteration for Lipschitz convex functions? I ask because the proximal point method achieves error O(1/t); which is better that the bounds of the model above

## By ZIYE July 25, 2016 - 5:45 pm

Thanks very much for the post. But I don't really understand how you pick the values for \alpha and \gama. It seems to me that the requirement for those two nonnegative coefficients is just: ||x*|| = \gama / (\alpha*sqrt(t)) <= R. So can I let \gama = O(sqrt(t) / (1+t^(1/3)) and \alpha = O(1/(1+t^(1/3))) and get a lower bound O(1/(1+t^(1/3))), which seems to be larger than the subgradient convergence rate.

Am I missing something here? Sorry if I misunderstood the content.

Cheers, Ziye

## By ZIYE July 25, 2016 - 5:51 pm

Oh I just realized the setting here is t<=n, so the constant in big 'O' notation actually matters.

## By Zhimin May 15, 2014 - 5:21 pm

Thanks for the post. I think there might be something wrong with the proof of the first theorem. It seems like \eta is also depends on s. So you cannot do the summation directly, right?

## By qd April 14, 2013 - 2:40 pm

great post!

but is this a typo that

||y_{t+1}-x*||<= ||x_{t+1}-x*||?

seems that x_{t+1} is the projection of y_{t+1} onto the convex ball, so it is closer than y_t+1 to any interior point in X.

Also the proof is a little bit strange to me, it seems you can replace (x^*,f(x^*)) with any feasible point in ball(R), and the argument is still valid! so does that mean no information of optimal solution is applied at all?

## By Sebastien Bubeck April 14, 2013 - 2:49 pm

Yes that's a typo, thanks a lot for spotting it! (I just fixed it.)

Indeed the proof works for any point in the ball, so the upper bound really says f(x_t) - f(x) < ... for any x in the ball. But observe that this bound is 'tightest' for x being the minimizer of f on the ball!

## By Hein Hundal March 26, 2013 - 3:04 pm

Very Very Nice Post. You might want to check out James Spall's talk at NIPS titled "Stochastic Search and Optimization" at

http://nips.cc/Conferences/2012/Program/event.php?ID=3131

Cheers, Hein

## By Sebastien Bubeck March 26, 2013 - 9:57 pm

I'm glad that you liked the post!

I attended Spall's tutorial at NIPS. Part III of my course will be about the 'statistical complexity of optimization', and I plan to talk about stochastic optimization with noisy 1st order and 0th order oracles. Hopefully I will have time to describe SPSA and prove rates of convergence during this last part!