In this post we are interested in minimizing a convex function over a compact convex set under the assumption that is -Lipschitz on with respect to some arbitrary norm (precisely we assume that ). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.

Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space (think of ). However we are now interested in the more general situation where we are optimizing the function in some Banach space (for example ). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative) are elements of the dual space and thus one cannot perform the computation (it simply does not make sense). We did not have this problem for optimization in an Hilbert space since by Riesz representation theorem is isometric to . The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point into the dual space , then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space . Of course the new point in the primal space might lie outside of the constraint set and thus we need a way to project back the point on the constraint set . The rest of this lecture will make this idea more precise.

**Preliminaries**

We consider a non-empty convex open set such that the constraint set is included in its closure, that is , and such that .

Now we consider a continuously differentiable and strictly convex function defined on . This function will play the role of the *mirror map* between the primal and dual space. Precisely a point in the primal is mapped to in the ‘dual’ (note that here all points lie in and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map will take the form . Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that (weaker assumptions are possible but we will not need them), that is one can find such that

Now this point might lie outside of the constraint set . To project back on the set of constraint we use the *Bregman divergence* associated to defined by

The projection of onto then takes the following form:

To ensure the existence of this projection we assume now that satisfies

With this assumption is locally increasing on the boundary of , which together with the fact that is compact, and is stricly convex implies the existence and uniqueness of the projection defined above.

We say that is a *mirror map* if it satisfies all the above assumptions.

**Mirror Descent**

We can now describe the Mirror Descent strategy based on a mirror map . Let . Then for , let such that

and

TheoremLet be a mirror map. Assume also that is strongly convex on with respect to , that is

Let and be convex and -Lipschitz w.r.t. , then Mirror Descent with satisfies

*Proof: *Let . The claimed bounded will be obtained by taking a limit . Now by convexity of , the definition of the Mirror Descent, and the definition of the Bregman divergence one has

Now the key observation is that by writing down the first order optimality condition for one obtains

which can be written equivalently as

Thus the term will lead to a telescopic sum when summing over to , and it remains to bound the other term as follows:

We proved

which concludes the proof up to trivial computation.

**Mirror Descent as a proximal method**

One can rewrite Mirror Descent as follows:

Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).

Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.

## By etali March 25, 2015 - 8:19 am

I have a question about y_{t+1}, how to get the y_{t+1}?

## By EricSong April 9, 2015 - 9:19 am

by solving the equation about y_{t+1}。。。 delta phi (y_{t+1} ) = delta phi (x_{t} ) – eta * delta phi (f(x_{t}) ) ..this is a operator equation…

## By Xiaolong February 23, 2015 - 3:36 pm

In (Beck & Teboulle), the author says that $t_k \prop \sqrt{k}^{-1}$ gives a good step size that yields efficiency of $O(1/sqrt{k})$. There argument was based on Thm. 4.2 in there paper. But it seems that the optimal step size is constant with the number of maximal iteration known a prior. This is also consistent (as far as I can see) with you derivation of the theorem in this blog. So in reality, does the $t_k \prop 1/\sqrt{k}$ gives a good efficiency? if so, how to prove? Thanks,

## By CJ April 18, 2013 - 7:37 pm

Is the eta in the theorem correct? It doesn’t agree with the theorem statement for projected sub-gradient descent. I think the L/R should be R/L.

## By Sebastien Bubeck April 19, 2013 - 8:55 am

Yup that’s a typo, thanks a lot for spotting it!