In this post we are interested in minimizing a convex function over a compact convex set
under the assumption that
is
-Lipschitz on
with respect to some arbitrary norm
(precisely we assume that
). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.
Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space (think of
). However we are now interested in the more general situation where we are optimizing the function in some Banach space
(for example
). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative)
are elements of the dual space
and thus one cannot perform the computation
(it simply does not make sense). We did not have this problem for optimization in an Hilbert space
since by Riesz representation theorem
is isometric to
. The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point
into the dual space
, then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space
. Of course the new point in the primal space might lie outside of the constraint set
and thus we need a way to project back the point on the constraint set
. The rest of this lecture will make this idea more precise.
Preliminaries
We consider a non-empty convex open set such that the constraint set
is included in its closure, that is
, and such that
.
Now we consider a continuously differentiable and strictly convex function defined on
. This function
will play the role of the mirror map between the primal and dual space. Precisely a point
in the primal is mapped to
in the ‘dual’ (note that here all points lie in
and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map
will take the form
. Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that
(weaker assumptions are possible but we will not need them), that is one can find
such that
Now this point might lie outside of the constraint set
. To project back on the set of constraint we use the Bregman divergence associated to
defined by
The projection of onto
then takes the following form:
To ensure the existence of this projection we assume now that satisfies
With this assumption is locally increasing on the boundary of
, which together with the fact that
is compact, and
is stricly convex implies the existence and uniqueness of the projection defined above.
We say that is a mirror map if it satisfies all the above assumptions.
Mirror Descent
We can now describe the Mirror Descent strategy based on a mirror map . Let
. Then for
, let
such that
and
Theorem Let
be a mirror map. Assume also that
is strongly convex on
with respect to
, that is
Let
and
be convex and
-Lipschitz w.r.t.
, then Mirror Descent with
satisfies
Proof: Let . The claimed bounded will be obtained by taking a limit
. Now by convexity of
, the definition of the Mirror Descent, and the definition of the Bregman divergence one has
Now the key observation is that by writing down the first order optimality condition for one obtains
which can be written equivalently as
Thus the term will lead to a telescopic sum when summing over
to
, and it remains to bound the other term as follows:
We proved
which concludes the proof up to trivial computation.
Mirror Descent as a proximal method
One can rewrite Mirror Descent as follows:
Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).
Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.
By Formulations of Mirror Descent | Look at the corners! February 14, 2016 - 7:31 pm
[…] the objective function, and is the stepsize parameter. This formulation, used extensively in [Bubeck 2015], permits to draw analogy with the Projected Subgradient Descent for which and […]
By etali March 25, 2015 - 8:19 am
I have a question about y_{t+1}, how to get the y_{t+1}?
By EricSong April 9, 2015 - 9:19 am
by solving the equation about y_{t+1}。。。 delta phi (y_{t+1} ) = delta phi (x_{t} ) – eta * delta phi (f(x_{t}) ) ..this is a operator equation…
By Xiaolong February 23, 2015 - 3:36 pm
In (Beck & Teboulle), the author says that $t_k \prop \sqrt{k}^{-1}$ gives a good step size that yields efficiency of $O(1/sqrt{k})$. There argument was based on Thm. 4.2 in there paper. But it seems that the optimal step size is constant with the number of maximal iteration known a prior. This is also consistent (as far as I can see) with you derivation of the theorem in this blog. So in reality, does the $t_k \prop 1/\sqrt{k}$ gives a good efficiency? if so, how to prove? Thanks,
By CJ April 18, 2013 - 7:37 pm
Is the eta in the theorem correct? It doesn’t agree with the theorem statement for projected sub-gradient descent. I think the L/R should be R/L.
By Sebastien Bubeck April 19, 2013 - 8:55 am
Yup that’s a typo, thanks a lot for spotting it!