Today I’m going to talk about a variant of Mirror Descent invented by Nemirovski in 2004 (see this paper). This new algorithm is called Mirror Prox and it is designed to minimize a smooth convex function over a compact convex set . We assume that the smoothness is measured in some arbitrary norm , precisely we assume that .
Let be a mirror map on (see this post for a definition) and let . Mirror Prox is described by the following equations:
In words the algorithm first makes a step of Mirror Descent to go from to , and then it makes a similar step to obtain , starting again from but this time using the gradient of evaluated at (instead of ). The following result justifies the procedure.
Theorem. Let be a mirror map -strongly convex on with respect to . Let and be convex and -smooth w.r.t. . Then Mirror Prox with satisfies (with )
Proof: Let . We write
We will now bound separately these three terms. For the first one we have (by definition of the method, first-order optimality condition of the projection, and the definition of the Bregman divergence)
For the second term using the same properties than above and the strong-convexity of the mirror map one has
Finally for the last term, using H\”older’s inequality, -smoothness, and the fact one has
Thus summing up these three terms and using that one gets
and this inequality is also true for (see the bound for the first term). The proof is then concluded as usual.
Mirror Prox for Saddle Points
Following the generalization of Mirror Descent for Saddle Points (MD-SP, see this post) one can propose a version of Mirror Prox for Saddle Points (MP-SP). In fact the importance of Mirror Prox lies primarily in its application to compute the saddle point of a smooth convex-convave function. Indeed, the key observation is that many non-smooth convex functions admit a smooth saddle-point representation (see examples below). With such a representation one can then use MP-SP to obtain a rate of convergence of order , despite the fact that the original function is non-smooth! Recall that we proved that for non-smooth convex functions the optimal black-box rate is of order . However MP-SP is not a black-box method in the sense that it makes use of the structure of the function (by representing it as a saddle-point). Nonetheless MP-SP still enjoys the nice time-complexity property of black-box methods since it is a first-order method. It is a win-win situation!
Let us now describe more precisely MP-SP. Let be a compact and convex set endowed with some norm and a mirror map (defined on ), and let be a compact and convex set endowed with some norm and a mirror map (defined on ). Let be a continuously differentiable function such that is convex and is concave. We assume that is -smooth in the sense that:
We consider the mirror map on (defined on ). Let . Then for , let and be defined by the following equations (with ):
Theorem. Assume that is -strongly convex on with respect to and let (with similar definition for ). Let . Then SP Mirror Descent with satisfies
Proof: First we endow with the norm defined by
It is immediate that is -strongly convex with respect to on . Furthermore one can easily check that
and thus the vector fields used in MP-SP satisfy:
It suffices now to use the exact same proof than for the standard Mirror Prox starting with:
Smooth saddle-point representation of non-smooth functions
Assume that the function to be optimized is of the form:
where the functions are convex, -smooth and -Lipschitz in some norm . Minimizing over is equivalent to finding a saddle point over since one has:
Thus we are looking for the saddle point of which is convex in , and concave (in fact linear!) in . Furthermore is smooth and it is an easy exercise to verify that in this case one has , , , and . Thus a rate of convergence (to the minimizer of ) of order can be obtained by applying MP-SP with appropriate mirror maps (in particular the update in the -space has to be done with the negative entropy). Note that is rate is particularly striking since the oracle lower bound of was proved precisely for this type of function! Here we are able to get around this lower bound by making use of the structure of the function. Remark also that if you think about what this algorithm is doing to optimize the function you get a very very elegant procedure for which it is highly non-trivial a priori to prove something about its rate of convergence.
Finally note that if and are of different order of magnitudes one can obtain a better result by considering the mirror map and optimizing over the values of and .
Another interesting example arises when with , and . With the -norm on both spaces it is easy to see that (the entrywise max norm), and of course . Thus we obtain here a convergence rate of order . Again if and are of different order of magnitudes one can improve the result to obtain a rate of order by considering the mirror map and optimizing over the values of and .