We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.
Standard setups for Mirror Descent
‘Ball’ setup. The simplest version of Mirror Descent is obtained by taking on
The function
is a mirror map strongly convex w.r.t.
, and furthermore the associated Bregman Divergence is given by
. Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.
‘Simplex’ setup. A more interesting choice of a mirror map is given by the negative entropy
on . In that case the gradient update
can be written equivalently as
The Bregman divergence of this mirror map is given by (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex
amounts to a simple renormalization
. Furthermore it is also easy to verify that
is
-strongly convex w.r.t.
on
(this result is known as Pinsker’s inequality). Note also that one has
and
when
The above observations imply that when minimizing on the simplex a function
with subgradients bounded in
-norm, Mirror Descent with the negative entropy achieves a rate of convergence of order
. On the other the regular Subgradient Descent achieves only a rate of order
in this case!
‘Spectrahedron’ setup. We consider here functions defined on matrices, and we are interested in minimizing a function on the spectrahedron
defined as:
In this setting we consider the mirror map on given by the negative von Neumann entropy:
where are the eigenvalues of
. It can be shown that the gradient update
can be written equivalently as
where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on is a simple trace renormalization.
With highly non-trivial computation one can show that is
-strongly convex with respect to the Schatten
-norm defined as
It is easy to see that one has and
for
. In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!
Mirror Descent for Saddle Points
We consider now the following saddle-point problem: Let be compact convex sets, and
be a continuous function such that
is convex and
is concave. We are interested in finding
such that
(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution through the primal/dual gap
The key observation now is that for any , one has with
and
,
and
In particular, denoting we just proved
for some This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space
using the ‘vector field’ given by
. Precisely this gives the following SP Mirror Descent:
Let be a mirror map on
(defined on
) and let
be a mirror map on
(defined on
). We consider now the mirror map
on
(defined on
).
Let . Then for
, let
such that
where with
and
. Finally let
The following result describes the performances of SP Mirror Descent.
Theorem. Assume that
is
-strongly convex on
with respect to
, and let
. Assume that
is
-Lipschitz w.r.t.
. Define similarly
. Then SP Mirror Descent with
satisfies
Proof: First we endow with the norm defined by
It is immediate that is
-strongly convex with respect to
on
. Furthermore one can easily check that
and thus the vector field used in the SP mirror descent satisfies:
It suffices now to use the exact same proof than for the standard Mirror Descent starting with:
By Eufrasio July 16, 2015 - 11:49 am
Congratulations for your post. Do you know where can I find a r code (or library – http://www.r-project.org) with the mirror descent algorithm?
By Alvaro December 19, 2013 - 5:15 pm
Thank you for your answer and the reference. It is interesting to see a connection to interior-point methods.
By Alvaro December 19, 2013 - 12:08 am
Thank you very much for the great blog.
I have a question though. Given the geometry of a particular problem, how does one choose an appropriate mirror map? The examples listed here show that the selected mirror maps work remarkably well, but these are a posteriori observations. How was, for instance, the generalized Kullback-Leibler divergence deemed a suitable mirror map for the simplex? It is hard to grasp the intuition behind these choices, other than the fact that they turn out to be good in hindsight.
By Sebastien Bubeck December 19, 2013 - 3:12 pm
To the best of my knowledge this is still an open question (and a very interesting one in my opinion). In some sense this paper (http://web.eecs.umich.edu/~jabernet/123-Abernethy.pdf) gives a recommendation: use a mirror map which is a self-concordant barrier for the constraint set. Note that this recommendation makes sense in the bandit setting but it is not clear how relevant it is for the optimization setting discussed here.