We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.

**Standard setups for Mirror Descent**

**‘Ball’ setup.** The simplest version of Mirror Descent is obtained by taking on The function is a mirror map strongly convex w.r.t. , and furthermore the associated Bregman Divergence is given by . Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.

**‘Simplex’ setup.** A more interesting choice of a mirror map is given by the negative entropy

on . In that case the gradient update can be written equivalently as

The Bregman divergence of this mirror map is given by (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex amounts to a simple renormalization . Furthermore it is also easy to verify that is -strongly convex w.r.t. on (this result is known as Pinsker’s inequality). Note also that one has and when

The above observations imply that when minimizing on the simplex a function with subgradients bounded in -norm, Mirror Descent with the negative entropy achieves a rate of convergence of order . On the other the regular Subgradient Descent achieves only a rate of order in this case!

**‘Spectrahedron’ setup.** We consider here functions defined on matrices, and we are interested in minimizing a function on the *spectrahedron* defined as:

In this setting we consider the mirror map on given by the negative von Neumann entropy:

where are the eigenvalues of . It can be shown that the gradient update can be written equivalently as

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on is a simple trace renormalization.

With highly non-trivial computation one can show that is -strongly convex with respect to the Schatten -norm defined as

It is easy to see that one has and for . In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

**Mirror Descent for Saddle Points**

We consider now the following saddle-point problem: Let be compact convex sets, and be a continuous function such that is convex and is concave. We are interested in finding such that

(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution through the primal/dual gap

The key observation now is that for any , one has with and ,

and

In particular, denoting we just proved

for some This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space using the ‘vector field’ given by . Precisely this gives the following SP Mirror Descent:

Let be a mirror map on (defined on ) and let be a mirror map on (defined on ). We consider now the mirror map on (defined on ).

Let . Then for , let such that

where with and . Finally let

The following result describes the performances of SP Mirror Descent.

Theorem.Assume that is -strongly convex on with respect to , and let . Assume that is -Lipschitz w.r.t. . Define similarly . Then SP Mirror Descent with satisfies

*Proof:* First we endow with the norm defined by

It is immediate that is -strongly convex with respect to on . Furthermore one can easily check that

and thus the vector field used in the SP mirror descent satisfies:

It suffices now to use the exact same proof than for the standard Mirror Descent starting with:

## By Alvaro December 19, 2013 - 5:15 pm

Thank you for your answer and the reference. It is interesting to see a connection to interior-point methods.

## By Alvaro December 19, 2013 - 12:08 am

Thank you very much for the great blog.

I have a question though. Given the geometry of a particular problem, how does one choose an appropriate mirror map? The examples listed here show that the selected mirror maps work remarkably well, but these are a posteriori observations. How was, for instance, the generalized Kullback-Leibler divergence deemed a suitable mirror map for the simplex? It is hard to grasp the intuition behind these choices, other than the fact that they turn out to be good in hindsight.

## By Sebastien Bubeck December 19, 2013 - 3:12 pm

To the best of my knowledge this is still an open question (and a very interesting one in my opinion). In some sense this paper (http://web.eecs.umich.edu/~jabernet/123-Abernethy.pdf) gives a recommendation: use a mirror map which is a self-concordant barrier for the constraint set. Note that this recommendation makes sense in the bandit setting but it is not clear how relevant it is for the optimization setting discussed here.