ORF523: Mirror Descent, part II/II

We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.

Standard setups for Mirror Descent

‘Ball’ setup. The simplest version of Mirror Descent is obtained by taking $\Phi(x) = \frac{1}{2} \|x\|^2_2$ on $\mathcal{D} = \mathbb{R}^n.$ The function $\Phi$ is a mirror map strongly convex w.r.t. $\|\cdot\|_2$ , and furthermore the associated Bregman Divergence is given by $D_{\Phi}(x,y) = \|x - y\|^2$ . Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.

‘Simplex’ setup. A more interesting choice of a mirror map is given by the negative entropy

$\Phi(x) = \sum_{i=1}^n x_i \log x_i,$

on $\mathcal{D} = \mathbb{R}_{++}^n$ . In that case the gradient update $\nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t)$ can be written equivalently as

$y_{t+1}(i) = x_{t}(i) \exp\big(- \eta [\nabla f(x_t) ]_i \big) , \ i=1, \hdots, n.$

The Bregman divergence of this mirror map is given by $D_{\Phi}(x,y) = \sum_{i=1}^n x_i \log \frac{x_i}{y_i} - \sum_{i=1}^n (x_i - y_i)$ (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex $\Delta_n = \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\}$ amounts to a simple renormalization $y \mapsto y / \|y\|_1$ . Furthermore it is also easy to verify that $\Phi$ is $1$ -strongly convex w.r.t. $\|\cdot\|_1$ on $\Delta_n$ (this result is known as Pinsker’s inequality). Note also that one has $x_1 = (1/n, \hdots, 1/n)$ and $R^2 = \log n$ when $\mathcal{X} = \Delta_n.$

The above observations imply that when minimizing on the simplex $\Delta_n$ a function $f$ with subgradients bounded in $\ell_{\infty}$ -norm, Mirror Descent with the negative entropy achieves a rate of convergence of order $\sqrt{\frac{\log n}{t}}$ . On the other the regular Subgradient Descent achieves only a rate of order $\sqrt{\frac{n}{t}}$ in this case!

‘Spectrahedron’ setup. We consider here functions defined on matrices, and we are interested in minimizing a function $f$ on the spectrahedron $\mathcal{S}_n$ defined as:

$\mathcal{S}_n = \left\{X \in \mathbb{S}_+^n : \mathrm{Tr}(X) = 1 \right\} .$

In this setting we consider the mirror map on $\mathcal{D} = \mathbb{S}_{++}^n$ given by the negative von Neumann entropy:

$\Phi(X) = \sum_{i=1}^n \lambda_i(X) \log \lambda_i(X) ,$

where $\lambda_1(X), \hdots, \lambda_n(X)$ are the eigenvalues of $X$ . It can be shown that the gradient update $\nabla \Phi(Y_{t+1}) = \nabla \Phi(X_t) - \eta \nabla f(X_t)$ can be written equivalently as

$Y_{t+1} = \exp\big(\log X_t - \eta \nabla f(X_t) \big) ,$

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on $\mathcal{S}_n$ is a simple trace renormalization.

With highly non-trivial computation one can show that $\Phi$ is $\frac{1}{2}$ -strongly convex with respect to the Schatten $1$ -norm defined as

$\|X\|_1 = \sum_{i=1}^n \lambda_i(X).$

It is easy to see that one has $x_1 = \frac{1}{n} I_n$ and $R^2 = \log n$ for $\mathcal{X} = \mathcal{S}_n$ . In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

Mirror Descent for Saddle Points

We consider now the following saddle-point problem: Let $\mathcal{X} \subset \mathbb{R}^n, \mathcal{Y} \subset \mathbb{R}^m$ be compact convex sets, and $\phi : \cX \times \cY \rightarrow \mathbb{R}$ be a continuous function such that $\phi(\cdot, y)$ is convex and $\phi(x, \cdot)$ is concave. We are interested in finding $z^*=(x^*, y^*) \in \mathcal{X} \times \mathcal{Y} =: \mathcal{Z}$ such that

$\phi(x^*,y^*) = \inf_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x,y) = \sup_{y \in \mathcal{Y}} \inf_{x \in \mathcal{X}} \phi(x,y) .$

(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution $z=(x,y)$ through the primal/dual gap

$\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) .$

The key observation now is that for any $x,y,x',y'$ , one has with $g_x \in \partial_x \phi(x,y)$ and $-g_y \in \partial_y (- \phi(x,y))$ ,

$\phi(x,y) - \phi(x',y) \leq g_x^{\top} (x-x'),$

and

$- \phi(x,y) - (- \phi(x,y')) \leq (- g_y)^{\top} (y-y') .$

In particular, denoting $g_z = (g_x, - g_y)$ we just proved

$\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) \leq g_z^{\top} (z - z') ,$

for some $z' \in \mathcal{Z}.$ This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space $\mathcal{Z}$ using the ‘vector field’ given by $g_z$ . Precisely this gives the following SP Mirror Descent:

Let $\Phi_{\mathcal{X}}$ be a mirror map on $\mathcal{X}$ (defined on $\mathcal{D_{\mathcal{X}}}$ ) and let $\Phi_{\mathcal{Y}}$ be a mirror map on $\mathcal{Y}$ (defined on $\mathcal{D_{\mathcal{Y}}}$ ). We consider now the mirror map $\Phi((x,y)) = \Phi_{\mathcal{X}}(x) + \Phi_{\mathcal{Y}}(y)$ on $\mathcal{Z}$ (defined on $\mathcal{D} = \mathcal{D_{\mathcal{X}}} \times \mathcal{D_{\mathcal{Y}}}$ ).

Let $z_1 \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} \Phi(z)$ . Then for $t \geq 1$ , let $w_{t+1} \in \mathcal{D}$ such that

$\nabla \Phi(w_{t+1}) = \nabla \Phi(z_{t}) - \eta g_t,$

where $g_t = (g_x, - g_y)$ with $g_x \in \partial_x \phi(x_t,y_t)$ and $-g_y \in \partial_y (- \phi(x_t,y_t))$ . Finally let

$z_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,w_{t+1}) .$

The following result describes the performances of SP Mirror Descent.

Theorem. Assume that $\Phi_{\mathcal{X}}$ is $\kappa_{\mathcal{X}}$ -strongly convex on $\mathcal{X} \cap \mathcal{D}_{\mathcal{X}}$ with respect to $\|\cdot\|_{\mathcal{X}}$ , and let $R^2_{\mathcal{X}} = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) - \Phi_{\mathcal{X}}(x_1)$ . Assume that $\phi(\cdot, y)$ is $L_{\mathcal{X}}$ -Lipschitz w.r.t. $\|\cdot\|_{\mathcal{X}}$ . Define similarly $\kappa_{\mathcal{Y}}, R^2_{\mathcal{Y}}, L_{\mathcal{Y}}$ . Then SP Mirror Descent with $\eta = \frac{R^2_{\mathcal{X}} + R^2_{\mathcal{Y}}}{\sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}}} \sqrt{\frac{2}{t}}$ satisfies

$\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq \sqrt{\frac{2 \left( R^2_{\mathcal{X}} + R^2_{\mathcal{Y}} \right) \left(\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}} \right) }{t}}.$

Proof: First we endow $\mathcal{Z}$ with the norm defined by

$\|z\|_{\mathcal{Z}} = \sqrt{\kappa_{\mathcal{X}} \|x\|_{\mathcal{X}}^2 + \kappa_{\mathcal{Y}} \|x\|_{\mathcal{Y}}^2} .$

It is immediate that $\Phi$ is $1$ -strongly convex with respect to $\|\cdot\|_{\mathcal{Z}}$ on $\mathcal{Z} \cap \mathcal{D}$ . Furthermore one can easily check that

$\|z\|_{\mathcal{Z},*} = \sqrt{\frac1{\kappa_{\mathcal{X}}} \|x\|_{\mathcal{X},*}^2 + \frac1{\kappa_{\mathcal{Y}}} \|x\|_{\mathcal{Y},*}^2} ,$

and thus the vector field $(g_t)$ used in the SP mirror descent satisfies:

$\|g_t\|_{\mathcal{Z},*} \leq \sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}} .$

It suffices now to use the exact same proof than for the standard Mirror Descent starting with:

$\begin{eqnarray*} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) & \leq & \frac{1}{t} \sum_{s=1}^t \bigg( \phi\left(x_s,y \right) - \phi\left(x, y_s \right) \bigg) \\ & \leq & \frac{1}{t} \sum_{s=1}^t g_s^{\top} (z_s - z) . \end{eqnarray*}$

$\Box$

4 Responses to "ORF523: Mirror Descent, part II/II"

By Eufrasio July 16, 2015 - 11:49 am

Congratulations for your post. Do you know where can I find a r code (or library – http://www.r-project.org) with the mirror descent algorithm?
By Alvaro December 19, 2013 - 5:15 pm

Thank you for your answer and the reference. It is interesting to see a connection to interior-point methods.
By Alvaro December 19, 2013 - 12:08 am

Thank you very much for the great blog.

I have a question though. Given the geometry of a particular problem, how does one choose an appropriate mirror map? The examples listed here show that the selected mirror maps work remarkably well, but these are a posteriori observations. How was, for instance, the generalized Kullback-Leibler divergence deemed a suitable mirror map for the simplex? It is hard to grasp the intuition behind these choices, other than the fact that they turn out to be good in hindsight.
- By Sebastien Bubeck December 19, 2013 - 3:12 pm
  
  To the best of my knowledge this is still an open question (and a very interesting one in my opinion). In some sense this paper (http://web.eecs.umich.edu/~jabernet/123-Abernethy.pdf) gives a recommendation: use a mirror map which is a self-concordant barrier for the constraint set. Note that this recommendation makes sense in the bandit setting but it is not clear how relevant it is for the optimization setting discussed here.

Aug	APR	Nov
	26
2016	2021	2022

ORF523: Mirror Descent, part II/II

4 Responses to "ORF523: Mirror Descent, part II/II"

By Eufrasio July 16, 2015 - 11:49 am

By Alvaro December 19, 2013 - 5:15 pm

By Alvaro December 19, 2013 - 12:08 am

By Sebastien Bubeck December 19, 2013 - 3:12 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll