The Wayback Machine - https://web.archive.org/web/20210426020138/https://blogs.princeton.edu/imabandit/2013/04/18/orf523-mirror-descent-part-iiii/

ORF523: Mirror Descent, part II/II

We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.

Standard setups for Mirror Descent

‘Ball’ setup. The simplest version of Mirror Descent is obtained by taking \Phi(x) = \frac{1}{2} \|x\|^2_2 on \mathcal{D} = \mathbb{R}^n. The function \Phi is a mirror map strongly convex w.r.t. \|\cdot\|_2, and furthermore the associated Bregman Divergence is given by D_{\Phi}(x,y) = \|x - y\|^2. Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.

‘Simplex’ setup. A more interesting choice of a mirror map is given by the negative entropy

    \[\Phi(x) = \sum_{i=1}^n x_i \log x_i,\]

on \mathcal{D} = \mathbb{R}_{++}^n. In that case the gradient update \nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t) can be written equivalently as

    \[y_{t+1}(i) = x_{t}(i) \exp\big(- \eta [\nabla f(x_t) ]_i \big) , \ i=1, \hdots, n.\]

The Bregman divergence of this mirror map is given by D_{\Phi}(x,y) = \sum_{i=1}^n x_i \log \frac{x_i}{y_i} - \sum_{i=1}^n (x_i - y_i) (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex \Delta_n = \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} amounts to a simple renormalization y \mapsto y / \|y\|_1. Furthermore it is also easy to verify that \Phi is 1-strongly convex w.r.t. \|\cdot\|_1 on \Delta_n (this result is known as Pinsker’s inequality). Note also that one has x_1 = (1/n, \hdots, 1/n) and R^2 = \log n when \mathcal{X} = \Delta_n. 

The above observations imply that when minimizing on the simplex \Delta_n a function f with subgradients bounded in \ell_{\infty}-norm, Mirror Descent with the negative entropy achieves a rate of convergence of order \sqrt{\frac{\log n}{t}}. On the other the regular Subgradient Descent achieves only a rate of order \sqrt{\frac{n}{t}} in this case!

‘Spectrahedron’ setup. We consider here functions defined on matrices, and we are interested in minimizing a function f on the spectrahedron \mathcal{S}_n defined as:

    \[\mathcal{S}_n = \left\{X \in \mathbb{S}_+^n : \mathrm{Tr}(X) = 1 \right\} .\]

In this setting we consider the mirror map on \mathcal{D} = \mathbb{S}_{++}^n given by the negative von Neumann entropy:

    \[\Phi(X) = \sum_{i=1}^n \lambda_i(X) \log \lambda_i(X) ,\]

where \lambda_1(X), \hdots, \lambda_n(X) are the eigenvalues of X. It can be shown that the gradient update \nabla \Phi(Y_{t+1}) = \nabla \Phi(X_t) - \eta \nabla f(X_t) can be written equivalently as

    \[Y_{t+1} = \exp\big(\log X_t - \eta \nabla f(X_t) \big) ,\]

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on \mathcal{S}_n is a simple trace renormalization.

With highly non-trivial computation one can show that \Phi is \frac{1}{2}-strongly convex with respect to the Schatten 1-norm defined as

    \[\|X\|_1 = \sum_{i=1}^n \lambda_i(X).\]

It is easy to see that one has x_1 = \frac{1}{n} I_n and R^2 = \log n for \mathcal{X} = \mathcal{S}_n. In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

 

Mirror Descent for Saddle Points

We consider now the following saddle-point problem: Let \mathcal{X} \subset \mathbb{R}^n, \mathcal{Y} \subset \mathbb{R}^m be compact convex sets, and \phi : \cX \times \cY \rightarrow \mathbb{R} be a continuous function such that \phi(\cdot, y) is convex and \phi(x, \cdot) is concave. We are interested in finding z^*=(x^*, y^*) \in \mathcal{X} \times \mathcal{Y} =: \mathcal{Z} such that

    \[\phi(x^*,y^*) = \inf_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x,y) = \sup_{y \in \mathcal{Y}} \inf_{x \in \mathcal{X}} \phi(x,y) .\]

(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution z=(x,y) through the primal/dual gap

    \[\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) .\]

The key observation now is that for any x,y,x',y', one has with g_x \in \partial_x \phi(x,y) and -g_y \in \partial_y (- \phi(x,y)),

    \[\phi(x,y) - \phi(x',y) \leq g_x^{\top} (x-x'),\]

and

    \[- \phi(x,y) - (- \phi(x,y')) \leq (- g_y)^{\top} (y-y') .\]

In particular, denoting g_z = (g_x, - g_y) we just proved

    \[\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) \leq g_z^{\top} (z - z') ,\]

for some z' \in \mathcal{Z}. This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space \mathcal{Z} using the ‘vector field’ given by g_z. Precisely this gives the following SP Mirror Descent:

Let \Phi_{\mathcal{X}} be a mirror map on \mathcal{X} (defined on \mathcal{D_{\mathcal{X}}}) and let \Phi_{\mathcal{Y}} be a mirror map on \mathcal{Y} (defined on \mathcal{D_{\mathcal{Y}}}). We consider now the mirror map \Phi((x,y)) = \Phi_{\mathcal{X}}(x) + \Phi_{\mathcal{Y}}(y) on \mathcal{Z} (defined on \mathcal{D} = \mathcal{D_{\mathcal{X}}} \times \mathcal{D_{\mathcal{Y}}}).

Let z_1 \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} \Phi(z). Then for t \geq 1, let w_{t+1} \in \mathcal{D} such that

    \[\nabla \Phi(w_{t+1}) = \nabla \Phi(z_{t}) - \eta g_t,\]

where g_t = (g_x, - g_y) with g_x \in \partial_x \phi(x_t,y_t) and -g_y \in \partial_y (- \phi(x_t,y_t)). Finally let

    \[z_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,w_{t+1}) .\]

The following result describes the performances of SP Mirror Descent.

Theorem. Assume that \Phi_{\mathcal{X}} is \kappa_{\mathcal{X}}-strongly convex on \mathcal{X} \cap \mathcal{D}_{\mathcal{X}} with respect to \|\cdot\|_{\mathcal{X}}, and let R^2_{\mathcal{X}} = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) - \Phi_{\mathcal{X}}(x_1). Assume that \phi(\cdot, y) is L_{\mathcal{X}}-Lipschitz w.r.t. \|\cdot\|_{\mathcal{X}}. Define similarly \kappa_{\mathcal{Y}}, R^2_{\mathcal{Y}}, L_{\mathcal{Y}}. Then SP Mirror Descent with \eta = \frac{R^2_{\mathcal{X}} + R^2_{\mathcal{Y}}}{\sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}}} \sqrt{\frac{2}{t}} satisfies

    \[\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq \sqrt{\frac{2 \left( R^2_{\mathcal{X}} + R^2_{\mathcal{Y}} \right) \left(\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}} \right) }{t}}.\]

Proof: First we endow \mathcal{Z} with the norm defined by

    \[\|z\|_{\mathcal{Z}} = \sqrt{\kappa_{\mathcal{X}} \|x\|_{\mathcal{X}}^2 + \kappa_{\mathcal{Y}} \|x\|_{\mathcal{Y}}^2} .\]

It is immediate that \Phi is 1-strongly convex with respect to \|\cdot\|_{\mathcal{Z}} on \mathcal{Z} \cap \mathcal{D}. Furthermore one can easily check that

    \[\|z\|_{\mathcal{Z},*} = \sqrt{\frac1{\kappa_{\mathcal{X}}} \|x\|_{\mathcal{X},*}^2 + \frac1{\kappa_{\mathcal{Y}}} \|x\|_{\mathcal{Y},*}^2} ,\]

and thus the vector field (g_t) used in the SP mirror descent satisfies:

    \[\|g_t\|_{\mathcal{Z},*} \leq \sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}} .\]

It suffices now to use the exact same proof than for the standard Mirror Descent starting with:

    \begin{eqnarray*} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) & \leq & \frac{1}{t} \sum_{s=1}^t \bigg( \phi\left(x_s,y \right) - \phi\left(x, y_s \right) \bigg) \\ & \leq & \frac{1}{t} \sum_{s=1}^t g_s^{\top} (z_s - z) . \end{eqnarray*}

\Box

This entry was posted in Optimization. Bookmark the permalink.

4 Responses to "ORF523: Mirror Descent, part II/II"

Leave a reply