ORF523: Mirror Prox

Today I’m going to talk about a variant of Mirror Descent invented by Nemirovski in 2004 (see this paper). This new algorithm is called Mirror Prox and it is designed to minimize a smooth convex function f over a compact convex set \mathcal{X} \subset \mathbb{R}^n. We assume that the smoothness is measured in some arbitrary norm \|\cdot\|, precisely we assume that \forall x, y \in \mathcal{X}, \|\nabla f(x) - \nabla f(y) \|_* \leq \beta \|x - y\|.

Let \Phi : \mathcal{D} \rightarrow \mathbb{R} be a mirror map on \mathcal{X} (see this post for a definition) and let x_1 \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x). Mirror Prox is described by the following equations:

    \begin{align*} & \nabla \Phi(y_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(x_t), \\ \notag \\ & y_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}') , \\ \notag \\ & \nabla \Phi(x_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(y_{t+1}), \\ \notag\\ & x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,x_{t+1}') . \end{align*}

In words the algorithm first makes a step of Mirror Descent to go from x_t to y_{t+1}, and then it makes a similar step to obtain x_{t+1}, starting again from x_t but this time using the gradient of f evaluated at y_{t+1} (instead of x_t). The following result justifies the procedure.

Theorem. Let \Phi be a mirror map \kappa-strongly convex on \mathcal{X} \cap \mathcal{D} with respect to \|\cdot\|. Let R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1) and f be convex and \beta-smooth w.r.t. \|\cdot\|. Then Mirror Prox with \eta = \frac{\kappa}{\beta} satisfies (with y_1 = x_1)

    \[f\bigg(\frac{1}{t} \sum_{s=1}^t y_s \bigg) - f(x^*) \leq \frac{\beta R^2}{\kappa t} .\]

Proof: Let x \in \mathcal{X} \cap \mathcal{D}. We write

    \begin{align*} & f(y_{t+1}) - f(x) \\ & \leq \nabla f(y_{t+1})^{\top} (y_{t+1} - x) \\ & = \nabla f(y_{t+1})^{\top} (x_{t+1} - x) + \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \\ & \qquad + (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) . \end{align*}

We will now bound separately these three terms. For the first one we have (by definition of the method, first-order optimality condition of the projection, and the definition of the Bregman divergence)

    \begin{align*} & \eta \nabla f(y_{t+1})^{\top} (x_{t+1} - x) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}'))^{\top} (x_{t+1} - x) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}))^{\top} (x_{t+1} - x) \\ & = D_{\Phi}(x,x_t) - D_{\Phi}(x, x_{t+1}) - D_{\Phi}(x_{t+1}, x_t) . \end{align*}

For the second term using the same properties than above and the strong-convexity of the mirror map one has

    \begin{align*} & \eta \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}'))^{\top} (y_{t+1} - x_{t+1}) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}))^{\top} (y_{t+1} - x_{t+1}) \\ & = D_{\Phi}(x_{t+1},x_t) - D_{\Phi}(x_{t+1}, y_{t+1}) - D_{\Phi}(y_{t+1}, x_t) \\ & \leq D_{\Phi}(x_{t+1},x_t) - \frac{\kappa}{2} \|x_{t+1} - y_{t+1} \|^2 - \frac{\kappa}{2} \|y_{t+1} - x_t\|^2 . \end{align*}

Finally for the last term, using H\”older’s inequality, \beta-smoothness, and the fact 2 ab \leq a^2 + b^2 one has

    \begin{align*} & (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) \\ & \leq \|\nabla f(y_{t+1}) - \nabla f(x_t)\|_* \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \beta \|y_{t+1} - x_t\| \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \frac{\beta}{2} \|y_{t+1} - x_t\|^2 + \frac{\beta}{2} \|y_{t+1} - x_{t+1} \|^2 . \end{align*}

Thus summing up these three terms and using that \eta = \frac{\kappa}{\beta} one gets

    \[f(y_{t+1}) - f(x) \leq \frac{D_{\Phi}(x,x_t) - D_{\Phi}(x,x_{t+1})}{\eta} ,\]

and this inequality is also true for t=0 (see the bound for the first term). The proof is then concluded as usual.



Mirror Prox for Saddle Points

Following the generalization of Mirror Descent for Saddle Points (MD-SP, see this post) one can propose a version of Mirror Prox for Saddle Points (MP-SP). In fact the importance of Mirror Prox lies primarily in its application to compute the saddle point of a smooth convex-convave function. Indeed, the key observation is that many non-smooth convex functions admit a smooth saddle-point representation (see examples below). With such a representation one can then use MP-SP to obtain a rate of convergence of order 1/t, despite the fact that the original function is non-smooth! Recall that we proved that for non-smooth convex functions the optimal black-box rate is of order 1/\sqrt{t}. However MP-SP is not a black-box method in the sense that it makes use of the structure of the function (by representing it as a saddle-point). Nonetheless MP-SP still enjoys the nice time-complexity property of black-box methods since it is a first-order method. It is a win-win situation!

Let us now describe more precisely MP-SP. Let \mathcal{X} \subset \mathbb{R}^n be a compact and convex set endowed with some norm \|\cdot\|_{\mathcal{X}} and a mirror map \Phi_{\mathcal{X}} (defined on \mathcal{D_{\mathcal{X}}}), and let \mathcal{Y} \subset \mathbb{R}^m be a compact and convex set endowed with some norm \|\cdot\|_{\mathcal{Y}} and a mirror map \Phi_{\mathcal{Y}} (defined on \mathcal{D_{\mathcal{Y}}}). Let \phi : \cX \times \cY \rightarrow \mathbb{R} be a continuously differentiable function such that \phi(\cdot, y) is convex and \phi(x, \cdot) is concave. We assume that \phi is (\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})-smooth in the sense that:

    \begin{align*} & \|\nabla_x \phi(x,y) - \nabla_x \phi(x',y) \|_{\mathcal{X},*} \leq \beta_{11} \|x-x'\|_{\mathcal{X}} , \\ & \|\nabla_x \phi(x,y) - \nabla_x \phi(x,y') \|_{\mathcal{X},*} \leq \beta_{12} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x,y') \|_{\mathcal{Y},*} \leq \beta_{22} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x',y) \|_{\mathcal{Y},*} \leq \beta_{21} \|x-x'\|_{\mathcal{X}} , \\ \end{align*}

We consider the mirror map \Phi((x,y)) = \Phi_{\mathcal{X}}(x) + \Phi_{\mathcal{Y}}(y) on \mathcal{Z} = \mathcal{X} \times \mathcal{Y} (defined on \mathcal{D} = \mathcal{D_{\mathcal{X}}} \times \mathcal{D_{\mathcal{Y}}}). Let z_1 \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} \Phi(z). Then for t \geq 1, let z_t=(x_t, y_t) and w_t=(u_t,v_t) be defined by the following equations (with w_1=z_1):

    \begin{align*} & \nabla \Phi(w_{t+1}') = \nabla \Phi(z_{t}) - \eta (\nabla_x \phi(x_t, y_t), - \nabla_y \phi(x_t, y_t)), \\ \notag \\ & w_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,w_{t+1}') , \\ \notag \\ & \nabla \Phi(z_{t+1}') = \nabla \Phi(z_{t}) - \eta (\nabla_x \phi(u_t, v_t), - \nabla_y \phi(u_t, v_t)), \\ \notag\\ & z_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,z_{t+1}') . \end{align*}


Theorem. Assume that \Phi_{\mathcal{X}} is 1-strongly convex on \mathcal{X} \cap \mathcal{D}_{\mathcal{X}} with respect to \|\cdot\|_{\mathcal{X}} and let R^2_{\mathcal{X}} = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) - \inf_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) (with similar definition for \Phi_{\mathcal{Y}}). Let \beta = \max(\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21}). Then SP Mirror Descent with \eta = \frac{1}{2 \beta} satisfies

    \[\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t u_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t v_s \right) \leq \frac{2 \beta (R^2_{\mathcal{X}} + R^2_{\mathcal{X}})}{t}.\]

Proof: First we endow \mathcal{Z} with the norm defined by

    \[\|z\|_{\mathcal{Z}} = \sqrt{\|x\|_{\mathcal{X}}^2 + \|y\|_{\mathcal{Y}}^2} .\]

It is immediate that \Phi is 1-strongly convex with respect to \|\cdot\|_{\mathcal{Z}} on \mathcal{Z} \cap \mathcal{D}. Furthermore one can easily check that

    \[\|z\|_{\mathcal{Z},*} = \sqrt{\|x\|_{\mathcal{X},*}^2 + \|y\|_{\mathcal{Y},*}^2} ,\]

and thus the vector fields (\nabla_x \phi(x, y), - \nabla_y \phi(x,y)) used in MP-SP satisfy:

    \[\|(\nabla_x \phi(x, y), - \nabla_y \phi(x,y)) - (\nabla_x \phi(x', y'), - \nabla_y \phi(x',y'))\|_{\mathcal{Z},*}^2 \leq 4 \beta^2 \|(x,y) - (x',y') \|_{\mathcal{Z}}^2 .\]

It suffices now to use the exact same proof than for the standard Mirror Prox starting with:

    \begin{eqnarray*} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) & \leq & \frac{1}{t} \sum_{s=1}^t \bigg( \phi\left(x_s,y \right) - \phi\left(x, y_s \right) \bigg) \\ & \leq & \frac{1}{t} \sum_{s=1}^t g_s^{\top} (z_s - z) . \end{eqnarray*}



Smooth saddle-point representation of non-smooth functions

Assume that the function f to be optimized is of the form:

    \[f(x) = \max_{1 \leq i \leq m} f_{i} (x) ,\]

where the functions f_i(x) are convex, \beta-smooth and L-Lipschitz in some norm \|\cdot\|. Minimizing f over \mathcal{X} is equivalent to finding a saddle point over \mathcal{X} \times \Delta_{m} since one has:

    \[f(x) = \max_{y \in \Delta_{m}} y^{\top} (f_1(x), \hdots, f_{m}(x)) .\]

Thus we are looking for the saddle point of \phi(x,y) = y^{\top} (f_1(x), \hdots, f_{m}(x)) which is convex in x, and concave (in fact linear!) in y. Furthermore \phi is smooth and it is an easy exercise to verify that in this case one has \beta_{22} = 0, \beta_{21} = L, \beta_{11} = \beta, and \beta_{12} = L. Thus a rate of convergence (to the minimizer of f) of order (\beta + L) (R^2_{\mathcal{X}} + \log(\ell)) /t can be obtained by applying MP-SP with appropriate mirror maps (in particular the update in the y-space has to be done with the negative entropy). Note that is rate is particularly striking since the oracle lower bound of 1/\sqrt{t} was proved precisely for this type of function! Here we are able to get around this lower bound by making use of the structure of the function. Remark also that if you think about what this algorithm is doing to optimize the function f you get a very very elegant procedure for which it is highly non-trivial a priori to prove something about its rate of convergence.

Finally note that if \beta and L are of different order of magnitudes one can obtain a better result by considering the mirror map \Phi((x,y)) = a \Phi_{\mathcal{X}}(x) + b \Phi_{\mathcal{Y}}(y) and optimizing over the values of a and b.


Another interesting example arises when \phi(x,y) = x^{\top} A y with A \in \mathbb{R}^{n \times m}, \mathcal{X} = \Delta_n and \mathcal{Y} = \Delta_m. With the \ell_1-norm on both spaces it is easy to see that \beta_{12} = \beta_{21} = \| A \|_{\mathrm{max}} (the entrywise max norm), and of course \beta_{11} = \beta_{22} = 0. Thus we obtain here a convergence rate of order \| A \|_{\mathrm{max}} (\log(n) + \log(m)) / t. Again if n and m are of different order of magnitudes one can improve the result to obtain a rate of order \| A \|_{\mathrm{max}} \sqrt{\log(n) \log(m)} / t by considering the mirror map \Phi((x,y)) = a \Phi_{\mathcal{X}}(x) + b \Phi_{\mathcal{Y}}(y) and optimizing over the values of a and b.

Many more interesting examples can be derived and we refer the reader to the original paper of Nemirovski as well as this book chapter by Juditsky and Nemirovski.

This entry was posted in Optimization. Bookmark the permalink.

2 Responses to "ORF523: Mirror Prox"