ORF523: Mirror Descent, part I/II

In this post we are interested in minimizing a convex function $f$ over a compact convex set $\mathcal{X} \subset \mathbb{R}^n$ under the assumption that $f$ is $L$ -Lipschitz on $\mathcal{X}$ with respect to some arbitrary norm $\|\cdot\|$ (precisely we assume that $\forall x \in \mathcal{X}, g \in \partial f(x), \|g\|_* \leq L$ ). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.

Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space $\mathcal{H}$ (think of $\mathcal{H} = \ell_2$ ). However we are now interested in the more general situation where we are optimizing the function in some Banach space $\mathcal{B}$ (for example $\mathcal{B} = \ell_1$ ). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative) $\nabla f(x)$ are elements of the dual space $\mathcal{B}^*$ and thus one cannot perform the computation $x - \eta \nabla f(x)$ (it simply does not make sense). We did not have this problem for optimization in an Hilbert space $\mathcal{H}$ since by Riesz representation theorem $\mathcal{H}^*$ is isometric to $\mathcal{H}$ . The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point $x \in \mathcal{B}$ into the dual space $\mathcal{B}^*$ , then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space $\mathcal{B}$ . Of course the new point in the primal space might lie outside of the constraint set $\mathcal{X} \subset \mathcal{B}$ and thus we need a way to project back the point on the constraint set $\mathcal{X}$ . The rest of this lecture will make this idea more precise.

Preliminaries

We consider a non-empty convex open set $\mathcal{D} \subset \mathbb{R}^n$ such that the constraint set $\mathcal{X}$ is included in its closure, that is $\mathcal{X} \subset \overline{\mathcal{D}}$ , and such that $\mathcal{X} \cap \mathcal{D} \neq \emptyset$ .

Now we consider a continuously differentiable and strictly convex function $\Phi$ defined on $\mathcal{D}$ . This function $\Phi$ will play the role of the mirror map between the primal and dual space. Precisely a point $x \in \mathcal{X} \cap \mathcal{D}$ in the primal is mapped to $\nabla \Phi(x)$ in the ‘dual’ (note that here all points lie in $\mathbb{R}^n$ and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map $\Phi$ will take the form $\nabla \Phi(x) - \eta \nabla f(x)$ . Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that $\nabla \Phi(\mathcal{D}) = \mathbb{R}^n$ (weaker assumptions are possible but we will not need them), that is one can find $y \in \mathcal{D}$ such that

$\nabla \Phi(y) = \nabla \Phi(x) - \eta \nabla f(x) .$

Now this point $y$ might lie outside of the constraint set $\mathcal{X}$ . To project back on the set of constraint we use the Bregman divergence associated to $\Phi$ defined by

$D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y)^{\top} (x - y) .$

The projection of $y \in \mathcal{D}$ onto $\mathcal{X}$ then takes the following form:

$\mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y) .$

To ensure the existence of this projection we assume now that $\Phi$ satisfies

$\lim_{x \rightarrow \partial \mathcal{D}} \|\nabla \Phi(x)\| = + \infty .$

With this assumption $x \mapsto D_{\Phi}(x,y)$ is locally increasing on the boundary of $\mathcal{D}$ , which together with the fact that $\mathcal{X}$ is compact, and $x \mapsto D_{\Phi}(x,y)$ is stricly convex implies the existence and uniqueness of the projection defined above.

We say that $\Phi : \mathcal{D} \rightarrow \mathbb{R}$ is a mirror map if it satisfies all the above assumptions.

Mirror Descent

We can now describe the Mirror Descent strategy based on a mirror map $\Phi$ . Let $x_1 \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x)$ . Then for $t \geq 1$ , let $y_{t+1} \in \mathcal{D}$ such that

$\nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta g_t, \ \text{where} \ g_t \in \partial f(x_t) ,$

and

$x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}) .$

Theorem Let $\Phi$ be a mirror map. Assume also that $\Phi$ is strongly convex on $\mathcal{X} \cap \mathcal{D}$ with respect to $\|\cdot\|$ , that is

$\forall x, y \in \mathcal{X} \cap \mathcal{D}, \Phi(x) - \Phi(y) \leq \nabla \Phi(x)^{\top}(x-y) - \frac{\alpha}{2} \|x-y\|^2.$

Let $R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)$ and $f$ be convex and $L$ -Lipschitz w.r.t. $\|\cdot\|$ , then Mirror Descent with $\eta = \frac{R}{L} \sqrt{\frac{2 \alpha}{t}}$ satisfies

$f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq RL \sqrt{\frac{2}{\alpha t}} .$

Proof: Let $x \in \mathcal{X} \cap \mathcal{D}$ . The claimed bounded will be obtained by taking a limit $x \rightarrow x^*$ . Now by convexity of $f$ , the definition of the Mirror Descent, and the definition of the Bregman divergence one has

$\begin{eqnarray*} f(x_s) - f(x) & \leq & g_s^{\top} (x_s - x) \\ & = & \frac{1}{\eta} (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1})^{\top} (x_s - x) \\ & = & \frac{1}{\eta} \bigg( D_{\Phi}(x, x_s) + D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x, y_{s+1}) \bigg). \end{eqnarray*}$

Now the key observation is that by writing down the first order optimality condition for $x_{s+1}$ one obtains

$(\nabla \Phi(x_{s+1}) - \nabla \Phi(y_{s+1}))^{\top} (x_{s+1} - x) \leq 0 ,$

which can be written equivalently as

$D_{\Phi}(x, y_{s+1}) \geq D_{\Phi}(x, x_{s+1}) + D_{\Phi}(x_{s+1}, y_{s+1}) .$

Thus the term $D_{\Phi}(x, x_s) - D_{\Phi}(x, x_{s+1})$ will lead to a telescopic sum when summing over $s=1$ to $s=t$ , and it remains to bound the other term as follows:

$\begin{align*} & D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x_{s+1}, y_{s+1}) \\ & = \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi(y_{s+1})^{\top} (x_{s} - x_{s+1}) \\ & \leq (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1}))^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & = \eta g_s^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \eta L \|x_{s+1} - x_s\| - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \frac{(\eta L)^2}{2 \alpha}. \end{align*}$

We proved

$\sum_{s=1}^t \bigg(f(x_s) - f(x)\bigg) \leq \frac{D_{\Phi}(x,x_1)}{\eta} + \eta \frac{L^2 t}{2 \alpha},$

which concludes the proof up to trivial computation.

$\Box$

Mirror Descent as a proximal method

One can rewrite Mirror Descent as follows:

$\begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ D_{\Phi}(x,y_{t+1}) \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - \nabla \Phi(y_{t+1})^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - (\nabla \Phi(x_{t}) - \eta g_t)^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta g_t^{\top} x + D_{\Phi}(x,x_t) . \\ \end{eqnarray*}$

Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).

Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.

7 Responses to "ORF523: Mirror Descent, part I/II"

By Formulations of Mirror Descent | Look at the corners! February 14, 2016 - 7:31 pm

[…] the objective function, and is the stepsize parameter. This formulation, used extensively in [Bubeck 2015], permits to draw analogy with the Projected Subgradient Descent for which and […]
By etali March 25, 2015 - 8:19 am

I have a question about y_{t+1}, how to get the y_{t+1}?
- By EricSong April 9, 2015 - 9:19 am
  
  by solving the equation about y_{t+1}。。。 delta phi (y_{t+1} ) = delta phi (x_{t} ) – eta * delta phi (f(x_{t}) ) ..this is a operator equation…
By Xiaolong February 23, 2015 - 3:36 pm

In (Beck & Teboulle), the author says that $t_k \prop \sqrt{k}^{-1}$ gives a good step size that yields efficiency of $O(1/sqrt{k})$. There argument was based on Thm. 4.2 in there paper. But it seems that the optimal step size is constant with the number of maximal iteration known a prior. This is also consistent (as far as I can see) with you derivation of the theorem in this blog. So in reality, does the $t_k \prop 1/\sqrt{k}$ gives a good efficiency? if so, how to prove? Thanks,
By CJ April 18, 2013 - 7:37 pm

Is the eta in the theorem correct? It doesn’t agree with the theorem statement for projected sub-gradient descent. I think the L/R should be R/L.
- By Sebastien Bubeck April 19, 2013 - 8:55 am
  
  Yup that’s a typo, thanks a lot for spotting it!

Nov	JAN	Mar
	15
2016	2021	2022

ORF523: Mirror Descent, part I/II

7 Responses to "ORF523: Mirror Descent, part I/II"

By Formulations of Mirror Descent | Look at the corners! February 14, 2016 - 7:31 pm

By etali March 25, 2015 - 8:19 am

By EricSong April 9, 2015 - 9:19 am

By Xiaolong February 23, 2015 - 3:36 pm

By CJ April 18, 2013 - 7:37 pm

By Sebastien Bubeck April 19, 2013 - 8:55 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll