The Wayback Machine - https://web.archive.org/web/20210115154906/https://blogs.princeton.edu/imabandit/2013/04/16/orf523-mirror-descent-part-iii/

ORF523: Mirror Descent, part I/II

In this post we are interested in minimizing a convex function f over a compact convex set \mathcal{X} \subset \mathbb{R}^n under the assumption that f is L-Lipschitz on \mathcal{X} with respect to some arbitrary norm \|\cdot\| (precisely we assume that \forall x \in \mathcal{X}, g \in \partial f(x), \|g\|_* \leq L). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.

Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space \mathcal{H} (think of \mathcal{H} = \ell_2). However we are now interested in the more general situation where we are optimizing the function in some Banach space \mathcal{B} (for example \mathcal{B} = \ell_1). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative) \nabla f(x) are elements of the dual space \mathcal{B}^* and thus one cannot perform the computation x - \eta \nabla f(x) (it simply does not make sense). We did not have this problem for optimization in an Hilbert space \mathcal{H} since by Riesz representation theorem \mathcal{H}^* is isometric to \mathcal{H}. The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point x \in \mathcal{B} into the dual space \mathcal{B}^*, then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space \mathcal{B}. Of course the new point in the primal space might lie outside of the constraint set \mathcal{X} \subset \mathcal{B} and thus we need a way to project back the point on the constraint set \mathcal{X}. The rest of this lecture will make this idea more precise.

 

Preliminaries

We consider a non-empty convex open set \mathcal{D} \subset \mathbb{R}^n such that the constraint set \mathcal{X} is included in its closure, that is \mathcal{X} \subset \overline{\mathcal{D}}, and such that \mathcal{X} \cap \mathcal{D} \neq \emptyset.

Now we consider a continuously differentiable and strictly convex function \Phi defined on \mathcal{D}. This function \Phi will play the role of the mirror map between the primal and dual space. Precisely a point x \in \mathcal{X} \cap \mathcal{D} in the primal is mapped to \nabla \Phi(x) in the ‘dual’ (note that here all points lie in \mathbb{R}^n and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map \Phi will take the form \nabla \Phi(x) - \eta \nabla f(x). Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that \nabla \Phi(\mathcal{D}) = \mathbb{R}^n (weaker assumptions are possible but we will not need them), that is one can find y \in \mathcal{D} such that

    \[\nabla \Phi(y) = \nabla \Phi(x) - \eta \nabla f(x) .\]

Now this point y might lie outside of the constraint set \mathcal{X}. To project back on the set of constraint we use the Bregman divergence associated to \Phi defined by

    \[D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y)^{\top} (x - y) .\]

The projection of y \in \mathcal{D} onto \mathcal{X} then takes the following form:

    \[\mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y) .\]

To ensure the existence of this projection we assume now that \Phi satisfies

    \[\lim_{x \rightarrow \partial \mathcal{D}} \|\nabla \Phi(x)\| = + \infty .\]

With this assumption x \mapsto D_{\Phi}(x,y) is locally increasing on the boundary of \mathcal{D}, which together with the fact that \mathcal{X} is compact, and x \mapsto D_{\Phi}(x,y) is stricly convex implies the existence and uniqueness of the projection defined above.

We say that \Phi : \mathcal{D} \rightarrow \mathbb{R} is a mirror map if it satisfies all the above assumptions.

 

Mirror Descent

We can now describe the Mirror Descent strategy based on a mirror map \Phi. Let x_1 \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x). Then for t \geq 1, let y_{t+1} \in \mathcal{D} such that

    \[\nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta g_t, \ \text{where} \ g_t \in \partial f(x_t) ,\]

and

    \[x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}) .\]

 

Theorem Let \Phi be a mirror map. Assume also that \Phi is strongly convex on \mathcal{X} \cap \mathcal{D} with respect to \|\cdot\|, that is

    \[\forall x, y \in \mathcal{X} \cap \mathcal{D}, \Phi(x) - \Phi(y) \leq \nabla \Phi(x)^{\top}(x-y) - \frac{\alpha}{2} \|x-y\|^2.\]

Let R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1) and f be convex and L-Lipschitz w.r.t. \|\cdot\|, then Mirror Descent with \eta = \frac{R}{L} \sqrt{\frac{2 \alpha}{t}} satisfies

    \[f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq RL \sqrt{\frac{2}{\alpha t}} .\]

 

Proof: Let x \in \mathcal{X} \cap \mathcal{D}. The claimed bounded will be obtained by taking a limit x \rightarrow x^*. Now by convexity of f, the definition of the Mirror Descent, and the definition of the Bregman divergence one has

    \begin{eqnarray*} f(x_s) - f(x) & \leq & g_s^{\top} (x_s - x) \\ & = & \frac{1}{\eta} (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1})^{\top} (x_s - x) \\ & = & \frac{1}{\eta} \bigg( D_{\Phi}(x, x_s) + D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x, y_{s+1}) \bigg). \end{eqnarray*}

Now the key observation is that by writing down the first order optimality condition for x_{s+1} one obtains

    \[(\nabla \Phi(x_{s+1}) - \nabla \Phi(y_{s+1}))^{\top} (x_{s+1} - x) \leq 0 ,\]

which can be written equivalently as

    \[D_{\Phi}(x, y_{s+1}) \geq D_{\Phi}(x, x_{s+1}) + D_{\Phi}(x_{s+1}, y_{s+1}) .\]

Thus the term D_{\Phi}(x, x_s) - D_{\Phi}(x, x_{s+1}) will lead to a telescopic sum when summing over s=1 to s=t, and it remains to bound the other term as follows:

    \begin{align*} & D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x_{s+1}, y_{s+1}) \\ & = \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi(y_{s+1})^{\top} (x_{s} - x_{s+1}) \\ & \leq (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1}))^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & = \eta g_s^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \eta L \|x_{s+1} - x_s\| - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \frac{(\eta L)^2}{2 \alpha}. \end{align*}

We proved

    \[\sum_{s=1}^t \bigg(f(x_s) - f(x)\bigg) \leq \frac{D_{\Phi}(x,x_1)}{\eta} + \eta \frac{L^2 t}{2 \alpha},\]

which concludes the proof up to trivial computation.

\Box

 

Mirror Descent as a proximal method

One can rewrite Mirror Descent as follows:

    \begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ D_{\Phi}(x,y_{t+1}) \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - \nabla \Phi(y_{t+1})^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - (\nabla \Phi(x_{t}) - \eta g_t)^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta g_t^{\top} x + D_{\Phi}(x,x_t) . \\ \end{eqnarray*}

Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).

 

Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.

This entry was posted in Optimization. Bookmark the permalink.

7 Responses to "ORF523: Mirror Descent, part I/II"

Leave a reply