ORF523: Conditional Gradient Descent and Structured Sparsity

In the following table we summarize our findings of previous lectures in terms of oracle convergence rates (we denote $R = \|x_1 - x^*\|_2^2$ ).

Note that in the last two lines the upper bounds and lower bounds are not matching. In both cases one can show that in fact the lower bound is tight (using respectively Nesterov’s Accelerated Gradient Descent, and the Center of Gravity Method).

Thus in terms of oracle complexity the picture is now complete. However this is not quite the end of the story and the algorithms can be improved in various ways:

The projection step in Projected Subgradient Descent can be expensive in terms of computational complexity. We will see an algorithm that avoids this projection, called Conditional Gradient Descent.
The black-box model is essentially an abstraction (note that this is true for $1^{st}$ -order oracles, but the situation is different for $0^{th}$ -order oracles). In practice one often knows the function to be optimized entirely. In that context it seems very wasteful to drop all this information and focus on black-box procedures. However by doing so we were able to derive linear time algorithm while the ‘structural’ Interior Point Methods (which use the form of the function to be optimized by deriving an appropriate self-concordant barrier) are not linear time. We will see one example of an algorithm that makes a partial use of the structural information on the function to be optimized, while still being essentially a gradient-based procedure with linear time complexity. More specifically we will study the FISTA algorithm that can optimize functions of the form $f(x) + g(x)$ , where $g$ is known and $f$ can be accessed with a $1^{st}$ -order oracle. Another very interesting idea in that direction is the Nesterov’s smoothing but unfortunately we will not have time to cover it.
The rates of convergence we proved did not explicitely involve the dimension $n$ . However in specific instances some parameters may involve a dependency on the dimension. For instance consider the case where the function $f$ is such that the magnitude of its subgradients are controlled coordinate-wise, say $\|g\|_{\infty} \leq 1, g \in \partial f(x)$ . In that case the best one can say is that $\|g\|_2 \leq \sqrt{n}$ which leads to a rate of convergence for Projected Subgradient Descent of order $\sqrt{n / t}$ . We will see that this situation can be dramatically improved and that one can obtain in this setting a rate of order $\sqrt{ \log (n) / t}$ . More generally we will describe the Mirror Descent algorithm which is adapted to convex functions which are $L$ -Lipschtiz in an arbitrary norm $\| \cdot \|$ (not necessarily $\ell_2$ ).

We start this program with a projection-free algorithm for black-box constrained optimization.

Conditional Gradient Descent (a.k.a. Frank-Wolfe algorithm)

Conditional Gradient Descent is an algorithm designed to minimize a smooth convex function $f$ over a compact convex set $\mathcal{X}$ . It performs the following update for $t \geq 1$ :

$\begin{align*} &y_{t} \in \mathrm{argmin}_{y \in \mathcal{X}} \nabla f(x_t)^{\top} y \\ & x_{t+1} = (1 - \gamma_t) x_t + \gamma_t y_t . \end{align*}$

This algorithm goes back to Frank and Wolfe (1956), and the following result is extracted from Jaggi (2013). We consider here functions which are $\beta$ -smooth in some arbitrary norm $\|\cdot\|$ , that is

$\|\nabla f(x) - \nabla f(y) \|_* \leq \beta \|x-y\| ,$

where the dual norm $\|\cdot\|_*$ is defined as $\|g\|_* = \sup_{x \in \mathbb{R}^n : \|x\| \leq 1} g^{\top} x$ .

Theorem. Let $f$ be a $\beta$ -smooth function w.r.t. some norm $\|\cdot\|$ , $R = \sup_{x, y \in \mathcal{X}} \|x - y\|$ , and $\gamma_s = \frac{2}{s+1}$ . Then for any $t \geq 2$ , one has

$f(x_t) - f(x^*) \leq \frac{2 \beta R^2}{t+1} .$

Proof: The following sequence of inequalities holds true, using respectively $\beta$ -smoothness (which implies $f(x) - f(y) \leq \nabla f(y)^{\top}(x-y) + \frac{\beta}{2} \|x-y\|^2$ ), the definition of $x_{s+1}$ , the definition of $y_s$ , and the convexity of $f$ :

$\begin{eqnarray*} f(x_{s+1}) & \leq & f(x_s) + \nabla f(x_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & \leq & f(x_s) + \gamma_s \nabla f(x_s)^{\top} (y_{s} - x_s) + \frac{\beta}{2} \gamma_s^2 R^2 \\ & \leq & f(x_s) + \gamma_s \nabla f(x_s)^{\top} (x^* - x_s) + \frac{\beta}{2} \gamma_s^2 R^2 \\ & \leq & f(x_s) + \gamma_s (f(x^*) - f(x_s)) + \frac{\beta}{2} \gamma_s^2 R^2 . \end{eqnarray*}$

Rewriting this inequality in terms of $\delta_s = f(x_s) - f(x^*)$ one obtains

$\delta_{s+1} \leq (1 - \gamma_s) \delta_s + \frac{\beta}{2} \gamma_s^2 R^2 .$

A simple induction using that $\gamma_s = \frac{2}{s+1}$ finishes the proof (note that the initialization is done at step $2$ with the above inequality yielding $\delta_2 \leq \frac{\beta}{2} R^2$ ).

$\Box$

The rate of convergence we just proved for Conditional Gradient Descent is no better than the rate of Projected Gradient Descent in the same setting. However, on the contrary to the latter algorithm where a quadratic programming problem over $\mathcal{X}$ has to be solved (the projection step), here we only need to solve a linear problem over $\mathcal{X}$ , and in some cases this can be much simpler.

While being projection-free is an important property of Conditional Gradient Descent, a perhaps even more important property is that it produces sparse iterates in the following sense: Assume that $\mathcal{X}$ is a polytope with sparse vertices, that is the number of non-zero coordinates (which we denote by the $\ell_0$ -norm $\|\cdot\|_0$ which is not a norm despite the name) in a fixed vertex is small, say of order $s \ll n$ . Then at each iteration, the new iterate $x_{t+1}$ will increase its number of non-zero coordinates by at most $s$ . In particular one has $\|x_{t+1}\|_0 \leq \|x_1\|_0 + t s$ . We will see now an example of application where this property is critical to obtain a computationally tractable algorithm.

Note that the property described above proves in particular that for a smooth function $f$ on the simplex $\{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\}$ there always exist sparse approximate solutions. More precisely there must exist a point $x$ with only $t$ non-zero coordinates and such that $f(x) - f(x^*) = O(1/t)$ . Clearly this is the best one can hope for in general, as it can be seen with the function $f(x) = \|x\|^2_2$ (since by Cauchy-Schwarz one has $\|x\|_1 \leq \sqrt{\|x\|_0} \|x\|_2$ which implies on the simplex $\|x\|_2^2 \geq 1 / \|x\|_0$ ).

An application of Conditional Gradient Descent: Least-squares regression with structured sparsity

This section is inspired by an open problem from this paper by Gabor Lugosi (what is described below solves the open problem). Consider the problem of approximating a signal $Y \in \mathbb{R}^n$ by a ‘small’ combination of elementary dictionary elements $d_1, \hdots, d_N \in \mathbb{R}^n$ . One way to do this is to consider a LASSO type problem in dimension $N$ of the following form (we denote here $x(i)$ for the $i^{th}$ coordinate of a vector $x$ )

$\min_{x \in \mathbb{R}^N} \big\| Y - \sum_{i=1}^N x(i) d_i \big\|_2^2 + \lambda \|x\|_1 .$

Let $D \in \mathbb{R}^{n \times N}$ be the dictionary matrix with $i^{th}$ column given by $d_i$ . Instead of considering the penalized version of the problem one could look at the following constrained problem on which we will focus now:

$\begin{eqnarray*} \min_{x \in \mathbb{R}^N} \| Y - D x \|_2^2 & \qquad \Leftrightarrow \qquad & \min_{x \in \mathbb{R}^N} \| Y / s - D x \|_2^2 \\ \text{subject to} \; \|x\|_1 \leq s & & \text{subject to} \; \|x\|_1 \leq 1 . \end{eqnarray*}$

We will now make some assumptions on the dictionary. We are interested in cases where the size of the dictionary $N$ can be very large, potentially exponential in the ambient dimension $n$ . Nonetheless we want to restrict our attention to algorithms that run in reasonable time with respect to the ambient dimension $n$ , that is we want polynomial time algorithms in $n$ . Of course in general this is impossible, and we need to assume that the dictionary has some structure that can be exploited. Here we make the assumption that one can do linear optimization over the dictionary in polynomial time in $n$ . More precisely we assume that we can solve in time $p(n)$ (where $p$ is polynomial) the following problem for any $y \in \mathbb{R}^n$ :

$\min_{1 \leq i \leq N} y^{\top} d_i .$

This assumption is met for many combinatorial dictionaries. For instance the dictionary elements could be vector of incidence of spanning trees in some fixed graph, and then the linear optimization problem can be solved with a greedy algorithm.

Finally we assume that the $\ell_2$ -norm of the dictionary elements are controlled by some $m>0$ , that is $\|d_i\|_2 \leq m, \forall i \in [N]$ .

Recall that one wants to minimize the function $f(x) = \frac{1}{2} \| Y - D x \|^2_2$ on the $\ell_1$ -ball of $\mathbb{R}^N$ in polynomial time in $n$ . Note that at first sight this task may seem completely impossible as we are not even allowed to write down entirely a vector $x \in \mathbb{R}^N$ (since this would take time linear in $N$ ). The key property that will save us is that this function admits sparse minimizers as we discussed in the previous section, and this will be exploited by the Conditional Gradient Descent. First note that

$\nabla f(x) = D^{\top} (D x - Y).$

This form for the gradient together with our computational assumption on the dictionary imply that Conditional Gradient Descent can be run in polynomial time. Indeed, assume that $z_t = D x_t - Y \in \mathbb{R}^n$ is already computed, then the first step of Conditional Gradient Descent is to find the coordinate $i_t \in [N]$ that maximizes $|[\nabla f(x_t)]_i|$ which can be done by maximizing $d_i^{\top} z_t$ and $- d_i^{\top} z_t$ . Thus this first step takes time $O(p(n))$ . Computing $x_{t+1}$ from $x_t$ and $i_{t}$ takes time $O(t)$ since $\|x_t\|_0 \leq t$ , and computing $z_{t+1}$ from $z_t$ and $i_t$ takes time $O(n)$ . Thus the overall time complexity of $t$ steps is $O(t p(n) + t^2)$ .

To derive a rate of convergence it remains to study the smoothness of $f$ . This can be done as follows:

$\begin{eqnarray*} \| \nabla f(x) - \nabla f(y) \|_{\infty} & = & \|D^{\top} D (x-y) \|_{\infty} \\ & = & \max_{1 \leq i \leq N} \bigg| d_i^{\top} \left(\sum_{j=1}^N d_j (x(j) - y(j))\right) \bigg| \\ & \leq & m^2 \|x-y\|_1 , \end{eqnarray*}$

which means that $f$ is $m^2$ -smooth with respect to the $\ell_1$ norm. Thus we get the following rate of convergence:

$f(x_t) - f(x^*) \leq \frac{4 m^2}{t+1} .$

In other words we proved that one can get an $\epsilon$ -optimal solution to our original problem with a computational effort of $O(m^2 p(n)/\epsilon + m^4/\epsilon^2)$ using the Conditional Gradient Descent.

One Response to "ORF523: Conditional Gradient Descent and Structured Sparsity"

By Conditional Gradient Descent (a.k.a. Frank-Wolfe algorithm) | zhanxingzhu May 2, 2013 - 10:55 am

[…] Recently, I read Martin Wainwright’s old paper: A new class of upper bounds on the log partition function, and found that conditional gradient was used in his paper to optimize edge appearance probability. Sébastien Bubeck gave a nice introduction on the conditional gradient. […]

ORF523: Conditional Gradient Descent and Structured Sparsity

One Response to "ORF523: Conditional Gradient Descent and Structured Sparsity"

By Conditional Gradient Descent (a.k.a. Frank-Wolfe algorithm) | zhanxingzhu May 2, 2013 - 10:55 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll

ORF523: Conditional Gradient Descent and Structured Sparsity

One Response to "ORF523: Conditional Gradient Descent and Structured Sparsity"

By Con­di­tional Gra­di­ent Descent (a.k.a. Frank-Wolfe algorithm) | zhanxingzhu May 2, 2013 - 10:55 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll

By Conditional Gradient Descent (a.k.a. Frank-Wolfe algorithm) | zhanxingzhu May 2, 2013 - 10:55 am