ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

In this lecture we are interested in the following problem:

    \[\min_{x \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(x) ,\]

where each f_i is \beta-smooth and \alpha strongly-convex. We denote by Q=\frac{\beta}{\alpha} the condition number of these functions (which is also the condition number of the average).

Using a gradient descent directly on the sum with step-size \frac{2}{\alpha + \beta} one obtains a rate of convergence of order \exp( - c t / Q) for some numerical constant c>0 (see this lecture). In particular one can reach an \epsilon-optimal point in time O(m Q N \log(1/\epsilon)) with this method, where O(N) is the time to compute the gradient of a single function f_i(x). Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to O(m \sqrt{Q} N \log(1/\epsilon)).

Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an \epsilon-optimal point in time O( \log(m) N / (\alpha \epsilon)) which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in \epsilon would be of the form \log(1/\epsilon)).

 

Stochastic Average Gradient (SAG)

We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let y_0 = 0, x_1 \in \mathbb{R}^n and for t \geq 1 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[y_t(i) = \left\{ \begin{array}{ccc} \nabla f_i(x_t) & \text{if} & i=i_t \\ y_{t-1}(i) & \text{if} & i\neq i_t \end{array} \right. ; \qquad x_{t+1} = x_t - \frac{\eta}{m} \sum_{i=1}^m y_t(i) .\]

They show that in the regime where m \geq 8 Q, SAG (with step-size \eta = \frac{1}{2 m \alpha} and with x_1 obtained as the average of m iterates of SGD) has a rate of convergence of order \exp( - c t / m) (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of O(N \log(m)) this results in an overall time-complexity of O(m N \log(m) \log(1/\epsilon)), which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.

 

Short detour through Fenchel’s duality

Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function f: \mathbb{R}^n \rightarrow \mathbb{R} is defined by:

    \[f^*(u) = \sup_{x \in \mathbb{R}^n} u^{\top} x - f(x) .\]

Note that f^* is always a convex function (possibly taking value +\infty). If f is convex, then for any u \in \partial f(x) one clearly has f^*(u) + f(x) = u^{\top} x. Finally one can show that if f is \beta-smooth then f^* is \frac{1}{\beta}-strongly convex (one can use the second Lemma from this lecture).

 

Stochastic Dual Coordinate Ascent (SDCA)

We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):

    \[\min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2 ,\]

where

  • f_i : \mathbb{R} \rightarrow \mathbb{R} is convex, \beta-smooth.
  • f_i(s) \geq 0, \forall s \in \mathbb{R} and f_i(0) \leq 1.
  • \|x_i\| \leq 1.

Denote

    \[P(w) = \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2\]

the primal objective. For the dual objective we introduce a dual variable \nu(i) for each data point x_i (as well as a temporary primal variable z(i)):

    \begin{eqnarray*} \min_{w \in \mathbb{R}^n} P(w) & = & \min_{z \in \mathbb{R}^m, w} \max_{\nu \in \mathbb{R}^m} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{z \in \mathbb{R}^m, w} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \bigg( - f_i^*(- \nu(i)) - \nu(i) w^{\top} x_i \bigg) + \frac{\alpha}{2} \|w\|^2 . \end{eqnarray*}

Next observe that by the KKT conditions one has at the optimum (w^*, \nu^*):

    \[w^* = \frac{1}{\alpha m} \sum_{i=1}^m \nu^*(i) x_i ,\]

and thus denoting

    \[D(\nu) = \frac{1}{m} \sum_{i=1}^m - f_i^*(- \nu(i)) - \frac{\alpha}{2} \left\|\frac{1}{\alpha m} \sum_{i=1}^m \nu(i) x_i \right\|^2 ,\]

we just proved

    \[\min_{w \in \mathbb{R}^n} P(w) = \max_{\nu \in \mathbb{R}^m} D(\nu) .\]

 

We can now describe SDCA: let \nu_0 = 0 and for t \geq 0 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[\nu_{t+1}(i) = \left\{ \begin{array}{ccc} (1-\eta) \nu_t(i) - \eta f_i'(w_t^{\top} x_i) & \text{if} & i=i_t , \\ \nu_t(i) & \text{if} & i \neq i_t . \end{array} \right.\]

Let w_t = \frac{1}{\alpha m} \sum_{i=1}^m \nu_t(i) x_i be the corresponding primal variable. The following result justifies the procedure.

Theorem (Shalev-Shwartz and Zhang 2013) SDCA with \eta = \frac{m}{Q + m} satisfies:

    \[\mathbb{E} P(w_t) - \min_{w \in \mathbb{R}^n} P(w) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

The above convergence rate yields an overall time-complexity of O((Q+m) N \log(m) \log((Q+m)/\epsilon)) for SDCA (where O(N) is the complexity of computing a gradient of w \mapsto f_i(w^{\top} x_i)), which can be much better than the O(m Q N \log(1/\epsilon)) complexity of the plain gradient descent.

Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:

(1)   \begin{equation*} \mathbb{E} \left( P(w_t) - D(\nu_t) \right) \leq (Q+m) \mathbb{E} \left( D(\nu_{t+1}) - D(\nu_t) \right) . \end{equation*}

Let us see how the above inequality yields the result. First we clearly have D(\nu) \leq P(w), and thus the above inequality immediately yields:

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right) \mathbb{E} \left( D(\nu^*) - D(\nu_{t}) \right) ,\]

which by induction gives (using D(0) \geq 0 and D(\nu^*) \leq 1):

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right)^{t+1} \leq \exp\left( - \frac{t+1}{Q+m} \right) ,\]

and thus using once more (1):

    \[\mathbb{E} P(w_t) - P(w^*) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:

    \[P(w_t) - D(\nu_t) = \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) \bigg) + \alpha \|w_t\|^2 .\]

Let us now evaluate the improvement in the dual objective assuming that coordinate i is updated in the (t+1)^{th} step:

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & = - f_i^*(-\nu_{t+1}(i)) - \frac{\alpha m}{2} \|w_{t+1}\|^2 - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) \\ & = - f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) - \frac{\alpha m}{2} \left\|w_t - \frac{\eta}{\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i)) x_i \right\|^2 \\ & \qquad - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) . \end{align*}

Now using that f_i^* is \frac{1}{\beta}-strongly convex one has

    \begin{eqnarray*} f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) & \leq& (1 -\eta) f_i^*(- \nu_t(i)) + \eta f_i^*(f_i'(w_t^{\top} x_i)) \\ & & \;- \frac{1}{2 \beta} \eta (1-\eta) (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 , \end{eqnarray*}

which yields

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \frac{\eta (1-\eta)}{2 \beta} (f_i'(w_t^{\top} x_i) - \nu_t(i))^2 \\ & + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i - \frac{\eta^2}{2\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 \|x_i\|^2 \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i , \end{align*}

where in the last line we used that \eta = \frac{m}{Q+m} and \|x_i\| \leq 1. Observe now that

    \[f_i^*(f_i'(w_t^{\top} x_i)) = (w_t^{\top} x_i) f_i'(w_t^{\top} x_i) - f_i(w_t^{\top} x_i) ,\]

which finally gives

    \[m \left( D(\nu_{t+1}) - D(\nu_t) \right) \geq \eta \bigg(f_i^*(- \nu_t(i)) + f_i(w_t^{\top} x_i) + \nu_t(i) w_t^{\top} x_i \bigg),\]

and thus taking expectation one obtains

    \begin{align*} & (Q+m) \mathbb{E}_{i_t} \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) + \nu_t(i) w_t^{\top} x_i \bigg)\\ & = P(w_t) - D(\nu_t), \end{align*}

which concludes the proof of (1) and the proof of the theorem.

\Box

Posted in Optimization | 2 Comments

ORF523: Noisy oracles

Today we start the third (and last) part of our class titled: ‘Optimization and Randomness’. An important setting that we will explore is the one of optimization with noisy oracles:

  • Noisy 0^{th}-order oracle: given x \in \mathcal{X} it outputs Y such that \mathbb{E}(Y | x) = f(x).
  • Noisy 1^{st}-order oracle: given x \in \mathcal{X} it outputs \tilde{g} such that \mathbb{E}(\tilde{g} | x) \in \partial f(x).

Somewhat surprisingly several of the first-order algorithms we studied so far are robust to noise and attain the same performance if one feeds them with the answers of a noisy 1^{st} order oracle. This was first observed by Robbins and Monro in the early Fifties, and the resulting set of techniques is often referred to as stochastic approximation methods. For instance let us consider the Stochastic Mirror Descent:

    \begin{align*} & \nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta \tilde{g}_t, \ \text{where} \ \mathbb{E}(\tilde{g_t} | x_t) \in \partial f(x_t) , \\ & x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}) . \end{align*}

The analysis of this algorithm follows the exact same lines than the standard version. The key observation is that:

    \begin{eqnarray*} \mathbb{E} \ f(x_t) - f(x) & \leq & \mathbb{E} \ g_t^{\top} (x_t - x), \ \text{where} \ g_t \in \partial f(x_t) \\ & = & \mathbb{E} \left( \mathbb{E} (\tilde{g}_t|x_t)^{\top} (x_t - x) \right), \\ & = & \mathbb{E} \ \tilde{g}_t^{\top} (x_t - x) , \end{eqnarray*}

and now one can repeat the standard proof since \tilde{g}_t was used in the algorithm. This yields the following result.

Theorem. Let \Phi be a mirror map \kappa-strongly convex on \mathcal{X} \cap \mathcal{D} with respect to \|\cdot\|, and let R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1). Assume that f is convex and that the noisy oracle is such that \mathbb{E} \|\tilde{g}\|_*^2 \leq \sigma^2. Then Stochastic Mirror Descent with \eta = \frac{R}{\sigma} \sqrt{\frac{2 \kappa}{t}} satisfies

    \[\mathbb{E} f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq R \sigma \sqrt{\frac{2}{\kappa t}} .\]

Of course the exact same reasoning applies to the Mirror Descent for Saddle Points. One can also get the rate 1/t for strongly convex functions using the idea described in this lecture. High-probability results can be obtained using Markov’s inequality, and stronger concentration results can be proved under a subgaussian assumption on the tails of the estimates \tilde{g}, see Proposition 2.2. here. 

Let us now see some application of this noisy framework.

 

Statistical Learning

We are interested in the solving the following problem:

    \[\min_{w \in \mathcal{W}} \mathbb{E} \ \ell(w, Z) ,\]

where \ell : \mathcal{W} \times \mathcal{Z} is a known loss function such that \ell(\cdot, z) is a convex function \forall z \in \mathcal{Z}, and Z is a random variable whose distribution is unknown. While the distribution of Z is unknown, we assume that we have access to a sequence of m i.i.d. examples Z_1, \hdots, Z_m. A typical example within this framework is the case where z=(x,y) represents a pair of (input, output) and the function \ell is one of the loss function we saw in this lecture, for example the logistic loss \ell(w, (x,y)) = \log(1 + \exp(- y w^{\top} x)) with \mathcal{W} \subset \mathbb{R}^n, \mathcal{Z} = \mathbb{R}^n \times \{-1,1\}. 

Let g(w,z) be a subgradient of \ell(\cdot, z) at w, and let f(w) = \mathbb{E} \ \ell(w, Z). We assume that \mathbb{E} g(w, Z) \in \partial f(w). Note that we cannot perform a gradient descent on f(w) as we would need to know the distribution of Z to compute a subgradient of f. On the other hand one has access to unbiased estimates of these subgradients with the i.i.d. examples Z_1, \hdots, Z_m. Thus one can use the Stochastic Mirror Descent, where at each iteration a new example is used to compute the noisy subgradient, precisely we take \tilde{g}_t = g(w_t, Z_t). In particular with m examples one can only run the algorithm for m steps, and one gets to a precision of order 1/\sqrt{m}. If in addition f is strongly convex then one can in fact obtain a precision of order 1/m with the step sizes/averaging described in this lecture.

 

Acceleration by randomization

In the previous example there was no alternative to the use of noisy subgradients as the distribution of Z was unknown. However in some cases, even if the distribution of Z is known, it might be beneficial to artificially sample examples from the distribution in order to avoid computing difficult integrals. Consider for example the problem of minimizing an average of convex functions:

    \[\min_{x \in \mathcal{X}} \frac{1}{m} \sum_{i=1}^m f_i(x) .\]

We have seen several important examples of this form where m might be very large. Note that computing a subgradient of f(x) = \frac{1}{m} \sum_{i=1}^m f_i(x) takes O(m N) time where O(N) is the time to compute the subgradient of a single function f_i(x). Thus a standard gradient descent algorithm will take O(m N / \epsilon^2) time to reach an \epsilon-optimal point. On the other hand one can simply sample (with replacement) an index i \in [m] at every iteration (which takes time O(\log m)), and perform a step of gradient descent using only the subgradient of f_i. This method will reach an \epsilon-optimal point (in expectation) in time O(\log(m) N / \epsilon^2), which is much cheaper than the requirement for GD!

Note that if f is also strongly convex, then GD needs O(m N / \epsilon) time to reach an \epsilon-optimal point, and SGD needs only O(\log(m) N / \epsilon) time (with appropriate step sizes/averaging as discussed above). The situation becomes more favorable to GD if in addition f is smooth (recall that for strongly convex and smooth functions one has an exponential rate of convergence for GD), and we will come back to this in the next lecture.

 

Let us consider another example which we studied at the end of the previous lecture:

    \[\min_{x \in \Delta_n} \max_{y \in \Delta_n} x^{\top} A y ,\]

where A \in \mathbb{S}^n (the symmetry assumption is just for sake of notation). It is clear that the rate of convergence of MD-SP for this problem is of order \|A\|_{\max} \sqrt{\frac{\log n}{t}}. Furthermore each step requires to compute a subgradient in x and y, which are of the form A y and Ax. Thus the time complexity of one step is of order O(n^2) (which is the time complexity of a matrix-vector multiplication). Overall to reach an \epsilon-optimal points one needs O(\|A\|_{\max}^2 \frac{n^2 \log n}{\epsilon^2}) elementary operations. Now the key observation is that the matrix-vector multiplication can easily be randomized, especially since x, y \in \Delta_n. Let A_1, \hdots, A_n be the column vectors of A, then A x = \sum_{i=1}^n x_i A_i, and if I is a random variable such that \mathbb{P}(I=i) = x_i one has \mathbb{E} A_I = A x. Using this randomization in a Stochastic MD-SP one obtains the exact same rate of convergence (in expectation) but now each step has time complexity of order O(n), resulting in an overall complexity to reach an \epsilon-optimal point (in expectation) of O(\|A\|_{\max}^2 \frac{n \log n}{\epsilon^2}). This result is especially amazing since we obtain a sublinear time algorithm (the complexity is \tilde{O}(n) while the ‘size’ of the problem is O(n^2)). One can propose an even better algorithm based on Mirror Prox with a complexity of \tilde{O}(n / \epsilon) and we refer the reader to this book chapter by Juditsky and Nemirovski for more details.

Posted in Optimization | 1 Comment

ORF523: Mirror Prox

Today I’m going to talk about a variant of Mirror Descent invented by Nemirovski in 2004 (see this paper). This new algorithm is called Mirror Prox and it is designed to minimize a smooth convex function f over a compact convex set \mathcal{X} \subset \mathbb{R}^n. We assume that the smoothness is measured in some arbitrary norm \|\cdot\|, precisely we assume that \forall x, y \in \mathcal{X}, \|\nabla f(x) - \nabla f(y) \|_* \leq \beta \|x - y\|.

Let \Phi : \mathcal{D} \rightarrow \mathbb{R} be a mirror map on \mathcal{X} (see this post for a definition) and let x_1 \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x). Mirror Prox is described by the following equations:

    \begin{align*} & \nabla \Phi(y_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(x_t), \\ \notag \\ & y_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}') , \\ \notag \\ & \nabla \Phi(x_{t+1}') = \nabla \Phi(x_{t}) - \eta \nabla f(y_{t+1}), \\ \notag\\ & x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,x_{t+1}') . \end{align*}

In words the algorithm first makes a step of Mirror Descent to go from x_t to y_{t+1}, and then it makes a similar step to obtain x_{t+1}, starting again from x_t but this time using the gradient of f evaluated at y_{t+1} (instead of x_t). The following result justifies the procedure.

Theorem. Let \Phi be a mirror map \kappa-strongly convex on \mathcal{X} \cap \mathcal{D} with respect to \|\cdot\|. Let R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1) and f be convex and \beta-smooth w.r.t. \|\cdot\|. Then Mirror Prox with \eta = \frac{\kappa}{\beta} satisfies (with y_1 = x_1)

    \[f\bigg(\frac{1}{t} \sum_{s=1}^t y_s \bigg) - f(x^*) \leq \frac{\beta R^2}{\kappa t} .\]

Proof: Let x \in \mathcal{X} \cap \mathcal{D}. We write

    \begin{align*} & f(y_{t+1}) - f(x) \\ & \leq \nabla f(y_{t+1})^{\top} (y_{t+1} - x) \\ & = \nabla f(y_{t+1})^{\top} (x_{t+1} - x) + \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \\ & \qquad + (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) . \end{align*}

We will now bound separately these three terms. For the first one we have (by definition of the method, first-order optimality condition of the projection, and the definition of the Bregman divergence)

    \begin{align*} & \eta \nabla f(y_{t+1})^{\top} (x_{t+1} - x) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}'))^{\top} (x_{t+1} - x) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(x_{t+1}))^{\top} (x_{t+1} - x) \\ & = D_{\Phi}(x,x_t) - D_{\Phi}(x, x_{t+1}) - D_{\Phi}(x_{t+1}, x_t) . \end{align*}

For the second term using the same properties than above and the strong-convexity of the mirror map one has

    \begin{align*} & \eta \nabla f(x_t)^{\top} (y_{t+1} - x_{t+1}) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}'))^{\top} (y_{t+1} - x_{t+1}) \\ & = ( \nabla \Phi(x_t) - \nabla \Phi(y_{t+1}))^{\top} (y_{t+1} - x_{t+1}) \\ & = D_{\Phi}(x_{t+1},x_t) - D_{\Phi}(x_{t+1}, y_{t+1}) - D_{\Phi}(y_{t+1}, x_t) \\ & \leq D_{\Phi}(x_{t+1},x_t) - \frac{\kappa}{2} \|x_{t+1} - y_{t+1} \|^2 - \frac{\kappa}{2} \|y_{t+1} - x_t\|^2 . \end{align*}

Finally for the last term, using H\”older’s inequality, \beta-smoothness, and the fact 2 ab \leq a^2 + b^2 one has

    \begin{align*} & (\nabla f(y_{t+1}) - \nabla f(x_t))^{\top} (y_{t+1} - x_{t+1}) \\ & \leq \|\nabla f(y_{t+1}) - \nabla f(x_t)\|_* \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \beta \|y_{t+1} - x_t\| \cdot \|y_{t+1} - x_{t+1} \| \\ & \leq \frac{\beta}{2} \|y_{t+1} - x_t\|^2 + \frac{\beta}{2} \|y_{t+1} - x_{t+1} \|^2 . \end{align*}

Thus summing up these three terms and using that \eta = \frac{\kappa}{\beta} one gets

    \[f(y_{t+1}) - f(x) \leq \frac{D_{\Phi}(x,x_t) - D_{\Phi}(x,x_{t+1})}{\eta} ,\]

and this inequality is also true for t=0 (see the bound for the first term). The proof is then concluded as usual.

\Box

 

Mirror Prox for Saddle Points

Following the generalization of Mirror Descent for Saddle Points (MD-SP, see this post) one can propose a version of Mirror Prox for Saddle Points (MP-SP). In fact the importance of Mirror Prox lies primarily in its application to compute the saddle point of a smooth convex-convave function. Indeed, the key observation is that many non-smooth convex functions admit a smooth saddle-point representation (see examples below). With such a representation one can then use MP-SP to obtain a rate of convergence of order 1/t, despite the fact that the original function is non-smooth! Recall that we proved that for non-smooth convex functions the optimal black-box rate is of order 1/\sqrt{t}. However MP-SP is not a black-box method in the sense that it makes use of the structure of the function (by representing it as a saddle-point). Nonetheless MP-SP still enjoys the nice time-complexity property of black-box methods since it is a first-order method. It is a win-win situation!

Let us now describe more precisely MP-SP. Let \mathcal{X} \subset \mathbb{R}^n be a compact and convex set endowed with some norm \|\cdot\|_{\mathcal{X}} and a mirror map \Phi_{\mathcal{X}} (defined on \mathcal{D_{\mathcal{X}}}), and let \mathcal{Y} \subset \mathbb{R}^m be a compact and convex set endowed with some norm \|\cdot\|_{\mathcal{Y}} and a mirror map \Phi_{\mathcal{Y}} (defined on \mathcal{D_{\mathcal{Y}}}). Let \phi : \cX \times \cY \rightarrow \mathbb{R} be a continuously differentiable function such that \phi(\cdot, y) is convex and \phi(x, \cdot) is concave. We assume that \phi is (\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21})-smooth in the sense that:

    \begin{align*} & \|\nabla_x \phi(x,y) - \nabla_x \phi(x',y) \|_{\mathcal{X},*} \leq \beta_{11} \|x-x'\|_{\mathcal{X}} , \\ & \|\nabla_x \phi(x,y) - \nabla_x \phi(x,y') \|_{\mathcal{X},*} \leq \beta_{12} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x,y') \|_{\mathcal{Y},*} \leq \beta_{22} \|y-y'\|_{\mathcal{Y}} , \\ & \|\nabla_y \phi(x,y) - \nabla_y \phi(x',y) \|_{\mathcal{Y},*} \leq \beta_{21} \|x-x'\|_{\mathcal{X}} , \\ \end{align*}

We consider the mirror map \Phi((x,y)) = \Phi_{\mathcal{X}}(x) + \Phi_{\mathcal{Y}}(y) on \mathcal{Z} = \mathcal{X} \times \mathcal{Y} (defined on \mathcal{D} = \mathcal{D_{\mathcal{X}}} \times \mathcal{D_{\mathcal{Y}}}). Let z_1 \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} \Phi(z). Then for t \geq 1, let z_t=(x_t, y_t) and w_t=(u_t,v_t) be defined by the following equations (with w_1=z_1):

    \begin{align*} & \nabla \Phi(w_{t+1}') = \nabla \Phi(z_{t}) - \eta (\nabla_x \phi(x_t, y_t), - \nabla_y \phi(x_t, y_t)), \\ \notag \\ & w_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,w_{t+1}') , \\ \notag \\ & \nabla \Phi(z_{t+1}') = \nabla \Phi(z_{t}) - \eta (\nabla_x \phi(u_t, v_t), - \nabla_y \phi(u_t, v_t)), \\ \notag\\ & z_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,z_{t+1}') . \end{align*}

 

Theorem. Assume that \Phi_{\mathcal{X}} is 1-strongly convex on \mathcal{X} \cap \mathcal{D}_{\mathcal{X}} with respect to \|\cdot\|_{\mathcal{X}} and let R^2_{\mathcal{X}} = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) - \inf_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) (with similar definition for \Phi_{\mathcal{Y}}). Let \beta = \max(\beta_{11}, \beta_{12}, \beta_{22}, \beta_{21}). Then SP Mirror Descent with \eta = \frac{1}{2 \beta} satisfies

    \[\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t u_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t v_s \right) \leq \frac{2 \beta (R^2_{\mathcal{X}} + R^2_{\mathcal{X}})}{t}.\]

Proof: First we endow \mathcal{Z} with the norm defined by

    \[\|z\|_{\mathcal{Z}} = \sqrt{\|x\|_{\mathcal{X}}^2 + \|y\|_{\mathcal{Y}}^2} .\]

It is immediate that \Phi is 1-strongly convex with respect to \|\cdot\|_{\mathcal{Z}} on \mathcal{Z} \cap \mathcal{D}. Furthermore one can easily check that

    \[\|z\|_{\mathcal{Z},*} = \sqrt{\|x\|_{\mathcal{X},*}^2 + \|y\|_{\mathcal{Y},*}^2} ,\]

and thus the vector fields (\nabla_x \phi(x, y), - \nabla_y \phi(x,y)) used in MP-SP satisfy:

    \[\|(\nabla_x \phi(x, y), - \nabla_y \phi(x,y)) - (\nabla_x \phi(x', y'), - \nabla_y \phi(x',y'))\|_{\mathcal{Z},*}^2 \leq 4 \beta^2 \|(x,y) - (x',y') \|_{\mathcal{Z}}^2 .\]

It suffices now to use the exact same proof than for the standard Mirror Prox starting with:

    \begin{eqnarray*} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) & \leq & \frac{1}{t} \sum_{s=1}^t \bigg( \phi\left(x_s,y \right) - \phi\left(x, y_s \right) \bigg) \\ & \leq & \frac{1}{t} \sum_{s=1}^t g_s^{\top} (z_s - z) . \end{eqnarray*}

\Box

 

Smooth saddle-point representation of non-smooth functions

Assume that the function f to be optimized is of the form:

    \[f(x) = \max_{1 \leq i \leq m} f_{i} (x) ,\]

where the functions f_i(x) are convex, \beta-smooth and L-Lipschitz in some norm \|\cdot\|. Minimizing f over \mathcal{X} is equivalent to finding a saddle point over \mathcal{X} \times \Delta_{m} since one has:

    \[f(x) = \max_{y \in \Delta_{m}} y^{\top} (f_1(x), \hdots, f_{m}(x)) .\]

Thus we are looking for the saddle point of \phi(x,y) = y^{\top} (f_1(x), \hdots, f_{m}(x)) which is convex in x, and concave (in fact linear!) in y. Furthermore \phi is smooth and it is an easy exercise to verify that in this case one has \beta_{22} = 0, \beta_{21} = L, \beta_{11} = \beta, and \beta_{12} = L. Thus a rate of convergence (to the minimizer of f) of order (\beta + L) (R^2_{\mathcal{X}} + \log(\ell)) /t can be obtained by applying MP-SP with appropriate mirror maps (in particular the update in the y-space has to be done with the negative entropy). Note that is rate is particularly striking since the oracle lower bound of 1/\sqrt{t} was proved precisely for this type of function! Here we are able to get around this lower bound by making use of the structure of the function. Remark also that if you think about what this algorithm is doing to optimize the function f you get a very very elegant procedure for which it is highly non-trivial a priori to prove something about its rate of convergence.

Finally note that if \beta and L are of different order of magnitudes one can obtain a better result by considering the mirror map \Phi((x,y)) = a \Phi_{\mathcal{X}}(x) + b \Phi_{\mathcal{Y}}(y) and optimizing over the values of a and b.

 

Another interesting example arises when \phi(x,y) = x^{\top} A y with A \in \mathbb{R}^{n \times m}, \mathcal{X} = \Delta_n and \mathcal{Y} = \Delta_m. With the \ell_1-norm on both spaces it is easy to see that \beta_{12} = \beta_{21} = \| A \|_{\mathrm{max}} (the entrywise max norm), and of course \beta_{11} = \beta_{22} = 0. Thus we obtain here a convergence rate of order \| A \|_{\mathrm{max}} (\log(n) + \log(m)) / t. Again if n and m are of different order of magnitudes one can improve the result to obtain a rate of order \| A \|_{\mathrm{max}} \sqrt{\log(n) \log(m)} / t by considering the mirror map \Phi((x,y)) = a \Phi_{\mathcal{X}}(x) + b \Phi_{\mathcal{Y}}(y) and optimizing over the values of a and b.

Many more interesting examples can be derived and we refer the reader to the original paper of Nemirovski as well as this book chapter by Juditsky and Nemirovski.

Posted in Optimization | 2 Comments

COLT 2013 accepted papers

The accepted papers for COLT 2013 have just been revealed. You can also find the list below with a link to the arxiv version when I could find one (if you want me to add a link just send me an email).

I also take the opportunity of this post to remind my student readers that you can apply for a ‘student travel award’ to come participate in COLT 2013. The instructions can be found here. So far I received very few applications so I highly encourage you to give it a try! You can check out this blog post that gives a few reasons to make the effort to travel and participate in conferences.

Update with a message from Philippe Rigollet: Sebastien Bubeck and I are local organizers of this year’s edition of the Conference On Learning Theory (COLT 2013) on June 12-14.
This means that it will take place at Princeton (in Friend 101), and of course, that it’ll be awesome.
The list of accepted papers is very exciting and has been posted on the official COLT 2013 website:
http://orfe.princeton.edu/conferences/colt2013 (see also below).

Thanks to generous sponsors, we have been able to lower the registration costs to a historic minimum (regular: $300, students: $150).
These costs include:
-a cocktail party followed by a banquet on Thursday night
-coffee breaks (with poster presentations)
-lunches (with poster presentation)
-47 amazing presentations
-3 Keynote lectures by Sanjeev Arora (Princeton), Ralf Herbrich (amazon.com) and Yann LeCun (NYU) respectively

The deadline for early registration is May 13 (even if you’re local, help us support the conference by registering and not just crashing the party).
To register, go to:
http://orfe.princeton.edu/conferences/colt2013/?q=for-participants/registration

Please help us spread the word by forwarding this email to anyone interested in Learning Theory and by using your favorite social media.

 

COLT 2013 accepted papers

Posted in Uncategorized | Leave a comment

ORF523: Mirror Descent, part II/II

We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.

Standard setups for Mirror Descent

‘Ball’ setup. The simplest version of Mirror Descent is obtained by taking \Phi(x) = \frac{1}{2} \|x\|^2_2 on \mathcal{D} = \mathbb{R}^n. The function \Phi is a mirror map strongly convex w.r.t. \|\cdot\|_2, and furthermore the associated Bregman Divergence is given by D_{\Phi}(x,y) = \|x - y\|^2. Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.

‘Simplex’ setup. A more interesting choice of a mirror map is given by the negative entropy

    \[\Phi(x) = \sum_{i=1}^n x_i \log x_i,\]

on \mathcal{D} = \mathbb{R}_{++}^n. In that case the gradient update \nabla \Phi(y_{t+1}) = \nabla \Phi(x_t) - \eta \nabla f(x_t) can be written equivalently as

    \[y_{t+1}(i) = x_{t}(i) \exp\big(- \eta [\nabla f(x_t) ]_i \big) , \ i=1, \hdots, n.\]

The Bregman divergence of this mirror map is given by D_{\Phi}(x,y) = \sum_{i=1}^n x_i \log \frac{x_i}{y_i} - \sum_{i=1}^n (x_i - y_i) (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex \Delta_n = \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} amounts to a simple renormalization y \mapsto y / \|y\|_1. Furthermore it is also easy to verify that \Phi is 1-strongly convex w.r.t. \|\cdot\|_1 on \Delta_n (this result is known as Pinsker’s inequality). Note also that one has x_1 = (1/n, \hdots, 1/n) and R^2 = \log n when \mathcal{X} = \Delta_n. 

The above observations imply that when minimizing on the simplex \Delta_n a function f with subgradients bounded in \ell_{\infty}-norm, Mirror Descent with the negative entropy achieves a rate of convergence of order \sqrt{\frac{\log n}{t}}. On the other the regular Subgradient Descent achieves only a rate of order \sqrt{\frac{n}{t}} in this case!

‘Spectrahedron’ setup. We consider here functions defined on matrices, and we are interested in minimizing a function f on the spectrahedron \mathcal{S}_n defined as:

    \[\mathcal{S}_n = \left\{X \in \mathbb{S}_+^n : \mathrm{Tr}(X) = 1 \right\} .\]

In this setting we consider the mirror map on \mathcal{D} = \mathbb{S}_{++}^n given by the negative von Neumann entropy:

    \[\Phi(X) = \sum_{i=1}^n \lambda_i(X) \log \lambda_i(X) ,\]

where \lambda_1(X), \hdots, \lambda_n(X) are the eigenvalues of X. It can be shown that the gradient update \nabla \Phi(Y_{t+1}) = \nabla \Phi(X_t) - \eta \nabla f(X_t) can be written equivalently as

    \[Y_{t+1} = \exp\big(\log X_t - \eta \nabla f(X_t) \big) ,\]

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on \mathcal{S}_n is a simple trace renormalization.

With highly non-trivial computation one can show that \Phi is \frac{1}{2}-strongly convex with respect to the Schatten 1-norm defined as

    \[\|X\|_1 = \sum_{i=1}^n \lambda_i(X).\]

It is easy to see that one has x_1 = \frac{1}{n} I_n and R^2 = \log n for \mathcal{X} = \mathcal{S}_n. In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

 

Mirror Descent for Saddle Points

We consider now the following saddle-point problem: Let \mathcal{X} \subset \mathbb{R}^n, \mathcal{Y} \subset \mathbb{R}^m be compact convex sets, and \phi : \cX \times \cY \rightarrow \mathbb{R} be a continuous function such that \phi(\cdot, y) is convex and \phi(x, \cdot) is concave. We are interested in finding z^*=(x^*, y^*) \in \mathcal{X} \times \mathcal{Y} =: \mathcal{Z} such that

    \[\phi(x^*,y^*) = \inf_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x,y) = \sup_{y \in \mathcal{Y}} \inf_{x \in \mathcal{X}} \phi(x,y) .\]

(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution z=(x,y) through the primal/dual gap

    \[\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) .\]

The key observation now is that for any x,y,x',y', one has with g_x \in \partial_x \phi(x,y) and -g_y \in \partial_y (- \phi(x,y)),

    \[\phi(x,y) - \phi(x',y) \leq g_x^{\top} (x-x'),\]

and

    \[- \phi(x,y) - (- \phi(x,y')) \leq (- g_y)^{\top} (y-y') .\]

In particular, denoting g_z = (g_x, - g_y) we just proved

    \[\max_{y' \in \mathcal{Y}} \phi(x,y') - \min_{x' \in \mathcal{X}} \phi(x',y) \leq g_z^{\top} (z - z') ,\]

for some z' \in \mathcal{Z}. This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space \mathcal{Z} using the ‘vector field’ given by g_z. Precisely this gives the following SP Mirror Descent:

Let \Phi_{\mathcal{X}} be a mirror map on \mathcal{X} (defined on \mathcal{D_{\mathcal{X}}}) and let \Phi_{\mathcal{Y}} be a mirror map on \mathcal{Y} (defined on \mathcal{D_{\mathcal{Y}}}). We consider now the mirror map \Phi((x,y)) = \Phi_{\mathcal{X}}(x) + \Phi_{\mathcal{Y}}(y) on \mathcal{Z} (defined on \mathcal{D} = \mathcal{D_{\mathcal{X}}} \times \mathcal{D_{\mathcal{Y}}}).

Let z_1 \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} \Phi(z). Then for t \geq 1, let w_{t+1} \in \mathcal{D} such that

    \[\nabla \Phi(w_{t+1}) = \nabla \Phi(z_{t}) - \eta g_t,\]

where g_t = (g_x, - g_y) with g_x \in \partial_x \phi(x_t,y_t) and -g_y \in \partial_y (- \phi(x_t,y_t)). Finally let

    \[z_{t+1} \in \mathrm{argmin}_{z \in \mathcal{Z} \cap \mathcal{D}} D_{\Phi}(z,w_{t+1}) .\]

The following result describes the performances of SP Mirror Descent.

Theorem. Assume that \Phi_{\mathcal{X}} is \kappa_{\mathcal{X}}-strongly convex on \mathcal{X} \cap \mathcal{D}_{\mathcal{X}} with respect to \|\cdot\|_{\mathcal{X}}, and let R^2_{\mathcal{X}} = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi_{\mathcal{X}}(x) - \Phi_{\mathcal{X}}(x_1). Assume that \phi(\cdot, y) is L_{\mathcal{X}}-Lipschitz w.r.t. \|\cdot\|_{\mathcal{X}}. Define similarly \kappa_{\mathcal{Y}}, R^2_{\mathcal{Y}}, L_{\mathcal{Y}}. Then SP Mirror Descent with \eta = \frac{R^2_{\mathcal{X}} + R^2_{\mathcal{Y}}}{\sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}}} \sqrt{\frac{2}{t}} satisfies

    \[\max_{y \in \mathcal{Y}} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \min_{x \in \mathcal{X}} \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) \leq \sqrt{\frac{2 \left( R^2_{\mathcal{X}} + R^2_{\mathcal{Y}} \right) \left(\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}} \right) }{t}}.\]

Proof: First we endow \mathcal{Z} with the norm defined by

    \[\|z\|_{\mathcal{Z}} = \sqrt{\kappa_{\mathcal{X}} \|x\|_{\mathcal{X}}^2 + \kappa_{\mathcal{Y}} \|x\|_{\mathcal{Y}}^2} .\]

It is immediate that \Phi is 1-strongly convex with respect to \|\cdot\|_{\mathcal{Z}} on \mathcal{Z} \cap \mathcal{D}. Furthermore one can easily check that

    \[\|z\|_{\mathcal{Z},*} = \sqrt{\frac1{\kappa_{\mathcal{X}}} \|x\|_{\mathcal{X},*}^2 + \frac1{\kappa_{\mathcal{Y}}} \|x\|_{\mathcal{Y},*}^2} ,\]

and thus the vector field (g_t) used in the SP mirror descent satisfies:

    \[\|g_t\|_{\mathcal{Z},*} \leq \sqrt{\frac{L_{\mathcal{X}}^2}{\kappa_{\mathcal{X}}} + \frac{L_{\mathcal{Y}}^2}{\kappa_{\mathcal{Y}}}} .\]

It suffices now to use the exact same proof than for the standard Mirror Descent starting with:

    \begin{eqnarray*} \phi\left( \frac1{t} \sum_{s=1}^t x_s,y \right) - \phi\left(x, \frac1{t} \sum_{s=1}^t y_s \right) & \leq & \frac{1}{t} \sum_{s=1}^t \bigg( \phi\left(x_s,y \right) - \phi\left(x, y_s \right) \bigg) \\ & \leq & \frac{1}{t} \sum_{s=1}^t g_s^{\top} (z_s - z) . \end{eqnarray*}

\Box

Posted in Optimization | 3 Comments

ORF523: Mirror Descent, part I/II

In this post we are interested in minimizing a convex function f over a compact convex set \mathcal{X} \subset \mathbb{R}^n under the assumption that f is L-Lipschitz on \mathcal{X} with respect to some arbitrary norm \|\cdot\| (precisely we assume that \forall x \in \mathcal{X}, g \in \partial f(x), \|g\|_* \leq L). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.

Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space \mathcal{H} (think of \mathcal{H} = \ell_2). However we are now interested in the more general situation where we are optimizing the function in some Banach space \mathcal{B} (for example \mathcal{B} = \ell_1). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative) \nabla f(x) are elements of the dual space \mathcal{B}^* and thus one cannot perform the computation x - \eta \nabla f(x) (it simply does not make sense). We did not have this problem for optimization in an Hilbert space \mathcal{H} since by Riesz representation theorem \mathcal{H}^* is isometric to \mathcal{H}. The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point x \in \mathcal{B} into the dual space \mathcal{B}^*, then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space \mathcal{B}. Of course the new point in the primal space might lie outside of the constraint set \mathcal{X} \subset \mathcal{B} and thus we need a way to project back the point on the constraint set \mathcal{X}. The rest of this lecture will make this idea more precise.

 

Preliminaries

We consider a non-empty convex open set \mathcal{D} \subset \mathbb{R}^n such that the constraint set \mathcal{X} is included in its closure, that is \mathcal{X} \subset \overline{\mathcal{D}}, and such that \mathcal{X} \cap \mathcal{D} \neq \emptyset.

Now we consider a continuously differentiable and strictly convex function \Phi defined on \mathcal{D}. This function \Phi will play the role of the mirror map between the primal and dual space. Precisely a point x \in \mathcal{X} \cap \mathcal{D} in the primal is mapped to \nabla \Phi(x) in the ‘dual’ (note that here all points lie in \mathbb{R}^n and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map \Phi will take the form \nabla \Phi(x) - \eta \nabla f(x). Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that \nabla \Phi(\mathcal{D}) = \mathbb{R}^n (weaker assumptions are possible but we will not need them), that is one can find y \in \mathcal{D} such that

    \[\nabla \Phi(y) = \nabla \Phi(x) - \eta \nabla f(x) .\]

Now this point y might lie outside of the constraint set \mathcal{X}. To project back on the set of constraint we use the Bregman divergence associated to \Phi defined by

    \[D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y)^{\top} (x - y) .\]

The projection of y \in \mathcal{D} onto \mathcal{X} then takes the following form:

    \[\mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y) .\]

To ensure the existence of this projection we assume now that \Phi satisfies

    \[\lim_{x \rightarrow \partial \mathcal{D}} \|\nabla \Phi(x)\| = + \infty .\]

With this assumption x \mapsto D_{\Phi}(x,y) is locally increasing on the boundary of \mathcal{D}, which together with the fact that \mathcal{X} is compact, and x \mapsto D_{\Phi}(x,y) is stricly convex implies the existence and uniqueness of the projection defined above.

We say that \Phi : \mathcal{D} \rightarrow \mathbb{R} is a mirror map if it satisfies all the above assumptions.

 

Mirror Descent

We can now describe the Mirror Descent strategy based on a mirror map \Phi. Let x_1 \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x). Then for t \geq 1, let y_{t+1} \in \mathcal{D} such that

    \[\nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta g_t, \ \text{where} \ g_t \in \partial f(x_t) ,\]

and

    \[x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}) .\]

 

Theorem Let \Phi be a mirror map. Assume also that \Phi is strongly convex on \mathcal{X} \cap \mathcal{D} with respect to \|\cdot\|, that is

    \[\forall x, y \in \mathcal{X} \cap \mathcal{D}, \Phi(x) - \Phi(y) \leq \nabla \Phi(x)^{\top}(x-y) - \frac{\alpha}{2} \|x-y\|^2.\]

Let R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1) and f be convex and L-Lipschitz w.r.t. \|\cdot\|, then Mirror Descent with \eta = \frac{R}{L} \sqrt{\frac{2 \alpha}{t}} satisfies

    \[f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq RL \sqrt{\frac{2}{\alpha t}} .\]

 

Proof: Let x \in \mathcal{X} \cap \mathcal{D}. The claimed bounded will be obtained by taking a limit x \rightarrow x^*. Now by convexity of f, the definition of the Mirror Descent, and the definition of the Bregman divergence one has

    \begin{eqnarray*} f(x_s) - f(x) & \leq & g_s^{\top} (x_s - x) \\ & = & \frac{1}{\eta} (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1})^{\top} (x_s - x) \\ & = & \frac{1}{\eta} \bigg( D_{\Phi}(x, x_s) + D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x, y_{s+1}) \bigg). \end{eqnarray*}

Now the key observation is that by writing down the first order optimality condition for x_{s+1} one obtains

    \[(\nabla \Phi(x_{s+1}) - \nabla \Phi(y_{s+1}))^{\top} (x_{s+1} - x) \leq 0 ,\]

which can be written equivalently as

    \[D_{\Phi}(x, y_{s+1}) \geq D_{\Phi}(x, x_{s+1}) + D_{\Phi}(x_{s+1}, y_{s+1}) .\]

Thus the term D_{\Phi}(x, x_s) - D_{\Phi}(x, x_{s+1}) will lead to a telescopic sum when summing over s=1 to s=t, and it remains to bound the other term as follows:

    \begin{align*} & D_{\Phi}(x_s, y_{s+1}) - D_{\Phi}(x_{s+1}, y_{s+1}) \\ & = \Phi(x_s) - \Phi(x_{s+1}) - \nabla \Phi(y_{s+1})^{\top} (x_{s} - x_{s+1}) \\ & \leq (\nabla \Phi(x_s) - \nabla \Phi(y_{s+1}))^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & = \eta g_s^{\top} (x_{s} - x_{s+1}) - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \eta L \|x_{s+1} - x_s\| - \frac{\alpha}{2} \|x_s - x_{s+1}\|^2 \\ & \leq \frac{(\eta L)^2}{2 \alpha}. \end{align*}

We proved

    \[\sum_{s=1}^t \bigg(f(x_s) - f(x)\bigg) \leq \frac{D_{\Phi}(x,x_1)}{\eta} + \eta \frac{L^2 t}{2 \alpha},\]

which concludes the proof up to trivial computation.

\Box

 

Mirror Descent as a proximal method

One can rewrite Mirror Descent as follows:

    \begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ D_{\Phi}(x,y_{t+1}) \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - \nabla \Phi(y_{t+1})^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \Phi(x) - (\nabla \Phi(x_{t}) - \eta g_t)^{\top} x \\ & = & \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} \ \eta g_t^{\top} x + D_{\Phi}(x,x_t) . \\ \end{eqnarray*}

Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).

 

Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.

Posted in Optimization | 2 Comments

ORF523: ISTA and FISTA

As we said in the previous lecture it seems stupid to consider that we are in a black-box situation when in fact we know entirely the function to be optimized. Consider for instance the LASSO objective \| X w - Y \|_2^2 + \lambda \|w\|_1 where one wants to minimize over w \in \mathbb{R}^n. By resorting to black-box procedures one would solve this problem with Subgradient Descent and the rate of convergence would be of order 1/\sqrt{t} as this function is non-smooth (and potentially non-strongly convex). However we will see now that one can take advantage of the form of the LASSO objective which is a sum of a smooth part and a simple non-smooth part to obtain rates as fast as 1/t^2.

In this lecture (which follows this paper by Beck and Teboulle) we consider the unconstrained minimization of a sum of two functions f and g that satisfy the following requirements:

(i) f+g admits a minimizer x^* on \mathbb{R}^n.

(ii) f and g are convex, and f is \beta-smooth.

(iii) g is known and f is accessible with a 1^{st}-order oracle.

As we will see next, for the proposed algorithm to be computationally efficient g also needs to be ‘simple’. For instance a separable function (i.e., g(x) = \sum_{i=1}^n g_i(x_i)) will be considered as a simple function. Our prime example will be g(x) = \|x\|_1.

 

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall that the Gradient Descent algorithm to optimize the smooth function f is simply given by

    \[x_{t+1} = x_t - \eta \nabla f(x_t) ,\]

which can be written in the proximal form as

    \[x_{t+1} = \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 .\]

Now here one wants to minimize f+g, and g is assumed to be known and ‘simple’. It seems very natural to consider the following iterative procedure known as ISTA:

    \begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 + g(x) \\ & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \end{eqnarray*}

In terms of convergence rate it is not too hard to show that ISTA has the same convergence rate on f+g than Gradient Descent on f, more precisely with \eta=\frac{1}{\beta} one has

    \[f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} .\]

This improved convergence rate over Subgradient Descent comes at a price: computing x_{t+1} may be a difficult optimization problem by itself in general, and this is why one needs to assume that g is ‘simple’. For instance if g can be written as g(x) = \sum_{i=1}^n g_i(x_i) then one can compute x_{t+1} by solving n convex problems in dimension 1. In the case where g(x) = \lambda \|x\|_1 this one-dimensional problem is given by:

    \[\min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} .\]

Elementary computations show that this problem has an analytical solution given by \tau_{\lambda \eta}(x_0), where \tau is the shrinkage operator defined by

    \[\tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) .\]

 

FISTA (Fast ISTA)

As we have seen in this lecture, the optimal rate of convergence for smooth functions can be obtained with Nesterov’s Accelerated Gradient Descent. Combining this idea with ISTA one gets FISTA which is described as follows. Let

    \[\lambda_0 = 0, \ \lambda_{s} = \frac{1 + \sqrt{1+ 4 \lambda_{s-1}^2}}{2}, \ \text{and} \ \gamma_s = \frac{1 - \lambda_s}{\lambda_{s+1}}.\]

Let x_1 = y_1 an arbitrary initial point, and

    \begin{eqnarray*} y_{s+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_s - \eta \nabla f(x_s)) \|_2^2 , \\ x_{s+1} & = & (1 - \gamma_s) y_{s+1} + \gamma_s y_s . \end{eqnarray*}

Again it is not hard to show that the rate of FISTA is similar to the one of Nesterov’s Accelerated Gradient Descent, more precisely:

    \[f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} .\]

Posted in Optimization | 2 Comments

ORF523: Conditional Gradient Descent and Structured Sparsity

In the following table we summarize our findings of previous lectures in terms of oracle convergence rates (we denote R = \|x_1 - x^*\|_2^2).

00

Note that in the last two lines the upper bounds and lower bounds are not matching. In both cases one can show that in fact the lower bound is tight (using respectively Nesterov’s Accelerated Gradient Descent, and the Center of Gravity Method).

Thus in terms of oracle complexity the picture is now complete. However this is not quite the end of the story and the algorithms can be improved in various ways:

  1. The projection step in Projected Subgradient Descent can be expensive in terms of computational complexity. We will see an algorithm that avoids this projection, called Conditional Gradient Descent.
  2. The black-box model is essentially an abstraction (note that this is true for 1^{st}-order oracles, but the situation is different for 0^{th}-order oracles). In practice one often knows the function to be optimized entirely. In that context it seems very wasteful to drop all this information and focus on black-box procedures. However by doing so we were able to derive linear time algorithm while the ‘structural’ Interior Point Methods (which use the form of the function to be optimized by deriving an appropriate self-concordant barrier) are not linear time. We will see one example of an algorithm that makes a partial use of the structural information on the function to be optimized, while still being essentially a gradient-based procedure with linear time complexity. More specifically we will study the FISTA algorithm that can optimize functions of the form f(x) + g(x), where g is known and f can be accessed with a 1^{st}-order oracle. Another very interesting idea in that direction is the Nesterov’s smoothing but unfortunately we will not have time to cover it.
  3. The rates of convergence we proved did not explicitely involve the dimension n. However in specific instances some parameters may involve a dependency on the dimension. For instance consider the case where the function f is such that the magnitude of its subgradients are controlled coordinate-wise, say \|g\|_{\infty} \leq 1, g \in \partial f(x). In that case the best one can say is that \|g\|_2 \leq \sqrt{n} which leads to a rate of convergence for Projected Subgradient Descent of order \sqrt{n / t}. We will see that this situation can be dramatically improved and that one can obtain in this setting a rate of order \sqrt{ \log (n) / t}. More generally we will describe the Mirror Descent algorithm which is adapted to convex functions which are L-Lipschtiz in an arbitrary norm \| \cdot \| (not necessarily \ell_2).

We start this program with a projection-free algorithm for black-box constrained optimization.

 

Conditional Gradient Descent (a.k.a. Frank-Wolfe algorithm)

Conditional Gradient Descent is an algorithm designed to minimize a smooth convex function f over a compact convex set \mathcal{X}. It performs the following update for t \geq 1:

    \begin{align*} &y_{t} \in \mathrm{argmin}_{y \in \mathcal{X}} \nabla f(x_t)^{\top} y \\ & x_{t+1} = (1 - \gamma_t) x_t + \gamma_t y_t . \end{align*}

This algorithm goes back to Frank and Wolfe (1956), and the following result is extracted from Jaggi (2013). We consider here functions which are \beta-smooth in some arbitrary norm \|\cdot\|, that is

    \[\|\nabla f(x) - \nabla f(y) \|_* \leq \beta \|x-y\| ,\]

where the dual norm \|\cdot\|_* is defined as \|g\|_* = \sup_{x \in \mathbb{R}^n : \|x\| \leq 1} g^{\top} x.

Theorem. Let f be a \beta-smooth function w.r.t. some norm \|\cdot\|, R = \sup_{x, y \in \mathcal{X}} \|x - y\|, and \gamma_s = \frac{2}{s+1}. Then for any t \geq 2, one has

    \[f(x_t) - f(x^*) \leq \frac{2 \beta R^2}{t+1} .\]

Proof: The following sequence of inequalities holds true, using respectively \beta-smoothness (which implies f(x) - f(y) \leq \nabla f(y)^{\top}(x-y) + \frac{\beta}{2} \|x-y\|^2), the definition of x_{s+1}, the definition of y_s, and the convexity of f:

    \begin{eqnarray*} f(x_{s+1}) & \leq & f(x_s) + \nabla f(x_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & \leq & f(x_s) + \gamma_s \nabla f(x_s)^{\top} (y_{s} - x_s) + \frac{\beta}{2} \gamma_s^2 R^2 \\ & \leq & f(x_s) + \gamma_s \nabla f(x_s)^{\top} (x^* - x_s) + \frac{\beta}{2} \gamma_s^2 R^2 \\ & \leq & f(x_s) + \gamma_s (f(x^*) - f(x_s)) + \frac{\beta}{2} \gamma_s^2 R^2 . \end{eqnarray*}

Rewriting this inequality in terms of \delta_s = f(x_s) - f(x^*) one obtains

    \[\delta_{s+1} \leq (1 - \gamma_s) \delta_s + \frac{\beta}{2} \gamma_s^2 R^2 .\]

A simple induction using that \gamma_s = \frac{2}{s+1} finishes the proof (note that the initialization is done at step 2 with the above inequality yielding \delta_2 \leq \frac{\beta}{2} R^2).

\Box

The rate of convergence we just proved for Conditional Gradient Descent is no better than the rate of Projected Gradient Descent in the same setting. However, on the contrary to the latter algorithm where a quadratic programming problem over \mathcal{X} has to be solved (the projection step), here we only need to solve a linear problem over \mathcal{X}, and in some cases this can be much simpler.

While being projection-free is an important property of Conditional Gradient Descent, a perhaps even more important property is that it produces sparse iterates in the following sense: Assume that \mathcal{X} is a polytope with sparse vertices, that is the number of non-zero coordinates (which we denote by the \ell_0-norm \|\cdot\|_0 which is not a norm despite the name) in a fixed vertex is small, say of order s \ll n. Then at each iteration, the new iterate x_{t+1} will increase its number of non-zero coordinates by at most s. In particular one has \|x_{t+1}\|_0 \leq \|x_1\|_0 + t s. We will see now an example of application where this property is critical to obtain a computationally tractable algorithm.

Note that the property described above proves in particular that for a smooth function f on the simplex \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} there always exist sparse approximate solutions. More precisely there must exist a point x with only t non-zero coordinates and such that f(x) - f(x^*) = O(1/t). Clearly this is the best one can hope for in general, as it can be seen with the function f(x) = \|x\|^2_2 (since by Cauchy-Schwarz one has \|x\|_1 \leq \sqrt{\|x\|_0} \|x\|_2 which implies on the simplex \|x\|_2^2 \geq 1 / \|x\|_0).

 

An application of Conditional Gradient Descent: Least-squares regression with structured sparsity

This section is inspired by an open problem from this paper by Gabor Lugosi (what is described below solves the open problem). Consider the problem of approximating a signal Y \in \mathbb{R}^n by a ‘small’ combination of elementary dictionary elements d_1, \hdots, d_N \in \mathbb{R}^n. One way to do this is to consider a LASSO type problem in dimension N of the following form (we denote here x(i) for the i^{th} coordinate of a vector x)

    \[\min_{x \in \mathbb{R}^N} \big\| Y - \sum_{i=1}^N x(i) d_i \big\|_2^2 + \lambda \|x\|_1 .\]

Let D \in \mathbb{R}^{n \times N} be the dictionary matrix with i^{th} column given by d_i. Instead of considering the penalized version of the problem one could look at the following constrained problem on which we will focus now:

    \begin{eqnarray*} \min_{x \in \mathbb{R}^N} \| Y - D x \|_2^2 & \qquad \Leftrightarrow \qquad & \min_{x \in \mathbb{R}^N} \| Y / s - D x \|_2^2 \\ \text{subject to} \; \|x\|_1 \leq s & & \text{subject to} \; \|x\|_1 \leq 1 . \end{eqnarray*}

We will now make some assumptions on the dictionary. We are interested in cases where the size of the dictionary N can be very large, potentially exponential in the ambient dimension n. Nonetheless we want to restrict our attention to algorithms that run in reasonable time with respect to the ambient dimension n, that is we want polynomial time algorithms in n. Of course in general this is impossible, and we need to assume that the dictionary has some structure that can be exploited. Here we make the assumption that one can do linear optimization over the dictionary in polynomial time in n. More precisely we assume that we can solve in time p(n) (where p is polynomial) the following problem for any y \in \mathbb{R}^n:

    \[\min_{1 \leq i \leq N} y^{\top} d_i .\]

This assumption is met for many combinatorial dictionaries. For instance the dictionary elements could be vector of incidence of spanning trees in some fixed graph, and then the linear optimization problem can be solved with a greedy algorithm.

Finally we assume that the \ell_2-norm of the dictionary elements are controlled by some m>0, that is \|d_i\|_2 \leq m, \forall i \in [N].

 

Recall that one wants to minimize the function f(x) = \frac{1}{2} \| Y - D x \|^2_2 on the \ell_1-ball of \mathbb{R}^N in polynomial time in n. Note that at first sight this task may seem completely impossible as we are not even allowed to write down entirely a vector x \in \mathbb{R}^N (since this would take time linear in N). The key property that will save us is that this function admits sparse minimizers as we discussed in the previous section, and this will be exploited by the Conditional Gradient Descent. First note that

    \[\nabla f(x) = D^{\top} (D x - Y).\]

This form for the gradient together with our computational assumption on the dictionary imply that Conditional Gradient Descent can be run in polynomial time. Indeed, assume that z_t = D x_t - Y \in \mathbb{R}^n is already computed, then the first step of Conditional Gradient Descent is to find the coordinate i_t \in [N] that maximizes |[\nabla f(x_t)]_i| which can be done by maximizing d_i^{\top} z_t and - d_i^{\top} z_t. Thus this first step takes time O(p(n)). Computing x_{t+1} from x_t and i_{t} takes time O(t) since \|x_t\|_0 \leq t, and computing z_{t+1} from z_t and i_t takes time O(n). Thus the overall time complexity of t steps is O(t p(n) + t^2).

To derive a rate of convergence it remains to study the smoothness of f. This can be done as follows:

    \begin{eqnarray*} \| \nabla f(x) - \nabla f(y) \|_{\infty} & = & \|D^{\top} D (x-y) \|_{\infty} \\ & = & \max_{1 \leq i \leq N} \bigg| d_i^{\top} \left(\sum_{j=1}^N d_j (x(j) - y(j))\right) \bigg| \\ & \leq & m^2 \|x-y\|_1 , \end{eqnarray*}

which means that f is m^2-smooth with respect to the \ell_1 norm. Thus we get the following rate of convergence:

    \[f(x_t) - f(x^*) \leq \frac{4 m^2}{t+1} .\]

In other words we proved that one can get an \epsilon-optimal solution to our original problem with a computational effort of O(m^2 p(n)/\epsilon + m^4/\epsilon^2) using the Conditional Gradient Descent.

Posted in Optimization | 1 Comment

A new book on Concentration Inequalities by Boucheron, Lugosi and Massart

If you work in a field related to probability theory you most certainly have already come across the concentration of measure phenomenon: ‘a function that depends on many independent variables, but not too much on any one of them, is almost constant’. This extremely strong statement can be made formal in various ways, the most basic one being the Hoeffding inequality (and its well-known generalization the Mc Diarmid’s inequality). However in many applications Mc Diarmid’s inequality is too weak: the resulting bound does not account for the variance, and the assumption on the underlying function is very stringent. Fortunately one can go far beyond Mc Diarmid’s inequality, and this is explored in depth in the new book (published a few weeks ago!) ‘Concentration Inequalities‘, by Stéphane Boucheron, Gábor Lugosi and Pascal Massart.

conccover

I have read an early draft of the book during my postdoc with Gábor, and I strongly recommend it to anyone who is even just remotely interested in probabilistic phenomenons. The first 100 pages form a gentle introduction to the subject and they contain indispensable results for Machine Learners and alike. Then Chapter 5-6-7  are (in my opinion) the core of the book: they masterfully describe results that were always a bit daunting to me, including the geometrical point of view on the concentration of measure. This is where I stopped my first reading two years ago, but now that I have the book in my hands I realize that this is only the first half of it!!! I have started to read the second half, and right now I am having a lot of fun with influences and monotone sets ;).

Posted in Uncategorized | 4 Comments

Guest post by Amir Ali Ahmadi: Sum of Squares (SOS) Techniques: An Introduction, Part II/II

The reader is encouraged to read Part I of this series before reading this post. For ease of reference, we recall the polynomial optimization problem from our previous post:

(1)   \begin{equation*} \begin{array}{lll} \mbox{minimize} & p(x) &\ \\ \mbox{subject to} & x\in K\mathrel{\mathop:}=\{x\in\mathbb{R}^n\ |\ g_i(x)\geq 0, h_i(x)=0\}. &\ \end{array} \end{equation*}

This is the task of minimizing a multivariate polynomial p over a basic semialgebraic set K defined by the polynomials g_i, h_i.

 

Sum of squares and semidefinite programming

If a polynomial is nonnegative, can we write it in a way that its nonnegativity becomes obvious? This is the meta-question behind Hilbert’s 17th problem. As the title of this post suggests, one way to achieve this goal is to try to write the polynomial as a sum of squares of polynomials. We say that a polynomial p is a sum of squares (sos), if it can be written as p(x)=\sum_i{q_i^2(x)} for some polynomials q_i. Existence of an sos decomposition is an algebraic certificate for nonnegativity. Remarkably, it can be decided by solving a single semidefinite program.

 Theorem 1 A multivariate polynomial p in n variables and of degree 2d is a sum of squares if and only if there exists a positive semidefinite matrix Q (often called the Gram matrix) such that

(2)   \begin{equation*} p(x)=z^{T}Qz, \end{equation*}

where z is the vector of monomials of degree up to d

    \begin{equation*} z=[1,x_{1},x_{2},\ldots,x_{n},x_{1}x_{2},\ldots,x_{n}^d]. \end{equation*}

The proof is straightforward and is left to the reader. Note that the feasible set defined by these constraints is the intersection of an affine subspace (arising from the equality constraints matching the coefficients of p with the entries of Q in (2)) with the cone of positive semidefinite matrices. This is precisely the semidefinite programming (SDP) problem. The size of the Gram matrix Q is \binom{n+d}{d} \times \binom{n+d}{d}, which for fixed d is polynomial in n. Depending on the structure of p, there are well-documented techniques for further reducing the size of the Gram matrix Q and the monomial vector z. We do not pursue this direction here but state as an example that if p is homogeneous of degree 2d, then it suffices to place in the vector z only monomials of degree exactly d.

Example: Consider the task proving nonnegativity of the polynomial

    \begin{equation*} \begin{array}{lll} p(x)&=&x_1^4-6x_1^3x_2+2x_1^3x_3+6x_1^2x_3^2+9x_1^2x_2^2-6x_1^2x_2x_3-14x_1x_2x_3^2 \\ \ &\ &+4x_1x_3^3+5x_3^4-7x_2^2x_3^2+16x_2^4. \end{array} \end{equation*}

Since this is a form (i.e., a homogeneous polynomial), we take

    \[z=(x_1^2,x_1x_2,x_2^2,x_1x_3,x_2x_3,x_3^2)^T.\]

One feasible solution to the SDP in (2) is given by

    \[Q=\begin{pmatrix} 1 & -3 & 0 & 1 & 0 & 2 \\ -3 & 9 & 0 & -3 & 0 & -6 \\ 0 & 0 & 16 & 0 & 0 & -4 \\ 1 & -3 & 0 & 2 & -1 & 2 \\ 0 & 0 & 0 & -1 & 1 & 0 \\ 2 & -6 & 4 & 2 & 0 & 5 \end{pmatrix}.\]

Upon a decomposition Q=\sum_{i=1}^3a_i a_i^T, with a_1=(1,-3,0,1,0,2)^T, a_2=(0,0,0,1,-1,0)^T, a_3=(0,0,4,0,0,-1)^T, one obtains the sos decomposition

(3)   \begin{equation*} p(x)=(x_1^2-3x_1x_2+x_1x_3+2x_3^2)^2+(x_1x_3-x_2x_3)^2+(4x_2^2-x_3^2)^2. \end{equation*}

\triangle

You are probably asking yourself right now whether every nonnegative polynomial can be written as a sum of squares. Did we just get lucky on the above example? Well, from complexity considerations alone (see previous post), we know that we should expect a gap between nonnegative and sos polynomials, at least for large n.

In a seminal 1888 paper, Hilbert was the first to show that there exist nonnegative polynomials that are not sos. In fact, for each combination of degree and dimension, he showed whether such polynomials do or do not exist. For example, one famous theorem of Hilbert from that paper is that all nonnegative ternary quartic forms are sos. So we did not really use any luck in our example above! The same would have happened for any other nonnegative quartic form in three variables. Hilbert’s proof of existence of nonnegative polynomials that are not sos was not constructive. The first explicit example interestingly appeared nearly 80 years later and is due to Motzkin:

(4)   \begin{equation*} M(x_1,x_2,x_3)\mathrel{\mathop:}=x_1^4x_2^2+x_1^2x_2^4-3x_1^2x_2^2x_3^2+x_3^6. \end{equation*}

Nonnegativity of M follows from the arithmetic-geometric inequality (why?). Non-existence of an sos decomposition can be shown by some algebraic manipulations or by the systematic tools we have today for showing infeasibility of a semidefinite program (separating hyperplane theorems). From an application viewpoint, the good news for sum of squares optimization is that constructing polynomials of the type in (4) is not a trivial task. This is especially true if additional structure is required on the polynomial. For example, the following problem is still open.

Open problem. Construct an explicit example of a convex, nonnegative polynomial that is not a sum of squares.

This question is due to Parrilo. The motivation behind it is quite clear: we would like to understand whether sos optimization is exact for the special case of convex polynomial optimization problems. Blekherman has shown with non-constructive arguments that such “bad” convex polynomials must exist when the degree is four or larger and the number of variables goes to infinity. However, we do not know the smallest (or any reasonable) dimension for which this is possible and lack any explicit examples. For the reader interested in tackling this problem, it is known that any such convex polynomial must necessarily be not “sos-convex”. Roughly speaking, sos-convex polynomials are convex polynomials whose convexity is certified by an appropriately defined sum of squares identity. Examples of convex but not sos-convex polynomials have recently appeared and a characterization of the degrees and dimensions where they exist is now available. Interestingly, this characterization coincides with that of Hilbert, for reasons that are also not well-understood; see Chapter 3 of our thesis.

Having shown that not every nonnegative polynomial is a sum of squares of polynomials, Hilbert asked in his 17th problem whether every such polynomial can be written as a sum of squares of rational functions. Artin answered the question in the affirmative in 1927. As we will see next, such results allow for a hierarchy of semidefinite programs that approximate the set of nonnegative polynomials better and better.

 

Positivstellensatz and the SOS hierarchy

Consider proving a statement that we all learned in high school:

    \[\forall a,b,c,x,\ ax^2+bx+c=0\Rightarrow b^2-4ac\geq0.\]

Just for the sake of illustration, let us pull an algebraic identity out of our hat which certifies this claim:

(5)   \begin{equation*} b^2-4ac=(2ax+b)^2-4a(ax^2+bx+c). \end{equation*}

Think for a second why this constitutes a proof. The Positivstellensatz is a very powerful algebraic proof system that vastly generalizes what we just did here. It gives a systematic way of certifying infeasibility of any system of polynomial equalities and inequalities over the reals. Sum of squares representations play a central role in it. (They already did in our toy example above if you think about the role of the first term on the right hand side of (5)). Modern optimization theory adds a wonderfully useful aspect to this proof system: we can now use semidefinite programming to automatically find suitable algebraic certificates of the type in (5).

The Positivstellensatz is an example of a theorem of the alternative. The reader may already be familiar with similar fundamental theorems, such as the Hilbert’s (weak) Nullstellensatz (1893),

a system of polynomial equations f_i(z)=0 is infeasible over the complex numbers

    \[\Updownarrow\]

there exist polynomials t_i(z) such that \sum_i t_i(z)f_i(z)=-1,

or the Farkas Lemma (1902) of linear programming,

a system of linear (in)equalities Ax+b=0, Cx+d\geq0 is infeasible over the reals

    \[\Updownarrow\]

there exist \lambda\geq0, \mu such that A^T\mu+C^T\lambda=0, b^T\mu+d^T\lambda=-1.

These theorems typically have an “easy” (well, trivial) direction and a “hard” direction. The same is true for the Positivstellensatz.

Theorem 2  (Positivstellensatz — Stengle (1974)) The basic semialgebraic set

    \[K\mathrel{\mathop:}=\{x\in\mathbb{R}^n\ |\ g_i(x)\geq 0, i=1,\ldots,m, h_i(x)=0, i=1,\ldots,k\}\]

 is empty

    \[\Updownarrow\]

there exist polynomials t_1,\ldots,t_k and sum of squares polynomials s_0,s_1,\ldots,s_m, s_{12},s_{13},\ldots,s_{m-1m}, s_{123},\ldots,s_{m-2m-1m}, \ldots,s_{12\ldots m} such that

(6)   \begin{equation*} \begin{array}{lll} -1&=&\sum\limits_{i=1}^kt_i(x)h_i(x)+s_0(x)+ \sum\limits_{\{i\}}s_i(x)g_i(x)\\ \ &+&\sum\limits_{\{i,j\}}s_{ij}(x)g_i(x)g_j(x)+\sum\limits_{\{i,j,k\}}s_{ijk}(x)g_i(x)g_j(x)g_k(x) \\ \ &+&\cdots + s_{ijk\ldots m}(x)g_i(x)g_j(x)g_k(x)\ldots g_m(x). \end{array} \end{equation*}

The number of terms in this expression is finite since we never raise any polynomial g_i to a power larger than one. The sum of squares polynomials s_{i\ldots j} are of course allowed to be the zero polynomial, and in practice many of them often are. There are bounds in the literature on the degree of the polynomials t_i, s_{i\ldots j}, but of exponential size as one would expect for complexity reasons. There is substantial numerical evidence, however, from diverse application areas, indicating that in practice (whatever that means) the degrees of these polynomials are usually quite low. We remark that the Positivstellensatz is a very powerful result. For example, it is straightforward to show that the solution to Hilbert’s 17th problem follows as a corollary of this theorem (how?).

Under minor additional assumptions, refined versions of the Positivstellensatz we presented are available. The two most well-known are perhaps due to Schmüdgen and Putinar. For example, Putinar’s Positivstellensatz states that if the set K satisfies the so-called Archimedean property (a property slightly stronger than compactness), then emptiness of K guarantees a representation of the type (6), where the second and third line are scratched out; i.e., there is no need to take products of the constraints g_i(x)\geq 0. While this may look like a simplification at first, there is a tradeoff: the degree of the sos multipliers s_i may need to be higher in Putinar’s representation than in Stengle’s. This makes intuitive sense as the proof system needs to additionally prove statements of the type g_i\geq 0,g_j\geq 0\Rightarrow g_ig_j\geq 0, while in Stengle’s representation this is taken as an axiom.

SOS hierarchies. Positivstellensatz results form the basis of sos hierarchies of Parrilo and Lasserre for solving the polynomial optimization problem (1). The two approaches only differ in the version of the Positivstellensatz they use (originally, Parrilo’s paper follows Stengle’s version and Lasserre’s follows Putinar’s), and the fact that Lasserre presents the methodology from the dual (but equivalent) viewpoint of moment sequences. In either case though, the basic idea is pretty simple. We try to obtain the largest lower bound for problem (1), by finding the largest \gamma for which the set \{x\in K, p(x)\leq\gamma\} is empty. We certify this emptiness by finding Positivstellensatz certificates. In level l of the hierarchy, the degree of the polynomials t_i and the sos polynomails s_i in (6) is bounded by l. As l increases, the quality of the lower bound monotonically increases, and for each fixed l, the search for the optimal \gamma, and the polynomials t_i,s_i is a semidefinite optimization problem (possibly with some bisection over \gamma).

 

Revisiting MAXCUT

Recall from Seb’s earlier lecture that we saw a beautiful algorithm for MAXCUT due to Goemans and Williamson which produced an approximation ratio of 0.878. The algorithm had two steps: first we solved a semidefinite program, then we performed some clever randomized rounding. In this section we focus only on the first step. We show that even low degree Positivstellensatz refutations can produce stronger bounds than the standard SDP relaxation.

Consider the 5-cycle with all edge weights equal to one. It is easy to see that the MAXCUT value of this graph is equal to 4. However, the standard SDP relaxation (i.e. the one used in the Goemans and Williamson algorithm) produces an upper bound of \frac{5}{8}(\sqrt{5}+5)\approx 4.5225.

The MAXCUT value of the 5-cycle is equal to minus the optimal value of the quadratic program

(7)   \begin{equation*}\nonumber \begin{array}{lll} \mbox{minimize} & \frac{1}{2}(x_1x_2+x_2x_3+x_3x_4+x_4x_5+x_1x_5)-\frac{5}{2} &\ \\ \mbox{subject to} & x_i^2=1, \ \ \ i=1,\ldots,5. &\ \end{array} \end{equation*}

We will find the largest constant \gamma such that the objective function minus \gamma is algebraically certified to be nonnegative on the feasible set. To do this, we solve the sos optimization problem

    \begin{equation*} \begin{array}{ll} \mbox{maximize} & \gamma  \\ \mbox{such that} & \frac{1}{2}(x_1x_2+x_2x_3+x_3x_4+x_4x_5+x_1x_5)-\frac{5}{2}-\gamma+\sum\limits_{i=1}^5 t_i(x)(x_i^2-1) \\ & \mbox{is sos.} \\ \end{array} \end{equation*}

The decision variables of this problem are the constant \gamma and the coefficients of the polynomials t_i(x), which in this case we parametrize to be quadratic functions. This sos program results in a polynomially sized semidefinite optimization problem via Theorem 1. The optimal value of the program is -4; i.e., we have solved the MAXCUT instance exactly.

You may be wondering, “can we show that a certain level of the sos hierarchy combined with an appropriately designed rounding procedure produces an approximation ratio of better than 0.878?” Let’s just say that if you did this, you would probably become an overnight celebrity.

 

Software

There are very nice implementations of sum of squares optimization solvers that automate the process of setting up the resulting semidefinite programs. The interested reader may want to play around with SOSTOOLS, YALMIP, or GloptiPoly.

 

Impact

While we focused in this post on the polynomial optimization problem, the impact of sum of squares optimization goes much beyond this area. In dynamics and control, sos optimization has enabled a paradigm shift from classical linear control to an efficient framework for design of nonlinear controllers that are provably safer, more agile, and more robust. Papers on applications of sos optimization have appeared in areas as diverse as quantum information theory, robotics, geometric theorem proving, formal verification, derivative pricing, stochastic optimization, and game theory, among others. In theoretical computer science, sos techniques are currently a subject of intense study. Not too long ago, I attended a plenary presentation at the CCC by Subhash Khot, where one of the last things he said was, “the sos hierarchies are currently the greatest threat to the unique games conjecture.” If you are an ORFE student and still not convinced that this SOS business is actually useful, you may find relief in knowing that SOS is in fact the core discipline of ORFE ;)

Posted in Optimization | 1 Comment