ORF523: Oracle complexity of Lipschitz convex functions

In this lecture we consider the unconstrained minimization of a function f that satisfy the following requirements:

  • (i) f admits a minimizer x^* on \mathbb{R}^n such that \|x^*\| \leq R.
  • (ii) f is convex on \mathbb{R}^n. Recall that in particular this implies \partial f(x) \neq \emptyset for any x \in \mathbb{R}^n.
  • (iii) f is L-Lipschitz on the \ell_2-ball of radius R, that is for any x \in \mathbb{R}^n such that \|x\| \leq R and any g \in \partial f(x), one has \|g\| \leq L.

 

Subgradient Descent

The simplest strategy to minimize a differentiable function is probably the Gradient Descent scheme. It is an iterative algorithm that starts at some initial point x_1 \in \mathbb{R}^n, and then iterates the following equation:

    \[x_{t+1} = x_t - \eta \nabla f(x_t) ,\]

where \eta>0 is the step-size fixed beforehand. The idea of this scheme is simply to do small steps that minimize the local first order Taylor approximation of the function f. When the function is convex but not differentiable it seems rather natural to replace gradients by subgradients. This gives the Subgradient Descent strategy.

Here we need to be a bit careful if we want to analyze this scheme under assumptions (i)-(ii)-(iii). Indeed we have a control on the size of the subgradients only in a ball of radius R. Furthermore we also know that the minimizer of the function lies in this ball. Thus it makes sense to enforce that if we leave the ball, then we first project back the point to the ball before taking another gradient step. This gives the (Projected) Subgradient Descent:

    \begin{align*} & y_{t+1} = x_t - \eta g_t , \ \text{where} \ g_t \in \partial f(x_t) \\ & \text{if} \ \|y_{t+1}\| \leq R \ \text{then} \ x_{t+1} = y_{t+1}, \ \text{otherwise} \ x_{t+1} = \frac{R}{\|y_{t+1}\|} y_{t+1} . \end{align*}

The following elementary result gives a rate of convergence for the subgradient method.

Theorem Assume that f satisfies (i)-(ii)-(iii), and that x_1 = 0. Then the (Projected) Subgradient Descent with \eta = \frac{R}{L \sqrt{t}} satisfies for \bar{x}_t \in \left\{ \frac{1}{t} \sum_{s=1}^t x_s ; \mathrm{argmin}_{1 \leq s \leq t} f(x_s) \right\},

    \[f(\bar{x}_t) - f^* \leq \frac{R L}{\sqrt{t}} .\]

Proof: Using the definition of subgradients, the definition of the method, and an elementary identity, one obtains

    \begin{eqnarray*} f(x_s) - f(x^*) & \leq & g_s^{\top} (x_s - x^*) \\ & = & \frac{1}{\eta} (x_s - y_{s+1})^{\top} (x_s - x^*) \\ & = & \frac{1}{2 \eta} \left(\|x_s - x^*\|^2 + \|x_s - y_{s+1}\|^2 - \|y_{s+1} - x^*\|^2\right) \\ & = & \frac{1}{2 \eta} \left(\|x_s - x^*\|^2 - \|y_{s+1} - x^*\|^2\right) + \frac{\eta}{2} \|g_s\|^2. \end{eqnarray*}

Now note that \|g_s\| \leq L, and \|y_{s+1} - x^*\| \geq \|x_{s+1} - x^*\| (just do a picture to convince yourself). Summing the resulting inequality over s, and using that \|x_1 - x^*\| \leq R yield

    \[\sum_{s=1}^t \left( f(x_s) - f(x^*) \right) \leq \frac{R^2}{2 \eta} + \frac{\eta L^2 t}{2} .\]

Plugging in the value of \eta directly gives the statement (recall that by convexity f(\bar{x}_t) \leq \frac1{t} \sum_{s=1}^t f(x_s)).

\Box

 

Projected Subgradient Descent

In its general form the Project Subgradient Descent can solve the constrained minimization problem \min_{x \in \mathcal{X}} f(x) where \mathcal{X} is a compact convex set. The idea is simply to project back on the constraint set \mathcal{X} in case we leave it with the gradient step. That is we iterate the following equations:

    \begin{align*} & y_{t+1} = x_t - \eta g_t , \ \text{where} \ g_t \in \partial f(x_t) \\ & x_{t+1} = \mathrm{argmin}_{x \in \mathcal{X}} \|x - y_{t+1}\|. \end{align*}

The exact same result applies to this more general algorithm, with exactly the same proof. The only ‘difficulty’ is to prove that in this case too we have \|y_{s+1} - x^*\| \geq \|x_{s+1} - x^*\|. Again a picture makes this result obvious, and it can be proved formally by simply writing down first order optimality conditions for the definition of x_{t+1}.

Later in the course we will see a much more general form of the Projected Subgradient Descent, where the projection is not necessarily done in \ell_2.

 

Step-size, decision point

Note that the step-size \eta prescribed by the theorem depends on the number of steps t that we are going to make. It is possible to analyze an anytime version of the algorithm, with time-varying step-size \eta_s. For \eta_s = \frac{R}{L}\sqrt{\frac{1}{s}} one can prove the same rate up to a \log t factor.

The theorem gives two possibilities for the point to be returned by the optimization scheme: Either one can return \mathrm{argmin}_{1 \leq s \leq t} f(x_s), which requires to have access to a 0^{th} order oracle, or one can simply return the average of the points visited by the method, that is \frac{1}{t} \sum_{s=1}^t x_s.

 

Oracle complexity, computational complexity

To reach an \epsilon-optimal point, one needs to make T_{\epsilon} = \frac{R^2 L^2}{\epsilon^2} calls to the 1^{st} order oracle. This result is astonishing, because T_{\epsilon} is independent of the dimension! This fact is what makes gradient methods so suitable for large-scale optimization.

In terms of computational complexity, each step of the (Projected) Subgradient Descent (i.e. when the projection is made on the Euclidean ball) has a linear time O(n) complexity! The more general Projected Subgradient Descent might not enjoy such a nice time complexity: it depends on how easy it is to project a point on the constraint set. Again we will discuss these issues further when we consider the generalized scheme that does not necessarily project in \ell_2.

 

Oracle lower bound in high dimension

We will now derive our very first sharp lower bound for an optimization problem. This result was originally discovered by Nemirovski and Yudin, and we follow here the presentation of Nesterov.

 

Recall from the previous lecture that a black-box optimization procedure is simply a sequence of mappings from ‘history’ to the next query point. To simplify the argument we restrict our attention to mappings with the property that

(1)   \begin{equation*}  x_{t+1} \in x_1 + \mathrm{Span}(g_1, \hdots, g_t) , \end{equation*}

where g_s is the output of the first order oracle at point x_s (in particular g_s \in \partial f(x_s)). It is relatively straightforward to generalize the lower bound to any black-box procedure but it makes notation a little bit more messy.

Theorem For any black-box optimization procedure satisfying (1) and for any t \leq n, there exists a function f that satisfies (i)-(ii)-(iii) and such that

    \[\min_{1 \leq s \leq t} f(x_s) - f^* \geq \frac{R L}{2 (1 + \sqrt{t})} .\]

Proof: Without loss of generality we assume that x_1 = 0. To avoid confusion we denote here x(i) for the i^{th} coordinate of x \in \mathbb{R}^n. Let e_1, \hdots, e_n be the canonical basis of \mathbb{R}^n.

We consider now the following convex function:

    \[f(x) = \gamma \max_{1 \leq i \leq t} x(i) + \frac{\alpha}{2} \|x\|^2 .\]

It is easy to see that

    \[\partial f(x) = \alpha x + \mathrm{Conv}\left(e_i , i : x(i) = \max_{1 \leq j \leq t} x(j) \right).\]

In particular if \|x\| \leq R then for any g \in \partial f(x) one has \|g\| \leq \alpha R + \gamma. In other words f is (\alpha R + \gamma)-Lipschitz on the \ell_2 ball of radius R.

Next we describe the first order oracle for this function: when asked for a subgradient at x, it returns \alpha x + \gamma e_{i} where i is the {\em first} coordinate that satisfy x(i) = \max_{1 \leq j \leq t} x(j). In particular when asked for a subgradient at x_1=0 it returns e_1. Thus x_2 must lie on the line generated by e_1. It is easy to see recursively that in fact x_s must lie in the linear span of e_1, \hdots, e_{s-1}. In particular for s \leq t we necessarily have x_s(t) = 0 and thus f(x_s) \geq 0.

It remains to compute the minimal value of f. Let y be such that y(i) = - \frac{\gamma}{\alpha t} for 1 \leq i \leq t and y(i) = 0 for t+1 \leq i \leq n. It is clear that 0 \in \partial f(y) and thus the minimal value f^* of f is

    \[f(y) = - \frac{\gamma^2}{\alpha t} + \frac{\alpha}{2} \frac{\gamma^2}{\alpha^2 t} = - \frac{\gamma^2}{2 \alpha t} .\]

Wrapping up, we proved that for any s \leq t one must have

    \[f(x_s) - f^* \geq \frac{\gamma^2}{2 \alpha t} .\]

Taking \alpha = \frac{L}{R} \frac{1}{1 + \sqrt{t}} and \gamma = L \frac{\sqrt{t}}{1 + \sqrt{t}} concludes the proof (note in particular that \|y\|^2 = \frac{\gamma^2}{\alpha^2 t} = R^2 with these parameters).

\Box

 

What about lower bounds in finite dimension?

The result of the previous section shows that Subgradient Descent is optimal, as far as dimension-uniform rates are concerned. We already know that for a fixed dimension n one can obtain much faster rates of convergence as the number of calls to the oracle t goes to infinity. More precisely recall that the Ellipsoid Method satisfy, with t calls x_1, \hdots, x_t to a first order oracle,

    \[\min_{1 \leq s \leq t} f(x_s) - f^* = O(\exp(- t / n^2)) .\]

For large t (t \gg n^2) this rate is much better than the one derived for the Subgradient Descent. In this ‘large t‘ regime one can prove a lower bound of O(\exp(- t / n)) on the optimization error. One can bridge the gap between the upper bound given by the Ellipsoid Method and this lower bound with an algorithm known as the Method of centers of gravity. Unfortunately this method has a very bad computational complexity, which is why it is not so interesting for us. We refer the interested reader to the book by Nesterov for more details on these results in finite dimension.

This entry was posted in Optimization. Bookmark the permalink.

5 Responses to "ORF523: Oracle complexity of Lipschitz convex functions"

    • Sebastien Bubeck
    • Sebastien Bubeck

Leave a reply