ORF523: Oracle complexity, large-scale optimization

Our study of the computational complexity of mathematical optimization led us to realize that we are still far from a complete understanding of what exactly can be optimized with a computer.

On the positive side we proved that some problems can be solved in reasonable time: An optimization problem with convex objective and relatively simple constraint set (see previous lectures) can be efficiently solved with either the ellipsoid method, or interior point methods (and the typical running time was something like ${O(n^3)}$ for optimization in dimension ${n}$ ).

On the negative side we defined the notion of ${\mathbf{NP}}$ -completeness to show that some problems are intrinsically hard, such as computing the size of the largest clique in a graph, or computing the maximal cut of a graph. We also explained why such results have to be taken with a grain of salt, as it does not necessarily mean that approximate solutions are hard to find.

Note that at this point the reader might be either disappointed, or excited that there is still so much that remains to be discovered!

Another issue with what we discussed so far is that both the ellipsoid method and interior point methods are intractable in very high dimension. Indeed, while from a computational complexity point of view a time in ${O(n^3)}$ is good enough, in pratice this is way too big when ${n}$ is in the order of millions, or billions, which is the order of magnitude in ‘Big Data’ applications! In that case one may want to restrict to optimization procedures that run in linear time (or almost linear time), that is in time ${O(n)}$ .

In this second part of the course we will describe a theory that addresses both concerns: it is a complete theory of optimization in the sense that within the restricted model that we consider we will prove both upper and lower complexity bounds (and they will match!), and it also naturally leads to algorithms that work in linear time, and which are thus well-suited for large-scale optimization. This is the theory of black-box optimization.

Black-box optimization, oracle complexity

In the black-box model one tries to minimize an unknown function ${f}$ over a known constraint set ${\mathcal{X} \subset {\mathbb R}^n}$ . One can get information about the unknown function through queries to oracles:

${0^{th}}$ -order oracle: given ${x \in \mathcal{X}}$ it outputs ${f(x)}$ .
${1^{st}}$ -order oracle: if ${f}$ is differentiable (respectively ${f}$ convex) then given ${x \in \mathcal{X}}$ it outputs a gradient ${\nabla f(x)}$ (respectively a subgradient ${g \in \partial f(x)}$ ).

In the following we are essentially restricting our attention to optimization procedure that have access to both ${0^{th}}$ -order and ${1^{st}}$ -order oracles. A black-box optimization procedure is simply a sequence (over ${t=0,1, \hdots}$ ) of mappings ${\phi_t: \mathcal{X}^t \times ({\mathbb R}^n)^t \times {\mathbb R}^t \rightarrow \mathcal{X}}$ . The algorithm given by these mappings runs iteratively as follows: initially it makes a query to the point ${x_0 =\phi_0(\emptyset)}$ , and then at the ${t^{th}}$ step it queries

$\displaystyle x_t = \phi_t(x_0, \hdots, x_{t-1}, \nabla f(x_0), \hdots, \nabla f(x_{t-1}), f(x_0), \hdots, f(x_{t-1})) .$

The minimax oracle optimization error after ${t}$ steps, over a set of functions ${\mathcal{F}}$ , is defined as follows:

$\displaystyle \mathrm{OC}_t(\mathcal{F}) = \inf_{\phi_0, \hdots, \phi_t} \sup_{f \in \mathcal{F}} \left( f(x_t) - \inf_{x \in \mathcal{X}} f(x) \right).$

Intuitively ${\mathrm{OC}_t(\mathcal{F})}$ describes the best possible rate of convergence (as a function of the number of queries) for the optimization error when one restricts to black-box procedures that only know the constraint set ${\mathcal{X}}$ and the class of functions ${\mathcal{F}}$ . We will derive both lower bounds for the oracle optimization error (with ‘information-theoretic’ arguments), and upper bounds (with explicit algorithms), for various set of convex functions ${\mathcal{F}}$ of interest (essentially smooth/non-smooth, and strongly convex/non-strongly convex). In the literature the results are often stated in terms of oracle complexity, which is the smallest ${T_{\epsilon}}$ such that ${\mathrm{OC}_{T_{\epsilon}}(\mathcal{F}) \leq \epsilon}$ . We may use interchangeably the notions of oracle complexity and oracle optimization error.

The oracle complexity tells us nothing a priori about computational complexity. Indeed in the definition of the minimax oracle optimization we made no restriction on the mappings ${\phi_t}$ , and these mappings could be hard to compute. This complaint is of course valid, especially for the lower bounds. On the other hand for the upper bounds we will also discuss the computational complexity of the black-box optimization procedures that we describe.

We have already studied one black-box optimization procedure: the ellipsoid method. We proved an upper bound on the oracle optimization error for the ellipsoid method on bounded convex functions (over a convex body) of order ${O(\exp(- t / n^2))}$ . This bound does not look good with respect to our objective of finding linear time algorithms: here the number of calls to the oracle needs to be at least ${O(n^2)}$ to attain reasonnable accuracy, and each call requires to manipulate vectors in ${{\mathbb R}^n}$ , which leads to a lower computational complexity bound of ${O(n^3)}$ . In fact, more generally, we can see that our only hope to obtain linear time algorithms is to have an oracle complexity bound which is independent (or almost independent) of the dimension ${n}$ . Thus in the following we will restrict our attention to this high-dimensional setting, where we strive for oracle complexity bounds almost independent of the dimension. Note that this also the reason why we restrict to ${0^{th}}$ and ${1^{st}}$ -order oracles: just looking at the answer from higher order oracles would take too much computational effort.

Note that interior-point methods are not black-box optimization procedures. Indeed typically one reduces the problem ${\min_{x \in \mathcal{X}} f(x)}$ to ${\min_{(x,t) \in \mathrm{epi}(f)} t}$ , and then one constructs a self-concordant barrier ${F}$ for the set ${\mathrm{epi}(f)}$ . In particular the barrier ${F}$ ‘encodes’ some global property of the function we are trying to minimize, on the contrary to the ${0^{th}}$ and ${1^{st}}$ order oracle which give only local information on the function. The fact that interior-point methods exploit the global structure of the function at hand is the reason for their success! But it is also what makes them computationally more intensive than black-box procedures.

Examples of large-scale optimization problems

Let us come back to our classification problem, with pairs of points ${(x_i, y_i) \in {\mathbb R}^n \times \{-1,1\}}$ , ${i=1, \hdots, m}$ . Assuming that the data set was linearly separable we considered the following optimization problem to find a separating hyperplane maximizing the margin:

$\displaystyle \begin{array}{rcl} & \min_{w \in {\mathbb R}^n} & \|w\|^2 \\ & \text{subject to} & y_i w^{\top} x_i \geq 1, i =1, \hdots, m . \end{array}$

Of course the assumption that the data set is linearly separable is unrealistic, especially when one has a very large number of examples (that is ${m}$ is very large). In that case one may want to allow for the constraint ${y_i w^{\top} x_i \geq 1}$ to be violated with a price to pay in the objective function for such an error. A natural idea is to have a fixed price ${\lambda >0}$ for this error, leading to the following optimization problem:

$\displaystyle \min_{w \in {\mathbb R}^n} \|w\|^2 + \lambda \sum_{i=1}^m \mathbf{1} \{y_i w^{\top} x_i < 1\} .$

The above optimization problem is non-convex because the loss function ${\ell(s) = \mathbf{1} \{s < 1\}}$ is non-convex. As a result we do not know how to solve this optimization problem. At this point it seems natural to find convex surrogate loss functions for the zero-one loss.

Hinge loss

Perhaps the simpler convex surrogate for the zero-one loss is the hinge loss ${\ell(s) = \max(0, 1-s)}$ which leads to the following convex problem:

$\displaystyle \min_{w \in {\mathbb R}^n} \|w\|^2 + \lambda \sum_{i=1}^m \max(0, 1-y_i w^{\top} x_i) .$

Note that this problem is unconstrained, and the objective function is convex, non-smooth (it has ‘kinks’), and it has ‘curvature’ everywhere because of the term ${\|w\|^2}$ (in other words it is strongly convex).

Logistic loss

Another possible convex surrogate with nicer properties is the logistic loss ${\ell(s) = \log(1 + \exp(-s))}$ which leads to the following convex problem:

$\displaystyle \min_{w \in {\mathbb R}^n} \|w\|^2 + \lambda \sum_{i=1}^m \log(1 + \exp(- y_i w^{\top} x_i)) .$

Note that this problem is unconstrained, and the objective function is smooth and strongly-convex.

Sparse least-squares regression

Consider now the case where ${y_i}$ can take any real value (instead of just ${\{-1,1\}}$ ). In that case a standard objective to minimize is the sum of squared errors when one predicts ${y_i}$ with ${w^{\top} x_i}$ , that is:

$\displaystyle \min_{w \in {\mathbb R}^n} \sum_{i=1}^m (y_i - w^{\top} x_i)^2 .$

Note that this can be rewritten as (with obvious notation):

$\displaystyle \min_{w \in {\mathbb R}^n} \| X w - Y \|_2^2 .$

In a high-dimensional setting (that is ${n}$ is large) we might be interested in restricting to predictors ${w}$ which contain only a few non-zero entries. Indeed in that case the resulting ${w}$ would allow us to interpret the results more easily: to predict ${y}$ only the coordinates of ${x}$ where ${w}$ is non-zero are important. A natural optimization problem here is to minimize the sum of squared errors, with a penalization for predictors ${w}$ with a large number of non-zero coefficients, that is

$\displaystyle \min_{w \in {\mathbb R}^n} \| X w - Y \|_2^2 + \lambda \|w\|_0.$

Again we run into the issue that ${w \mapsto \|w\|_0}$ is a non-convex function. A natural convex surrogate that has been used extensively in the last decade is the ${\ell_1}$ -penalization, leading to the LASSO objective:

$\displaystyle \min_{w \in {\mathbb R}^n} \| X w - Y \|_2^2 + \lambda \|w\|_1.$

Portfolio optimization, sparse portfolio

Consider the problem of investing a capital of ${\$1}$ into ${n}$ assets. This amounts to choosing a portfolio ${w \in {\mathbb R}^n_+}$ such that ${\sum_{i=1}^n w_i = 1}$ . Now assume that we have access to the vector of returns for the ${n}$ assets over ${T}$ trading days, which we denote ${r_1, \hdots, r_T}$ . This means that if every day we re-invest our capital according to ${w}$ , then at the end of the ${T^{th}}$ day we would have a capital of

$\displaystyle \prod_{t=1}^T w^{\top} r_t .$

In portfolio optimization one is interested in finding the portfolio that would have maximized our return, which is equivalent to the following convex problem:

$\displaystyle \min_{w \in {\mathbb R}^n_+, \sum_{i=1}^n w_i = 1} \sum_{t=1}^T - \log \left(w^{\top} r_t\right) .$

If one wants to add a sparsity constraint on the portfolio one cannot simply add a penalization in terms of the ${\ell_1}$ -norm of ${w}$ (since it is always equal to ${1}$ ). A possible alternative is to solve, for some fixed ${p \in (1,2)}$ ,

$\displaystyle \min_{w \in {\mathbb R}^n_+, \sum_{i=1}^n w_i = 1} \sum_{t=1}^T - \log \left(w^{\top} r_t\right) + \lambda \|w\|_p.$

2 Responses to "ORF523: Oracle complexity, large-scale optimization"

By Nicholas Johnson April 5, 2013 - 3:38 pm

If we want our portfolio to be sparse and update occasionally, we can simply add a L1 norm of the difference between w_t and w_{t+1}.
- By Sebastien Bubeck April 9, 2013 - 8:32 pm
  
  That’s a good idea indeed!! I’ll make sure to add this comment in the final version of the lecture notes.

ORF523: Oracle complexity, large-scale optimization

2 Responses to "ORF523: Oracle complexity, large-scale optimization"

By Nicholas Johnson April 5, 2013 - 3:38 pm

By Sebastien Bubeck April 9, 2013 - 8:32 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll