ORF523 (‘Advanced Optimization’): Introduction

In optimization we are ‘given’ a set {\mathcal{X} \subset {\mathbb R}^n}, and a real-valued function {f} defined on this set. The objective is to find a point {x^* \in \mathcal{X}} which minimizes {f} on {\mathcal{X}}.

This goal is ambitious. It encompasses an enormous variety of problems, from engineering to politics. Here is an example that immediately suggests the difficulty of the problem: building an AI (Artificial Intelligence) for a game is an optimization problem, since one wants to find a strategy that maximizes the chances of winning.

In this course we are going to explore the frontiers of what we know about mathematical optimization. We will discuss both positive results: how can you solve such and such optimization problems; but also negative ones with statements like: ‘this family of problems is too hard to be solved in reasonable time’. The course will be divided into three parts, each one explores a different aspect of mathematical optimization.

Part 1: Computational complexity of optimization

The idea here is to look at optimization from the most fundamental perspective: how many ‘elementary operations’ do I need to do to find {x^*}? We will see that this question is very far from being resolved, and that there might be good reasons why this is such a difficult question. Here is sample of contents that will go into this chapter:

  • The ellipsoid method
  • Conic Programming and Interior point methods
  • P vs. NP
  • SDP relaxation of MAX-CUT
  • Extended formulations and Yannakakis’ Theorem
  • Dynamic programming

Part 2: Oracle complexity of optimization

The first part of the course focuses on the complexity in terms of ‘elementary operations’. Here we take another point of view: we simply count the number of ‘questions’ we ask about the function {f}. More formally we assume that we have an oracle that can give us the value and gradient of {f} at any point {x \in \mathcal{X}}. The oracle complexity is then the number of calls to this oracle one needs to do to find {x^*}. While being somehow less fundamental than the computational complexity, the oracle complexity allows for an almost ‘complete’ theory. Here is a partial list of topics for this part:

  • Oracle complexity of Lipschitz optimization
  • Gradient Descent for smooth convex optimization (respectively smooth and strongly convex)
  • Oracle complexity lower bounds for smooth convex optimization (respectively smooth and strongly convex)
  • Optimal methods: Nesterov’s Accelerated Gradient Descent, FISTA
  • Mirror Descent
  • Mirror Descent for saddle points
  • Coordinate Descent, Conditional Gradient Method (aka Frank-Wolfe)

Part 3: Statistical complexity of stochastic optimization

In many situations one does not have a perfect knowledge about the function {f}. In this last part of the course we explore what happens when the oracle gives noisy responses. Partial list of topics:

  • Mirror Descent for stochastic optimization
  • Lower bounds for stochastic optimization
  • Randomized methods for deterministic optimization
  • {0^{th}} order stochastic optimization

 

 

In most of this course we focus on convex set of constraints {\mathcal{X}}, and convex functions {f}. I would like here to review basic definitions and results related to convexity. I will also show why, ‘in some sense’, we can restrict our attention to convexity without loss of generality.

Convex sets

Definition 1 {\mathcal{X}} is convex if for all pair of points {x, y \in \mathcal{X}}, the segment {[x,y]} is included in {\mathcal{X}}, that is

\displaystyle \forall \gamma \in [0,1], (1-\gamma) x + \gamma y \in \mathcal{X}.

{\mathcal{X}} is a convex body if it is convex, compact, and with non-empty interior.

 

Definition 2 The convex hull {\mathrm{conv}(\mathcal{X})} of {\mathcal{X}} is defined as the set of convex combinations of points from {\mathcal{X}}, that is

\displaystyle \mathrm{conv}(\mathcal{X}) = \left\{\sum_{i=1}^N p_i x_i , \; N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1\right\} .

It is easy to check that the convex hull can also be defined as the minimal (for inclusion) convex set that contains {\mathcal{X}}. A basic result about convex sets that we shall use extensively is the Separation Theorem.

Theorem 3 (Separation Theorem) Let {\mathcal{X} \subset {\mathbb R}^n} be a closed convex set, and {x_0 \in {\mathbb R}^n \setminus \mathcal{X}}. Then, there exists {w \in {\mathbb R}^n} and {t \in {\mathbb R}} such that

\displaystyle w^{\top} x_0 < t, \; \text{and} \; \forall x \in \mathcal{X}, w^{\top} x \geq t.

Note that if {\mathcal{X}} is not closed then one can only guarantee that {w^{\top} x_0 \leq w^{\top} x, \forall x \in \mathcal{X}} (and {w \neq 0}). A related result is the Supporting Hyperplane Theorem.

Theorem 4 (Supporting Hyperplane Theorem) Let {\mathcal{X} \subset {\mathbb R}^n} be a convex set, and {x_0 \in \partial \mathcal{X}}. Then, there exists {w \in {\mathbb R}^n, w \neq 0} such that

\displaystyle \forall x \in \mathcal{X}, w^{\top} x \geq w^{\top} x_0.

 

Convex functions

Definition 5 Let {\mathcal{X} \subset {\mathbb R}^n} be convex, and {f : \mathcal{X} \rightarrow {\mathbb R}}. {f} is convex if it always lies below its chords, that is

\displaystyle \forall (x,y,\gamma) \in \mathcal{X}^2 \times [0,1], f((1-\gamma) x + \gamma y) \leq (1-\gamma)f(x) + \gamma f(y) .

It is obvious that a function {f} is convex if and only if its epigraph

\displaystyle \mathrm{epi}(f) = \{(x,t) \in \mathcal{X} \times {\mathbb R} : t \geq f(x) \} ,

is a convex set.

Definition 6 Let {\mathcal{X} \subset {\mathbb R}^n}, and {f : \mathcal{X} \rightarrow {\mathbb R}}. Then {g \in {\mathbb R}^n} is a subgradient of {f} at {x \in \mathcal{X}} if for any {y \in \mathcal{X}} one has

\displaystyle f(x) - f(y) \leq g^{\top} (x - y) .

The set of subgradients of {f} at {x} is denoted {\partial f (x)}.

 

Proposition 7 Let {\mathcal{X} \subset {\mathbb R}^n} be convex, and {f : \mathcal{X} \rightarrow {\mathbb R}}. If {\forall x \in \mathcal{X}, \partial f(x) \neq \emptyset} then {f} is convex. Conversely if {f} is convex then for any {x \in \mathrm{int}(\mathcal{X}), \partial f(x) \neq \emptyset}. Furthermore if {f} is convex and differentiable at {x} then {\nabla f(x) \in \partial f(x)}.

Proof: The first claim is almost trivial: let {g \in \partial f((1-\gamma) x + \gamma y)}, then by definition one has

\displaystyle \begin{array}{rcl} & & f((1-\gamma) x + \gamma y) \leq f(x) + \gamma g^{\top} (y - x) , \\ & & f((1-\gamma) x + \gamma y) \leq f(y) + (1-\gamma) g^{\top} (x - y) , \end{array}

which clearly shows that {f} is convex by adding the two (appropriately rescaled) inequalities.

Now let us prove that a convex function {f} has subgradients in the interior of {\mathcal{X}}. We build a subgradient by using a supporting hyperplane to the epigraph of the function. Let {x \in \mathcal{X}}. Then clearly {(x,f(x)) \in \partial \ \mathrm{epi}(f)}, and {\mathrm{epi}(f)} is a convex set. Thus by using the Supporting Hyperplane Theorem, there exists {(a,b) \in {\mathbb R}^n \times {\mathbb R}} such that

\displaystyle a^{\top} x + b f(x) \geq a^{\top} y + b t, \forall (y,t) \in \mathrm{epi}(f) . \ \ \ \ \ (1)

Clearly, by letting {t} tends to infinity, one can see that {b \leq 0}. Now let us assume that {x} is in the interior of {\mathcal{X}}. Then for {\epsilon > 0} small enough, {y=x + \epsilon a \in \mathcal{X}}, which implies that {b} cannot be equal to {0}. Thus rewriting (1) for {t=f(y)} one obtains

\displaystyle f(x) - f(y) \leq \frac{1}{|b|} a^{\top} (x - y) .

Thus {a / |b| \in \partial f(x)} which concludes the proof of the second claim.

Finally let {f} be a convex and differentiable function. Then by definition:

\displaystyle \begin{array}{rcl} f(y) & \geq & \frac{f((1-\gamma) x + \gamma y) - (1- \gamma) f(x)}{\gamma} \\ & = & f(x) + \frac{f(x + \gamma (y - x)) - f(x)}{\gamma} \\ & \rightarrow_{\gamma \rightarrow 0} & f(x) + \nabla f(x)^{\top} (y-x), \end{array}

which shows that {\nabla f(x) \in \partial f(x)}. \Box

Why convexity?

This section is neither meant to provide a formal argument, nor to give an algorithmic idea. However it shows that, intuitively, assuming that {f} and {\mathcal{X}} are convex comes ‘without loss of generality’. In particular this should be taken as a word of caution: an optimization problem with convex constraints and convex objective is not easy in general. Somehow the difficulty of the problem is encoded in the geometry of {\mathcal{X}}: if {\mathcal{X}} is ‘nice’, then minimizing convex objectives over {\mathcal{X}} is ‘easy’. The rest of the class will make this statement much more precise. Let us now show why convexity is ‘universal’. First note that we can do the following rewriting of our problem:

\displaystyle \inf_{x \in \mathcal{X}} f(x) = \inf_{(x,t) \in \mathrm{epi}(f)} t ,

and a solution {(x^*,t^*)} to the latter problem immediately gives a solution to the former. Thus we have already reduced the case of a general function {f}, to a simple linear function of the form {f(x) = c^{\top} x} for some {c \in {\mathbb R}^n}. Now remark that we always have

\displaystyle \inf_{x \in \mathcal{X}} c^{\top} x = \inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x .

This follows from the following computation:

\displaystyle \begin{array}{rcl} \inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x & = & \inf_{N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1} \sum_{i=1}^N p_i (c^{\top} x_i) \\ & \geq & \inf_{N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1} \sum_{i=1}^N p_i (\min_j c^{\top} x_j) \\ & = & \inf_{x \in \mathcal{X}} c^{\top} x . \end{array}

Here a solution to {\inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x} gives a solution to {\inf_{x \in \mathcal{X}} c^{\top} x} if one is computationally able to decompose {x \in \mathrm{conv}(\mathcal{X})} into a convex combination of points in {\mathcal{X}} (this can be a very difficult task in some cases!). In any case we showed that

\displaystyle \inf_{x \in \mathcal{X}} f(x) = \inf_{(x,t) \in \mathrm{conv}(\mathrm{epi}(f))} t ,

and thus we reduced an arbitrary optimization problem to a linear optimization over a convex set.

This entry was posted in Optimization. Bookmark the permalink.

6 Responses to "ORF523 (‘Advanced Optimization’): Introduction"

    • Sebastien Bubeck
    • Sebastien Bubeck
    • Sebastien Bubeck

Leave a reply