ORF523 (‘Advanced Optimization’): Introduction

In optimization we are ‘given’ a set ${\mathcal{X} \subset {\mathbb R}^n}$ , and a real-valued function ${f}$ defined on this set. The objective is to find a point ${x^* \in \mathcal{X}}$ which minimizes ${f}$ on ${\mathcal{X}}$ .

This goal is ambitious. It encompasses an enormous variety of problems, from engineering to politics. Here is an example that immediately suggests the difficulty of the problem: building an AI (Artificial Intelligence) for a game is an optimization problem, since one wants to find a strategy that maximizes the chances of winning.

In this course we are going to explore the frontiers of what we know about mathematical optimization. We will discuss both positive results: how can you solve such and such optimization problems; but also negative ones with statements like: ‘this family of problems is too hard to be solved in reasonable time’. The course will be divided into three parts, each one explores a different aspect of mathematical optimization.

Part 1: Computational complexity of optimization

The idea here is to look at optimization from the most fundamental perspective: how many ‘elementary operations’ do I need to do to find ${x^*}$ ? We will see that this question is very far from being resolved, and that there might be good reasons why this is such a difficult question. Here is sample of contents that will go into this chapter:

The ellipsoid method
Conic Programming and Interior point methods
P vs. NP
SDP relaxation of MAX-CUT
Extended formulations and Yannakakis’ Theorem
Dynamic programming

Part 2: Oracle complexity of optimization

The first part of the course focuses on the complexity in terms of ‘elementary operations’. Here we take another point of view: we simply count the number of ‘questions’ we ask about the function ${f}$ . More formally we assume that we have an oracle that can give us the value and gradient of ${f}$ at any point ${x \in \mathcal{X}}$ . The oracle complexity is then the number of calls to this oracle one needs to do to find ${x^*}$ . While being somehow less fundamental than the computational complexity, the oracle complexity allows for an almost ‘complete’ theory. Here is a partial list of topics for this part:

Oracle complexity of Lipschitz optimization
Gradient Descent for smooth convex optimization (respectively smooth and strongly convex)
Oracle complexity lower bounds for smooth convex optimization (respectively smooth and strongly convex)
Optimal methods: Nesterov’s Accelerated Gradient Descent, FISTA
Mirror Descent
Mirror Descent for saddle points
Coordinate Descent, Conditional Gradient Method (aka Frank-Wolfe)

Part 3: Statistical complexity of stochastic optimization

In many situations one does not have a perfect knowledge about the function ${f}$ . In this last part of the course we explore what happens when the oracle gives noisy responses. Partial list of topics:

Mirror Descent for stochastic optimization
Lower bounds for stochastic optimization
Randomized methods for deterministic optimization
${0^{th}}$ order stochastic optimization

In most of this course we focus on convex set of constraints ${\mathcal{X}}$ , and convex functions ${f}$ . I would like here to review basic definitions and results related to convexity. I will also show why, ‘in some sense’, we can restrict our attention to convexity without loss of generality.

Convex sets

Definition 1 ${\mathcal{X}}$ is convex if for all pair of points ${x, y \in \mathcal{X}}$ , the segment ${[x,y]}$ is included in ${\mathcal{X}}$ , that is

$\displaystyle \forall \gamma \in [0,1], (1-\gamma) x + \gamma y \in \mathcal{X}.$

${\mathcal{X}}$ is a convex body if it is convex, compact, and with non-empty interior.

Definition 2 The convex hull ${\mathrm{conv}(\mathcal{X})}$ of ${\mathcal{X}}$ is defined as the set of convex combinations of points from ${\mathcal{X}}$ , that is

$\displaystyle \mathrm{conv}(\mathcal{X}) = \left\{\sum_{i=1}^N p_i x_i , \; N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1\right\} .$

It is easy to check that the convex hull can also be defined as the minimal (for inclusion) convex set that contains ${\mathcal{X}}$ . A basic result about convex sets that we shall use extensively is the Separation Theorem.

Theorem 3 (Separation Theorem) Let ${\mathcal{X} \subset {\mathbb R}^n}$ be a closed convex set, and ${x_0 \in {\mathbb R}^n \setminus \mathcal{X}}$ . Then, there exists ${w \in {\mathbb R}^n}$ and ${t \in {\mathbb R}}$ such that

$\displaystyle w^{\top} x_0 < t, \; \text{and} \; \forall x \in \mathcal{X}, w^{\top} x \geq t.$

Note that if ${\mathcal{X}}$ is not closed then one can only guarantee that ${w^{\top} x_0 \leq w^{\top} x, \forall x \in \mathcal{X}}$ (and ${w \neq 0}$ ). A related result is the Supporting Hyperplane Theorem.

Theorem 4 (Supporting Hyperplane Theorem) Let ${\mathcal{X} \subset {\mathbb R}^n}$ be a convex set, and ${x_0 \in \partial \mathcal{X}}$ . Then, there exists ${w \in {\mathbb R}^n, w \neq 0}$ such that

$\displaystyle \forall x \in \mathcal{X}, w^{\top} x \geq w^{\top} x_0.$

Convex functions

Definition 5 Let ${\mathcal{X} \subset {\mathbb R}^n}$ be convex, and ${f : \mathcal{X} \rightarrow {\mathbb R}}$ . ${f}$ is convex if it always lies below its chords, that is

$\displaystyle \forall (x,y,\gamma) \in \mathcal{X}^2 \times [0,1], f((1-\gamma) x + \gamma y) \leq (1-\gamma)f(x) + \gamma f(y) .$

It is obvious that a function ${f}$ is convex if and only if its epigraph

$\displaystyle \mathrm{epi}(f) = \{(x,t) \in \mathcal{X} \times {\mathbb R} : t \geq f(x) \} ,$

is a convex set.

Definition 6 Let ${\mathcal{X} \subset {\mathbb R}^n}$ , and ${f : \mathcal{X} \rightarrow {\mathbb R}}$ . Then ${g \in {\mathbb R}^n}$ is a subgradient of ${f}$ at ${x \in \mathcal{X}}$ if for any ${y \in \mathcal{X}}$ one has

$\displaystyle f(x) - f(y) \leq g^{\top} (x - y) .$

The set of subgradients of ${f}$ at ${x}$ is denoted ${\partial f (x)}$ .

Proposition 7 Let ${\mathcal{X} \subset {\mathbb R}^n}$ be convex, and ${f : \mathcal{X} \rightarrow {\mathbb R}}$ . If ${\forall x \in \mathcal{X}, \partial f(x) \neq \emptyset}$ then ${f}$ is convex. Conversely if ${f}$ is convex then for any ${x \in \mathrm{int}(\mathcal{X}), \partial f(x) \neq \emptyset}$ . Furthermore if ${f}$ is convex and differentiable at ${x}$ then ${\nabla f(x) \in \partial f(x)}$ .

Proof: The first claim is almost trivial: let ${g \in \partial f((1-\gamma) x + \gamma y)}$ , then by definition one has

$\displaystyle \begin{array}{rcl} & & f((1-\gamma) x + \gamma y) \leq f(x) + \gamma g^{\top} (y - x) , \\ & & f((1-\gamma) x + \gamma y) \leq f(y) + (1-\gamma) g^{\top} (x - y) , \end{array}$

which clearly shows that ${f}$ is convex by adding the two (appropriately rescaled) inequalities.

Now let us prove that a convex function ${f}$ has subgradients in the interior of ${\mathcal{X}}$ . We build a subgradient by using a supporting hyperplane to the epigraph of the function. Let ${x \in \mathcal{X}}$ . Then clearly ${(x,f(x)) \in \partial \ \mathrm{epi}(f)}$ , and ${\mathrm{epi}(f)}$ is a convex set. Thus by using the Supporting Hyperplane Theorem, there exists ${(a,b) \in {\mathbb R}^n \times {\mathbb R}}$ such that

$\displaystyle a^{\top} x + b f(x) \geq a^{\top} y + b t, \forall (y,t) \in \mathrm{epi}(f) . \ \ \ \ \ (1)$

Clearly, by letting ${t}$ tends to infinity, one can see that ${b \leq 0}$ . Now let us assume that ${x}$ is in the interior of ${\mathcal{X}}$ . Then for ${\epsilon > 0}$ small enough, ${y=x + \epsilon a \in \mathcal{X}}$ , which implies that ${b}$ cannot be equal to ${0}$ . Thus rewriting (1) for ${t=f(y)}$ one obtains

$\displaystyle f(x) - f(y) \leq \frac{1}{|b|} a^{\top} (x - y) .$

Thus ${a / |b| \in \partial f(x)}$ which concludes the proof of the second claim.

Finally let ${f}$ be a convex and differentiable function. Then by definition:

$\displaystyle \begin{array}{rcl} f(y) & \geq & \frac{f((1-\gamma) x + \gamma y) - (1- \gamma) f(x)}{\gamma} \\ & = & f(x) + \frac{f(x + \gamma (y - x)) - f(x)}{\gamma} \\ & \rightarrow_{\gamma \rightarrow 0} & f(x) + \nabla f(x)^{\top} (y-x), \end{array}$

which shows that ${\nabla f(x) \in \partial f(x)}$ . $\Box$

Why convexity?

This section is neither meant to provide a formal argument, nor to give an algorithmic idea. However it shows that, intuitively, assuming that ${f}$ and ${\mathcal{X}}$ are convex comes ‘without loss of generality’. In particular this should be taken as a word of caution: an optimization problem with convex constraints and convex objective is not easy in general. Somehow the difficulty of the problem is encoded in the geometry of ${\mathcal{X}}$ : if ${\mathcal{X}}$ is ‘nice’, then minimizing convex objectives over ${\mathcal{X}}$ is ‘easy’. The rest of the class will make this statement much more precise. Let us now show why convexity is ‘universal’. First note that we can do the following rewriting of our problem:

$\displaystyle \inf_{x \in \mathcal{X}} f(x) = \inf_{(x,t) \in \mathrm{epi}(f)} t ,$

and a solution ${(x^*,t^*)}$ to the latter problem immediately gives a solution to the former. Thus we have already reduced the case of a general function ${f}$ , to a simple linear function of the form ${f(x) = c^{\top} x}$ for some ${c \in {\mathbb R}^n}$ . Now remark that we always have

$\displaystyle \inf_{x \in \mathcal{X}} c^{\top} x = \inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x .$

This follows from the following computation:

$\displaystyle \begin{array}{rcl} \inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x & = & \inf_{N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1} \sum_{i=1}^N p_i (c^{\top} x_i) \\ & \geq & \inf_{N \geq 1, x_1, \hdots, x_N \in \mathcal{X}, p_1, \hdots, p_N \geq 0 \; \text{s.t.} \; \sum_{i=1}^N p_i =1} \sum_{i=1}^N p_i (\min_j c^{\top} x_j) \\ & = & \inf_{x \in \mathcal{X}} c^{\top} x . \end{array}$

Here a solution to ${\inf_{x \in \mathrm{conv}(\mathcal{X})} c^{\top} x}$ gives a solution to ${\inf_{x \in \mathcal{X}} c^{\top} x}$ if one is computationally able to decompose ${x \in \mathrm{conv}(\mathcal{X})}$ into a convex combination of points in ${\mathcal{X}}$ (this can be a very difficult task in some cases!). In any case we showed that

$\displaystyle \inf_{x \in \mathcal{X}} f(x) = \inf_{(x,t) \in \mathrm{conv}(\mathrm{epi}(f))} t ,$

and thus we reduced an arbitrary optimization problem to a linear optimization over a convex set.

7 Responses to "ORF523 (‘Advanced Optimization’): Introduction"

By Kunal Chaudhury March 11, 2015 - 6:41 am

Nice post!

I was wondering if the requiring X to have a non-empty interior is somewhat restrictive. It appears that the analysis on this page (and the subsequent analysis of the ellipsoid method) would go through (with minor changes) if one required X to have a non-empty relative interior instead.
By mbkv September 28, 2013 - 6:07 am

I got here from Amir’s guest post on SOS. First of all, thanks to your nice post. So,the take-home point from this is that “Any arbitrary optimization problem, regardless of the nature of the objective f(x) we want to minimlze and regardless of the nature of the set(the set can be semialgebraic?) over which we want to minimize, can be reduced to a linear optimization over a convex set”.
Have i understood and stated it correctly?
- By Sebastien Bubeck September 28, 2013 - 2:25 pm
  
  Yes this is correct, but note that the reduction is absolutely non-algorithmic, this is only a formal rewriting!
By haipeng February 12, 2013 - 10:10 am

Proposition 7 says that every interior point has subgradients for a convex function, what about points on the boundary?
- By Sebastien Bubeck February 12, 2013 - 4:20 pm
  
  Convex functions are really nice in the interior of the domain, but they can have basically an arbitrary behavior on the boundary. Think for example of a convex function defined on a disk $x^2 + y^2 \leq 1$ . The only constraint for the function on the circle $x^2 + y^2 = 1$ is basically that it has to take values larger than inside the disk. Apart from that the function can do anything on the circle.
  
  If you add continuity to your function, then things cannot be too wild, but subgradients can still be empty, think for example of $x \mapsto - \sqrt{x}$ on $[0,1] .$
By Emre February 5, 2013 - 10:49 pm

First!
Also, really good lecture notes. Please keep this up.
- By Sebastien Bubeck February 6, 2013 - 8:17 am
  
  Thanks Emre ;).

Aug	JAN	Apr
	17
2013	2021	2022

ORF523 (‘Advanced Optimization’): Introduction

7 Responses to "ORF523 (‘Advanced Optimization’): Introduction"

By Kunal Chaudhury March 11, 2015 - 6:41 am

By mbkv September 28, 2013 - 6:07 am

By Sebastien Bubeck September 28, 2013 - 2:25 pm

By haipeng February 12, 2013 - 10:10 am

By Sebastien Bubeck February 12, 2013 - 4:20 pm

By Emre February 5, 2013 - 10:49 pm

By Sebastien Bubeck February 6, 2013 - 8:17 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll