ORF523: Interior Point Methods

Khachiyan’s ellipsoid method was such a breakthrough that it made it to The New York Times in 1979. The next time that optimization theory was featured in The New York Times was only a few years later, in 1984, and it was to report a new breakthrough: Karmakar’s Interior Point Method to solve LPs. Similarly to the ellipsoid method, it was shown that Karmakar’s algorithm is a polynomial time algorithm for LPs, but this time it was also shown that the method had very good empirical performances! We will now describe the idea behind this new class of algorithms, following the presentation of Nesterov and Nemirovski.

 

Interior point methods

Consider the following standard optimization problem:

\displaystyle \text{Find} \; x^* \in \mathrm{argmin}_{x \in \mathcal{X}} c^{\top} x , \; \text{where} \; \mathcal{X} \subset {\mathbb R}^n \; \text{is convex body.}

Assume that we can construct a barrier {F} for the set {\mathcal{X}}, that is a real-valued function {F} defined on {\mathrm{int}(\mathcal{X})} and such that

\displaystyle F(x) \rightarrow_{x \rightarrow \partial \mathcal{X}} + \infty .

Let us extend the function {F} to {{\mathbb R}^n} with {F(x) = +\infty} for {x \not\in \mathrm{int}(\mathcal{X})}. Then, instead of the original problem, one could consider the following penalized version, for {t \geq 0}:

\displaystyle \min_{x \in {\mathbb R}^n} t c^{\top} x + F(x) .

Let {x^*(t)} be a solution to the above problem, we call the curve {(x^*(t))_{t \geq 0}} the central path. Note that {x^*(0)} is simply the minimizer of {F}, we call this point the analytical center of {\mathcal{X}}. Remark also that {x^*(t) \rightarrow_{t \rightarrow +\infty} x^*}, that is the central path converges to the solution of our original problem.

The whole idea of interior point methods is to ‘boost’ an efficient locally convergent optimization algorithm, such as Newton’s method, with the following scheme: Assume that we have found {x^*(t)} for some {t>0}. Then let {t'>t}, and use the locally convergent algorithm, initialized at {x^*(t)}, to find {x^*(t')}. In other words the idea is to move along the central path, using some basic algorithm to do small steps along the central path. The ‘basic algorithm’ that is used in interior point methods is the Newton’s method, because of its very fast local convergence.

To make this (beautiful) idea more formal we need to take care of several things:

  1. First we need to describe precisely the region of fast convergence for Newton’s method. This will lead us to define self-concordant functions, which are ‘natural’ functions for Newton’s method.
  2. Then we need to evaluate how much larger {t'} can be compared to {t}, so that {x^*(t)} is still in the region of fast convergence of Newton’s method when optimizing the function {x \mapsto t' c^{\top} x + F(x)} (which is minimized at {x^*(t')}). This will lead us to define {\nu}-self concordant barriers.
  3. Can we find easily the analytical center {x^*(0)} of {\mathcal{X}}?

 

Traditional analysis of Newton’s method

In this section we denote {\|\cdot\|} for both the Euclidean norm on {{\mathbb R}^n} and the operator norm on matrices (in particular {\|A x\| \leq \|A\| \cdot \|x\|}).

Let {f: {\mathbb R}^n \rightarrow {\mathbb R}} be a {C^2} function. Newton’s method is a simple iterative optimization scheme. Using a Taylor’s expansion of {f} around {x} one obtains

\displaystyle f(x+h) = f(x) + h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla^2 f(x) h + o(\|h\|^2) .

Thus, starting at {x}, in order to minimize {f} it seems natural to move in the direction {h} that minimizes

\displaystyle h^{\top} \nabla f(x) + \frac12 h^{\top} \nabla f^2(x) h .

If {\nabla^2 f(x)} is invertible then the solution to this problem is given by {h = - [\nabla^2 f(x)]^{-1} \nabla f(x)}. Thus Newton’s method simply iterates this idea, starting at some point {x_0 \in {\mathbb R}^n}, and then

\displaystyle x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k), k \geq 0 .

While this method can have an arbitrarily bad behavior in general, if started close enough to a strict local minimum of {f}, it can have a very fast convergence:

Theorem 1 (Local quadratic convergence of Newton’s method) Assume that {f} has a Lipschitz Hessian, that is {\| \nabla^2 f(x) - \nabla^2 f(y) \| \leq M \|x - y\|}. Let {x^*} be local minimum of {f} with strictly positive Hessian, that is {\nabla^2 f(x^*) \succeq \mu I_n}, {\mu > 0}. Suppose that the initial starting point {x_0} of Newton’s method is such that

\displaystyle \|x_0 - x^*\| \leq \frac{\mu}{2 M} .

Then Newton’s method is well-defined and converges to {x^*} at a quadratic rate:

\displaystyle \|x_{k+1} - x^*\| \leq \frac{M}{\mu} \|x_k - x^*\|^2.

Proof: We use the following simple formula, for {x, h \in {\mathbb R}^n},

\displaystyle \int_0^1 \nabla^2 f(x + s h) \ h \ ds = \nabla f(x+h) - \nabla f(x) .

Now note that {\nabla f(x^*) = 0}, and thus with the above formula one obtains

\displaystyle \nabla f(x_k) = \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds ,

which allows us to write:

\displaystyle \begin{array}{rcl} x_{k+1} - x^* & = & x_k - x^* - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) \\ & = & x_k - x^* - [\nabla^2 f(x_k)]^{-1} \int_0^1 \nabla^2 f(x^* + s (x_k - x^*)) \ (x_k - x^*) \ ds \\ & = & [\nabla^2 f(x_k)]^{-1} \int_0^1 [\nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) ] \ (x_k - x^*) \ ds . \end{array}

In particular one has

\displaystyle \|x_{k+1} - x^*\| \leq \|[\nabla^2 f(x_k)]^{-1}\| \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \|x_k - x^* \|.

Using the Lipschitz property of the Hessian one immediately obtains that

\displaystyle \left( \int_0^1 \| \nabla^2 f (x_k) - \nabla^2 f(x^* + s (x_k - x^*)) \| \ ds \right) \leq \frac{M}{2} \|x_k - x^*\| .

Using again the Lipschitz property of the Hessian, the hypothesis on {x^*}, and an induction hypothesis that {\|x_k - x^*\| \leq \frac{\mu}{2M}}, one has

\displaystyle \nabla^2 f(x_k) \succeq \nabla^2 f(x^*) - M \|x_k - x^*\| I_n \succeq (\mu - M \|x_k - x^*\|) I_n \succeq \frac{\mu}{2} I_n ,

which concludes the proof. \Box

 

Self-concordant functions

Let {A} be an invertible {n \times n} matrix. We will use the following simple formulas:

\displaystyle \nabla (x \mapsto f(A x) ) =A^{\top} \nabla f(A x) , \; \text{and} \; \nabla^2 (x \mapsto f(A x) ) =A^{\top} \nabla^2 f(A x) A . \ \ \ \ \ (1)

Let us try to get some insight into the ‘geometry’ of Newton’s method by looking at the iteration that would be performed using the function {\phi(y) = f(A^{-1} y)}. In other words, we view {A} as a mapping from {{\mathbb R}^n} (the ‘{x}-space’) to {{\mathbb R}^n} (the ‘{y}-space’), and the function {\phi} is defined on the ‘{y}-space’, so that its evaluation at a point {y=Ax} coincides with the evaluation of {f} at {x}. Assume that we start Newton’s method on {f} in the ‘{x}-space’ at {x_0}, and on {\phi} in the ‘{y}-space’ at {y_0 = Ax_0}. The next iterations are then given by the following formulas:

\displaystyle x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) , \; \text{and} \; y_{k+1} = y_k - [\nabla^2 \phi(y_k)]^{-1} \nabla \phi(y_k) .

Using (1) one can see in a few lines that {y_{k+1} = A x_{k+1}}, that is Newton’s method is affine invariant, it follows the same trajectory in the ‘{x}-space’ and the ‘{y}-space’!

This fact casts some concerns on the assumptions of the previous analysis: indeed the Lipschitz assumption on the Hessian is not affine invariant. The idea of self-concordance is to improve this situation.

Assume from now on that {f} is {C^3}. Consider the third order differential operator {\nabla^3 f(x) : {\mathbb R}^n \times {\mathbb R}^n \times {\mathbb R}^n \rightarrow {\mathbb R}}. Then the Lipschitz assumption on the Hessian that we used can be written as:

\displaystyle \nabla^3 f(x) [h,h,h] \leq M \|h\|^3 .

The left-hand side in the inequality is affine invariant, but not the right-hand side. A natural idea to make the right-hand side affine invariant is to replace the Euclidean metric by the metric given by the function {f} itself at {x}, that is:

\displaystyle \|h\|_x = \sqrt{ h^{\top} \nabla^2 f(x) h }.

Definition 2 Let {\mathcal{X}} be a convex set with non-empty interior, and {f} a {C^3} convex function defined on {\mathrm{int}(\mathcal{X})}. Then {f} is self-concordant (with constant {M}) if for all {x \in \mathrm{int}(\mathcal{X}), h \in {\mathbb R}^n},

\displaystyle \nabla^3 f(x) [h,h,h] \leq M \|h\|_x^3 .

We say that {f} is standard self-concordant if {f} is self-concordant with constant {M=2}.

An easy consequence of the definition is that a self-concordant function is a barrier for the set {\mathcal{X}}. The main example to keep in mind of a standard self-concordant function is {f(x) = - \log x} for {x > 0}. The next definition will be key in order to describe the region of quadratic convergence for Newton’s method on self-concordant functions. Recall first that for an arbitrary norm {\|\cdot\|} on {{\mathbb R}^n}, its dual norm {\|\cdot\|^*} is defined by

\displaystyle \|h\|^* = \sup_{x \in {\mathbb R}^n : \|x\| \leq 1} x^{\top} h .

In particular, by definition, Hölder’s inequality holds: {|x^{\top} h| \leq \|x\| \cdot \|h\|^*}. Recall also that for a norm of the form {\|x\| = \sqrt{x^{\top} A x}}, the dual norm is given by {\|h\|^* = \sqrt{h^{\top} A^{-1} h}}.

Definition 3 Let {f} be a standard self-concordant function on {\mathcal{X}}. For {x \in \mathrm{int}(\mathcal{X})}, we say that {\lambda_f(x) = \|\nabla f(x)\|_x^*} is the Newton decrement of {f} at {x}.

We state the next theorem without a proof.

Theorem 4 Let {f} be a standard self-concordant function on {\mathcal{X}}, and {x \in \mathrm{int}(\mathcal{X})} such that {\lambda_f(x) < 1}, then

\displaystyle \lambda_f\Big(x - [\nabla^2 f(x)]^{-1} \nabla f(x)\Big) \leq \left(\frac{\lambda_f(x)}{1-\lambda_f(x)}\right)^2 .

In other words the above theorem states that, if initialized at a point {x_0} such that {\lambda_f(x_0) \leq 1/4}, then Newton’s iterates satisfy {\lambda_f(x_{k+1}) \leq 2 \lambda_f(x_k)^2}. Thus, Newton’s region of quadratic convergence for self-concordant functions can be described as a ‘Newton decrement ball’ {\{x : \lambda_f(x) < 1\}}.

 

{\nu}-self-concordant barriers

Let us go back to our central path idea. Assume that we have computed {x^*(t) = \mathrm{argmin}_{x \in {\mathbb R}^n} F_t(x)}, where {F_t(x) = t c^{\top} x + F(x)} and {F} is a standard self-concordant function on {\mathcal{X}}. We want to use Newton’s method to compute {x^*(t')}, starting at {x_0 = x^*(t)}. Thus, using the analysis of the previous section, we need to ensure that

\displaystyle \lambda_{F_{t'}}(x^*(t) ) < 1/4 .

Since the Hessian of {F_{t'}} is the Hessian of {F}, one has with {\|h\|_x = \sqrt{ h^{\top} \nabla^2 f(x) h }},

\displaystyle \lambda_{F_{t'}}(x^*(t) ) = \|t' c + \nabla F(x^*(t)) \|_{x^*(t)}^* .

Now, by the first-order optimality condition, one has {t c + \nabla F(x^*(t)) = 0}, which yields

\displaystyle \lambda_{F_{t'}}(x^*(t) ) = (t'-t) \|c\|^*_{x^*(t)} .

Thus we can take {t' = t + \frac{1}{4 \|c\|^*_{x^*(t)}}}, and according to the analysis of the previous section the Newton’s method will converge very rapidly to {x^*(t')}. Now to obtain a reasonable optimization method overall, we need to make sure that we are increasing {t} fast enough. That is we need to control {\|c\|^*_{x^*(t)} = \frac1{t} \|\nabla F(x^*(t))\|_{x^*(t)}^*}. In particular we could do this if we had a uniform control over {\|\nabla F(x)\|_x^*}. There is a natural class of functions for which such a control exist. This is the set of functions such that

\displaystyle \nabla^2 F(x) \succeq \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} . \ \ \ \ \ (2)

Indeed in that case one has:

\displaystyle \|\nabla F(x)\|_x^* = \sup_{h : h^{\top} \nabla F^2(x) h \leq 1} \nabla F(x)^{\top} x \leq \sup_{h : h^{\top} \left( \frac1{\nu} \nabla F(x) [\nabla F(x) ]^{\top} \right) h \leq 1} \nabla F(x)^{\top} x = \sqrt{\nu} .

Thus a safe choice to increase the penalization parameter is {t' = \left(1 + \frac1{4\sqrt{\nu}}\right) t}. Note that the condition (2) can also be written as the fact that the function {F} is {\frac1{\nu}}-exp-concave, that is {x \mapsto \exp(- \frac1{\nu} F(x))} is concave. We arrive at the following definition.

Definition 5 {F} is a {\nu}-self-concordant barrier if it is a standard self-concordant function, and it is {\frac1{\nu}}-exp-concave.

Again the canonical example is the logarithmic function, {x \mapsto - \log x}, which is a {1}-self-concordant barrier for the set {{\mathbb R}_+}. We state the next (difficult) theorem without a proof.

Theorem 6 Let {\mathcal{X} \subset {\mathbb R}^n} be a closed convex set with non-empty interior. There exists {F} which is a {(c \ n)}-self-concordant barrier for {\mathcal{X}} (where {c} is some universal constant).

A key property of {\nu}-self-concordant barriers is the following inequality:

\displaystyle c^{\top} x^*(t) - \min_{x \in \mathcal{X}} c^{\top} x \leq \frac{\nu}{t} . \ \ \ \ \ (3)

 

Path-following scheme

Let us recap what we have seen so far. The basic path-following scheme with a {\nu}-self-concordant barrier {F} for {\mathcal{X}} goes as follows. Assume that we can find {x_0 \in \mathrm{argmin}_{x \in {\mathbb R}^n} t_0 c^{\top} x + F(x)} for some small value {t_0 >0}. Then for {k \geq 0}, let

\displaystyle \begin{array}{rcl} & & t_{k+1} = \left(1 + \frac1{4\sqrt{\nu}}\right) t_k ,\\ & & x_{k+1} = x_k - [\nabla^2 F(x_k)]^{-1} (t_{k+1} c + \nabla F(x_k) ) . \end{array}

Using the machinery that we developed it is not hard to analyze the above scheme. Indeed it is reasonable to expect that {x_k} will be ‘close’ to {x^*(t_k)}, and thus an inequality similar to (3) should hold for {x_k}. This is indeed the case, up to a multiplicate numerical constant. Furthermore, since one has {t_{k} = \left(1 + \frac1{4\sqrt{\nu}}\right)^{k} t_0}, this leads to the following bound:

\displaystyle c^{\top} x_k - \min_{x \in \mathcal{X}} c^{\top} x = \frac{\nu}{t_0} \exp( - O(k / \sqrt{\nu}) ) .

In other words, with {k_{\epsilon} = O\left( \sqrt{\nu} \log \frac{\nu}{t_0 \epsilon} \right)} one can get an {\epsilon}-optimal point with the path-following scheme.

At this point we still need to explain how one can get close to an intial point {x^*(t_0)} of the central path. This can be done with a damped Newton method on {F_{t_0}}, which uses the following iterates:

\displaystyle x_{k+1} = x_k - \frac{1}{1+ \lambda_{F_{t_0}}(x_k)} [\nabla^2 F_{t_0}(x_k)]^{-1} \nabla F_{t_0}(x_k) .

The complexity of this step is roughly of the same order than the rest of the path-following scheme.

This entry was posted in Optimization. Bookmark the permalink.

2 Responses to "ORF523: Interior Point Methods"

    • Sebastien Bubeck

Leave a reply