ORF523: Getting around NP-hardness

In this post I would like to argue that showing {\mathbf{NP}}-completeness of a problem is in fact only a weak certificate of difficulty.

Dynamic Programming

Consider the problem {\mathrm{PARTITION}} of the previous lecture (which is {\mathbf{NP}}-complete): given {n} positive integers {a_1, \hdots, a_n}, does there exist {S \subset \{1, \hdots, n\}} such that {\sum_{i \in S} a_i = \sum_{i \in S^c} a_i}?

Let {N = \sum_{i=1}^n a_i}. We will describe an algorithm which has a time complexity of {O(n N)}. At first sight this looks like a polynomial-time complexity, but if it was the case it would prove that {\mathbf{P} = \mathbf{NP}}. In fact this complexity is not quite polynomial: Note that the size of the encoding of the input for the problem (that is the integers {a_1, \hdots, a_n}) is of order {\sum_{i=1}^n \log a_i}. Thus {N = \sum_{i=1}^n a_i} is exponential in the size of the encoding, which means that the time complexity {O(n N)} is not polynomial. However the algorithm is polynomial with respect to the ‘numerical value’ of the input, and we say in that case that the algorithm is pseudo-polynomial. An {\mathbf{NP}}-complete problem with a pseudo-polynomial algorithm is called weakly {\mathbf{NP}}-complete. In particular {\mathrm{PARTITION}} is weakly {\mathbf{NP}}-complete. On the contrary a problem is strongly {\mathbf{NP}}-complete if it is {\mathbf{NP}}-complete even when one uses unary coding for numbers rather than binary codin (this is a trick to enforce that polynomial-time now refers to polynomial-time in the numerical size of the input).

Let us now describe the algorithm with {O(n N)}-time complexity for {\mathrm{PARTITION}}. The idea is very simple, yet very powerful. It builds on the following observations: First remark that the question equivalently asks to find {S \subset \{1, \hdots, n\}} such that {\sum_{i \in S} a_i = N/2} ({N} has to be even otherwise the partition problem is clearly impossible). Now Let {T(j,\alpha) = \mathbf{1} \{\exists S\subset\{1,\hdots,j\} : \sum_{i \in S} a_i = \alpha\}}, we want to know the value of {T(n, N/2)}. Note that {T(j, \alpha)} is one if and only if either {T(j-1, \alpha)} is one, or if {T(j-1, \alpha -a_j)} is one. This relation gives us a dynamic programming algorithm: First compute {T(1, \alpha), \alpha =1, \hdots, N/2}, then using these values compute {T(2, \alpha), \alpha =1, \hdots, N/2}, and repeat this iteratively until computing {T(n, N/2)} with the previously computed values {T(n-1, \alpha), \alpha =1, \hdots, N/2}. Each step of this iterative procedure is of complexity {O(N)}, and there is a total of {n} steps, resulting in a complexity of {O(n N)} to solve {\mathrm{PARTITION}}.

In general the idea of dynamic programming is to break down recursively the problem into a limited number of simpler pieces. It is a very powerful technique of wide applicability, but we won’t have time to see other examples.


SDP relaxations

We discuss now the problem {\mathrm{MAXCUT}}, which is the optimization version of the decision problem {\mathrm{CUT}} that we have seen in the previous lecture, where one is looking for the cut of maximal size. Let {A \in \{0,1\}^{n \times n}} be the adjacency matrix of the graph {G = ([n],E)}, or more generally let {A \in {\mathbb R}_+^{n \times n}} be a symmetric matrix of non-negative weights. We encode a partition of {[n]} into two sets as a vector {x \in \{-1,1\}^n}. Then {(1-x_i x_j)} is equal to {0} if both {i} and {j} are in the same element of the partition, and it is equal to {2} if they are in different elements. In particular the value of the cut corresponding to {x} is given by

\displaystyle \frac1{4} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) .

Thus finding the maximal cut is equivalent to the following optimization problem:

\displaystyle \max_{x \in \{-1,1\}^n} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) \ \ \ \ \ (1)

Since {\mathrm{MAXCUT}} is {\mathbf{NP}}-complete (in the sense that the corresponding decision problem {\mathrm{CUT}} is {\mathbf{NP}}-complete) we know that the above problem is ‘hard’, in the sense that a polynomial-time algorithm solving (1) would prove that {\mathbf{P} = \mathbf{NP}}. Note that naively one may overlook the difficulty of (1) since it is ‘simply’ a quadratic objective with quadratic equality constraints ({x_i^2 = 1}). Remark in particular that if one replaces in the constraint the hypercube {\{-1,1\}^n} by the Euclidean sphere, then one obtains an efficiently solvable problem (it is the problem of computing the maximal eigenvalue of some symmetric matrix).

We will see now that while (1) is a difficult optimization problem, it is in fact possible to find relatively good approximate solutions! Let {\zeta} be uniformly drawn on the hypercube {\{-1,1\}^n}, then clearly:

\displaystyle \mathop{\mathbb E} \sum_{i,j} A_{i,j} (1-\zeta_i \zeta_j) = \sum_{i,j} A_{i,j} \geq \frac12 \max_{x \in \{-1,1\}^n} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) .

This means that on average, the obtained cut is at least half as good (in terms of the value of the cut) than the optimal cut. In fact this implies that with probability at least {\epsilon}, the obtained cut is a {(1/2-\epsilon)}-approximation of the maximum. Thus by repeatedly sampling from the hypercube, one can get arbitrarily close (with probability approaching {1}) to a {1/2}-approximation of {\mathrm{MAXCUT}}!

Next we show that in fact one can get much better than a {1/2}-approximation with a more subtle sampling strategy. First note that:

\displaystyle \max_{x \in \{-1,1\}^n} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) = \sum_{i,j=1}^n A_{i,j} - \min_{x \in \{-1,1\}^n} \langle A, x x^{\top} \rangle \leq \sum_{i,j=1}^n A_{i,j} - \min_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \langle A, X \rangle ,

and the last optimization problem is an SDP (with linear equality constraints) which can thus be solved efficiently. We call the solution to the problem

\displaystyle \min_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \langle A, X \rangle , \ \ \ \ \ (2)

the SDP relaxation of {\mathrm{MAXCUT}}.

The next theorem is of tremendous importance. It gives a bound on the quality of the SDP relaxation, and the proof gives a randomized algorithm to compute an {0.87}-approximate solution to (1).

Theorem 1 (Goemans and Williamson 1995) The SDP relaxation for {\mathrm{MAXCUT}} satisfies

\displaystyle \begin{array}{rcl} \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \left( \sum_{i,j=1}^n A_{i,j} - \langle A, X \rangle \right) & \geq & \max_{x \in \{-1,1\}^n} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) \\ & \geq & 0.87 \ \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \left( \sum_{i,j=1}^n A_{i,j} - \langle A, X \rangle \right) . \end{array}

The proof of this result is based on the following elementary geometric lemma.

Lemma 2 Let {\xi \sim \mathcal{N}(0,\Sigma)} with {\Sigma_{i,i}=1} for {i=1, \hdots, n}, and {\zeta = \mathrm{sign}(\xi)}. Then

\displaystyle \mathop{\mathbb E} \ \zeta_i \zeta_j = \frac{2}{\pi} \mathrm{arcsin} \left(\Sigma_{i,j}\right) .

Proof: Let {V \in {\mathbb R}^{n \times n}} (with {i^{th}} row {V_i^{\top}}) be such that {\Sigma = V V^{\top}}. Note that since {\Sigma_{i,i}=1} one has {\|V_i\| = 1} (remark also that necessarily {|\Sigma_{i,j}| \leq 1}, which will be important in the proof of the theorem). Let {\epsilon \sim \mathcal{N}(0,I_n)} be such that {\xi = V \epsilon}. Then {\zeta_i = \mathrm{sign}(V_i^{\top} \epsilon)}, and in particular

\displaystyle \begin{array}{rcl} \mathop{\mathbb E} \ \zeta_i \zeta_j & = & 2 \mathop{\mathbb P}(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon \geq 0) - 2 \mathop{\mathbb P}(V_i^{\top} \epsilon \geq 0 \ \text{and} \ V_j^{\top} \epsilon < 0) \\ & = & 1 - 2 \mathop{\mathbb P}(V_j^{\top} \epsilon < 0 | V_i^{\top} \epsilon \geq 0) . \end{array}

Now a quick picture shows that {\mathop{\mathbb P}(V_j^{\top} \epsilon < 0 | V_i^{\top} \epsilon \geq 0) = \frac{1}{\pi} \mathrm{arccos}(V_i^{\top} V_j)}. Using the fact that {V_i^{\top} V_j = \Sigma_{i,j}} and {\mathrm{arccos}(x) = \frac{\pi}{2} - \mathrm{arcsin}(x)} conclude the proof. \Box

We can now get to the proof of the Goemans and Williamson’s result.

Proof: Let {X} be the solution to the SDP relaxation of {\mathrm{MAXCUT}} (that is a solution of (2)), {\xi \sim \mathcal{N}(0, X)}, and {\zeta = \mathrm{sign}(\xi) \in \{-1,1\}^n}. We shall use the following inequality:

\displaystyle 1 - \frac{2}{\pi} \mathrm{arcsin}(t) \geq 0.87 (1-t), \ \forall t \in [-1,1] .

Now remark that, thanks to the previous lemma and the facts that {A_{i,j} \geq 0} and {|X_{i,j}| \leq 1}, one has

\displaystyle \begin{array}{rcl} \max_{x \in \{-1,1\}^n} \sum_{i,j=1}^n A_{i,j} (1-x_i x_j) & \geq & \mathop{\mathbb E}\sum_{i,j=1}^n A_{i,j} (1-\zeta_i \zeta_j) \\ & = & \sum_{i,j=1}^n A_{i,j} \left(1- \frac{2}{\pi} \mathrm{arcsin} \left(X_{i,j}\right)\right) \\ & \geq & 0.87 \sum_{i,j=1}^n A_{i,j} \left(1- X_{i,j}\right) \\ & = & 0.87 \ \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \left( \sum_{i,j=1}^n A_{i,j} - \langle A, X \rangle \right). \end{array}


The previous result crucially used the form of the {\mathrm{MAXCUT}} objective. We show next a result that applies more generally but with a worse constant of approximation. First note that, as we already pointed out, the {\mathrm{MAXCUT}} objective is of the form:

\displaystyle \max_{x \in \{-1,1\}^n} x^{\top} B x , \ \ \ \ \ (3)

where {B \in \mathbb{S}_+^n}. Indeed in {\mathrm{MAXCUT}} one has {B = \mathrm{diag}(\sum_{j=1}^n A_{i,j}) - A} which is also called the graph Laplacian {L} of {G}, and it is easy to verify that in this case {x^{\top} L x = \frac12 \sum_{i,j} A_{i,j} (x_i - x_j)^2} which shows that {L \in \mathbb{S}_+^n}.

Theorem 3 (Nesterov 1997) Let {B \in \mathbb{S}_+^n}, then

\displaystyle \frac{2}{\pi} \ \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \langle B, X \rangle \leq \max_{x \in \{-1,1\}^n} x^{\top} B x \leq \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \langle B, X \rangle .

Proof: Let {X} be the solution of

\displaystyle \max_{X \in \mathbb{S}_+^n, X_{i,i}=1, i=1,\hdots, n} \langle B, X \rangle ,

{\xi \sim \mathcal{N}(0, X)}, and {\zeta = \mathrm{sign}(\xi) \in \{-1,1\}^n}. Remark that thanks to the previous lemma one has

\displaystyle \mathop{\mathbb E} \ \zeta^{\top} B \zeta = \sum_{i,j=1}^n B_{i,j} \frac{2}{\pi} \mathrm{arcsin} \left(X_{i,j}\right) = \frac{2}{\pi} \langle B, \mathrm{arcsin}(X) \rangle .

Thus to prove the result it is enough to show that {\langle B, \mathrm{arcsin}(X) \rangle \geq \langle B, X \rangle}, which is itself implied by {\mathrm{arcsin}(X) \succeq X} (since {B} is positive semi-definite, just write the eigendecomposition). Now the last inequality is true thanks to a simple Taylor expansion. Indeed recall that {|X_{i,j}| \leq 1} and thus denoting by {A^{\circ \alpha}} the matrix where the entries are raised to the power {\alpha} one has

\displaystyle \mathrm{arcsin}(X) = \sum_{k=0}^{+\infty} \frac{{2k \choose k}}{4^k (2k +1)} X^{\circ (2k+1)} = X + \sum_{k=1}^{+\infty} \frac{{2k \choose k}}{4^k (2k +1)} X^{\circ (2k+1)}.

Finally one can conclude using the fact if {A,B \succeq 0} then {A \circ B \succeq 0}. (This can be seen by writing {A= V V^{\top}}, {B=U U^{\top}}, and thus {(A \circ B)_{i,j} = V_i^{\top} V_j U_i^{\top} U_j = \langle V_i U_i^{\top}, V_j U_j^{\top} \rangle}, which means that {A \circ B} is a Gram-matrix and thus it is positive semi-definite.) \Box


Hardness of approximation

In this previous section we have seen how to improve a trivial {1/2}-approximation algorithm to a highly non-trivial {0.87}-approximation algorithm. Of course at this point one may wonder whether better approximation guarantees are possible for {\mathrm{MAXCUT}}. In the last two decades some technology have been developed to answer this question. In the early Nineties, Sanjeev Arora and his co-authors have discovered the so-called {\mathbf{PCP}} Theorem (Probabilistically Checkable Proofs). This result gives a new randomized perspective on the class {\mathbf{NP}}, and it can be used to prove theorems such as the following one. We refer the interested reader to Chapter 11 in Arora and Barak for more details on this (including the proof).

Theorem 4 There is some {\gamma <1} such that computing a {\gamma}-approximation to {\mathrm{MIN-VERTEXCOVER}} is {\mathbf{NP}}-hard. For every {\rho < 1}, computing a {\rho}-approximation to {\mathrm{MAX-INDSET}} is {\mathbf{NP}}-hard.

The above theorem is nice, but for example for {\mathrm{MIN-VERTEXCOVER}} it might not give a sharp constant in terms of difficulty of approximation. The Unique Games Conjecture, a now decade-old conjecture made by Subhash Khot, implies sharper results for approximation. For example this conjecture (together with {\mathbf{P} \neq \mathbf{NP}}) implies that the constant derived by Goemans and Williamson for {\mathrm{MAXCUT}} is unimprovable: there is no polynomial-time algorithm that achieve a better constant in terms of approximation.


Extended formulations

Remark that (3) can be written as

\displaystyle \max_{x \in \{-1,1\}^n} \langle B, x x^{\top} \rangle = \max_{X \in \mathrm{conv}(x x^{\top}, \ x \in \{-1,1\}^n)} \langle B, X \rangle ,

which is ‘just’ a linear optimization over the polytope {\mathrm{conv}(x x^{\top}, \ x \in \{-1,1\}^n)} (which is called the correlation polytope). We have seen in previous lectures that linear optimization over a polytope can be solved in polynomial-time if the polytope can be described with a polynomial number of faces. Of course here we can immediately see that the correlation polytope has an exponential number of vertices. But this tells us nothing in terms of the faces, as one can see with the hypercube {[-1,1]^n} for example ({2^n} vertices but only {2n} faces). Even worse, the number of faces in the ambient space tells us nothing about possible representations of the polytope in higher dimension (in the sense that the polytope would be a projection of this higher dimensional representation). Think for example of the {\ell_1}-ball which can represented as follows:

\displaystyle \left\{x \in {\mathbb R}^n : \sum_{i=1}^n |x_i| \leq 1 \right\} = \left\{x \in {\mathbb R}^n : \exists y \in {\mathbb R}_+^{2 n} \ \text{s.t.} \ x_i = y_i - y_{n + i}, i=1\hdots,n, \ \text{and} \ \sum_{i=1}^{2 n} y_i \leq 1 \right\} .

In other words while the {\ell_1}-ball has {2^n} faces, it can be represented as the projection of a polytope in dimension {3n} with only {2n+1} faces. This type of representation is called an extended formulation. Let {P \subset {\mathbb R}^n} be a polytope, then {Q \subset {\mathbb R}^m, m \geq n} is an extension of {P} if

\displaystyle P = \{x \in {\mathbb R}^n : \exists y \ \text{s.t.} \ (x, y) \in Q \}.

A quantity of great interest, called the extension complexity of {P}, is the minimal number of faces of an extension of {P}. The extension complexity characterizes in some sense the difficulty of linear optimization over {P} when one restricts to Linear Programming techniques. While this quantity seems completely untractable, a famous result of Yannakakis shows that it can be described as a completely combinatorial quantity: it is equal to the non-negative rank of the slack matrix of {P} (the matrix where each row corresponds to a vertex {v} of {P}, and each column corresponds to a face {a^{\top} x \leq b}, and the corresponding entry if {b - a^{\top} v}). Controlling the non-negative rank is no easy task, but in some cases it can be done. For instance a very recent results of Fiorini et al. shows that the extension complexity of the correlation polytope is at least {2^{c n}} for some constant {c>0}. This can be viewed as an unconditional lower bound on {\mathrm{MAXCUT}} when one restricts to Linear Programming techniques.


Average-case complexity, smoothed analysis

So far we restricted ourselves to worst-case complexity: An algorithm is polynomial only if for every instance {x \in \{0,1\}^n} the running time is polynomial in {n}. This may be considered too strict, as it might be that an algorithm run in polynomial time on any ‘real-life’ instance {x}, but it could have very bad running time on artificial instances. This is exactly what happens with the simplex method that you have studied in ORF522: there are instances of {LP} where the simplex method takes exponential time (the Klee-Minty cube), yet in practice it is much more efficient than the ellipsoid method (and it is competitive with interior-point methods). A great result of Spielman and Teng from 2001 shows that indeed the examples that make the simplex method fail are artificial: if one adds a little bit of noise to the instance, then with high probability the simplex method will run in polynomial time. This is called the smoothed analysis of the simplex method, and it falls under the more general research agenda of understanding average-case complexity rather than worst-case complexity.

This entry was posted in Theoretical Computer Science. Bookmark the permalink.

Leave a reply