Komlos conjecture, Gaussian correlation conjecture, and a bit of machine learning

Today I would like to talk (somewhat indirectly) about a beautiful COLT 2014 paper by Nick Harvey and Samira Samadi. The problem studied in this paper goes as follows: imagine that you have a bunch of data points in \mathbb{R}^n with a certain mean value \mu. Your goal is to generate a sequence out these data points, such that the partial averages of this new sequence are as close as possible to \mu. The “herding” procedure of Max Welling has the property that the average of the t^{th} first terms in the new sequence is at distance at most O(1/t) from \mu (this was shown by Francis Bach, Simon Lacoste-Julien and Guillaume Obozinski by viewing the herding procedure as a simple Frank-Wolfe). In their paper Nick and Samira improves the dependency on the dimension n in the O notation by using a new algorithm inspired by the theory of discrepancy. In the rest of this post I would like to discuss the basic notions of this theory. What follows is a transcription of what Ronen Eldan (a postdoc in the theory group at MSR) explained to me (of course all errors are mine). I will use c and C to denote absolute numerical constants that take whatever value I need them to take.

Discrepancy theory

Consider a large finite set of vertices V, and a size-n set \mathcal{P} of subsets of vertices (in other words (V, \mathcal{P}) is an hypergraph). For sake of simplicity we will assume here that elements of \mathcal{P} are also of size n. We are now interested in the following combinatorial optimization problem:

    \begin{align*} & \qquad \;\; \mathrm{min.} \qquad \;\;\max_{A \in \mathcal{P}} \left| \sum_{v \in A} f(v) \right| \\ & f : V \rightarrow \{-1,1\} . \end{align*}

In words, one is looking for a partitioning f of the vertices, so as to have the hyperedges A \in \mathcal{P} as balanced as possible.

Alright, so let’s try something simple, just randomly partition the vertices: clearly a simple union bounds shows that the obtained objective value in this case is of order \sqrt{n \log n}. Denote D (for Discrepancy) the smallest objective value (in particular we just saw that D \leq C \sqrt{n \log n}). A beautiful result of Joel Spencer shows that one can always do much better than random partitioning:

Theorem (Spencer’s six deviations, 1985): D \leq 6 \sqrt{n}.

We will now see a splendid (partial) proof of this result. Let \gamma be the Gaussian measure in \mathbb{R}^d (in the following we will take d=|V|). First let me remind you of a basic fact about this measure, and a related conjecture. A set of the form S_{v,t} = \{x : |v^{\top} x| \leq t\}, for some v \in \mathbb{R}^n, t >0, is called a slab.

Theorem (Sidak, 1967): Let A be a symmetric convex set and B be a slab, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Obviously there is equality if A and B are two orthogonal slabs, and the theorem says (in particular) that an angle between the slabs would imply positive correlation.

Conjecture (Gaussian correlation): Let A and B be two symmetric convex sets, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Ok, now back to the proof of Spencer’s theorem. First, let’s rewrite things suggestively in terms of the slabs S_A := S_{\frac{\mathds{1}_A}{\sqrt{n}}, C}, A \in \mathcal{P}. We want to show that there exists f \in \{-1,1\}^d such that f \in \cap_{A \in \mathcal{P}} S_A. Using Sidak’s theorem one clearly has \gamma(\cap_{A \in \mathcal{P}} S_A) \geq c^d (with c being roughly like 1-2 \exp(-C^2/2)). From there it is now a fairly short route to show that, if c \geq 0.99 (say), then the intersection of \{-1,1\}^d with \cap_{A \in \mathcal{P}} S_A must be non-empty. Ronen showed me a non-trivial (and algorithmically non-efficient) 5 lines argument for this but I think that it’s yet unpublished work so I won’t reproduce it here. Another proof of this, which yields an efficient algorithm for finding a point in the intersection, is given in this very nice paper of Lovett and Meka.

Extensions

In what we discussed so far the underlying hypergraph was completely arbitrary. For some specific graphs one can clearly improve the bound on D. Indeed if all hyperedges are disjoint, then D=0. More interestingly, Beck and Fiala showed in 1981 that, if each vertex appears in at most t hyperedges, then D \leq 2 t. They also conjectured that the correct bound should be O(\sqrt{t}). In 1998 Banaszczyk showed a bound of O(\sqrt{t \log(n)}) using the following theorem (which is a variant of the result mentioned above):

Theorem (Banaszczyk, 1998): Let A be a convex set such that \gamma(A) \geq 1/2, and u_1, \hdots u_m \in \mathbb{R}^d such that \|u_i\|_2 \leq C. Then there exists f \in \{-1,1\}^m such that \sum_{i=1}^m f_i u_i \in A.

The longstanding Komlos’ conjecture says that in fact this holds for A=[-1,1]^n (despite the fact that the Gaussian measure of this set is c^n).

Finally let me mention another open problem, which is the matrix version of Spencer’s six deviations, see this blog post by Raghu Meka for details.

 

This entry was posted in Optimization, Probability theory, Theoretical Computer Science. Bookmark the permalink.

5 Responses to "Komlos conjecture, Gaussian correlation conjecture, and a bit of machine learning"

    • Sebastien Bubeck

Leave a replyLeave a Reply to U