Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss

A week or so ago at our Theory Lunch we had the pleasure to listen to Harishchandra Ramadas (student of Thomas Rothvoss) who told us about their latest discrepancy algorithm. I think the algorithm is quite interesting as it combines ideas from gradient descent and multiplicative weights in a non-trivial (yet very simple) way. Below I reprove Spencer’s $6$ deviations theorem with their machinery (in the actual paper Levy, Ramadas and Rothvoss do more than this).

First let me remind you the setting (see also this previous blog post for some motivation on discrepancy and a bit more context; by the way it is funny to read the comments in that post after this): given $v_1, \hdots, v_n \in \mathbb{S}^{n-1}$ one wants to find $x \in \{-1,1\}^n$ (think of it as a “coloring” of the coordinates) such that $\max_{i \in [n]} |x \cdot v_i| \leq C$ for some numerical constant $C>0$ (when $v_i$ is a normalized vectors of $1$ ‘s and $0$ ‘s the quantity $|x \cdot v_i|$ represents the unbalancedness of the coloring in the set corresponding to $v_i$ ). Clearly it suffices to give a method to find $x \in [-1,1]^n$ with at least half of its coordinates equal to $-1$ and $1$ and such that $\max_{i \in [n]} |x \cdot v_i| \leq C'$ for some numerical constant $C'>0$ (indeed one can then simply recurse on the coordinates not yet set to $-1$ or $1$ ; this is the so-called “partial coloring” argument). Note also that one can drop the absolute value by taking $v_i$ and $-v_i$ (the number of constraints then becomes $2n$ but this is easy to deal with and we ignore it here for sake of simplicity).

The algorithm

Let $x_0 = 0$ , $w_0 = 1 \in \mathbb{R}^n$ . We run an iterative algorithm which keeps at every time step $t \in \mathbb{N}$ a subspace $U_t$ of valid update directions and then proceeds as follows. First find (using for instance a basis for $U_t$ ) $z_t \in \mathbb{S}^{n-1} \bigcap U_t$ such that

(1) $\begin{equation*} \sum_{i=1}^n \frac{w_t(i)}{\|w\|_1} (v_i \cdot z_t)^2 \leq \frac{1}{\mathrm{dim}(U_t)} . \end{equation*}$

Then update $x_{t+1}= x_t + \lambda_t z_t$ where $\lambda_t \in [0,1]$ is maximal so that $x_{t+1}$ remains in $[-1,1]^n$ . Finally update the exponential weights by $w_{t+1}(i) = w_t(i) \exp( v_i \cdot (x_{t+1} - x_t) )$ .

It remains to describe the subspace $U_t$ . For this we introduce the set $I_t \subset [n]$ containing the $n/16^{th}$ largest coordinates of $w_t$ (the “inactive” coordinates) and the set $F_t \subset [n]$ containing the coordinates of $x_t$ equal to $-1$ or $1$ (the “frozen” coordinates). The subspace $U_t$ is now described as the set of points orthogonal to (i) $x_t$ , (ii) $e_j, j \in F_t$ , (iii) $v_i, i \in I_t$ , (iv) $\sum_{i=1}^n w_t(i) v_i$ . The intuition for (i) and (ii) is rather clear: for (i) one simply wants to ensure that the method keeps making progress towards the boundary of the cube (i.e., $|x_{t+1}| > |x_t|$ ) while for (ii) one wants to make sure that coordinates which are already “colored” (i.e., set to $-1$ or $1$ ) are not updated. In particular (i) and (ii) together ensures that at each step either the norm squared of $x_t$ augments by $1$ (in particular $\lambda_t=1$ ) or that one fixes forever one of the coordinates to $-1$ or $1$ . In particular this means that after at most $3 n /2$ iterations one will have a partial coloring (i.e., half of the coordinates set to $-1$ or $1$ , which was our objective). Property (iii) is about ensuring that we stop walking in the directions where we are not making good progress (there are many ways to ensure this and this precise form will make sense towards the end of the analysis). Property (iv) is closely related, and while it might be only a technical condition it can also be understood as ensuring that locally one is not increasing the softmax of the constraints, indeed (iv) exactly says that one should move orthogonally to the gradient of $\log(\sum_{i=1}^n \exp(x \cdot v_i))$ .

The analysis

Let $Z_t = \sum_{i=1}^n w_t(i)$ . Note that since $z_t$ is on the sphere and $\lambda_t \in [0,1]$ one has that $|v_i \cdot (x_{t+1} - x_t)| \leq 1$ . Thus using $\exp(x) \leq 1 + x + x^2$ for $x \in [-1,1]$ , as well as property (iv) (i.e., $\sum_{i=1}^n w_t(i) v_i \cdot z_t = 0$ ) and $\lambda_t \in [0,1]$ one obtains:

$Z_{t+1} = \sum_{i=1}^n w_t(i) \exp(v_i \cdot (x_{t+1} - x_t)) \leq \sum_{i=1}^n w_t(i) (1 + (v_i \cdot z_t)^2) .$

Observe now that the subspace $U_t$ has dimension at least $n/4$ (say for $n \geq 16$ ) and thus by (1) and the above inequalities one gets:

$Z_{t+1} \leq (1+ 4/n) Z_t .$

In particular for any $t \leq 2n, Z_{t} \leq C n$ for some numerical constant $C >0$ . It only remains to observe that this ensures $w_{2n}(i) = O(1)$ for any $i \in [n]$ (this concludes the proof since we already observed that at time $2 n$ at least half of the coordinates are colored). For this last implication we simply use property (iii). Indeed assume that some coordinate $i$ satisfies at some time $t \leq 2n$ , $w_t(i) > c e$ for some $c>0$ . Since each update increases the weights (multiplicatively) by at most $e$ it means that there is a previous time (say $s$ ) where this weight was larger than $c$ and yet it got updated, meaning that it was not in the top $n/16$ weights, and in particular one had $Z_s \geq c n / 16$ which contradicts $Z_{s} \leq C n$ for $c$ large enough (namely $c > 16 C$ ).

One Response to "Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss"

By Sasho October 18, 2017 - 1:22 am

Seb, I think you are claiming a little too much in the second paragraph. While the partial coloring lemma works for arbitrary unit vectors, Spencer’s theorem doesn’t. I.e. it is not true that for any $n$ unit vectors $v_1, \ldots, v_n$ in $\mathbb{R}^n$, there are signs $x_1, \ldots, x_n$ such that $\max_i |\langle v_i, x_i \rangle| = O(1)$. The best you can guarantee is $O(\sqrt{\log n})$, which is what you get from a random coloring. What breaks in your argument is that when you recurse on the fractionally colored elements, the lengths of the vectors may not decrease.That’s why in the Spencer setting you assume that each $v_i$ is in $[-1, 1]^n$, and you ask that $\max_i |\langle v_i, x_i \rangle| = O(\sqrt{n})$.

Aug	NOV	Oct
	28
2019	2020	2022

Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss

One Response to "Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss"

By Sasho October 18, 2017 - 1:22 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll