The Wayback Machine - https://web.archive.org/web/20201128195157/https://blogs.princeton.edu/imabandit/2017/04/22/discrepancy-algorithm-inspired-by-gradient-descent-and-multiplicative-weights-after-levy-ramadas-and-rothvoss/

Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss

A week or so ago at our Theory Lunch we had the pleasure to listen to Harishchandra Ramadas (student of Thomas Rothvoss) who told us about their latest discrepancy algorithm. I think the algorithm is quite interesting as it combines ideas from gradient descent and multiplicative weights in a non-trivial (yet very simple) way. Below I reprove Spencer’s 6 deviations theorem with their machinery (in the actual paper Levy, Ramadas and Rothvoss do more than this).

First let me remind you the setting (see also this previous blog post for some motivation on discrepancy and a bit more context; by the way it is funny to read the comments in that post after this): given v_1, \hdots, v_n \in \mathbb{S}^{n-1} one wants to find x \in \{-1,1\}^n (think of it as a “coloring” of the coordinates) such that \max_{i \in [n]} |x \cdot v_i| \leq C for some numerical constant C>0 (when v_i is a normalized vectors of 1‘s and 0‘s the quantity |x \cdot v_i| represents the unbalancedness of the coloring in the set corresponding to v_i). Clearly it suffices to give a method to find x \in [-1,1]^n with at least half of its coordinates equal to -1 and 1 and such that \max_{i \in [n]} |x \cdot v_i| \leq C' for some numerical constant C'>0 (indeed one can then simply recurse on the coordinates not yet set to -1 or 1; this is the so-called “partial coloring” argument). Note also that one can drop the absolute value by taking v_i and -v_i (the number of constraints then becomes 2n but this is easy to deal with and we ignore it here for sake of simplicity).

The algorithm

Let x_0 = 0, w_0 = 1 \in \mathbb{R}^n. We run an iterative algorithm which keeps at every time step t \in \mathbb{N} a subspace U_t of valid update directions and then proceeds as follows. First find (using for instance a basis for U_t) z_t \in \mathbb{S}^{n-1} \bigcap U_t such that

(1)   \begin{equation*}  \sum_{i=1}^n \frac{w_t(i)}{\|w\|_1} (v_i \cdot z_t)^2 \leq \frac{1}{\mathrm{dim}(U_t)} . \end{equation*}

Then update x_{t+1}= x_t + \lambda_t z_t where \lambda_t \in [0,1] is maximal so that x_{t+1} remains in [-1,1]^n. Finally update the exponential weights by w_{t+1}(i) = w_t(i) \exp( v_i \cdot (x_{t+1} - x_t) ).

 

It remains to describe the subspace U_t. For this we introduce the set I_t \subset [n] containing the n/16^{th} largest coordinates of w_t (the “inactive” coordinates) and the set F_t \subset [n] containing the coordinates of x_t equal to -1 or 1 (the “frozen” coordinates). The subspace U_t is now described as the set of points orthogonal to (i) x_t, (ii) e_j, j \in F_t, (iii) v_i, i \in I_t, (iv) \sum_{i=1}^n w_t(i) v_i. The intuition for (i) and (ii) is rather clear: for (i) one simply wants to ensure that the method keeps making progress towards the boundary of the cube (i.e., |x_{t+1}| > |x_t|) while for (ii) one wants to make sure that coordinates which are already “colored” (i.e., set to -1 or 1) are not updated. In particular (i) and (ii) together ensures that at each step either the norm squared of x_t augments by 1 (in particular \lambda_t=1) or that one fixes forever one of the coordinates to -1 or 1. In particular this means that after at most 3 n /2 iterations one will have a partial coloring (i.e., half of the coordinates set to -1 or 1, which was our objective). Property (iii) is about ensuring that we stop walking in the directions where we are not making good progress (there are many ways to ensure this and this precise form will make sense towards the end of the analysis). Property (iv) is closely related, and while it might be only a technical condition it can also be understood as ensuring that locally one is not increasing the softmax of the constraints, indeed (iv) exactly says that one should move orthogonally to the gradient of \log(\sum_{i=1}^n \exp(x \cdot v_i)).

The analysis

Let Z_t = \sum_{i=1}^n w_t(i). Note that since z_t is on the sphere and \lambda_t \in [0,1] one has that |v_i \cdot (x_{t+1} - x_t)| \leq 1. Thus using \exp(x) \leq 1 + x + x^2 for x \in [-1,1], as well as property (iv) (i.e., \sum_{i=1}^n w_t(i) v_i \cdot z_t = 0) and \lambda_t \in [0,1] one obtains:

    \[Z_{t+1} = \sum_{i=1}^n w_t(i) \exp(v_i \cdot (x_{t+1} - x_t)) \leq \sum_{i=1}^n w_t(i) (1 + (v_i \cdot z_t)^2) .\]

Observe now that the subspace U_t has dimension at least n/4 (say for n \geq 16) and thus by (1) and the above inequalities one gets:

    \[Z_{t+1} \leq (1+ 4/n) Z_t .\]

In particular for any t \leq 2n, Z_{t} \leq C n for some numerical constant C >0. It only remains to observe that this ensures w_{2n}(i) = O(1) for any i \in [n] (this concludes the proof since we already observed that at time 2 n at least half of the coordinates are colored). For this last implication we simply use property (iii). Indeed assume that some coordinate i satisfies at some time t \leq 2n, w_t(i) > c e for some c>0. Since each update increases the weights (multiplicatively) by at most e it means that there is a previous time (say s) where this weight was larger than c and yet it got updated, meaning that it was not in the top n/16 weights, and in particular one had Z_s \geq c n / 16 which contradicts Z_{s} \leq C n for c large enough (namely c > 16 C).

This entry was posted in Theoretical Computer Science. Bookmark the permalink.

One Response to "Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss"

Leave a reply