First Big Data workshop at the Simons Institute

This week at the Simons Institute we had the first Big Data workshop on Succinct Data Representations and Applications. Here I would like to briefly talk about one of the ‘stars’ of this workshop: the squared-length sampling technique. I will illustrate this method on three examples (taken from three seminal papers).

Fast low rank approximation

Frieze, Kannan and Vempala proposed the following algorithm to compute an approximate low rank decomposition of a matrix $A \in \mathbb{R}^{m \times n}$ (this specific version is taken from Chapter 6 here). We denote by $A_1, \hdots A_n$ the columns of $A$ .

Let $j_1, \hdots, j_s$ be $s$ i.i.d. random variables taking values in $[n]$ whose distribution is proportional to the squared-length of the columns, more precisely the probability that $j_1$ is equal to $i \in [n]$ is proportional to $\|A_i\|_2^2$ . Let $B \in \mathbb{R}^{m \times s}$ be such that its $i^{th}$ column is $\frac{1}{\sqrt{s p_{j_i}}} A_{j_i}$ . Next compute an SVD of $B$ and let $u_1, \hdots, u_k$ be the $k$ top left singular vectors of $B$ . The low rank approximation to $A$ is finally given by

$\tilde{A} = \sum_{i=1}^k u_i u_i^{\top} A .$

One can prove that for $s = 4 k / \epsilon^2$ this approximation satisfies:

$\mathbb{E} \|A - \tilde{A}\|_F^2 \leq \|A - A_k\|^2_F + \epsilon \|A\|_F^2 ,$

where $A_k$ is the best rank $k$ approximation to $A$ .

The amazing feature of this algorithm is that the complexity of computing the projection matrix $\sum_{i=1}^k u_i u_i^{\top}$ is independent of $n$ (one needs to be careful with how the sampling of $j_1, \hdots, j_s$ is done but this can be taken care of). In fact it is even possible to obtain an algorithm whose complexity is independent of both $n$ and $m$ and depends only polynomially in $k$ and $1/\epsilon$ (though the version described above might be better in practice because the complexity is simply $m k^2 /\epsilon^4$ ).

Fast graph sparsification

Spielman and Teng looked at the following graph sparsification problem: given a weighted graph $G$ , find a ‘sparse’ weighted graph $H$ such that

$(1-\epsilon) L_G \preceq L_H \preceq (1+\epsilon) L_G ,$

where $L_H$ and $L_G$ are the graph Laplacians of $H$ and $G$ . The idea is that if one is able to find ‘fast enough’ a sparse approximation $H$ to $G$ then one can solve many problems on $G$ (which is dense) by instead solving them much faster on $H$ (think of solving linear systems of the form $L_G x = b$ for example).

Spielman and Srivastava proposed a method to carry this graph sparsification task using the squared-length sampling technique. First they reduce the problem to the following: given vectors $v_1, \hdots, v_N$ that forms a decomposition of the identity, that is

$\sum_{i=1}^N v_i v_i^{\top} = I_n ,$

find a small subset of them such that appropriately reweighted they form an approximate decomposition of the identity. Rudelson showed that one can solve this problem by simply sampling $s = O \left(n \log(n) / \epsilon^2 \right)$ vectors from $\{v_1, \hdots, v_N\}$ i.i.d. proportionally to their squared-length and reweight them by the square root of the inverse probability of selecting this vector. In other words, he showed that if $j_1, \hdots, j_s$ is an i.i.d. sequence in $[N]$ such that $p_i = \mathbb{P}(j_1 = i)$ is proportional to $\|v_i\|_2^2$ then with high probability one has

$(1-\epsilon) I_n \preceq \sum_{t=1}^s \frac{1}{p_t} v_t v_t^{\top} \preceq (1+\epsilon) I_n.$

Batson, Spielman and Srivastava also showed how to find this decomposition with a deterministic algorithm, see also this very nice survey of Naor.

Approximation guarantee for k-means

The last example is the now famous k-means++ algorithm of Arthur and Vassilvitskii. Given $x_1, \hdots, x_N \in \mathbb{R}^n$ one would like to find $c_1, \hdots, c_k \in \mathbb{R}^n$ minimizing the following quantity:

$\sum_{t=1}^n \min_{1 \leq i \leq k} \|x_t - c_i\|_2^2 .$

The following strategy gives a randomized $O(\log(k))$ -approximation algorithm to this NP-hard problem. First select $c_1$ at random from $x_1, \hdots, x_n$ . Then select iteratively the new centers at random, by sampling a point at random from $x_1, \hdots, x_n$ proportionally to their distance squared to the current nearest center.

2 Responses to "First Big Data workshop at the Simons Institute"

By Jeremie September 23, 2013 - 7:53 am

Why is there a weight over the selected columns of A in the B matrix ?

How to draw columns with probability proportionally to their Frobenius norm without having computed the norms ? I guess you have to introduce a assumption over the distribution of the norms of the columns.

Is it possible to just use the s columns with the highest frobenius norm ?
- By Sebastien Bubeck September 23, 2013 - 11:20 am
  
  Hey Jeremie,
  
  thanks for your questions! There are weights in the matrix B in order to ensure unbiasedness. Also the randomization is absolutely key, selecting the highest Frobenius norm columns would not work. As far as the precise sampling algorithm go you should check the linked papers, it is not difficult but not trivial either.

Nov	JAN	Dec
	25
2016	2021	2022

First Big Data workshop at the Simons Institute

2 Responses to "First Big Data workshop at the Simons Institute"

By Jeremie September 23, 2013 - 7:53 am

By Sebastien Bubeck September 23, 2013 - 11:20 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll