The Wayback Machine - https://web.archive.org/web/20200924152959/https://blogs.princeton.edu/imabandit/2016/03/14/notes-on-least-squares-part-i/

Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables, and let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

We are interested in finding a minimizer w^* of f (we will also be satisfied with w such that f(w) - f(w^*) \leq \epsilon). Denoting \Sigma = \mathbb{E} X X^{\top} for the covariance matrix of the features (we refer to X as a feature vector) and \mu = \mathbb{E} Y X, one has w^* = \Sigma^{-1} \mu (if \Sigma is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case w^* is the smallest-norm element of the affine subspace of minimizers of f). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of (X, Y). Thus the only constraint to find w^* is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where \Sigma can be extremely ill-conditioned (for instance the smallest eigenvalue can be 0, or very close to 0), and thus bounds that depend on the condition number of \Sigma (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., Y = X^{\top} w^* + \epsilon where \epsilon \sim \mathcal{N}(0, \sigma^2) is independent of X) we define the noise level \sigma of the problem by assuming that \mathbb{E} \ (X^{\top} w^* - Y)^2 X X^{\top} \preceq \sigma^2 \Sigma. For simplicity we will assume (whenever necessary) that almost surely one has |X| \leq R and |Y| \leq B. All our iterative algorithms will start at w_0 = 0. We will also not worry too much about numerical constants and write \lesssim for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary d-dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of (X,Y)).

ERM

The most natural approach (also the most terrible as it will turn out) is to take a large sample (x_i, y_i)_{i \in [n]} and then minimize the corresponding empirical version \hat{f} of f, or in other words compute \hat{w} = \left(\mathbb{X} \mathbb{X}^{\top} \right)^{-1} \mathbb{X} \mathbb{Y} where \mathbb{Y} \in \mathbb{R}^n is the column vector (y_i) and \mathbb{X} \in \mathbb{R}^{d \times n} is the matrix whose i^{th} column is x_i. Since \hat{f} is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get \hat{w} in d iterations. However each iteration (basically computing a gradient of \hat{f}) cost O(n) elementary vector operations, thus leading overall to O(n d) elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix \hat{\Sigma} = \mathbb{X} \mathbb{X}^{\top}). What number n of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

    \[\mathbb{E} \hat{f}(\hat{w}) - f(w^*) \lesssim \frac{\sigma^2 \mathrm{rank}(\Sigma)}{n} .\]

(Note that in the above bound we evaluate \hat{w} on \hat{f} instead of f, see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking n= \sigma^2 d / \epsilon samples and performing O(\sigma^2 d^2 / \epsilon) elementary vector operations. This d^2 number of elementary vector operations completely dooms this approach when d is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least \sigma^2 d / \epsilon samples (in the worst case over all \Sigma) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of \hat{w} for some computational easiness is to add a regularization term to \hat{f} in order to make the problem better conditioned (recall that conjugate gradient reaches an \epsilon-optimal point in \sqrt{\kappa} \log(1/\epsilon) iterations, where \kappa is the condition number).

In other words we would like to minimize, for some values of n and \lambda to be specified,

    \[\frac{1}{2n} \sum_{i=1}^n (x_i^{\top} w - y_i)^2 + \frac{\lambda}{2} |w|^2 .\]

One can write the minimizer as \hat{w}(\lambda) = \left(\mathbb{X} \mathbb{X}^{\top} + n \lambda \mathrm{I}_d \right)^{-1} \mathbb{X} \mathbb{Y}, and this point is now obtained at a cost of \tilde{O} \left(n \sqrt{\frac{\|\Sigma\|}{\lambda}}\right) elementary vector operations (\|\Sigma\| denotes the spectral norm of \Sigma and we used the fact that with high probability \|\hat{\Sigma}\| \lesssim \|\Sigma\|). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to \tilde{O}\left(n +\sqrt{n \frac{\|\Sigma\|}{\lambda}} \right). Also observe that again an easy argument based on Rademacher complexity leads to

\mathbb{E} f(\hat{w}(\lambda)) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\sigma^2 d}{n}. Thus we want to take \lambda of order \frac{\sigma^2 d}{n |w^*|^2} which in terms of \epsilon is of order \epsilon/ |w^*|^2 leading to an overall number of vector operations of

    \[\frac{\sigma^2 d}{\epsilon} + \frac{|w^*| \sqrt{\|\Sigma\| \sigma^2 d}}{\epsilon} .\]

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on \epsilon than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that |w^*| appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing g_i(w) = x_i (x_i^{\top} w - y_i) (this is a stochastic gradient, that is it satisfies \mathbb{E}[g_i(w) | w] = \nabla f(w)) we recall that SGD can be written as:

    \[w_{i+1} = w_i - \eta g_i = (\mathrm{I}_d - \eta x_i x_i^{\top}) w_i + \eta y_i x_i .\]

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size \eta of order 1/\sqrt{n}, and assuming to simplify (see below for how to get around this) that |w^*| is known up to a numerical constant so that if w_{i+1} steps outside the ball of that radius then we project it back on it, one has (we also note that \mathbb{E}( |g_i(w)|^2 | w) \leq (B + |w| R)^2 \mathrm{Tr}(\Sigma), and we denote \bar{\sigma}^2 = (B + |w^*| R)^2):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \sqrt{\frac{|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{n}} .\]

This means that roughly with SGD the computational cost for our main task corresponds O(|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma) / \epsilon^2) vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in \epsilon^{-2} instead of \epsilon^{-1}, and (ii) \bar{\sigma}^2 instead of \sigma^2. We will fix both issues, but let us also observe that on the positive side the term d \|\Sigma\| in accelerated SVRG is replaced by the less conservative quantity \mathrm{Tr}(\Sigma) (that is one replaces the largest eigenvalue of \Sigma by the average eigenvalue of \Sigma).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on f(w) + \frac{\lambda}{2} |w|^2 with stochastic gradient g_i(w) = (\lambda \mathrm{I}_d + x_i x_i^{\top}) w - y_i x_i which leads to (see Theorem 6.2 here):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{\lambda n} .\]

Optimizing over \lambda we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size \eta of order of a constant times 1/R^2 satisfies

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \frac{\sigma^2 d + R^2 |w^*|^2}{n}.\]

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term \sigma^2 d / n is minimax optimal, while the bias term R^2 |w^*|^2 / n matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in 1/n^2 is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

This entry was posted in Machine learning, Optimization. Bookmark the permalink.

3 Responses to "Notes on least-squares, part I"

Leave a reply