Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let $(X, Y) \in \mathbb{R}^d \times \mathbb{R}$ be a pair of random variables, and let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be the convex quadratic function defined by

$f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .$

We are interested in finding a minimizer $w^*$ of $f$ (we will also be satisfied with $w$ such that $f(w) - f(w^*) \leq \epsilon$ ). Denoting $\Sigma = \mathbb{E} X X^{\top}$ for the covariance matrix of the features (we refer to $X$ as a feature vector) and $\mu = \mathbb{E} Y X$ , one has $w^* = \Sigma^{-1} \mu$ (if $\Sigma$ is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case $w^*$ is the smallest-norm element of the affine subspace of minimizers of $f$ ). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of $(X, Y)$ . Thus the only constraint to find $w^*$ is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where $\Sigma$ can be extremely ill-conditioned (for instance the smallest eigenvalue can be $0$ , or very close to $0$ ), and thus bounds that depend on the condition number of $\Sigma$ (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., $Y = X^{\top} w^* + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ is independent of $X$ ) we define the noise level $\sigma$ of the problem by assuming that $\mathbb{E} \ (X^{\top} w^* - Y)^2 X X^{\top} \preceq \sigma^2 \Sigma$ . For simplicity we will assume (whenever necessary) that almost surely one has $|X| \leq R$ and $|Y| \leq B$ . All our iterative algorithms will start at $w_0 = 0$ . We will also not worry too much about numerical constants and write $\lesssim$ for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary $d$ -dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of $(X,Y)$ ).

ERM

The most natural approach (also the most terrible as it will turn out) is to take a large sample $(x_i, y_i)_{i \in [n]}$ and then minimize the corresponding empirical version $\hat{f}$ of $f$ , or in other words compute $\hat{w} = \left(\mathbb{X} \mathbb{X}^{\top} \right)^{-1} \mathbb{X} \mathbb{Y}$ where $\mathbb{Y} \in \mathbb{R}^n$ is the column vector $(y_i)$ and $\mathbb{X} \in \mathbb{R}^{d \times n}$ is the matrix whose $i^{th}$ column is $x_i$ . Since $\hat{f}$ is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get $\hat{w}$ in $d$ iterations. However each iteration (basically computing a gradient of $\hat{f}$ ) cost $O(n)$ elementary vector operations, thus leading overall to $O(n d)$ elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix $\hat{\Sigma} = \mathbb{X} \mathbb{X}^{\top}$ ). What number $n$ of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

$\mathbb{E} \hat{f}(\hat{w}) - f(w^*) \lesssim \frac{\sigma^2 \mathrm{rank}(\Sigma)}{n} .$

(Note that in the above bound we evaluate $\hat{w}$ on $\hat{f}$ instead of $f$ , see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking $n= \sigma^2 d / \epsilon$ samples and performing $O(\sigma^2 d^2 / \epsilon)$ elementary vector operations. This $d^2$ number of elementary vector operations completely dooms this approach when $d$ is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least $\sigma^2 d / \epsilon$ samples (in the worst case over all $\Sigma$ ) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of $\hat{w}$ for some computational easiness is to add a regularization term to $\hat{f}$ in order to make the problem better conditioned (recall that conjugate gradient reaches an $\epsilon$ -optimal point in $\sqrt{\kappa} \log(1/\epsilon)$ iterations, where $\kappa$ is the condition number).

In other words we would like to minimize, for some values of $n$ and $\lambda$ to be specified,

$\frac{1}{2n} \sum_{i=1}^n (x_i^{\top} w - y_i)^2 + \frac{\lambda}{2} |w|^2 .$

One can write the minimizer as $\hat{w}(\lambda) = \left(\mathbb{X} \mathbb{X}^{\top} + n \lambda \mathrm{I}_d \right)^{-1} \mathbb{X} \mathbb{Y}$ , and this point is now obtained at a cost of $\tilde{O} \left(n \sqrt{\frac{\|\Sigma\|}{\lambda}}\right)$ elementary vector operations ( $\|\Sigma\|$ denotes the spectral norm of $\Sigma$ and we used the fact that with high probability $\|\hat{\Sigma}\| \lesssim \|\Sigma\|$ ). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to $\tilde{O}\left(n +\sqrt{n \frac{\|\Sigma\|}{\lambda}} \right)$ . Also observe that again an easy argument based on Rademacher complexity leads to

$\mathbb{E} f(\hat{w}(\lambda)) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\sigma^2 d}{n}$ . Thus we want to take $\lambda$ of order $\frac{\sigma^2 d}{n |w^*|^2}$ which in terms of $\epsilon$ is of order $\epsilon/ |w^*|^2$ leading to an overall number of vector operations of

$\frac{\sigma^2 d}{\epsilon} + \frac{|w^*| \sqrt{\|\Sigma\| \sigma^2 d}}{\epsilon} .$

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on $\epsilon$ than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that $|w^*|$ appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing $g_i(w) = x_i (x_i^{\top} w - y_i)$ (this is a stochastic gradient, that is it satisfies $\mathbb{E}[g_i(w) | w] = \nabla f(w)$ ) we recall that SGD can be written as:

$w_{i+1} = w_i - \eta g_i = (\mathrm{I}_d - \eta x_i x_i^{\top}) w_i + \eta y_i x_i .$

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size $\eta$ of order $1/\sqrt{n}$ , and assuming to simplify (see below for how to get around this) that $|w^*|$ is known up to a numerical constant so that if $w_{i+1}$ steps outside the ball of that radius then we project it back on it, one has (we also note that $\mathbb{E}( |g_i(w)|^2 | w) \leq (B + |w| R)^2 \mathrm{Tr}(\Sigma)$ , and we denote $\bar{\sigma}^2 = (B + |w^*| R)^2$ ):

$\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \sqrt{\frac{|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{n}} .$

This means that roughly with SGD the computational cost for our main task corresponds $O(|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma) / \epsilon^2)$ vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in $\epsilon^{-2}$ instead of $\epsilon^{-1}$ , and (ii) $\bar{\sigma}^2$ instead of $\sigma^2$ . We will fix both issues, but let us also observe that on the positive side the term $d \|\Sigma\|$ in accelerated SVRG is replaced by the less conservative quantity $\mathrm{Tr}(\Sigma)$ (that is one replaces the largest eigenvalue of $\Sigma$ by the average eigenvalue of $\Sigma$ ).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on $f(w) + \frac{\lambda}{2} |w|^2$ with stochastic gradient $g_i(w) = (\lambda \mathrm{I}_d + x_i x_i^{\top}) w - y_i x_i$ which leads to (see Theorem 6.2 here):

$\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{\lambda n} .$

Optimizing over $\lambda$ we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size $\eta$ of order of a constant times $1/R^2$ satisfies

$\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \frac{\sigma^2 d + R^2 |w^*|^2}{n}.$

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term $\sigma^2 d / n$ is minimax optimal, while the bias term $R^2 |w^*|^2 / n$ matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in $1/n^2$ is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

3 Responses to "Notes on least-squares, part I"

By All around the weblogs in 78 hrs | A bunch of data April 11, 2016 - 11:32 pm

[…] Notes on least-squares, aspect I […]
By Csaba Szepesvari March 15, 2016 - 10:10 am

Nice writeup, as usual!
Just a little observation: Without regularization the expected loss can be infinite even if the data is bounded, so the first inequality does not hold as written, unless the squiggly less-than means something like that the inequality may or may not hold. I learned this from Problem 10.3 of [1] and we presented a rehash of this often overlooked fact as Example 5.3 in [2] (there is a regrettable typo in the proof of this example..)
[1]. Gyorfi, M. Kohler, A. Kryzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer, 2002
[2] Huang, R. and Szepesvári, Cs., A Finite-Sample Generalization Bound for Semiparametric Regression: Partially Linear Models, AISTATS, pp. 402–410, 2014
- By Sebastien Bubeck March 15, 2016 - 1:50 pm
  
  Thanks Csaba for pointing this out, indeed this is a real issue. I have added a disclaimer in the main post (and in the bound you mentioned I now have \hat{f} instead of f, I believe that with this modification the inequality is true ;)).

Nov	SEP	Sep
	24
2016	2020	2021

Notes on least-squares, part I

3 Responses to "Notes on least-squares, part I"

By All around the weblogs in 78 hrs | A bunch of data April 11, 2016 - 11:32 pm

By Csaba Szepesvari March 15, 2016 - 10:10 am

By Sebastien Bubeck March 15, 2016 - 1:50 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll