# Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let be a pair of random variables, and let be the convex quadratic function defined by

We are interested in finding a minimizer of (we will also be satisfied with such that ). Denoting for the covariance matrix of the features (we refer to as a feature vector) and , one has (if is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case is the smallest-norm element of the affine subspace of minimizers of ). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of . Thus the only constraint to find is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where can be extremely ill-conditioned (for instance the smallest eigenvalue can be , or very close to ), and thus bounds that depend on the condition number of (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., where is independent of ) we define the noise level of the problem by assuming that . For simplicity we will assume (whenever necessary) that almost surely one has and . All our iterative algorithms will start at . We will also not worry too much about numerical constants and write for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary -dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of ).

ERM

The most natural approach (also the most terrible as it will turn out) is to take a large sample and then minimize the corresponding empirical version of , or in other words compute where is the column vector and is the matrix whose column is . Since is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get in iterations. However each iteration (basically computing a gradient of ) cost elementary vector operations, thus leading overall to elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix ). What number of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

(Note that in the above bound we evaluate on instead of , see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking samples and performing elementary vector operations. This number of elementary vector operations completely dooms this approach when is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least samples (in the worst case over all ) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of for some computational easiness is to add a regularization term to in order to make the problem better conditioned (recall that conjugate gradient reaches an -optimal point in iterations, where is the condition number).

In other words we would like to minimize, for some values of and to be specified,

One can write the minimizer as , and this point is now obtained at a cost of elementary vector operations ( denotes the spectral norm of and we used the fact that with high probability ). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to . Also observe that again an easy argument based on Rademacher complexity leads to

. Thus we want to take of order which in terms of is of order leading to an overall number of vector operations of

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing (this is a stochastic gradient, that is it satisfies ) we recall that SGD can be written as:

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size of order , and assuming to simplify (see below for how to get around this) that is known up to a numerical constant so that if steps outside the ball of that radius then we project it back on it, one has (we also note that , and we denote ):

This means that roughly with SGD the computational cost for our main task corresponds vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in instead of , and (ii) instead of . We will fix both issues, but let us also observe that on the positive side the term in accelerated SVRG is replaced by the less conservative quantity (that is one replaces the largest eigenvalue of by the average eigenvalue of ).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on with stochastic gradient which leads to (see Theorem 6.2 here):

Optimizing over we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size of order of a constant times satisfies

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term is minimax optimal, while the bias term matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

This entry was posted in Machine learning, Optimization. Bookmark the permalink.