These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!
Let be a pair of random variables, and let
be the convex quadratic function defined by
We are interested in finding a minimizer of
(we will also be satisfied with
such that
). Denoting
for the covariance matrix of the features (we refer to
as a feature vector) and
, one has
(if
is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case
is the smallest-norm element of the affine subspace of minimizers of
). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of
. Thus the only constraint to find
is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where
can be extremely ill-conditioned (for instance the smallest eigenvalue can be
, or very close to
), and thus bounds that depend on the condition number of
(ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.
Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., where
is independent of
) we define the noise level
of the problem by assuming that
. For simplicity we will assume (whenever necessary) that almost surely one has
and
. All our iterative algorithms will start at
. We will also not worry too much about numerical constants and write
for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary
-dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of
).
ERM
The most natural approach (also the most terrible as it will turn out) is to take a large sample and then minimize the corresponding empirical version
of
, or in other words compute
where
is the column vector
and
is the matrix whose
column is
. Since
is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get
in
iterations. However each iteration (basically computing a gradient of
) cost
elementary vector operations, thus leading overall to
elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix
). What number
of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:
(Note that in the above bound we evaluate on
instead of
, see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking
samples and performing
elementary vector operations. This
number of elementary vector operations completely dooms this approach when
is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least
samples (in the worst case over all
) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.
A potential idea to trade a bit of the nice statistical property of for some computational easiness is to add a regularization term to
in order to make the problem better conditioned (recall that conjugate gradient reaches an
-optimal point in
iterations, where
is the condition number).
In other words we would like to minimize, for some values of and
to be specified,
One can write the minimizer as , and this point is now obtained at a cost of
elementary vector operations (
denotes the spectral norm of
and we used the fact that with high probability
). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to
. Also observe that again an easy argument based on Rademacher complexity leads to
. Thus we want to take
of order
which in terms of
is of order
leading to an overall number of vector operations of
Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that
appears in this bias term as it is the initial distance to the optimum for the optimization procedure.
Standard SGD
Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing (this is a stochastic gradient, that is it satisfies
) we recall that SGD can be written as:
The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size of order
, and assuming to simplify (see below for how to get around this) that
is known up to a numerical constant so that if
steps outside the ball of that radius then we project it back on it, one has (we also note that
, and we denote
):
This means that roughly with SGD the computational cost for our main task corresponds vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in
instead of
, and (ii)
instead of
. We will fix both issues, but let us also observe that on the positive side the term
in accelerated SVRG is replaced by the less conservative quantity
(that is one replaces the largest eigenvalue of
by the average eigenvalue of
).
Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on with stochastic gradient
which leads to (see Theorem 6.2 here):
Optimizing over we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!
Constant step size SGD (after Bach and Moulines 2013)
SGD with a constant step size of order of a constant times
satisfies
We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term is minimax optimal, while the bias term
matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in
is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).
In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).
By All around the weblogs in 78 hrs | A bunch of data April 11, 2016 - 11:32 pm
[…] Notes on least-squares, aspect I […]
By Csaba Szepesvari March 15, 2016 - 10:10 am
Nice writeup, as usual!
Just a little observation: Without regularization the expected loss can be infinite even if the data is bounded, so the first inequality does not hold as written, unless the squiggly less-than means something like that the inequality may or may not hold. I learned this from Problem 10.3 of [1] and we presented a rehash of this often overlooked fact as Example 5.3 in [2] (there is a regrettable typo in the proof of this example..)
[1]. Gyorfi, M. Kohler, A. Kryzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer, 2002
[2] Huang, R. and Szepesvári, Cs., A Finite-Sample Generalization Bound for Semiparametric Regression: Partially Linear Models, AISTATS, pp. 402–410, 2014
By Sebastien Bubeck March 15, 2016 - 1:50 pm
Thanks Csaba for pointing this out, indeed this is a real issue. I have added a disclaimer in the main post (and in the bound you mentioned I now have \hat{f} instead of f, I believe that with this modification the inequality is true ;)).