As promised at the end of Part I, I will give some intuition for the Bach and Moulines analysis of constant step size SGD for least-squares. Let us recall the setting. Let be a pair of random variables with where . Let be the convex quadratic function defined by
Observe that, with the notation and one has
We assume that (in particular this implies ) and . For instance these assumptions are respectively satisfied with almost surely, and independent of . Recall that constant step-size SGD on is defined by the following iteration (where is a sequence of i.i.d. copies of ), for ,
Our goal is to show that for one has (for a sequence we denote by its average )
Decomposition into a noise-free SGD and a pure noise SGD
With the notation , , and (with the convention ) one can write an iteration of SGD as
Interestingly, since , we have with ,
In other words the convergence analysis decomposes into two terms: which corresponds to a noise-free SGD (it is as if we were running SGD with ) and which corresponds to a pure noise SGD (it is as if we were running SGD with ).
The analysis of the noise-free SGD is the usual drill: let and recall that (recall that in the noise-free case one has ), thus
which directly yields and thus for one gets which finally yields (by convexity of )
Thus from now on we can restrict our attention to the pure noise case, and the goal is to show that in this case one has .
Full gradient and pure noise
The pure noise case is the difficult part in the proof (this was to be expected, indeed in the SGD analysis 101 one usually cancels the effect of the noise by taking a very small step size). For this case Bach and Moulines (following the footsteps of Polyak and Juditsky ) suggest to first study a “full gradient” method, that is to replace the term by in the SGD iteration. More precisely the “full gradient” analysis would amount to study the sequence defined by (recall that we are in the pure noise case and thus ) and
We expect that will manage to stay pretty close to the optimum since we are doing full gradient steps, and thus it should only suffer the inherent variance of the problem (recall that the statistical lower bound for this problem is ). It is actually easy to prove that this intuition is correct, first note that
and thus for (which implies )
which implies (the cross terms vanish since is centered, and also recall that )
Difference between full gradient and SGD
Let us now see how different is the sequence from our real target :
In words we see that the difference between the two sequences satisfy exactly the same recursion as SGD, except that the noise term is replaced by . This is pretty good news since we expect to be of order : indeed is a full gradient method starting from , so it should stay close to up to the noise term ; again this intuition is easy to make precise since with (X) one sees easily that . In other words while corresponds to SGD with noise (the covariance of this noise is bounded by ), we see that corresponds to SGD with noise (now the covariance of this noise is bounded by ). Thus we see that the new noise term is a factor smaller. In particular we expect to have
(this come from the usual decomposition of the “average regret” for SGD as distance to optimum over –which is here– plus times the average of the norm squared of the noise –which is .)
Not quite there but close
So far we have obtained for the pure noise case:
We fall short of what we were hoping to prove (which is ) because is not close enough to . Morally the above shows that can be viewed as a order approximation to , with a remainder term of order . What about continuing this expansion of as a polynomial in ? Let us see what the next term should look like. First we hope that this term will also satisfy (2) and thus it should correspond to a full gradient method. On the other hand the noise term in this full gradient method should absorb the noise term in (3) so that we can hope that the noise term in the new (3) will be smaller. This leads to:
and thus (1) rewrites as
In fact there is no reason to stop there, and for any one can define
which leads to
Denoting for a positive number such that the covariance matrix of (which is the noise term in the “full gradient with noise” corresponding to both and ) is bounded by , one gets with the same arguments as above
In particular to conclude it suffices to show that is geometrically decreasing. In fact an easy induction shows that one can take (we already did the base case ).
That’s it for today. In part III I will discuss how to use random projections to speed up computations for least squares in both the large sample regime and in the high-dimensional regime .