As promised at the end of Part I, I will give some intuition for the Bach and Moulines analysis of constant step size SGD for least-squares. Let us recall the setting. Let be a pair of random variables with
where
. Let
be the convex quadratic function defined by
Observe that, with the notation and
one has
We assume that (in particular this implies
) and
. For instance these assumptions are respectively satisfied with
almost surely, and
independent of
. Recall that constant step-size SGD on
is defined by the following iteration (where
is a sequence of i.i.d. copies of
), for
,
Our goal is to show that for one has (for a sequence
we denote by
its average
)
Decomposition into a noise-free SGD and a pure noise SGD
With the notation ,
, and
(with the convention
) one can write an iteration of SGD as
Interestingly, since , we have with
,
In other words the convergence analysis decomposes into two terms: which corresponds to a noise-free SGD (it is as if we were running SGD with
) and
which corresponds to a pure noise SGD (it is as if we were running SGD with
).
Noise-free SGD
The analysis of the noise-free SGD is the usual drill: let and recall that
(recall that in the noise-free case one has
), thus
which directly yields and thus for
one gets
which finally yields (by convexity of
)
Thus from now on we can restrict our attention to the pure noise case, and the goal is to show that in this case one has .
Full gradient and pure noise
The pure noise case is the difficult part in the proof (this was to be expected, indeed in the SGD analysis 101 one usually cancels the effect of the noise by taking a very small step size). For this case Bach and Moulines (following the footsteps of Polyak and Juditsky [1992]) suggest to first study a “full gradient” method, that is to replace the term by
in the SGD iteration. More precisely the “full gradient” analysis would amount to study the sequence
defined by
(recall that we are in the pure noise case and thus
) and
We expect that will manage to stay pretty close to the optimum since we are doing full gradient steps, and thus it should only suffer the inherent variance of the problem (recall that the statistical lower bound for this problem is
). It is actually easy to prove that this intuition is correct, first note that
(1)
and thus for (which implies
)
which implies (the cross terms vanish since is centered, and also recall that
)
(2)
Difference between full gradient and SGD
Let us now see how different is the sequence from our real target
:
(3)
In words we see that the difference between the two sequences satisfy exactly the same recursion as SGD, except that the noise term is replaced by
. This is pretty good news since we expect
to be of order
: indeed
is a full gradient method starting from
, so it should stay close to
up to the noise term
; again this intuition is easy to make precise since with (X) one sees easily that
. In other words while
corresponds to SGD with noise
(the covariance of this noise is bounded by
), we see that
corresponds to SGD with noise
(now the covariance of this noise is bounded by
). Thus we see that the new noise term is a factor
smaller. In particular we expect to have
(this come from the usual decomposition of the “average regret” for SGD as distance to optimum over –which is
here– plus
times the average of the norm squared of the noise –which is
.)
Not quite there but close
So far we have obtained for the pure noise case:
We fall short of what we were hoping to prove (which is ) because
is not close enough to
. Morally the above shows that
can be viewed as a
order approximation to
, with a remainder term of order
. What about continuing this expansion of
as a polynomial in
? Let us see what the next term
should look like. First we hope that this term will also satisfy (2) and thus it should correspond to a full gradient method. On the other hand the noise term in this full gradient method should absorb the noise term in (3) so that we can hope that the noise term in the new (3) will be smaller. This leads to:
(4)
and thus (1) rewrites as
In fact there is no reason to stop there, and for any one can define
which leads to
Denoting for a positive number such that the covariance matrix of
(which is the noise term in the “full gradient with noise” corresponding to both
and
) is bounded by
, one gets with the same arguments as above
and
In particular to conclude it suffices to show that is geometrically decreasing. In fact an easy induction shows that one can take
(we already did the base case
).
That’s it for today. In part III I will discuss how to use random projections to speed up computations for least squares in both the large sample regime and in the high-dimensional regime
.