Notes on least-squares, part II

As promised at the end of Part I, I will give some intuition for the Bach and Moulines analysis of constant step size SGD for least-squares. Let us recall the setting. Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables with Y=X^{\top} w^* + \epsilon where \mathbb{E}[\epsilon X] = 0. Let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

Observe that, with the notation \Sigma =\mathbb{E} X X^{\top} and \|x\|_{\Sigma} = \sqrt{x^{\top} \Sigma x} one has

    \[f(w) - f(w^*) = \|w - w^*\|_{\Sigma}^2 .\]

We assume that \mathbb{E}[ |X|^2 X X^{\top} ] \preceq R^2 \Sigma (in particular this implies \mathbb{E} |X|^2 = \mathrm{Tr}(\Sigma) \leq R^2) and \mathbb{E}[ \epsilon^2 X X^{\top} ] \preceq \sigma^2 \Sigma. For instance these assumptions are respectively satisfied with |X| \leq R almost surely, and \epsilon \sim \mathcal{N}(0, \sigma^2) independent of X. Recall that constant step-size SGD on f is defined by the following iteration (where (x_t, y_t) is a sequence of i.i.d. copies of (X,Y)), for t \geq 0,

    \[w_{t+1} = w_{t} - \eta (x_t^{\top} w_{t} - y_t) x_t .\]

Our goal is to show that for \eta = \Theta(1/R^2) one has (for a sequence (u_t)_{t \in [n]} we denote by \bar{u}_n its average \frac{1}{n} \sum_{t=1}^n u_t)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \frac{|w_0 - w^*|^2}{n} .\]

Decomposition into a noise-free SGD and a pure noise SGD

With the notation b_t = y_t x_t, A_t=I_d - \eta x_t x_t^{\top}, and A_{i:j} = A_j \hdots A_i (with the convention A_{i:i-1} = I_d) one can write an iteration of SGD as

    \[w_{t+1} = A_{t} w_{t} + \eta b_{t} = A_{1:t} w_0 + \eta \sum_{k=1}^t A_{k+1:t} b_k .\]

Interestingly, since y_t = w_*^{\top} x_t + \epsilon_t, we have with \xi_t = \epsilon_t x_t,

    \[w_{t+1}-w_* = A_{t} (w_{t}-w_*) + \eta \xi_{t} = A_{1:t} (w_0-w_*) + \eta \sum_{k=1}^t A_{k+1:t} \xi_k.\]

In other words the convergence analysis decomposes into two terms: A_{1:t} (w_0-w_*) which corresponds to a noise-free SGD (it is as if we were running SGD with \epsilon_t = 0) and \sum_{k=1}^t A_{k+1:t} \xi_k which corresponds to a pure noise SGD (it is as if we were running SGD with w_0=w^*=0).

Noise-free SGD

The analysis of the noise-free SGD is the usual drill: let \Delta_t = w_t - w^* and recall that \Delta_{t+1} = \Delta_t - \eta x_t x_t^{\top} \Delta_t (recall that in the noise-free case one has \epsilon_t=0), thus

    \[|\Delta_{t+1}|^2 = |\Delta_t|^2 - 2 \eta \Delta_t^{\top} x_t x_t^{\top} \Delta_t + \eta^2 \Delta_t^{\top} (x_t x_t^{\top})^2 \Delta_t ,\]

which directly yields \mathbb{E} |\Delta_{t+1}|^2 \leq \mathbb{E}[ |\Delta_t|^2 - 2 \eta \|\Delta_t\|_{\Sigma}^2 + \eta^2 R^2 \|\Delta_t\|_{\Sigma}^2 ] and thus for \eta \leq 1/R^2 one gets \eta \mathbb{E} \|\Delta_t\|_{\Sigma}^2 \leq \mathbb{E}[ |\Delta_t|^2 - |\Delta_{t+1}|^2 ] which finally yields (by convexity of x \mapsto \|x\|_{\Sigma}^2)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 = \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq\frac{|\Delta_1|^2}{n \eta} .\]

Thus from now on we can restrict our attention to the pure noise case, and the goal is to show that in this case one has \mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}.

Full gradient and pure noise

The pure noise case is the difficult part in the proof (this was to be expected, indeed in the SGD analysis 101 one usually cancels the effect of the noise by taking a very small step size). For this case Bach and Moulines (following the footsteps of Polyak and Juditsky [1992]) suggest to first study a “full gradient” method, that is to replace the term x_t x_t^{\top} by \Sigma in the SGD iteration. More precisely the “full gradient” analysis would amount to study the sequence \Delta_t^{(1)} defined by \Delta_0^{(1)} = 0 (recall that we are in the pure noise case and thus w_0= w^*) and

    \[\Delta_{t+1}^{(1)} =(I_d - \eta \Sigma) \Delta_t^{(1)} + \eta \xi_t .\]

We expect that \Delta_t^{(1)} will manage to stay pretty close to the optimum since we are doing full gradient steps, and thus it should only suffer the inherent variance of the problem (recall that the statistical lower bound for this problem is \sigma^2 d / n). It is actually easy to prove that this intuition is correct, first note that

(1)   \begin{equation*} \Delta_{t}^{(1)} = \eta \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s , \end{equation*}

and thus for \eta \leq 1/R^2 (which implies \eta \Sigma \preceq I_d)

    \begin{align*} \bar{\Delta}_n^{(1)} & = \frac{\eta}{n} \sum_{t=1}^n \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s\\ & = \frac{\eta}{n} \sum_{s=1}^n \sum_{k=0}^{n-s} (I_d - \eta \Sigma)^{k} \xi_s \\ & = \frac{\eta}{n} \sum_{s=1}^n (\eta \Sigma)^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s , \end{align*}

which implies (the cross terms vanish since \xi_s is centered, and also recall that \mathbb{E} \xi_s \xi_s^{\top} \preceq \sigma^2 \Sigma)

(2)   \begin{equation*} \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 = \frac{1}{n^2} \sum_{s=1}^n \mathbb{E} \| \Sigma^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s \|_{\Sigma}^2 \leq \frac{\sigma^2 d}{n} . \end{equation*}

Difference between full gradient and SGD

Let us now see how different is the sequence \Delta_t^{(1)} from our real target \Delta_t:

(3)   \begin{align*} \Delta_{t+1} - \Delta_{t+1}^{(1)} & = (I_d - \eta x_t x_t^{\top}) \Delta_t - (I_d - \eta \Sigma) \Delta_t^{(1)} \notag \\ & = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} . \end{align*}

In words we see that the difference between the two sequences satisfy exactly the same recursion as SGD, except that the noise term \xi_t is replaced by (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)}. This is pretty good news since we expect \Delta_t^{(1)} to be of order \eta: indeed \Delta_t^{(1)} is a full gradient method starting from 0, so it should stay close to 0 up to the noise term \eta \xi_t; again this intuition is easy to make precise since with (X) one sees easily that \mathbb{E}[ \Delta_t^{(1)} \Delta_t^{(1) \top} ] \preceq \eta \sigma^2 I_d. In other words while \Delta_t corresponds to SGD with noise \xi_t (the covariance of this noise is bounded by \sigma^2 \Sigma), we see that \Delta_t - \Delta_t^{(1)} corresponds to SGD with noise (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} (now the covariance of this noise is bounded by \eta R^2 \sigma^2 \Sigma). Thus we see that the new noise term is a factor \eta R^2 smaller. In particular we expect to have

    \[\mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \eta \times \eta R^2 \times \sigma^2 R^2\]

(this come from the usual decomposition of the “average regret” for SGD as distance to optimum over n \times \eta –which is 0 here– plus \eta times the average of the norm squared of the noise –which is \frac{\eta}{n} \sum_{t=1}^n \mathbb{E} |(\Sigma - x_t x_t^{\top}) \Delta_{t}^{(1)}|^2 \leq \eta R^2 \times \sigma^2 R^2.)

Not quite there but close

So far we have obtained for the pure noise case:

    \[\mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 + \mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \eta \times \eta R^2 \times \sigma^2 R^2 .\]

We fall short of what we were hoping to prove (which is \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}) because \Delta_t^{(1)} is not close enough to \Delta_t. Morally the above shows that \Delta_t^{(1)} can be viewed as a 1^{st} order approximation to \Delta_t, with a remainder term of order \eta R^2. What about continuing this expansion of \Delta_t as a polynomial in \eta R^2? Let us see what the next term \Delta_t^{(2)} should look like. First we hope that this term will also satisfy (2) and thus it should correspond to a full gradient method. On the other hand the noise term in this full gradient method should absorb the noise term in (3) so that we can hope that the noise term in the new (3) will be smaller. This leads to:

(4)   \begin{equation*} \Delta_{t+1}^{(2)} = (I_d - \eta \Sigma) \Delta_t^{(2)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} , \end{equation*}

and thus (1) rewrites as

    \[\Delta_{t+1} - \Delta_{t+1}^{(1)} - \Delta_{t+1}^{(2)} = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)} - \Delta_{t+1}^{(2)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(2)} .\]

In fact there is no reason to stop there, and for any k \geq 1 one can define

    \[\Delta_{t+1}^{(k)} = (I_d - \eta \Sigma) \Delta_t^{(k)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} ,\]

which leads to

    \[\Delta_{t+1} - \sum_{k=1}^r \Delta_{t+1}^{(k)} = (I_d - \eta x_t x_t^{\top}) \left(\Delta_t - \sum_{k=1}^r \Delta_t^{(k)} \right) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(r)} .\]

Denoting \lambda_k for a positive number such that the covariance matrix of (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} (which is the noise term in the “full gradient with noise” corresponding to both \Delta_t^{(k)} and \Delta_{t} - \sum_{s=1}^k \Delta_{t}^{(s)}) is bounded by \lambda_k \sigma^2 \Sigma, one gets with the same arguments as above

    \[\mathbb{E} \left\|\bar{\Delta}_{n} - \sum_{k=1}^r \bar{\Delta}_{n}^{(k)} \right\|_{\Sigma}^2 \lesssim \eta \times \lambda_r \times \sigma^2 R^2\]

and

    \[\mathbb{E} \|\bar{\Delta}_n^{(k)}\|_{\Sigma}^2 \lesssim \frac{\lambda_k \sigma^2 d}{n} .\]

In particular to conclude it suffices to show that \lambda_k is geometrically decreasing. In fact an easy induction shows that one can take \lambda_k = (\eta R^2)^k (we already did the base case k=1).

 

That’s it for today. In part III I will discuss how to use random projections to speed up computations for least squares in both the large sample regime n \gg d and in the high-dimensional regime d \gg n.

This entry was posted in Machine learning, Optimization. Bookmark the permalink.

Leave a reply