The Wayback Machine - https://web.archive.org/web/20201021164559/https://blogs.princeton.edu/imabandit/2016/11/13/geometry-of-linearized-neural-networks/

Geometry of linearized neural networks

This week we had the pleasure to host Tengyu Ma from Princeton University who told us about the recent progress he has made with co-authors to understand various linearized versions of neural networks. I will describe here two such results, one for Residual Neural Networks and one for Recurrent Neural Networks.

Some properties to look for in non-convex optimization

We will say that a function f admits first order optimality (respectively second order optimality) if all critical points (respectively all local minima) of f are global minima (of course first order optimality implies second order optimality for smooth functions). In particular with first order optimality one has that gradient descent converges to the global minimum, and with second order optimality this is also true provided that one avoids saddle points. To obtain rates of convergence it can be useful to make more quantitative statements. For example we say that f is \alpha-Polyak if

    \[\|\nabla f(x)\|^2 \geq \alpha (f(x) - f^*) .\]

Clearly \alpha-Polyak implies first order optimality, but more importantly it also implies linear convergence rate for gradient descent on f. A variant of this condition is \alpha-weak-quasi-convexity:

    \[\langle \nabla f(x), x-x^*\rangle \geq \alpha (f(x) - f^*) ,\]

in which case gradient descent converges at the slow non-smooth rate 1/\sqrt{t} (and in this case it is also robust to noise, i.e. one can write a stochastic gradient descent version). The proofs of these statements just mimic the usual convex proofs. For more on these conditions see for instance this paper.

Linearized Residual Networks

Recall that a neural network is just a map x \in \mathbb{R}^n \mapsto \sigma \circ A_L \circ \sigma A_{L-1} \hdots \sigma \circ A_0 (x) where A_0,\hdots, A_{L} are linear maps (i.e. they are the matrices parametrizing the neural network) and \sigma is some non-linear map (the most popular one, ReLu, is the just the coordinate-wise positive part). Alternatively you can think of a neural network as a sequence of hidden states h_0, h_1,\hdots,h_L where h_0=x and h_{t+1} = \sigma(A_t h_{t}). In 2015 a team of researcher at MSR Asia introduced the concept of a residual neural network where the hidden states are now updated as before for t even but for t odd we set h_{t+1} = h_{t-1} + \sigma(A_t h_t). Apparently this trick allowed them to train much deeper networks, though it is not clear why this would help from a theoretical point of view (the intuition is that at least when the network is initialized with all matrices being 0 it still does something non-trivial, namely it computes the identity).

In their most recent paper Moritz Hardt and Tengyu Ma try to explain why adding this “identity connection” could be a good idea from a geometric point of view. They consider an (extremely) simplified model where there is no non-linearity, i.e. \sigma is the identity map. A neural network is then just a product of matrices. In particular the landscape we are looking at for least-squares with such a model is of the form:

    \[g(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L A_i x - y\|^2 ,\]

which is of course a non-convex function (just think of the function (a,b) \mapsto (ab-1)^2 and observe that on the segment a=b it gives the non-convex function (a^2-1)^2). However it actually satisfies the second-order optimality condition:

Proposition [Kawaguchi 2016]

Assume that x has a full rank covariance matrix and that y=Rx for some deterministic matrix R. Then all local minima of g are global minima.

I won’t give the proof of this result as it requires to take the second derivative of g which is a bit annoying (I will give below the proof of the first derivative). Now in this linearized setting the residual network version (where the identity connection is added at every layer) corresponds simply to a reparametrization around the identity, in other words we consider now the following function:

    \[f(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L (A_i+\mathrm{I}) x - y\|^2 .\]

Proposition [Hardt and Ma 2016]

Assume that x has a full rank covariance matrix and that y= Rx for some deterministic matrix R. Then f has first order optimality on the set \{(A_0, \hdots, A_L) : \|A_i\| <1\}.

Thus adding the identity connection makes the objective function better behave around the starting points with all-zeros matrices (in the sense that gradient descent doesn’t have to worry about avoiding saddle points). The proof is just a few lines of standard calculations to take derivatives of functions with matrix-valued inputs.

Proof: One has with E = R - \prod_{i=0}^L (A_i+\mathrm{I}) and \Sigma = \E x x^{\top},

    \[f = \E (E x)^{\top} Ex = \mathrm{Tr}(E \Sigma E^{\top}) =: \|E\|_{\Sigma}^2 ,\]

so with E_{<i} = \prod_{j <i} (A_j + \mathrm{I}) and E_{>i}=\prod_{j >i} (A_j + \mathrm{I}),

    \begin{eqnarray*} f(A_0,\hdots, A_i + V, \hdots, A_L) & = & \|R - \prod_{j <i} (A_j + \mathrm{I}) \times (A_i + V +\mathrm{I}) \times \prod_{j >i} (A_j + \mathrm{I})\|_{\Sigma}^2 \\ & = & \|E + E_{<i} V E_{>i}\|_{\Sigma}^2 \\ & = & \|E\|_{\Sigma}^2 + 2 \langle \Sigma E, E_{<i} V E_{>i} \rangle + \|E_{<i} V E_{>i}\|_{\Sigma}^2 , \end{eqnarray*}

which exactly means that the derivative of f with respect to A_i is equal to E_{>i}^{\top} \Sigma E E_{<i}^{\top}. On the set under consideration one has that E_{>i} and E_{<i} are invertible (and so is \Sigma by assumption), and thus if this derivative is equal to 0 it muts be that E=0 and thus f=0 (which is the global minimum).

Linearized recurrent neural networks

The simplest version of a recurrent neural network is as follows. It is a mapping of the form (x_1,\hdots,x_T) \mapsto (y_1,\hdots,y_T) (we are thinking of doing sequence to sequence prediction). In these networks the hidden state is updated as h_{t+1} = \sigma_1(A h_{t} + B x_{t}) (with h_1=0) and the output is y_t = \sigma_2(C h_t + D x_t). I will now describe a paper by Hardt, Ma and Recht (see also this blog post) that tries to understand the geometry of least-squares for this problem in the linearized version where \sigma_1 = \sigma_2 = \mathrm{Id}. That is we are looking at the function:

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\mathbb{E}_{(x_t)} \frac{1}{T} \sum_{t=1}^T \|\hat{y}_t - y_t\|^2 ,\]

where (y_t) is obtained from (x_t) via some unknown recurrent neural network with parameters A,B,C,D. First observe that by induction one can easily see that h_{t+1} = \sum_{k=1}^t A^{t-k} B x_k and y_t = D x_t + \sum_{k=1}^{t-1} C A^{t-1-k} B x_k. In particular, assuming that (x_t) is an i.i.d. isotropic sequence one obtains

    \[\E \|y_t - \hat{y}_t\|_2^2 = \|D-\hat{D}\|_F^2 + \sum_{k=1}^{t-1} \|\hat{C} \hat{A}^{t-1-k} \hat{B} - C A^{t-1-k} B\|_F^2 ,\]

and thus

    \[f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\|D-\hat{D}\|_F^2 + \sum_{k=1}^{T-1} (1- \frac{k}{T}) \|\hat{C} \hat{A}^{k-1} \hat{B} - C A^{k-1} B\|_F^2 .\]

In particular we see that the effect of \hat{D} is decoupled from the other variables and that is appears as a convex function, thus we will just ignore it. Next we make the natural assumption that the spectral radius of A is less than 1 (for otherwise the influence of the initial input x_1 is growing over time which doesn’t seem natural) and thus up to some small error term (for large T) one can consider the idealized risk:

    \[(\hat{A},\hat{B}, \hat{C}) \mapsto \sum_{k=0}^{+\infty} \|\hat{C} \hat{A}^{k} \hat{B} - C A^{k} B\|_F^2 .\]

The next idea is a cute one which makes the above expression more tractable. Consider the series r_k = C A^k B and its Fourier transform:

    \[G(\theta) = \sum_{k=0}^{+\infty} r_k \exp(i k \theta) = C (\sum_{k=0}^{+\infty} (\exp(i \theta) A)^k) B = C(\mathrm{I} - \exp(i \theta) A)^{-1} B .\]

By Parseval’s theorem the idealized risk is equal to the L_2 distance between G and \hat{G} (i.e. \int_{[-\pi, \pi]} \|G(\theta)-\hat{G}(\theta)\|_F^2 d\theta). We will now show that under appropriate further assumptions, for any \theta that \|G(\theta) - \hat{G}(\theta)\|_F^2 is weakly-quasi-convex in (\hat{A},\hat{B},\hat{C}) (in particular this shows that the idealized risk is weakly-quasi-convex). The big assumption that Hardt, Ma and Recht make is that the system is a “single-input single-output” model, that is both x_t and y_t are scalar. In this case it turns out that control theory shows that there is a “canonical controlable form” where B=(0,\hdots,0,1), C=(c_1,\hdots,c_n) and A has zeros everywhere except on the upper diagonal where it has ones and on the last row where it has a_n,\hdots, a_1 (I don’t know the proof of this result, if some reader has a pointer for a simple proof please share in the comments!). Note that with this form the system is simple to interpret as one has Ah = (h(2),\hdots, h(n-1), \langle a,h\rangle) and Ah+Bx =(h(2),\hdots, h(n-1), \langle a,h\rangle+x). Now with just a few lines of algebra:

    \[G(\theta) = \frac{c_1 + \hdots + c_{n} z^{n-1}}{z^n + a_1 z^{n-1} + \hdots + a_n}, \ \text{where} \ z=\exp(i \theta) .\]

Thus we are just asking to check the weak-quasi-convexity of

    \[(\hat{a}, \hat{c}) \mapsto |\frac{\hat{c}_1 + \hdots + \hat{c}_{n} z^{n-1}}{z^n + \hat{a}_1 z^{n-1} + \hdots + \hat{a}_n} - \frac{u}{v}|^2 .\]

Weak-quasi-convexity is preserved by linear functions, so we just need to understand the map

    \[(\hat{u},\hat{v}) \in \mathbb{C} \times \mathbb{C} \mapsto |\frac{\hat{u}}{\hat{v}} - \frac{u}{v}|^2 ,\]

which is weak-quasi-convex provided that \hat{v} has a positive inner product with v. In particular we just proved the following:

Theorem [Hardt, Ma, Recht 2016]

Let C(a) := \{z^n + a_1 z^{n-1} + \hdots + a_n , z \in \mathbb{C}, |z|=1\} and assume there is some cone \mathcal{C} \subset \mathbb{C}^2 of angle less than \pi/2-\alpha such that C(a) \subset \mathcal{C}. Then the idealized risk is \alpha-weakly-quasi-convex on the set of \hat{a} such that C(\hat{a}) \subset \mathcal{C}.

(In the paper they specifically pick the cone \mathcal{C} where the imaginary part is larger than the real part.) This theorem naturally suggests that by overparametrizing the network (i.e. adding dimensions to a and c) one could have a nicer landscape (indeed in this case the above condition can be easier to check), see the paper for more details!

This entry was posted in Machine learning, Optimization. Bookmark the permalink.

2 Responses to "Geometry of linearized neural networks"

Leave a reply