Geometry of linearized neural networks

This week we had the pleasure to host Tengyu Ma from Princeton University who told us about the recent progress he has made with co-authors to understand various linearized versions of neural networks. I will describe here two such results, one for Residual Neural Networks and one for Recurrent Neural Networks.

Some properties to look for in non-convex optimization

We will say that a function $f$ admits first order optimality (respectively second order optimality) if all critical points (respectively all local minima) of $f$ are global minima (of course first order optimality implies second order optimality for smooth functions). In particular with first order optimality one has that gradient descent converges to the global minimum, and with second order optimality this is also true provided that one avoids saddle points. To obtain rates of convergence it can be useful to make more quantitative statements. For example we say that $f$ is $\alpha$ -Polyak if

$\|\nabla f(x)\|^2 \geq \alpha (f(x) - f^*) .$

Clearly $\alpha$ -Polyak implies first order optimality, but more importantly it also implies linear convergence rate for gradient descent on $f$ . A variant of this condition is $\alpha$ -weak-quasi-convexity:

$\langle \nabla f(x), x-x^*\rangle \geq \alpha (f(x) - f^*) ,$

in which case gradient descent converges at the slow non-smooth rate $1/\sqrt{t}$ (and in this case it is also robust to noise, i.e. one can write a stochastic gradient descent version). The proofs of these statements just mimic the usual convex proofs. For more on these conditions see for instance this paper.

Linearized Residual Networks

Recall that a neural network is just a map $x \in \mathbb{R}^n \mapsto \sigma \circ A_L \circ \sigma A_{L-1} \hdots \sigma \circ A_0 (x)$ where $A_0,\hdots, A_{L}$ are linear maps (i.e. they are the matrices parametrizing the neural network) and $\sigma$ is some non-linear map (the most popular one, ReLu, is the just the coordinate-wise positive part). Alternatively you can think of a neural network as a sequence of hidden states $h_0, h_1,\hdots,h_L$ where $h_0=x$ and $h_{t+1} = \sigma(A_t h_{t})$ . In 2015 a team of researcher at MSR Asia introduced the concept of a residual neural network where the hidden states are now updated as before for $t$ even but for $t$ odd we set $h_{t+1} = h_{t-1} + \sigma(A_t h_t)$ . Apparently this trick allowed them to train much deeper networks, though it is not clear why this would help from a theoretical point of view (the intuition is that at least when the network is initialized with all matrices being $0$ it still does something non-trivial, namely it computes the identity).

In their most recent paper Moritz Hardt and Tengyu Ma try to explain why adding this “identity connection” could be a good idea from a geometric point of view. They consider an (extremely) simplified model where there is no non-linearity, i.e. $\sigma$ is the identity map. A neural network is then just a product of matrices. In particular the landscape we are looking at for least-squares with such a model is of the form:

$g(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L A_i x - y\|^2 ,$

which is of course a non-convex function (just think of the function $(a,b) \mapsto (ab-1)^2$ and observe that on the segment $a=b$ it gives the non-convex function $(a^2-1)^2$ ). However it actually satisfies the second-order optimality condition:

Proposition [Kawaguchi 2016]

Assume that $x$ has a full rank covariance matrix and that $y=Rx$ for some deterministic matrix $R$ . Then all local minima of $g$ are global minima.

I won’t give the proof of this result as it requires to take the second derivative of $g$ which is a bit annoying (I will give below the proof of the first derivative). Now in this linearized setting the residual network version (where the identity connection is added at every layer) corresponds simply to a reparametrization around the identity, in other words we consider now the following function:

$f(A_0,\hdots, A_L) = \mathbb{E}_{(x,y) \sim \nu} \|\prod_{i=0}^L (A_i+\mathrm{I}) x - y\|^2 .$

Proposition [Hardt and Ma 2016]

Assume that $x$ has a full rank covariance matrix and that $y= Rx$ for some deterministic matrix $R$ . Then $f$ has first order optimality on the set $\{(A_0, \hdots, A_L) : \|A_i\| <1\}$ .

Thus adding the identity connection makes the objective function better behave around the starting points with all-zeros matrices (in the sense that gradient descent doesn’t have to worry about avoiding saddle points). The proof is just a few lines of standard calculations to take derivatives of functions with matrix-valued inputs.

Proof: One has with $E = R - \prod_{i=0}^L (A_i+\mathrm{I})$ and $\Sigma = \E x x^{\top}$ ,

$f = \E (E x)^{\top} Ex = \mathrm{Tr}(E \Sigma E^{\top}) =: \|E\|_{\Sigma}^2 ,$

so with $E_{<i} = \prod_{j <i} (A_j + \mathrm{I})$ and $E_{>i}=\prod_{j >i} (A_j + \mathrm{I})$ ,

$\begin{eqnarray*} f(A_0,\hdots, A_i + V, \hdots, A_L) & = & \|R - \prod_{j <i} (A_j + \mathrm{I}) \times (A_i + V +\mathrm{I}) \times \prod_{j >i} (A_j + \mathrm{I})\|_{\Sigma}^2 \\ & = & \|E + E_{<i} V E_{>i}\|_{\Sigma}^2 \\ & = & \|E\|_{\Sigma}^2 + 2 \langle \Sigma E, E_{<i} V E_{>i} \rangle + \|E_{<i} V E_{>i}\|_{\Sigma}^2 , \end{eqnarray*}$

which exactly means that the derivative of $f$ with respect to $A_i$ is equal to $E_{>i}^{\top} \Sigma E E_{<i}^{\top}$ . On the set under consideration one has that $E_{>i}$ and $E_{<i}$ are invertible (and so is $\Sigma$ by assumption), and thus if this derivative is equal to $0$ it muts be that $E=0$ and thus $f=0$ (which is the global minimum).

Linearized recurrent neural networks

The simplest version of a recurrent neural network is as follows. It is a mapping of the form $(x_1,\hdots,x_T) \mapsto (y_1,\hdots,y_T)$ (we are thinking of doing sequence to sequence prediction). In these networks the hidden state is updated as $h_{t+1} = \sigma_1(A h_{t} + B x_{t})$ (with $h_1=0$ ) and the output is $y_t = \sigma_2(C h_t + D x_t)$ . I will now describe a paper by Hardt, Ma and Recht (see also this blog post) that tries to understand the geometry of least-squares for this problem in the linearized version where $\sigma_1 = \sigma_2 = \mathrm{Id}$ . That is we are looking at the function:

$f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\mathbb{E}_{(x_t)} \frac{1}{T} \sum_{t=1}^T \|\hat{y}_t - y_t\|^2 ,$

where $(y_t)$ is obtained from $(x_t)$ via some unknown recurrent neural network with parameters $A,B,C,D$ . First observe that by induction one can easily see that $h_{t+1} = \sum_{k=1}^t A^{t-k} B x_k$ and $y_t = D x_t + \sum_{k=1}^{t-1} C A^{t-1-k} B x_k$ . In particular, assuming that $(x_t)$ is an i.i.d. isotropic sequence one obtains

$\E \|y_t - \hat{y}_t\|_2^2 = \|D-\hat{D}\|_F^2 + \sum_{k=1}^{t-1} \|\hat{C} \hat{A}^{t-1-k} \hat{B} - C A^{t-1-k} B\|_F^2 ,$

and thus

$f(\hat{A}, \hat{B}, \hat{C}, \hat{D})=\|D-\hat{D}\|_F^2 + \sum_{k=1}^{T-1} (1- \frac{k}{T}) \|\hat{C} \hat{A}^{k-1} \hat{B} - C A^{k-1} B\|_F^2 .$

In particular we see that the effect of $\hat{D}$ is decoupled from the other variables and that is appears as a convex function, thus we will just ignore it. Next we make the natural assumption that the spectral radius of $A$ is less than $1$ (for otherwise the influence of the initial input $x_1$ is growing over time which doesn’t seem natural) and thus up to some small error term (for large $T$ ) one can consider the idealized risk:

$(\hat{A},\hat{B}, \hat{C}) \mapsto \sum_{k=0}^{+\infty} \|\hat{C} \hat{A}^{k} \hat{B} - C A^{k} B\|_F^2 .$

The next idea is a cute one which makes the above expression more tractable. Consider the series $r_k = C A^k B$ and its Fourier transform:

$G(\theta) = \sum_{k=0}^{+\infty} r_k \exp(i k \theta) = C (\sum_{k=0}^{+\infty} (\exp(i \theta) A)^k) B = C(\mathrm{I} - \exp(i \theta) A)^{-1} B .$

By Parseval’s theorem the idealized risk is equal to the $L_2$ distance between $G$ and $\hat{G}$ (i.e. $\int_{[-\pi, \pi]} \|G(\theta)-\hat{G}(\theta)\|_F^2 d\theta$ ). We will now show that under appropriate further assumptions, for any $\theta$ that $\|G(\theta) - \hat{G}(\theta)\|_F^2$ is weakly-quasi-convex in $(\hat{A},\hat{B},\hat{C})$ (in particular this shows that the idealized risk is weakly-quasi-convex). The big assumption that Hardt, Ma and Recht make is that the system is a “single-input single-output” model, that is both $x_t$ and $y_t$ are scalar. In this case it turns out that control theory shows that there is a “canonical controlable form” where $B=(0,\hdots,0,1)$ , $C=(c_1,\hdots,c_n)$ and $A$ has zeros everywhere except on the upper diagonal where it has ones and on the last row where it has $a_n,\hdots, a_1$ (I don’t know the proof of this result, if some reader has a pointer for a simple proof please share in the comments!). Note that with this form the system is simple to interpret as one has $Ah = (h(2),\hdots, h(n-1), \langle a,h\rangle)$ and $Ah+Bx =(h(2),\hdots, h(n-1), \langle a,h\rangle+x)$ . Now with just a few lines of algebra:

$G(\theta) = \frac{c_1 + \hdots + c_{n} z^{n-1}}{z^n + a_1 z^{n-1} + \hdots + a_n}, \ \text{where} \ z=\exp(i \theta) .$

Thus we are just asking to check the weak-quasi-convexity of

$(\hat{a}, \hat{c}) \mapsto |\frac{\hat{c}_1 + \hdots + \hat{c}_{n} z^{n-1}}{z^n + \hat{a}_1 z^{n-1} + \hdots + \hat{a}_n} - \frac{u}{v}|^2 .$

Weak-quasi-convexity is preserved by linear functions, so we just need to understand the map

$(\hat{u},\hat{v}) \in \mathbb{C} \times \mathbb{C} \mapsto |\frac{\hat{u}}{\hat{v}} - \frac{u}{v}|^2 ,$

which is weak-quasi-convex provided that $\hat{v}$ has a positive inner product with $v$ . In particular we just proved the following:

Theorem [Hardt, Ma, Recht 2016]

Let $C(a) := \{z^n + a_1 z^{n-1} + \hdots + a_n , z \in \mathbb{C}, |z|=1\}$ and assume there is some cone $\mathcal{C} \subset \mathbb{C}^2$ of angle less than $\pi/2-\alpha$ such that $C(a) \subset \mathcal{C}$ . Then the idealized risk is $\alpha$ -weakly-quasi-convex on the set of $\hat{a}$ such that $C(\hat{a}) \subset \mathcal{C}$ .

(In the paper they specifically pick the cone $\mathcal{C}$ where the imaginary part is larger than the real part.) This theorem naturally suggests that by overparametrizing the network (i.e. adding dimensions to $a$ and $c$ ) one could have a nicer landscape (indeed in this case the above condition can be easier to check), see the paper for more details!

2 Responses to "Geometry of linearized neural networks"

By 192.168.O.1 May 30, 2017 - 10:53 am

Thanks for the great post, Seb!

Regarding the canonical controllable form question in the post, I learned it from this lecture notes
By Tengyu Ma November 13, 2016 - 8:40 pm

Thanks for the great post, Seb!

Regarding the canonical controllable form question in the post, I learned it from this lecture notes

http://control.ee.ethz.ch/~ifalst/docs/LectureNotes.pdf

In Section 9.2, it is proved that if a single-inupt single-output system $(A,B,C,D)$ satisfies

$[B, AB, \dots, A^{n-1}B]\in \mathbb{R}^{n\times n}$ is of full rank,

then there exists an invertible matrix $T$ such that $(TAT^{-1}, TB, CT^{-1}, D)$ is an equivalent system with the canonical controllable form mentioned in the post. This kind of system is called controllable system (for some other reason).

Aug	OCT	Jul
	21
2019	2020	2021

Geometry of linearized neural networks

2 Responses to "Geometry of linearized neural networks"

By 192.168.O.1 May 30, 2017 - 10:53 am

By Tengyu Ma November 13, 2016 - 8:40 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll