Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part I

 

 

I (n.b., Julien Mairal) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.

The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.

Preliminaries on kernel methods

Given data living in a set \mathcal{X}, a positive definite kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} implicitly defines a Hilbert space \mathcal{H} of functions from \mathcal{X} to \mathbb{R}, called reproducing kernel Hilbert space (RKHS), along with a mapping function \varphi: \mathcal{X} \to \mathcal{H}.

A predictive model f in \mathcal{H} associates to every point x a label in \mathbb{R}, and admits a simple form f(x) =\langle f, \varphi(x) \rangle_{\mathcal{H}}. Then, Cauchy-Schwarz inequality gives us a first basic stability property

    \[ \forall x, x'\in \mathcal{X},~~~~~ |f(x)-f(x')| \leq \|f\|_{\mathcal{H}} \| \varphi(x) - \varphi(x')\|_\mathcal{H}. \]

This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:

\|\varphi(x)-\varphi(x')\|_\mathcal{H} is a distance between two data representations \varphi(x),\varphi(x'), which are independent of the learning process, and \|f\|_\mathcal{H} is a norm on the model f (typically optimized over data) that acts as a measure of complexity.

Thinking about neural networks in terms of kernel methods then requires defining the underlying representation \varphi(x), which can only depend on the network architecture, and the model f, which will be parametrized by (learned) network’s weights.

Building a convolutional kernel for convolutional neural networks

Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image x_0, the main idea is to build a sequence of “feature maps” x_1,x_2,\ldots that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.

Formally, an input image x_0 is represented as a square-integrable function in L^2(\Omega,\mathcal{H}_0), where \Omega is a set of pixel coordinates, and \mathcal{H}_0 is a Hilbert space. \Omega may be a discrete grid or a continuous domain such as \mathbb{R}^2, and \mathcal{H}_0 may simply be \mathbb{R}^3 for RGB images. Then, a feature map x_k in L^2(\Omega,\mathcal{H}_k) is obtained from a previous layer x_{k-1} as follows:

  • modeling larger neighborhoods than in the previous layer: we map neighborhoods (patches) from x_{k-1} to a new Hilbert space \mathcal{H}_k. Concretely, we define a homogeneous dot-product kernel between patches z, z' from x_{k-1}:

        \[ K_k(z,z') = \|z\| \|z'\| \kappa_k \left( \left\langle \frac{z}{\|z\|}, \frac{z'}{\|z'\|} \right\rangle \right), \]

    where \langle . , . \rangle is an inner-product derived from \mathcal{H}_{k-1}, and \kappa_k is a non-linear function that ensures positive definiteness, e.g., \kappa_k(\langle u,u'\rangle ) = e^{\alpha (\langle u,u'\rangle -1)} = e^{-\frac{\alpha}{2}\|u-u'\|^2} for vectors u, u' with unit norm, see this paper. By doing so, we implicitly define a kernel mapping \varphi_k that maps patches from x_{k-1} to a new Hilbert space \mathcal{H}_k. This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.

  • increasing invariance: to gain invariance to small deformations, we smooth~x_{k-1} with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).

Formally, the previous construction amounts to applying operators P_k (patch extraction), M_k (kernel mapping), and A_k (smoothing/pooling operator) to x_{k-1} such that the n-th layer representation can be written as

    \[ \Phi_n(x_0)= x_n= A_n M_n P_n \ldots A_1 M_1 P_1 x_0~~~\text{in}~~~~L^2(\Omega,\mathcal{H}_n). \]

We may finally define a kernel for images as \mathcal{K}_n(x_0,x_0')=\langle \Phi_n(x_0), \Phi_n(x_0') \rangle, whose RKHS contains the functions f_w(x_0) = \langle w , \Phi_n(x_0) \rangle for w in L^2(\Omega,\mathcal{H}_n). Note now that we have introduced a concept of image representation \Phi_n, which only depends on some network architecture (amounts of pooling, patch size), and predictive model f_w parametrized by w.

From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).

 

Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.

 

 

This entry was posted in Machine learning. Bookmark the permalink.

Leave a reply