I (n.b., Julien Mairal) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.
The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.
Preliminaries on kernel methods
Given data living in a set , a positive definite kernel implicitly defines a Hilbert space of functions from to , called reproducing kernel Hilbert space (RKHS), along with a mapping function .
A predictive model in associates to every point a label in , and admits a simple form . Then, Cauchy-Schwarz inequality gives us a first basic stability property
This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:
is a distance between two data representations , which are independent of the learning process, and is a norm on the model (typically optimized over data) that acts as a measure of complexity.
Thinking about neural networks in terms of kernel methods then requires defining the underlying representation , which can only depend on the network architecture, and the model , which will be parametrized by (learned) network’s weights.
Building a convolutional kernel for convolutional neural networks
Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image , the main idea is to build a sequence of “feature maps” that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.
Formally, an input image is represented as a square-integrable function in , where is a set of pixel coordinates, and is a Hilbert space. may be a discrete grid or a continuous domain such as , and may simply be for RGB images. Then, a feature map in is obtained from a previous layer as follows:
- modeling larger neighborhoods than in the previous layer: we map neighborhoods (patches) from to a new Hilbert space . Concretely, we define a homogeneous dot-product kernel between patches from :
where is an inner-product derived from , and is a non-linear function that ensures positive definiteness, e.g., for vectors with unit norm, see this paper. By doing so, we implicitly define a kernel mapping that maps patches from to a new Hilbert space . This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.
- increasing invariance: to gain invariance to small deformations, we smooth~ with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).
Formally, the previous construction amounts to applying operators (patch extraction), (kernel mapping), and (smoothing/pooling operator) to such that the -th layer representation can be written as
We may finally define a kernel for images as , whose RKHS contains the functions for in . Note now that we have introduced a concept of image representation , which only depends on some network architecture (amounts of pooling, patch size), and predictive model parametrized by .
From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).
Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.