Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part II

This is a continuation of Julien Mairal‘s guest post on CNNs, see part I here.

Stability to deformations of convolutional neural networks

In their ICML paper Zhang et al. introduce a functional space for CNNs with one layer, by noticing that for some dot-product kernels, smoothed variants of rectified linear unit activation functions (ReLU) live in the corresponding RKHS, see also this paper and that one. By following a similar reasoning with multiple layers, it is then possible to show that the functional space described in part I \{ f_w: x \mapsto \langle w , \Phi_n(x_0) \rangle; w \in L^2(\Omega,\mathcal{H}_n) \} contains CNNs with such smoothed ReLU, and that the norm \|f_w\| of such networks can be controlled by the spectral norms of filter matrices. This is consistent with previous measures of complexity for CNNs, see this paper by Bartlett et al.

A perhaps more interesting finding is that the abstract representation \Phi_n(x), which only depends on the network architecture, may provide near-translation invariance and stability to small image deformations while preserving information—that is, x can be recovered from \Phi_n(x). The original characterization we use was introduced by Mallat in his paper on the scattering transform—a multilayer architecture akin to CNNs based on wavelets, and was extended to \Phi_n by Alberto Bietti, who should be credited for all the hard work here.

Our goal is to understand under which conditions it is possible to obtain a representation that (i) is near-translation invariant, (ii) is stable to deformations, (iii) preserves signal information. Given a C^1-diffeomorphism \tau: \mathbb{R}^2 \to \mathbb{R}^2 and denoting by L_\tau x(u) = x(u-\tau(u)) its action operator (for an image defined on the continuous domain \mathbb{R}^2), the main stability bound we obtain is the following one, see Theorem 7 in Mallat’s paper if \|\nabla \tau\|_\infty \leq 1/2, for all x,

    \[ \| \Phi_n(L_\tau x) - \Phi_n(x)\| \leq \left ( C_1 (1+n) \|\nabla \tau\|_\infty + \frac{C_2}{\sigma_n} \|\tau\|_\infty \right) \|x\|, \]

where C_1, C_2 are universal constants, \sigma_n is the scale parameter of the pooling operator A_n corresponding to the “amount of pooling” performed up to the last layer, \|\tau\|_\infty is the maximum pixel displacement and \|\nabla \tau\|_\infty represents the maximum amount of deformation, see the paper for the precise definitions of all these quantities. Note that when C_2/\sigma_n \to 0, the representation \Phi_n becomes translation invariant: indeed, consider the particular case of \tau being a translation, then \nabla \tau=0 and \|\Phi_n(L_\tau x) - \Phi_n(x)\| \to 0.

The stability bound and a few additional results tell us a few things about the network architecture: (a) small patches lead to more stable representations (the dependency is hidden in C_1); (b) signal preservation for discrete signals requires small subsampling factors (and thus small pooling) between layers. In such a setting, the scale parameter \sigma_n still grows exponentially with n and near translation invariance may be achieved with several layers.

Interestingly, we may now come back to the Cauchy-Schwarz inequality from part 1, and note that if \Phi_n is stable, the RKHS norm \|f\| is then a natural quantity that provides stability to deformations to the prediction function f, in addition to measuring model complexity in a traditional sense.

Feature learning in RKHSs and convolutional kernel networks

The previous paragraph is devoted to the characterization of convolutional architectures such as CNNs but the previous kernel construction can in fact be used to derive more traditional kernel methods. After all, why should one spend efforts defining a kernel between images if not to use it?

This can be achieved by considering finite-dimensional approximations of the previous feature maps. In order to shorten the presentation, we simply describe the main idea based on the Nystrom approximation and refer to the paper for more details. Approximating the infinite-dimensional feature maps x_k (see the figure at the top of part I) can be done by projecting each point in \mathcal{H}_k onto a p_k-dimensional subspace \mathcal{F}_k leading to a finite-dimensional feature map \tilde{x}_k akin to CNNs, see the figure at the top of the post.

By parametrizing \mathcal{F}_k=\text{span}(\varphi_k(z_1),\varphi_k(z_2),\ldots,\varphi_k(z_{p_k})) with p_k anchor points Z=[z_1,\ldots,z_{p_k}], and using a dot-product kernel, a patch z from \tilde{x}_{k-1} is encoded through the mapping function

    \[ \psi_k(z) = \|z\| \kappa_k( Z^\top Z)^{-1/2} \kappa_k\left( Z^\top \frac{z}{\|z\|} \right), \]

where \kappa_k is applied pointwise. Then, computing \tilde{x}_k from \tilde{x}_{k-1} admits a CNN interpretation, where only the normalization and the matrix multiplication by \kappa_k( Z^\top Z)^{-1/2} are not standard operations. It remains now to choose the anchor points:

  • kernel approximation: a first approach consists of using a variant of the Nystrom method, see this paper and that one. When plugging the corresponding image representation in a linear classifier, the resulting approach behaves as a classical kernel machine. Empirically, we observe that the higher the number of anchor points, the better the kernel approximation, and the higher the accuracy. For instance, a two-layer network with a 300k-dimensional representations achieves about 86\% accuracy on CIFAR-10 without data augmentation (see here).
  • back-propagation, feature selection: learning the anchor points Z can also be done as in a traditional CNN, by optimizing them end-to-end. This allows using deeper lower-dimensional architectures and empirically seems to perform better when enough data is available, e.g., 92\% accuracy on CIFAR-10 with simple data augmentation. There, the subspaces \mathcal{F}_k are not learned anymore to provide the best kernel approximation, but the model seems to perform a sort of feature selection in each layer’s RKHS \mathcal{H}_k, which is not well understood yet (This feature selection interpretation is due to my collaborator Laurent Jacob).

Note that the first CKN model published here was based on a different approximation principle, which was not compatible with end-to-end training. We found this to be less scalable and effective.

Other links between neural networks and kernel methods

Finally, other links between kernels and infinitely-wide neural networks with random weights are classical, but they were not the topic of this blog post (they should be the topic of another one!). In a nutshell, for a large collection of weights distributions and nonlinear functions s: \mathbb{R} \to \mathbb{R}, the following quantity admits an analytical form

    \[ K(x,x') = \E_{w}[ s(w^\top x) s(w^\top x')], \]

where the terms s(w^\top x) may be seen as an infinitely-wide single-layer neural network. The first time such a relation appears is likely to be in the PhD thesis of Radford Neal with a Gaussian process interpretation, and it was revisited later by Le Roux and Bengio and by Cho and Saul with multilayer models.

In particular, when s is the rectified linear unit and w follows a Gaussian distribution, it is known that we recover the arc-cosine kernel. We may also note that random Fourier features also yield a similar interpretation.

Other important links have also been drawn recently between kernel regression and strongly over-parametrized neural networks, see this paper and that one, which is another exciting story.

Posted in Machine learning | 5 Comments

Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part I



I (n.b., Julien Mairal) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.

The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.

Preliminaries on kernel methods

Given data living in a set \mathcal{X}, a positive definite kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} implicitly defines a Hilbert space \mathcal{H} of functions from \mathcal{X} to \mathbb{R}, called reproducing kernel Hilbert space (RKHS), along with a mapping function \varphi: \mathcal{X} \to \mathcal{H}.

A predictive model f in \mathcal{H} associates to every point x a label in \mathbb{R}, and admits a simple form f(x) =\langle f, \varphi(x) \rangle_{\mathcal{H}}. Then, Cauchy-Schwarz inequality gives us a first basic stability property

    \[ \forall x, x'\in \mathcal{X},~~~~~ |f(x)-f(x')| \leq \|f\|_{\mathcal{H}} \| \varphi(x) - \varphi(x')\|_\mathcal{H}. \]

This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:

\|\varphi(x)-\varphi(x')\|_\mathcal{H} is a distance between two data representations \varphi(x),\varphi(x'), which are independent of the learning process, and \|f\|_\mathcal{H} is a norm on the model f (typically optimized over data) that acts as a measure of complexity.

Thinking about neural networks in terms of kernel methods then requires defining the underlying representation \varphi(x), which can only depend on the network architecture, and the model f, which will be parametrized by (learned) network’s weights.

Building a convolutional kernel for convolutional neural networks

Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image x_0, the main idea is to build a sequence of “feature maps” x_1,x_2,\ldots that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.

Formally, an input image x_0 is represented as a square-integrable function in L^2(\Omega,\mathcal{H}_0), where \Omega is a set of pixel coordinates, and \mathcal{H}_0 is a Hilbert space. \Omega may be a discrete grid or a continuous domain such as \mathbb{R}^2, and \mathcal{H}_0 may simply be \mathbb{R}^3 for RGB images. Then, a feature map x_k in L^2(\Omega,\mathcal{H}_k) is obtained from a previous layer x_{k-1} as follows:

  • modeling larger neighborhoods than in the previous layer: we map neighborhoods (patches) from x_{k-1} to a new Hilbert space \mathcal{H}_k. Concretely, we define a homogeneous dot-product kernel between patches z, z' from x_{k-1}:

        \[ K_k(z,z') = \|z\| \|z'\| \kappa_k \left( \left\langle \frac{z}{\|z\|}, \frac{z'}{\|z'\|} \right\rangle \right), \]

    where \langle . , . \rangle is an inner-product derived from \mathcal{H}_{k-1}, and \kappa_k is a non-linear function that ensures positive definiteness, e.g., \kappa_k(\langle u,u'\rangle ) = e^{\alpha (\langle u,u'\rangle -1)} = e^{-\frac{\alpha}{2}\|u-u'\|^2} for vectors u, u' with unit norm, see this paper. By doing so, we implicitly define a kernel mapping \varphi_k that maps patches from x_{k-1} to a new Hilbert space \mathcal{H}_k. This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.

  • increasing invariance: to gain invariance to small deformations, we smooth~x_{k-1} with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).

Formally, the previous construction amounts to applying operators P_k (patch extraction), M_k (kernel mapping), and A_k (smoothing/pooling operator) to x_{k-1} such that the n-th layer representation can be written as

    \[ \Phi_n(x_0)= x_n= A_n M_n P_n \ldots A_1 M_1 P_1 x_0~~~\text{in}~~~~L^2(\Omega,\mathcal{H}_n). \]

We may finally define a kernel for images as \mathcal{K}_n(x_0,x_0')=\langle \Phi_n(x_0), \Phi_n(x_0') \rangle, whose RKHS contains the functions f_w(x_0) = \langle w , \Phi_n(x_0) \rangle for w in L^2(\Omega,\mathcal{H}_n). Note now that we have introduced a concept of image representation \Phi_n, which only depends on some network architecture (amounts of pooling, patch size), and predictive model f_w parametrized by w.

From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).


Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.



Posted in Machine learning | 1 Comment

Optimal bound for stochastic bandits with corruption

Guest post by Mark Sellke.

In the comments of the previous blog post we asked if the new viewpoint on best of both worlds can be used to get clean “interpolation” results. The context is as follows: in a STOC 2018 paper followed by a COLT 2019 paper, the following corruption model was discussed: stochastic bandits, except for C rounds which are adversarial. The state of the art bounds were of the form: optimal (or almost optimal) stochastic term plus K C, and it was mentioned as an open problem whether KC could be improved to C (there is a lower bound showing that C is necessary — when C = O(\sqrt{T})). As was discussed in the comment section, it seemed that indeed this clean best of both worlds approach should certainly shed light on the corruption model. It turns out that this is indeed the case, and a one-line calculation resolves positively the open problem from the COLT paper. The formal result is as follows (recall the notation/definitions from the previous blog post):

Lemma: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(1)   \begin{equation*} c \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

Then in the C-corruption stochastic bandit model one has that the regret is bounded by:

    \[ C + 2 c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \]


Note that by the previous blog post we know strategies that satisfy (1) with c=10 (see Lemma 2 in the previous post).

Proof: In equation (1) let us apply Jensen over the corrupt rounds, this yields a term c \sqrt{K C}. For the non-corrupt rounds, let us use that

    \[ c \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{c^2}{t \Delta_i} \right) \]

The sum of the second term on the right hand side is upper bounded by c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i}. On the other hand the sum (over non-corrupt rounds) of the first term is equal to 1/2 of the regret over the non-corrupt rounds, which is certainly smaller than 1/2 of the total regret plus C. Thus we obtain (denoting R for the total regret):

    \[ R \leq c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i} + \frac{C}{2} + \frac{R}{2} \]

which concludes the proof.


Posted in Machine learning, Optimization, Theoretical Computer Science | 3 Comments

Amazing progress in adversarially robust stochastic multi-armed bandits

In this post I briefly discuss some recent stunning progress on robust bandits (for more background on bandits see these two posts, part 1 and part 2, in particular what is described below gives a solution to Open Problem 3 at the end of part 2).


Stochastic bandit and adversarial examples

In multi-armed bandit problems the gold standard property, going back to a seminal paper of Lai and Robbins in 1985 is to have a regret upper bounded by:

(1)   \begin{equation*} \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \,. \end{equation*}

Let me unpack this a bit: this is for the scenario where the reward process for each action is simply an i.i.d. sequence from some fixed distribution, i^* is the index of the (unique) best action, and \Delta_i is the gap between the mean value of the best action and the one of i. Such guarantee is extremely strong, as in particular it means that actions whose average performance is a constant away from the optimal arm are very rarely played (only of order \log(T)). On the other hand, the price to pay for such an aggressive behavior (by this I mean focusing on good actions very quickly) is that all the classical algorithms attaining the above bound are extremely sensitive to adversarial examples: that is if there is some deviation from the i.i.d. assumption (even very brief in time), the algorithms can suddenly suffer linear in T regret.


Adversarial bandit

Of course there is an entirely different line of works, on adversarial multi-armed bandits, where the whole point is to prove regret guarantee for any reward process. In this case the best one can hope for is a regret of order \sqrt{K T}. The classical algorithm in this model, Exp3, attains a regret of order \sqrt{K T \log(K)}. In joint work with Jean-Yves Audibert we showed back in 2009 that the following strategy, which we called PolyINF, attains the optimal \sqrt{KT}: view Exp3 as Mirror Descent with the (negative) entropy as a regularizer, and now replace the entropy by a simple rational function namely - \sum_{i=1}^K x_i^p with p \in (0,1) (this mirror descent view was actually derived in a later paper with Gabor Lugosi). The proof becomes one line (given the appropriate knowledge of mirror descent and estimation in bandit games): the radius part of the bound is of the form \frac{1}{\eta} \sum_{i=1}^K x_i^p \leq \frac{K^{p}}{\eta}, while the variance is of the form (since the inverse of the Hessian of the mirror map is a diagonal matrix with entries x_i^{2-p}):

    \[ \eta \sum_{i=1} x_i^{2-p} \frac{1}{x_i} \leq \eta K^{1-p} \,. \]

Thus optimizing over \eta yields \sqrt{K T} for any p \in (0,1). Interestingly, the best numerical constant in the bound is obtained for p=1/2.


Best of both worlds

This was the state of affairs back in 2011, when with Alex Slivkins we started working on a best of both worlds type algorithm (which in today’s language is exactly a stochastic MAB robust to adversarial examples): namely one that gets the \log(T) guarantee (in fact \log^2(T) in our original paper) if the environment is the nice i.i.d. one, and also \sqrt{K T} (in fact \sqrt{K T \log^3(T)}) in the worst case. This original best of both worlds algorithm was of the following form: be aggressive as if it was a stochastic environment, but still sample sufficiently often the bad actions to make sure there isn’t an adversary trying to hide some non-stochastic behavior on these seemingly bad performing actions. Of course the whole difficulty was to show that it is possible to implement such a defense without hurting the stochastic performance too much (remember that bad actions can only be sampled of order \log(T) times!). Since this COLT 2012 paper there has been many improvements to the original strategy, as well as many variants/refinements (one such variant worth mentioning are the works trying to do a smooth transition between the stochastic and adversarial models, see e.g. here and here).


A stunning twist

The amazing development that I want to talk about in this post is the following: about a year ago Julian Zimmert and Yevgeny Seldin proved that the 2009 PolyINF (crucially with p=1/2) strategy actually gets the 2012 best of both worlds bound! This is truly surprising, as in principle mirror descent does not “know” anything about stochastic environments, it does not make any sophisticated concentration reasoning (say as in Lai and Robbins), yet it seems to automatically and optimally pick up on the regularity in the data. This is really amazing to me, and of course also a total surprise that the polynomial regularizer has such strong adaptivity property, while it was merely introduced to remove a benign log term.

The crucial observation of Zimmert and Seldin is that a a certain self-bounding property of the regret implies (in a one-line calculation) the best of both worlds result:

Lemma 1: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(2)   \begin{equation*} C \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

(Recall that for multi-armed bandit one selects a probability distribution x_t over the actions, so x_{i,t} denote here the probability of playing action i at time t.) Then one has that the regret is in fact bounded by 2 C \sqrt{K T} (this follows trivially by Jensen on the i sum), and moreover if the environment is stochastic one has that the regret is in fact bounded by C^2 times (1).

Proof: Assuming that the environment is stochastic we can write the regret as \sum_{i,t} \Delta_i x_{i,t}, so by assumption and using that C \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) one has:

    \[ \sum_{i \neq i^*,t} \Delta_i x_{i,t} \leq \frac{1}{2} \sum_{i \neq i^*,t} \left(\Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) \,, \]

which means that the left hand side is smaller than \sum_{i \neq i^*,t} \frac{C^2}{t \Delta_i} which is indeed smaller than C^2 times (1).


Yet another one-line proof (okay, maybe 5 lines)

Zimmert and Seldin proved that PolyINF with p=1/2 actually satisfies the self-bounding property of Lemma 1 (and thus obtains the best of both worlds guarantee). In another recent paper by Zimmert, in joint work with Haipeng Luo and Chen-Yu Wei, they simplify the analysis by using a very mild variation of the PolyINF regularizer, namely - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}). In my view it’s the proof from the book for the best of both worlds result (or very close to it)! Here it is:

Lemma 2: Equation (2) with C=10 is an upper bound on the regret of mirror descent with learning \eta_t = \frac{1}{\sqrt{t}}, mirror map \Phi(x) = - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}), and standard multi-armed bandit loss estimator.

Proof: The classical mirror descent analysis from any good book will tell you that the regret is controlled by (for \Phi(x) = \sum_{i=1}^K \phi(x_i) and with the convention \eta_0 = + \infty):

(3)   \begin{equation*} \sum_{t=1}^T \left(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}\right) (\Phi(x^*) - \Phi(x_t)) + \sum_{t=1}^T \eta_t \sum_{i=1}^K \frac{1}{x_{i,t} \phi''(x_t)} \,. \end{equation*}

We now consider those terms for the specific \Phi and \eta_t suggested in the lemma. First notice that \frac{1}{\eta_t} - \frac{1}{\eta_{t-1}} \leq \eta_t. Moreover \phi(x^*_i) = - 1 (since x^* is integral) so that \phi(x^*_i) - \phi(x_{i,t}) \leq \min(\sqrt{x_{i,t}}, \sqrt{1-x_{i,t}}). In other words the first term in (3) is upper bounded by

    \[ \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{\min(x_{i,t}, 1-x_{i,t})}{t}} \leq 2 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality simply comes from \sqrt{1-x_{i^*,t}} = \sqrt{\sum_{i \neq i^*} x_{i,t}} \leq \sum_{i \neq i^*} \sqrt{x_{i,t}}.

Next note that \phi''(s) = \frac{1}{4} (s^{-3/2} + (1-s)^{-3/2}) \geq \frac{1}{4 \min(s,1-s)^{3/2}}, so that the second term in (3) is upper bounded by

    \[ 4 \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{x_{i,t}}{t}} \min(1, (1-x_{i,t})/x_{i,t})^{3/2} \leq 8 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality follows trivially by considering the two cases whether x_{i^*,t} is smaller or larger than 1/2.

Posted in Machine learning, Optimization, Theoretical Computer Science | 2 Comments

Nemirovski’s acceleration

I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a 2-dimensional plane-search subroutine (which, in theory, can be implemented in \log(1/\epsilon) calls to a first-order oracle). He later improved it to only require a 1-dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).

Let f be a 1-smooth function. Denote x^{+} = x - \nabla f(x). Fix a sequence (\lambda_t)_{t \in \N}, to be optimized later. We consider the “conjugate” point \sum_{s =1}^t \lambda_s \nabla f(x_s). The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

    \[ x_{t+1} = \mathrm{argmin}_{x \in P_t} f(x) \, \text{where} \, P_t = \mathrm{span}\left(x_t^+, \sum_{s =1}^t \lambda_s \nabla f(x_s)\right) \,. \]

Let us denote g_s = \nabla f(x_s) and \delta_s = f(x_s) - f(x^*) for shorthand. The key point is that g_{t+1} \in P_t^{\perp}, and in particular \|\sum_{s \leq t} \lambda_s g_s\|^2 = \sum_{s \leq t} \lambda_s^2 \|g_s\|^2. Now recognize that \|g_s\|^2 is a lower bound on the improvement \delta_s - \delta_{s+1} (here we use that x_{s+1} is better than x_s^+). Thus we get:

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s^2 (\delta_s - \delta_{s+1}) \leq \sum_{s \leq t} \delta_s (\lambda_s^2 - \lambda_{s-1}^2) \,. \]

In other words if the sequence \lambda is chosen such that \lambda_s = \lambda_s^2 - \lambda_{s-1}^2 then we get

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s \delta_s \,. \]

This is good because roughly the reverse inequality also holds true by convexity (and the fact that x_s \in P_s so g_s \cdot x_s = 0):

    \[ \sum_{s \leq t} \lambda_s \delta_s \leq \sum_{s \leq t} \lambda_s g_s \cdot (x_s - x^*) \leq \|x^*\| \cdot \| \sum_{s \leq t} \lambda_s g_s\| \,. \]

So finally we get \sum_{s \leq t} \lambda_s \delta_s \leq \|x^*\|^2, and it just remains to realize that \lambda_s is of order s so that \delta_t \leq \|x^*\|^2 / t^2.

Posted in Optimization | 8 Comments

A short proof for Nesterov’s momentum

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

    \begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the 1/t^2 convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that f is \beta-smooth convex function, and we take \eta = 1/\beta in the gradient step. The momentum term \gamma_t will be set to a very particular value, which comes out naturally in the proof.


The two basic inequalities

Let us denote \delta_t = f(x_t) - f(x^*) and g_t = - \frac{1}{\beta} \nabla f(x_t + d_t) (note that x_{t+1} = x_t + g_t + d_t). Now let us write our favorite inequalities (using f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2 and f(x) - f(y) \leq \nabla f(x) \cdot (x-y)):

    \[ \delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,, \]


    \[ \delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,. \]


On the way to a telescopic sum

Recall now that |a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2, so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient 1 on the second inequality, so let us do \lambda_t-1 times the first inequality plus 1 times the second inequality. We obtain an inequality whose right hand side is given by -\frac{\beta}{2} times

    \begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose \lambda_t and \gamma_t. What we would like to have is:

    \[ x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,. \]

Observe that (since d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)) the right hand side can be written as x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*, and thus we see that we simply need to have:

    \[ \lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,. \]


Setting the parameters and concluding the proof

Writing u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2 we now obtain as a result of the combination of our two starting inequalities:

    \[ \lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,. \]

It only remains to select \lambda_t such that \lambda_t^2 - \lambda_t = \lambda_{t-1}^2 (i.e., roughly \lambda_t is of order t) so that by summing the previous inequality one obtains \delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2} which is exactly the 1/t^2 rate we were looking for.

Posted in Optimization | 11 Comments

Remembering Michael

It has been a year since the tragic event of September 2017. We now know what happened, and it is a tremendously sad story of undiagnosed type 1 diabetes.

This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.

Posted in Uncategorized | Leave a comment

How to generalize (algorithmically)

A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).


Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.


Some notes on algorithmic generalization

Let \mathcal{X}, \mathcal{Y} be input/output spaces. Let \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow [0,1] be a loss function, \mu a probability measure supported on \mathcal{X} \times \mathcal{Y}, and \underline{h} : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^\mathcal{X} a learning rule (in words \underline{h} takes as input a dataset of m examples, and output a mapping from \mathcal{X}-inputs to \mathcal{Y}-outputs). With a slight abuse of notation, for z=(x,y) \in \mathcal{X} \times \mathcal{Y} and h \in \mathcal{Y}^{\mathcal{X}}, we write \ell(h, z) := \ell(h(x), y). We define the generalization of \underline{h} on \mu by:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S=(z_1,\hdots,z_m) \sim \mu^{\otimes m}, z \sim \mu} \left[ \ell(\underline{h}(S), z) - \frac1{m} \sum_{i=1}^m \ell(\underline{h}(S), z_i) \right] \,. \]

In words, if \mathrm{gen}(\underline{h} , \mu) \leq \epsilon then we expect the empirical performance of the learned classifier \underline{h}(S) to be representative of its performance on a fresh out-of-sample data point, up to an additive \epsilon. The whole difficulty of course is that the empirical evaluation is done with the same dataset S that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on \mathrm{gen}(\underline{h} , \mu) here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class \mathcal{H} one trivially has

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S, z} \,\, \sup_{h \in \mathcal{H}} \left[ \frac1{m} \sum_{i=1}^m \ell(h, z_i) - \ell(h, z) \right] \,. \]

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.


The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:

    \begin{eqnarray*} \mathrm{gen}(\underline{h} , \mu) & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S), z) - \ell(\underline{h}(S), z_i) \right] \\ & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S^{-i}), z_i) - \ell(\underline{h}(S), z_i) \right] \,, \end{eqnarray*}

where S^{-i}=(z_1,\hdots, z_{i-1}, z, z_{i+1}, \hdots, z_m). This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).


The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define (K, \epsilon(\cdot))-robustness as follows: assume that \mathcal{Z} can be partitioned into K sets \{C_k\}_{k \in [K]} such that if z and z_i are in the same set C_k then

    \[ | \ell(\underline{h}(S), z_i) - \ell(\underline{h}(S), z) | \leq \epsilon(S) \,. \]

A good example to have in mind would be a binary classifier with large margin, in which case K corresponds to the covering number of \mathcal{X} at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases K would be typically exponential in the dimension of \mathcal{X}. The key result of Xu and Mannor that we prove next is a generalization bound of order \sqrt{K / m}. In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A (K, \epsilon(\cdot))-robust learning rule satisfies

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S} \, \epsilon(S) + \sqrt{\frac{K}{m}} \,. \]

Proof: Let N_k = |S \cap C_k| and note that \mathbb{E}_S \, N_k = m \mu(C_k). Now one has for a robust h:

    \begin{align*} \mathbb{E}_z \ \ell(h, z) & =\sum_{k=1}^K \mu(C_k) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & = \sum_{k=1}^K \frac{N_k}{m} \mathbb{E}_z [\ell(h,z) | z \in C_k] + \sum_{k=1}^K \left(\mu(C_k) - \frac{N_k}{m}\right) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & \leq \epsilon(S) + \frac{1}{m} \sum_{i = 1}^m \ell(h, z_i) + \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \, . \end{align*}

It only remains to observe that

    \[ \mathbb{E}_S \, \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \leq \frac{1}{m} \sum_{k=1}^K \sqrt{\mu(C_k)} \leq \sqrt{\frac{K}{m}} \, . \]


Information theoretic perspective

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information \mathrm{I}(\underline{h}(S), S) as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \sqrt{\frac{\mathrm{I}(\underline{h}(S), S)}{2 m}} \, . \]

Note that here we have assumed that the codomain of the learning rule \underline{h} consists of deterministic maps from inputs to outputs, in which case the mutual information \mathrm{I}(\underline{h}(S), S) is simply the entropy \mathrm{H}(\underline{h}(S)). However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S, S' \sim \mu^{\otimes m}} [ F(\underline{h}(S), S') - F(\underline{h}(S), S) ] \,, \]

where F(h, S) = \frac{1}{m} \sum_{z \in S} \ell(h, z).

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:


Let f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}. Let (X,Y) be random variables in \mathcal{X} \times \mathcal{Y} and \bar{X}, \bar{Y} be mutually independent copies of X and Y. Assume that f(\bar{X}, \bar{Y}) is \sigma-subgaussian (i.e., \log \mathbb{E} \exp( \lambda (f(\bar{X}, \bar{Y}) - \mathbb{E} f) ) \leq \lambda^2 \sigma^2 /2, \forall \lambda) then

    \[ \mathbb{E}[ f(X,Y) - f(\bar{X}, \bar{Y}) ] \leq \sqrt{2 \sigma^2 \mathrm{I}(X, Y)} \]

Proof: The mutual information is equal to the relative entropy between the distribution of (X,Y) and the distribution of (\bar{X}, \bar{Y}). Recall also the variational representation of the relative entropy which is that the map \mu \mapsto \mathrm{Ent}(\mu, \nu) is the convex conjugate of the log-partition function \Phi_{\nu}(g) := \log(\int \exp(g(x)) d \nu(x)). In particular one has a lower bound on the mutual information for any such g which means:

    \[ \mathrm{I}(X, Y) \geq \mathbb{E}[ g(X,Y) ] - \log \mathbb{E} [ \exp(g(\bar{X},\bar{Y})) ] \,. \]

Now it only remains to use the definition of subgaussianity, that is take g = \lambda f, and optimize over \lambda.

Posted in Machine learning | Leave a comment

Some updates

The blog has been eerily silent for most of 2018, here is why:

  • The main culprit is definitely the COLT 2018 chairing. This year we received a surprise 50% increase in number of submissions. This is great news for the community, and it led to a truly fantastic program. We are looking forward to see many of you in Stockholm! Note that the early registration deadline is tomorrow (May 30th).
  • The first volume of our new Mathematical Statistics and Learning journal just got published. If you care about being evaluated by top experts, and/or get the visibility of being in a highly selective mathematical journal, you should consider submitting your highest quality work there!
  • The first set of video lectures on the youtube channel is now complete. I realize that the quality is not perfect and that it is hard to read the blackboard. I will try to improve this for the future videos (which should be posted some time during the summer).
  • Finally I am quite excited to have a first preprint (joint work with Eric Price and Ilya Razenshteyn) at least loosely related to deep learning. It is titled “Adversarial examples from computational constraints”, which sums up the paper pretty well. In a nutshell, we prove that avoiding adversarial examples is computationally hard, even in situations where there exists a very robust classifier. We are looking forward to get the deep learning community’s feedback on this work!
Posted in Announcement | 5 Comments

Youtube channel

I just started a youtube channel. The hope is that this will be a companion to the blog. Some of the posts will “graduate” into videos (e.g., the first set of videos will correspond to an expanded version of bandit lecture notes –part 1, part 2-, and the second set will correspond to the crash course in learning theory –part 1, part 2-). But the mid/long-term plan is to also have dedicated content on the youtube channel (I still have to figure out the exact format). I personally like a lot to watch math videos online, so I hope the channel will be a nice addition to this growing material (some of my favorite existing channels include the Simons Institute, IAS, Princeton TCS, Harvard Math, IHESa new MIT data science channel, feel free to add yours in the comments below).

Posted in Announcement | Leave a comment