Convex body chasing, Steiner point, Sellke point, and SODA 2020 best papers

Big congratulations to my former intern Mark Sellke, and to the CMU team (C. J. Argue, Anupam Gupta, Guru Guruganesh, and Ziye Tang) for jointly winning the best paper award at SODA 2020 (as well as the best student paper for Mark)! They obtain a linear in the dimension competitive ratio for convex body chasing, a problem which was entirely open for any d \geq 3 just two years ago. What’s more is that they found the algorithm (and proof) from The Book! Let me explain.

 

Convex body chasing

In convex body chasing an online algorithm is presented at each time step with a convex body K_t \subset \mathbb{R}^d, and it must choose a point x_t \in K_t. The online algorithm is trying to minimize its total movement, namely \sum_{t \geq 0} \|x_{t+1} - x_{t}\| (all the norms here will be Euclidean, for simplicity). To evaluate the performance of the algorithm, one benchmarks it against the smallest movement one could achieve for this sequence of bodies, if one knew in advance the whole sequence. A good example to have in mind is if the sequence K_t corresponds to lines rotating around some fixed point, then the best thing to do is to move to this fixed point and never move again, while a greedy algorithm might have infinite movement (in the continuous time limit). The competitive ratio is the worst case (over all possible sequences of convex bodies) ratio of the algorithm’s performance to those of the oracle optimum. This problem was introduced in 1993 by Linial and Friedman, mainly motivated by considering a more geometric version of the k-server problem.

As it turns out, this problem in fact has a lot of “real” applications. It is not too surprising given how elementary the problem is. Just to give a flavor, Adam Wierman and friends show that dynamic powering of data centers can be viewed as an instance of convex *function* chasing. In convex function chasing, instead of a convex body one gets a convex function f_t, and there is then a movement cost \|x_t - x_{t-1}\| and a *service cost* f_t(x_t). While this is more general than convex body chasing, it turns out that convex body chasing in dimension d+1 is enough to solve convex function chasing in dimension d (this is left to the reader as an exercise). Roughly speaking in dynamic power management, one controls various power levels (number of servers online, etc), and a request correspond to a new set of jobs. Turning on/off servers has an associated cost (the movement cost), while servicing the jobs with a certain number of servers has an associated delay (the service cost). You get the idea.

 

The Steiner point

In a paper to appear at SODA too, we propose (with Yin Tat Lee, Yuanzhi Li, Bo’az Klartag, and Mark Sellke) to use the Steiner point for the nested convex body chasing problem. In nested convex body chasing the sequence of bodies is nested, and to account for irrelevant additive constant we assume that the oracle optimum starts from the worst point in the starting set K_0. In other words, opt is starting at the point the furthest away in K_0 from the closest point in K_T, so its movement is exactly equal to the Hausdorff distance between K_0 and K_T. So all we have to do is to give an online algorithm whose movement is proportional to this Hausdorff distance. Some kind of Lipschitzness in Hausdorff distance. Well, it turns out that mathematicians have been looking at this for centuries already, and Steiner back in the 1800’s introduced just what we need: a selector (a map from convex sets to points in them, i.e. S(K) \in K for any convex set K) which is Lipschitz with respect to the Hausdorff distance, that is:

    \[ \|S(K) - S(K')\| \leq L \cdot \mathrm{dist}_H(K,K') \,. \]

Note that obtaining such a selector is not trivial, for example the first natural guess would be the center of gravity, but it turns out that this is not Lipschitz (consider for example what happens with K being a very thin triangle and K' being the base of this triangle). Crazy thing is, this Steiner point selector (to be defined shortly) is not only Lipschitz, it is in fact the most Lipschitz selector for convex sets. How would you prove such a thing? Well, you clearly need a miracle to happen, and the miracle here is that the Steiner point satisfies some symmetries, which define it uniquely. From there all you need to do is that starting with any selector, you can add some of these symmetries while also improving the Lipschitz constant (for more on this see the references in the linked paper above).

OK, so what is the Steiner point? It has many definitions, but a particular appealing one from an optimizer’s viewpoint is as follows. For a convex set K and a vector \theta \in \mathbb{R}^d let g_K(\theta) be the maximizer of the linear function \theta in the set K. Then the Steiner point of K, \mathrm{St}(K) is defined by:

    \[ \mathrm{St}(K) = \mathbb{E}_{\theta : \|\theta\| \leq 1} [g_K(\theta)] \,. \]

In words: for any direction take the furthest point in K in that direction, and average over all directions. This feels like a really nice object, and clearly it satisfies \mathrm{St}(K) \in K (i.e., it is a selector). The algorithm we proposed in our SODA paper is simply to move to x_t = \mathrm{St}(K_t). Let us now see how to analyze this algorithm.

 

The nested case proof

First we give an alternative formula for the Steiner point. Denote h_K(\theta) = \max_{x \in K} \theta \cdot x, and observe that g_K(\theta) = \nabla h_K(\theta) for almost all \theta. Thus using the divergence theorem one obtains:

    \[ \mathrm{St}(K) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta h_K(\theta)] \,. \]

(The factor d comes from the ratio of the volume of the ball to the sphere, and the \theta in the expectation is because the normal to \theta is \theta on the sphere.)

Now we have the following one-line inequality to control the movement of the Steiner point:

    \begin{align*} \|\mathrm{St}(K) - \mathrm{St}(K')\| & = d \cdot \| \mathbb{E}_{\theta : \|\theta\| = 1} [\theta (h_K(\theta) - h_{K'}(\theta))] \| \\ & \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ |h_K(\theta) - h_{K'}(\theta)| ] \end{align*}

It only remains to observe that |h_K(\theta) - h_{K'}(\theta)| \leq \mathrm{dist}_H(K,K') to obtain a proof that Steiner is d-Lipschitz (in fact as an exercise you can try to improve the above argument to obtain O(\sqrt{d})-Lipschitz). Now for nested convex body chasing we will have K' \subset K so that

|h_K(\theta) - h_{K'}(\theta)| = h_K(\theta) - h_{K'}(\theta) (i.e., no absolute values), and thus the upper bound on the movement will telescope! More precisely:

    \begin{align*} \sum_{t \geq 0} \|\mathrm{St}(K_t) - \mathrm{St}(K_{t+1})\| & \leq d \cdot \sum_{t \geq 0} \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_t}(\theta) - h_{K_{t+1}}(\theta) ] \\ & = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_0}(\theta) - h_{K_{T}}(\theta) ] \\ & \leq d \cdot \mathrm{dist}_H(K_0, K_T) \,. \end{align*}

This proves that the Steiner point is d-competitive for nested convex body chasing. How to generalize this to the non-nested case seemed difficult, and the breakthrough of Mark and the CMU team is to bring back an old friend of the online algorithm community: the work function.

 

The work function

The work function W_t : \mathbb{R}^d \rightarrow \mathbb{R} is defined as follows. For any point x \in \mathbb{R}^d it is the smallest cost one can pay to satisfy all the requests K_0, \hdots, K_t and end up at x. First observe that this function is convex. Indeed take two point x and y, and z the middle point between x and y (any point on the segment between x and y would be treated similarly). In the definition of the work function there is an associated trajectory for both x and y. By convexity of the requests, the sequence of mid points between those two trajectories is a valid trajectory for z! And moreover the movement of this mid trajectory is (by triangle inequality) less than the average of the movement of the trajectory for x and y. Hence W_t((x+y)/2) \leq (W_t(x) + W_t(y))/2.

A very simple property we will need from the work function is that the norm of its gradient carries some information on the current request. Namely, x \not\in K_t \Rightarrow \|\nabla W_t(x)\| = 1 (indeed, if x is not in the current request set, then the best way to end up there is to move to a point z in K_t and then move to x, and if K_t is a polytope then when you move a little bit x you don’t move z, so the cost is changing at rate 1, hence the norm of the gradient being 1). Or to put it differently: \|\nabla W_t(x)\| < 1 \Rightarrow x \in K_t.

 

The Sellke point

Mark’s beautiful idea (and the CMU team very much related –in fact equivalent– idea) to generalize the Steiner point to the non-nested case is to use the work function as a surrogate for the request. Steiner of K_t will clearly not work since all of K_t does not matter in the same way, as some points might be essentially irrelevant because they are very far from the previous requests, a fact which will be uncovered by the work function (while on the other hand the random direction from the Steiner point definition is oblivious to the geometry of the previous requests). So how to pick an appropriately random point in K_t while respecting the work function structure? Well we just saw that \|\nabla W_t(x)\| < 1 \Rightarrow x \in K_t. So how about taking a random direction \theta, and applying the inverse of the gradient map, namely the gradient of the Fenchel dual W_t^*? Recall that the Fenchel dual is defined by:

    \[ W_t^*(\theta) = \max_{x \in \mathbb{R}^d} \theta \cdot x - W_t(x) \,. \]

This is exactly the algorithm Mark proposes and it goes like this (Mark calls it the *functional Steiner point*):

    \[ \mathrm{Se}(K_t) = \mathbb{E}_{\theta : \|\theta\| < 1} [\nabla W_t^*(\theta)] \,. \]

Crucially this is a valid point, namely \mathrm{Se}(K_t) \in K_t. Moreover just like for the Steiner point we can apply the divergence theorem and obtain:

    \[ \mathrm{Se}(K_t) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta W_t^*(\theta)] \,. \]

The beautiful thing is that we can now exactly repeat the nested convex body argument, since W_{t+1}^*(\theta) \leq W_t^*(\theta) for all \theta (just like we had in the nested case h_{K_{t+1}}(\theta) \leq h_{K_t}(\theta)) and so we get:

    \[ \sum_{t \geq 0} \|\mathrm{Se}(K_t) - \mathrm{Se}(K_{t+1})\| \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ W_{0}^*(\theta) - W^*_{{T}}(\theta) ] \]

The first term with W_0^* is just some additive constant which we can ignore, while the second term is bounded as follows:

    \begin{align*} \mathbb{E}_{\theta : \|\theta\| = 1} [ - W^*_{{T}}(\theta) ] & = \mathbb{E}_{\theta : \|\theta\| = 1} [ \min_{x \in \mathbb{R}^d} W_t(x) - \theta \cdot x ] \\ & \leq \min_{x \in \mathbb{R}^d} \mathbb{E}_{\theta : \|\theta\| = 1}[ W_t(x) - \theta \cdot x ] \\ & = \min_{x \in \mathbb{R}^d} W_t(x) \,. \end{align*}

Thus we exactly proved that the movement of the Sellke point is upped bounded by a constant plus d times the minimum of the work function, and the latter is nothing but the value of the oracle optimum!

Note that once everything is done and said, the proof has only two inequalities (triangle inequality as in the nested case, and the minimum of the expectation is less than the expectation of the minimum). It doesn’t get any better than this!

Posted in Theoretical Computer Science | Leave a comment

Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part II

This is a continuation of Julien Mairal‘s guest post on CNNs, see part I here.

Stability to deformations of convolutional neural networks

In their ICML paper Zhang et al. introduce a functional space for CNNs with one layer, by noticing that for some dot-product kernels, smoothed variants of rectified linear unit activation functions (ReLU) live in the corresponding RKHS, see also this paper and that one. By following a similar reasoning with multiple layers, it is then possible to show that the functional space described in part I \{ f_w: x \mapsto \langle w , \Phi_n(x_0) \rangle; w \in L^2(\Omega,\mathcal{H}_n) \} contains CNNs with such smoothed ReLU, and that the norm \|f_w\| of such networks can be controlled by the spectral norms of filter matrices. This is consistent with previous measures of complexity for CNNs, see this paper by Bartlett et al.

A perhaps more interesting finding is that the abstract representation \Phi_n(x), which only depends on the network architecture, may provide near-translation invariance and stability to small image deformations while preserving information—that is, x can be recovered from \Phi_n(x). The original characterization we use was introduced by Mallat in his paper on the scattering transform—a multilayer architecture akin to CNNs based on wavelets, and was extended to \Phi_n by Alberto Bietti, who should be credited for all the hard work here.

Our goal is to understand under which conditions it is possible to obtain a representation that (i) is near-translation invariant, (ii) is stable to deformations, (iii) preserves signal information. Given a C^1-diffeomorphism \tau: \mathbb{R}^2 \to \mathbb{R}^2 and denoting by L_\tau x(u) = x(u-\tau(u)) its action operator (for an image defined on the continuous domain \mathbb{R}^2), the main stability bound we obtain is the following one, see Theorem 7 in Mallat’s paper if \|\nabla \tau\|_\infty \leq 1/2, for all x,

    \[ \| \Phi_n(L_\tau x) - \Phi_n(x)\| \leq \left ( C_1 (1+n) \|\nabla \tau\|_\infty + \frac{C_2}{\sigma_n} \|\tau\|_\infty \right) \|x\|, \]

where C_1, C_2 are universal constants, \sigma_n is the scale parameter of the pooling operator A_n corresponding to the “amount of pooling” performed up to the last layer, \|\tau\|_\infty is the maximum pixel displacement and \|\nabla \tau\|_\infty represents the maximum amount of deformation, see the paper for the precise definitions of all these quantities. Note that when C_2/\sigma_n \to 0, the representation \Phi_n becomes translation invariant: indeed, consider the particular case of \tau being a translation, then \nabla \tau=0 and \|\Phi_n(L_\tau x) - \Phi_n(x)\| \to 0.

The stability bound and a few additional results tell us a few things about the network architecture: (a) small patches lead to more stable representations (the dependency is hidden in C_1); (b) signal preservation for discrete signals requires small subsampling factors (and thus small pooling) between layers. In such a setting, the scale parameter \sigma_n still grows exponentially with n and near translation invariance may be achieved with several layers.

Interestingly, we may now come back to the Cauchy-Schwarz inequality from part 1, and note that if \Phi_n is stable, the RKHS norm \|f\| is then a natural quantity that provides stability to deformations to the prediction function f, in addition to measuring model complexity in a traditional sense.

Feature learning in RKHSs and convolutional kernel networks

The previous paragraph is devoted to the characterization of convolutional architectures such as CNNs but the previous kernel construction can in fact be used to derive more traditional kernel methods. After all, why should one spend efforts defining a kernel between images if not to use it?

This can be achieved by considering finite-dimensional approximations of the previous feature maps. In order to shorten the presentation, we simply describe the main idea based on the Nystrom approximation and refer to the paper for more details. Approximating the infinite-dimensional feature maps x_k (see the figure at the top of part I) can be done by projecting each point in \mathcal{H}_k onto a p_k-dimensional subspace \mathcal{F}_k leading to a finite-dimensional feature map \tilde{x}_k akin to CNNs, see the figure at the top of the post.

By parametrizing \mathcal{F}_k=\text{span}(\varphi_k(z_1),\varphi_k(z_2),\ldots,\varphi_k(z_{p_k})) with p_k anchor points Z=[z_1,\ldots,z_{p_k}], and using a dot-product kernel, a patch z from \tilde{x}_{k-1} is encoded through the mapping function

    \[ \psi_k(z) = \|z\| \kappa_k( Z^\top Z)^{-1/2} \kappa_k\left( Z^\top \frac{z}{\|z\|} \right), \]

where \kappa_k is applied pointwise. Then, computing \tilde{x}_k from \tilde{x}_{k-1} admits a CNN interpretation, where only the normalization and the matrix multiplication by \kappa_k( Z^\top Z)^{-1/2} are not standard operations. It remains now to choose the anchor points:

  • kernel approximation: a first approach consists of using a variant of the Nystrom method, see this paper and that one. When plugging the corresponding image representation in a linear classifier, the resulting approach behaves as a classical kernel machine. Empirically, we observe that the higher the number of anchor points, the better the kernel approximation, and the higher the accuracy. For instance, a two-layer network with a 300k-dimensional representations achieves about 86\% accuracy on CIFAR-10 without data augmentation (see here).
  • back-propagation, feature selection: learning the anchor points Z can also be done as in a traditional CNN, by optimizing them end-to-end. This allows using deeper lower-dimensional architectures and empirically seems to perform better when enough data is available, e.g., 92\% accuracy on CIFAR-10 with simple data augmentation. There, the subspaces \mathcal{F}_k are not learned anymore to provide the best kernel approximation, but the model seems to perform a sort of feature selection in each layer’s RKHS \mathcal{H}_k, which is not well understood yet (This feature selection interpretation is due to my collaborator Laurent Jacob).

Note that the first CKN model published here was based on a different approximation principle, which was not compatible with end-to-end training. We found this to be less scalable and effective.

Other links between neural networks and kernel methods

Finally, other links between kernels and infinitely-wide neural networks with random weights are classical, but they were not the topic of this blog post (they should be the topic of another one!). In a nutshell, for a large collection of weights distributions and nonlinear functions s: \mathbb{R} \to \mathbb{R}, the following quantity admits an analytical form

    \[ K(x,x') = \E_{w}[ s(w^\top x) s(w^\top x')], \]

where the terms s(w^\top x) may be seen as an infinitely-wide single-layer neural network. The first time such a relation appears is likely to be in the PhD thesis of Radford Neal with a Gaussian process interpretation, and it was revisited later by Le Roux and Bengio and by Cho and Saul with multilayer models.

In particular, when s is the rectified linear unit and w follows a Gaussian distribution, it is known that we recover the arc-cosine kernel. We may also note that random Fourier features also yield a similar interpretation.

Other important links have also been drawn recently between kernel regression and strongly over-parametrized neural networks, see this paper and that one, which is another exciting story.

Posted in Machine learning | Leave a comment

Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part I

 

 

I (n.b., Julien Mairal) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.

The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.

Preliminaries on kernel methods

Given data living in a set \mathcal{X}, a positive definite kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} implicitly defines a Hilbert space \mathcal{H} of functions from \mathcal{X} to \mathbb{R}, called reproducing kernel Hilbert space (RKHS), along with a mapping function \varphi: \mathcal{X} \to \mathcal{H}.

A predictive model f in \mathcal{H} associates to every point x a label in \mathbb{R}, and admits a simple form f(x) =\langle f, \varphi(x) \rangle_{\mathcal{H}}. Then, Cauchy-Schwarz inequality gives us a first basic stability property

    \[ \forall x, x'\in \mathcal{X},~~~~~ |f(x)-f(x')| \leq \|f\|_{\mathcal{H}} \| \varphi(x) - \varphi(x')\|_\mathcal{H}. \]

This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:

\|\varphi(x)-\varphi(x')\|_\mathcal{H} is a distance between two data representations \varphi(x),\varphi(x'), which are independent of the learning process, and \|f\|_\mathcal{H} is a norm on the model f (typically optimized over data) that acts as a measure of complexity.

Thinking about neural networks in terms of kernel methods then requires defining the underlying representation \varphi(x), which can only depend on the network architecture, and the model f, which will be parametrized by (learned) network’s weights.

Building a convolutional kernel for convolutional neural networks

Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image x_0, the main idea is to build a sequence of “feature maps” x_1,x_2,\ldots that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.

Formally, an input image x_0 is represented as a square-integrable function in L^2(\Omega,\mathcal{H}_0), where \Omega is a set of pixel coordinates, and \mathcal{H}_0 is a Hilbert space. \Omega may be a discrete grid or a continuous domain such as \mathbb{R}^2, and \mathcal{H}_0 may simply be \mathbb{R}^3 for RGB images. Then, a feature map x_k in L^2(\Omega,\mathcal{H}_k) is obtained from a previous layer x_{k-1} as follows:

  • modeling larger neighborhoods than in the previous layer: we map neighborhoods (patches) from x_{k-1} to a new Hilbert space \mathcal{H}_k. Concretely, we define a homogeneous dot-product kernel between patches z, z' from x_{k-1}:

        \[ K_k(z,z') = \|z\| \|z'\| \kappa_k \left( \left\langle \frac{z}{\|z\|}, \frac{z'}{\|z'\|} \right\rangle \right), \]

    where \langle . , . \rangle is an inner-product derived from \mathcal{H}_{k-1}, and \kappa_k is a non-linear function that ensures positive definiteness, e.g., \kappa_k(\langle u,u'\rangle ) = e^{\alpha (\langle u,u'\rangle -1)} = e^{-\frac{\alpha}{2}\|u-u'\|^2} for vectors u, u' with unit norm, see this paper. By doing so, we implicitly define a kernel mapping \varphi_k that maps patches from x_{k-1} to a new Hilbert space \mathcal{H}_k. This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.

  • increasing invariance: to gain invariance to small deformations, we smooth~x_{k-1} with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).

Formally, the previous construction amounts to applying operators P_k (patch extraction), M_k (kernel mapping), and A_k (smoothing/pooling operator) to x_{k-1} such that the n-th layer representation can be written as

    \[ \Phi_n(x_0)= x_n= A_n M_n P_n \ldots A_1 M_1 P_1 x_0~~~\text{in}~~~~L^2(\Omega,\mathcal{H}_n). \]

We may finally define a kernel for images as \mathcal{K}_n(x_0,x_0')=\langle \Phi_n(x_0), \Phi_n(x_0') \rangle, whose RKHS contains the functions f_w(x_0) = \langle w , \Phi_n(x_0) \rangle for w in L^2(\Omega,\mathcal{H}_n). Note now that we have introduced a concept of image representation \Phi_n, which only depends on some network architecture (amounts of pooling, patch size), and predictive model f_w parametrized by w.

From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).

 

Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.

 

 

Posted in Machine learning | Leave a comment

Optimal bound for stochastic bandits with corruption

Guest post by Mark Sellke.

In the comments of the previous blog post we asked if the new viewpoint on best of both worlds can be used to get clean “interpolation” results. The context is as follows: in a STOC 2018 paper followed by a COLT 2019 paper, the following corruption model was discussed: stochastic bandits, except for C rounds which are adversarial. The state of the art bounds were of the form: optimal (or almost optimal) stochastic term plus K C, and it was mentioned as an open problem whether KC could be improved to C (there is a lower bound showing that C is necessary — when C = O(\sqrt{T})). As was discussed in the comment section, it seemed that indeed this clean best of both worlds approach should certainly shed light on the corruption model. It turns out that this is indeed the case, and a one-line calculation resolves positively the open problem from the COLT paper. The formal result is as follows (recall the notation/definitions from the previous blog post):

Lemma: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(1)   \begin{equation*} c \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

Then in the C-corruption stochastic bandit model one has that the regret is bounded by:

    \[ C + 2 c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \]

 

Note that by the previous blog post we know strategies that satisfy (1) with c=10 (see Lemma 2 in the previous post).

Proof: In equation (1) let us apply Jensen over the corrupt rounds, this yields a term c \sqrt{K C}. For the non-corrupt rounds, let us use that

    \[ c \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{c^2}{t \Delta_i} \right) \]

The sum of the second term on the right hand side is upper bounded by c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i}. On the other hand the sum (over non-corrupt rounds) of the first term is equal to 1/2 of the regret over the non-corrupt rounds, which is certainly smaller than 1/2 of the total regret plus C. Thus we obtain (denoting R for the total regret):

    \[ R \leq c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i} + \frac{C}{2} + \frac{R}{2} \]

which concludes the proof.

 

Posted in Machine learning, Optimization, Theoretical Computer Science | Leave a comment

Amazing progress in adversarially robust stochastic multi-armed bandits

In this post I briefly discuss some recent stunning progress on robust bandits (for more background on bandits see these two posts, part 1 and part 2, in particular what is described below gives a solution to Open Problem 3 at the end of part 2).

 

Stochastic bandit and adversarial examples

In multi-armed bandit problems the gold standard property, going back to a seminal paper of Lai and Robbins in 1985 is to have a regret upper bounded by:

(1)   \begin{equation*} \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \,. \end{equation*}

Let me unpack this a bit: this is for the scenario where the reward process for each action is simply an i.i.d. sequence from some fixed distribution, i^* is the index of the (unique) best action, and \Delta_i is the gap between the mean value of the best action and the one of i. Such guarantee is extremely strong, as in particular it means that actions whose average performance is a constant away from the optimal arm are very rarely played (only of order \log(T)). On the other hand, the price to pay for such an aggressive behavior (by this I mean focusing on good actions very quickly) is that all the classical algorithms attaining the above bound are extremely sensitive to adversarial examples: that is if there is some deviation from the i.i.d. assumption (even very brief in time), the algorithms can suddenly suffer linear in T regret.

 

Adversarial bandit

Of course there is an entirely different line of works, on adversarial multi-armed bandits, where the whole point is to prove regret guarantee for any reward process. In this case the best one can hope for is a regret of order \sqrt{K T}. The classical algorithm in this model, Exp3, attains a regret of order \sqrt{K T \log(K)}. In joint work with Jean-Yves Audibert we showed back in 2009 that the following strategy, which we called PolyINF, attains the optimal \sqrt{KT}: view Exp3 as Mirror Descent with the (negative) entropy as a regularizer, and now replace the entropy by a simple rational function namely - \sum_{i=1}^K x_i^p with p \in (0,1) (this mirror descent view was actually derived in a later paper with Gabor Lugosi). The proof becomes one line (given the appropriate knowledge of mirror descent and estimation in bandit games): the radius part of the bound is of the form \frac{1}{\eta} \sum_{i=1}^K x_i^p \leq \frac{K^{p}}{\eta}, while the variance is of the form (since the inverse of the Hessian of the mirror map is a diagonal matrix with entries x_i^{2-p}):

    \[ \eta \sum_{i=1} x_i^{2-p} \frac{1}{x_i} \leq \eta K^{1-p} \,. \]

Thus optimizing over \eta yields \sqrt{K T} for any p \in (0,1). Interestingly, the best numerical constant in the bound is obtained for p=1/2.

 

Best of both worlds

This was the state of affairs back in 2011, when with Alex Slivkins we started working on a best of both worlds type algorithm (which in today’s language is exactly a stochastic MAB robust to adversarial examples): namely one that gets the \log(T) guarantee (in fact \log^2(T) in our original paper) if the environment is the nice i.i.d. one, and also \sqrt{K T} (in fact \sqrt{K T \log^3(T)}) in the worst case. This original best of both worlds algorithm was of the following form: be aggressive as if it was a stochastic environment, but still sample sufficiently often the bad actions to make sure there isn’t an adversary trying to hide some non-stochastic behavior on these seemingly bad performing actions. Of course the whole difficulty was to show that it is possible to implement such a defense without hurting the stochastic performance too much (remember that bad actions can only be sampled of order \log(T) times!). Since this COLT 2012 paper there has been many improvements to the original strategy, as well as many variants/refinements (one such variant worth mentioning are the works trying to do a smooth transition between the stochastic and adversarial models, see e.g. here and here).

 

A stunning twist

The amazing development that I want to talk about in this post is the following: about a year ago Julian Zimmert and Yevgeny Seldin proved that the 2009 PolyINF (crucially with p=1/2) strategy actually gets the 2012 best of both worlds bound! This is truly surprising, as in principle mirror descent does not “know” anything about stochastic environments, it does not make any sophisticated concentration reasoning (say as in Lai and Robbins), yet it seems to automatically and optimally pick up on the regularity in the data. This is really amazing to me, and of course also a total surprise that the polynomial regularizer has such strong adaptivity property, while it was merely introduced to remove a benign log term.

The crucial observation of Zimmert and Seldin is that a a certain self-bounding property of the regret implies (in a one-line calculation) the best of both worlds result:

Lemma 1: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(2)   \begin{equation*} C \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

(Recall that for multi-armed bandit one selects a probability distribution x_t over the actions, so x_{i,t} denote here the probability of playing action i at time t.) Then one has that the regret is in fact bounded by 2 C \sqrt{K T} (this follows trivially by Jensen on the i sum), and moreover if the environment is stochastic one has that the regret is in fact bounded by C^2 times (1).

Proof: Assuming that the environment is stochastic we can write the regret as \sum_{i,t} \Delta_i x_{i,t}, so by assumption and using that C \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) one has:

    \[ \sum_{i \neq i^*,t} \Delta_i x_{i,t} \leq \frac{1}{2} \sum_{i \neq i^*,t} \left(\Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) \,, \]

which means that the left hand side is smaller than \sum_{i \neq i^*,t} \frac{C^2}{t \Delta_i} which is indeed smaller than C^2 times (1).

 

Yet another one-line proof (okay, maybe 5 lines)

Zimmert and Seldin proved that PolyINF with p=1/2 actually satisfies the self-bounding property of Lemma 1 (and thus obtains the best of both worlds guarantee). In another recent paper by Zimmert, in joint work with Haipeng Luo and Chen-Yu Wei, they simplify the analysis by using a very mild variation of the PolyINF regularizer, namely - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}). In my view it’s the proof from the book for the best of both worlds result (or very close to it)! Here it is:

Lemma 2: Equation (2) with C=10 is an upper bound on the regret of mirror descent with learning \eta_t = \frac{1}{\sqrt{t}}, mirror map \Phi(x) = - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}), and standard multi-armed bandit loss estimator.

Proof: The classical mirror descent analysis from any good book will tell you that the regret is controlled by (for \Phi(x) = \sum_{i=1}^K \phi(x_i) and with the convention \eta_0 = + \infty):

(3)   \begin{equation*} \sum_{t=1}^T \left(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}\right) (\Phi(x^*) - \Phi(x_t)) + \sum_{t=1}^T \eta_t \sum_{i=1}^K \frac{1}{x_{i,t} \phi''(x_t)} \,. \end{equation*}

We now consider those terms for the specific \Phi and \eta_t suggested in the lemma. First notice that \frac{1}{\eta_t} - \frac{1}{\eta_{t-1}} \leq \eta_t. Moreover \phi(x^*_i) = - 1 (since x^* is integral) so that \phi(x^*_i) - \phi(x_{i,t}) \leq \min(\sqrt{x_{i,t}}, \sqrt{1-x_{i,t}}). In other words the first term in (3) is upper bounded by

    \[ \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{\min(x_{i,t}, 1-x_{i,t})}{t}} \leq 2 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality simply comes from \sqrt{1-x_{i^*,t}} = \sqrt{\sum_{i \neq i^*} x_{i,t}} \leq \sum_{i \neq i^*} \sqrt{x_{i,t}}.

Next note that \phi''(s) = \frac{1}{4} (s^{-3/2} + (1-s)^{-3/2}) \geq \frac{1}{4 \min(s,1-s)^{3/2}}, so that the second term in (3) is upper bounded by

    \[ 4 \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{x_{i,t}}{t}} \min(1, (1-x_{i,t})/x_{i,t})^{3/2} \leq 8 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality follows trivially by considering the two cases whether x_{i^*,t} is smaller or larger than 1/2.

Posted in Machine learning, Optimization, Theoretical Computer Science | 2 Comments

Nemirovski’s acceleration

I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a 2-dimensional plane-search subroutine (which, in theory, can be implemented in \log(1/\epsilon) calls to a first-order oracle). He later improved it to only require a 1-dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).

Let f be a 1-smooth function. Denote x^{+} = x - \nabla f(x). Fix a sequence (\lambda_t)_{t \in \N}, to be optimized later. We consider the “conjugate” point \sum_{s =1}^t \lambda_s \nabla f(x_s). The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

    \[ x_{t+1} = \mathrm{argmin}_{x \in P_t} f(x) \, \text{where} \, P_t = \mathrm{span}\left(x_t^+, \sum_{s =1}^t \lambda_s \nabla f(x_s)\right) \,. \]

Let us denote g_s = \nabla f(x_s) and \delta_s = f(x_s) - f(x^*) for shorthand. The key point is that g_{t+1} \in P_t^{\perp}, and in particular \|\sum_{s \leq t} \lambda_s g_s\|^2 = \sum_{s \leq t} \lambda_s^2 \|g_s\|^2. Now recognize that \|g_s\|^2 is a lower bound on the improvement \delta_s - \delta_{s+1} (here we use that x_{s+1} is better than x_s^+). Thus we get:

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s^2 (\delta_s - \delta_{s+1}) \leq \sum_{s \leq t} \delta_s (\lambda_s^2 - \lambda_{s-1}^2) \,. \]

In other words if the sequence \lambda is chosen such that \lambda_s = \lambda_s^2 - \lambda_{s-1}^2 then we get

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s \delta_s \,. \]

This is good because roughly the reverse inequality also holds true by convexity (and the fact that x_s \in P_s so g_s \cdot x_s = 0):

    \[ \sum_{s \leq t} \lambda_s \delta_s \leq \sum_{s \leq t} \lambda_s g_s \cdot (x_s - x^*) \leq \|x^*\| \cdot \| \sum_{s \leq t} \lambda_s g_s\| \,. \]

So finally we get \sum_{s \leq t} \lambda_s \delta_s \leq \|x^*\|^2, and it just remains to realize that \lambda_s is of order s so that \delta_t \leq \|x^*\|^2 / t^2.

Posted in Optimization | 8 Comments

A short proof for Nesterov’s momentum

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

    \begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the 1/t^2 convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that f is \beta-smooth convex function, and we take \eta = 1/\beta in the gradient step. The momentum term \gamma_t will be set to a very particular value, which comes out naturally in the proof.

 

The two basic inequalities

Let us denote \delta_t = f(x_t) - f(x^*) and g_t = - \frac{1}{\beta} \nabla f(x_t + d_t) (note that x_{t+1} = x_t + g_t + d_t). Now let us write our favorite inequalities (using f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2 and f(x) - f(y) \leq \nabla f(x) \cdot (x-y)):

    \[ \delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,, \]

and

    \[ \delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,. \]

 

On the way to a telescopic sum

Recall now that |a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2, so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient 1 on the second inequality, so let us do \lambda_t-1 times the first inequality plus 1 times the second inequality. We obtain an inequality whose right hand side is given by -\frac{\beta}{2} times

    \begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose \lambda_t and \gamma_t. What we would like to have is:

    \[ x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,. \]

Observe that (since d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)) the right hand side can be written as x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*, and thus we see that we simply need to have:

    \[ \lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,. \]

 

Setting the parameters and concluding the proof

Writing u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2 we now obtain as a result of the combination of our two starting inequalities:

    \[ \lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,. \]

It only remains to select \lambda_t such that \lambda_t^2 - \lambda_t = \lambda_{t-1}^2 (i.e., roughly \lambda_t is of order t) so that by summing the previous inequality one obtains \delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2} which is exactly the 1/t^2 rate we were looking for.

Posted in Optimization | 11 Comments

Remembering Michael

It has been a year since the tragic event of September 2017. We now know what happened, and it is a tremendously sad story of undiagnosed type 1 diabetes.

This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.

Posted in Uncategorized | Leave a comment

How to generalize (algorithmically)

A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).

Syllabus

Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.

 

Some notes on algorithmic generalization

Let \mathcal{X}, \mathcal{Y} be input/output spaces. Let \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow [0,1] be a loss function, \mu a probability measure supported on \mathcal{X} \times \mathcal{Y}, and \underline{h} : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^\mathcal{X} a learning rule (in words \underline{h} takes as input a dataset of m examples, and output a mapping from \mathcal{X}-inputs to \mathcal{Y}-outputs). With a slight abuse of notation, for z=(x,y) \in \mathcal{X} \times \mathcal{Y} and h \in \mathcal{Y}^{\mathcal{X}}, we write \ell(h, z) := \ell(h(x), y). We define the generalization of \underline{h} on \mu by:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S=(z_1,\hdots,z_m) \sim \mu^{\otimes m}, z \sim \mu} \left[ \ell(\underline{h}(S), z) - \frac1{m} \sum_{i=1}^m \ell(\underline{h}(S), z_i) \right] \,. \]

In words, if \mathrm{gen}(\underline{h} , \mu) \leq \epsilon then we expect the empirical performance of the learned classifier \underline{h}(S) to be representative of its performance on a fresh out-of-sample data point, up to an additive \epsilon. The whole difficulty of course is that the empirical evaluation is done with the same dataset S that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on \mathrm{gen}(\underline{h} , \mu) here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class \mathcal{H} one trivially has

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S, z} \,\, \sup_{h \in \mathcal{H}} \left[ \frac1{m} \sum_{i=1}^m \ell(h, z_i) - \ell(h, z) \right] \,. \]

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.

Stability

The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:

    \begin{eqnarray*} \mathrm{gen}(\underline{h} , \mu) & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S), z) - \ell(\underline{h}(S), z_i) \right] \\ & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S^{-i}), z_i) - \ell(\underline{h}(S), z_i) \right] \,, \end{eqnarray*}

where S^{-i}=(z_1,\hdots, z_{i-1}, z, z_{i+1}, \hdots, z_m). This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).

Robustness

The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define (K, \epsilon(\cdot))-robustness as follows: assume that \mathcal{Z} can be partitioned into K sets \{C_k\}_{k \in [K]} such that if z and z_i are in the same set C_k then

    \[ | \ell(\underline{h}(S), z_i) - \ell(\underline{h}(S), z) | \leq \epsilon(S) \,. \]

A good example to have in mind would be a binary classifier with large margin, in which case K corresponds to the covering number of \mathcal{X} at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases K would be typically exponential in the dimension of \mathcal{X}. The key result of Xu and Mannor that we prove next is a generalization bound of order \sqrt{K / m}. In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A (K, \epsilon(\cdot))-robust learning rule satisfies

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S} \, \epsilon(S) + \sqrt{\frac{K}{m}} \,. \]

Proof: Let N_k = |S \cap C_k| and note that \mathbb{E}_S \, N_k = m \mu(C_k). Now one has for a robust h:

    \begin{align*} \mathbb{E}_z \ \ell(h, z) & =\sum_{k=1}^K \mu(C_k) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & = \sum_{k=1}^K \frac{N_k}{m} \mathbb{E}_z [\ell(h,z) | z \in C_k] + \sum_{k=1}^K \left(\mu(C_k) - \frac{N_k}{m}\right) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & \leq \epsilon(S) + \frac{1}{m} \sum_{i = 1}^m \ell(h, z_i) + \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \, . \end{align*}

It only remains to observe that

    \[ \mathbb{E}_S \, \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \leq \frac{1}{m} \sum_{k=1}^K \sqrt{\mu(C_k)} \leq \sqrt{\frac{K}{m}} \, . \]

 

Information theoretic perspective

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information \mathrm{I}(\underline{h}(S), S) as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \sqrt{\frac{\mathrm{I}(\underline{h}(S), S)}{2 m}} \, . \]

Note that here we have assumed that the codomain of the learning rule \underline{h} consists of deterministic maps from inputs to outputs, in which case the mutual information \mathrm{I}(\underline{h}(S), S) is simply the entropy \mathrm{H}(\underline{h}(S)). However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S, S' \sim \mu^{\otimes m}} [ F(\underline{h}(S), S') - F(\underline{h}(S), S) ] \,, \]

where F(h, S) = \frac{1}{m} \sum_{z \in S} \ell(h, z).

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:

Lemma:

Let f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}. Let (X,Y) be random variables in \mathcal{X} \times \mathcal{Y} and \bar{X}, \bar{Y} be mutually independent copies of X and Y. Assume that f(\bar{X}, \bar{Y}) is \sigma-subgaussian (i.e., \log \mathbb{E} \exp( \lambda (f(\bar{X}, \bar{Y}) - \mathbb{E} f) ) \leq \lambda^2 \sigma^2 /2, \forall \lambda) then

    \[ \mathbb{E}[ f(X,Y) - f(\bar{X}, \bar{Y}) ] \leq \sqrt{2 \sigma^2 \mathrm{I}(X, Y)} \]

Proof: The mutual information is equal to the relative entropy between the distribution of (X,Y) and the distribution of (\bar{X}, \bar{Y}). Recall also the variational representation of the relative entropy which is that the map \mu \mapsto \mathrm{Ent}(\mu, \nu) is the convex conjugate of the log-partition function \Phi_{\nu}(g) := \log(\int \exp(g(x)) d \nu(x)). In particular one has a lower bound on the mutual information for any such g which means:

    \[ \mathrm{I}(X, Y) \geq \mathbb{E}[ g(X,Y) ] - \log \mathbb{E} [ \exp(g(\bar{X},\bar{Y})) ] \,. \]

Now it only remains to use the definition of subgaussianity, that is take g = \lambda f, and optimize over \lambda.

Posted in Machine learning | Leave a comment

Some updates

The blog has been eerily silent for most of 2018, here is why:

  • The main culprit is definitely the COLT 2018 chairing. This year we received a surprise 50% increase in number of submissions. This is great news for the community, and it led to a truly fantastic program. We are looking forward to see many of you in Stockholm! Note that the early registration deadline is tomorrow (May 30th).
  • The first volume of our new Mathematical Statistics and Learning journal just got published. If you care about being evaluated by top experts, and/or get the visibility of being in a highly selective mathematical journal, you should consider submitting your highest quality work there!
  • The first set of video lectures on the youtube channel is now complete. I realize that the quality is not perfect and that it is hard to read the blackboard. I will try to improve this for the future videos (which should be posted some time during the summer).
  • Finally I am quite excited to have a first preprint (joint work with Eric Price and Ilya Razenshteyn) at least loosely related to deep learning. It is titled “Adversarial examples from computational constraints”, which sums up the paper pretty well. In a nutshell, we prove that avoiding adversarial examples is computationally hard, even in situations where there exists a very robust classifier. We are looking forward to get the deep learning community’s feedback on this work!
Posted in Announcement | 2 Comments