- Bandit algorithms by Tor Lattimore and Csaba Szepesvari
- Introduction to bandits by Alex Slivkins

These new references very significantly expand the 2012 survey, and they are wonderful starting points for anyone who wants to enter the field.

Here are some of the discoveries in the world of bandits that stood out for me this decade:

- We now understand very precisely Thompson Sampling, the first bandit strategy that was proposed back in 1933. The most beautiful reference here is the one by Dan Russo and Ben Van Roy: An Information-Theoretic Analysis of Thompson Sampling, JMLR 2016. Another one that stands out is Analysis of thompson sampling for the multi-armed bandit problem by S. Agrawal and N. Goyal at COLT 2012.
- T^{2/3} lower bound for *non-stochastic* bandit with switching cost by Dekel, Ding, Koren and Peres at
__STOC 201__4. This is a striking result for several reasons. In particular the proof has to be based on a non-trivial stochastic process, since for the classical stochastic i.i.d. model one can obtain \sqrt{T} (very easily in fact). - We now know that bandit convex optimization is “easy”, in the sense that it is a \sqrt{T}-regret type problem. What’s more is that in our STOC 2017 paper with Y.T. Lee and R. Eldan we introduced a new way to do function estimation based on bandit feedback, using kernels (I have written at length about this on this blog).
- A very intriguing model of computation for contextual bandit was proposed, where one can access the policy space only through an offline optimization oracle. With such access, the classical Exp4 algorithm cannot be simulated, and thus one needs new strategies. We now have a reasonable understanding that \sqrt{T} is doable with mild assumptions (see e.g. this ICML 2014 paper on “
__Taming the Monster“__by Agarwal, Hsu, Kale, Langford, L. Li and Schapire) and that it is impossible with no assumptions (work of Hazan and Koren at STOC 2016). - Honorable mentions also go to the work of Wei and Luo showing that very strong variation bounds are possible in bandits (see this COLT 2018 paper), and Zimmert and Seldin who made striking progress on the best of both worlds phenomenon that we discovered with Slivkins at the beginning of the decade (I blogged about it here already).

In addition to starting the decade with the bandit survey, I also started it with being bored with the bandit topic altogether. I thought that many (if not most) of the fundamental results were now known, and it was a good idea to move on to something else. Obviously I was totally wrong, as you can see with all the works cited above (and many many more for stochastic bandits, including much deeper understanding of best arm identification, a topic very close to my heart, see e.g., [Kaufmann, Cappe, Garivier, JMLR 16]). In fact I am now optimistic that there is probably another decade-worth of exploration left for the bandit problem(s). Nevertheless I ventured outside, and explored the world of optimization (out of which first came a survey, and more recently video lectures) and briefly networks (another modest survey came out of this too).

Here are some of the landmark optimization results of this decade in my view:

- Perhaps the most striking result of the decade in optimization is the observation that for finite sum problems, one can reduce the variance in stochastic gradient descent by somehow centering the estimates (e.g., using a slowly moving sequence on which we can afford to compute full gradients; but this is not the only way to perform such variance reduction). This idea, while very simple, has a lot of implications, both in practice and in theory! The origin of the idea are in the SAG algorithm of [Schmidt, Le Roux, Bach, NIPS 2012] and SDCA [Shalev-Shwartz and Zhang, JMLR 2013]. A simpler instantiation of the idea, called SVRG appeared shortly after in [Johnson and Zhang, NIPS 2013] (and also independently at the same NeurIPS, in [M. Madhavi, L. Zhang, R. Li, NIPS 2013]).
- An intriguing direction that I pursued fervently is the use of convex optimization for problems that have a priori nothing to do with convex optimization. A big inspiration for me was the COLT 2008 paper by Abernethy, Hazan and Rakhlin, who showed how mirror descent naturally solves bandit problems. In this decade, we (this we includes myself and co-authors, but also various other teams) explored how to use mirror descent for other online decision making problems, and made progress on some long-standing problems (k-server and MTS), see for example this set of video lectures on the “Five miracles of mirror descent”.
- Arnak Dalalyan showed how to use ideas inspired from convex optimization to analyze the Langevin Monte Carlo algorithm. This was absolutely beautiful work, that led to many many follow-ups.
- There has been a lot of rewriting of Nesterov’s acceleration, to try to demystify it. Overall the enterprise is not yet a resounding success in my opinion, but certainly a lot of progress has been made (again I have written a lot about it on this blog already). We now even have optimal acceleration for higher order of smoothness (see this 15 authors
__paper at COLT 2019__), but these techniques are clouded with the same shroud of mystery as was Nesterov’s original method. - Yin Tat Lee and Aaron Sidford obtained an efficient construction of a universal barrier.
- We now know that certain problems cannot be efficiently represented by SDPs (the so-called “extension complexity), see e.g. this work by Lee-Raghavendra-Steurer.
- We now know how to chase convex bodies, and we can even do so very elegantly with the Steiner/Sellke point.

The papers above are mostly topics on which I tried to work at some point. Here are some questions that I didn’t work on but followed closely and was fascinated by the progress:

- The stochastic block model was essentially solved during this decade, see for example this survey by Emmanuel Abbe.
- The computational/statistical tradeoffs were extensively explored, yet they remain mysterious. A nice impulse to the field was given by this COLT 2013 paper by Berthet and Rigollet relating sparse PCA and planted clique. In a similar spirit I also enjoyed the more recent work by Moitra, Jerry Li, and many co-authors on computationally efficient robust estimation (see e.g., this recent paper)
- Adaptive data analysis strikes me as both very important in practice, and quite deep theoretically, see e.g. the reusable holdout by Dwork et al. A related paper that I liked a lot is this ICML 2015 paper by Blum and Hardt, which essentially explores the regularization effect of publishing only models that beat the state of the art significantly (more generally this is an extremely interesting question, of why we can keep using the same datasets to evaluate progress in machine learning, see this provokingly titled paper “Do ImageNet Classifiers Generalize to ImageNet?“).
- A general trend has been in finding very fast (nearly linear time) method for many classical problems. Sometimes these investigations even led to actually practical algorithm, as with this now classical paper by Marco Cuturi at NIPS 2013 titled “Sinkhorn Distances: Lightspeed Computation of Optimal Transport“.

I also heard that, surprisingly, gradient descent can work to optimize highly non-convex functions such as training loss for neural networks. Not sure what this is about, it’s a pretty obscure topic, maybe it will catch up in the decade 2020-2029…

The above is only a tiny sample, there were many many more interesting directions being explored (tensor methods for latent variable models [Anandkumar, Ge, Hsu, Kakade, Telgarsky, JMLR 14]; phenomenon of “all local minima are good” for various non-convex learning problems, see e.g., [Ge, Lee, Ma, NIPS 2016]; etc etc). Feel free to share your favorite ML theory paper in the comments!

]]>

**Convex body chasing**

In convex body chasing an online algorithm is presented at each time step with a convex body , and it must choose a point . The online algorithm is trying to minimize its total movement, namely (all the norms here will be Euclidean, for simplicity). To evaluate the performance of the algorithm, one benchmarks it against the smallest movement one could achieve for this sequence of bodies, if one knew in advance the whole sequence. A good example to have in mind is if the sequence corresponds to lines rotating around some fixed point, then the best thing to do is to move to this fixed point and never move again, while a greedy algorithm might have infinite movement (in the continuous time limit). The competitive ratio is the worst case (over all possible sequences of convex bodies) ratio of the algorithm’s performance to those of the oracle optimum. This problem was introduced in 1993 by Linial and Friedman, mainly motivated by considering a more geometric version of the -server problem.

As it turns out, this problem in fact has a lot of “real” applications. It is not too surprising given how elementary the problem is. Just to give a flavor, Adam Wierman and friends show that dynamic powering of data centers can be viewed as an instance of convex *function* chasing. In convex function chasing, instead of a convex body one gets a convex function , and there is then a movement cost and a *service cost* . While this is more general than convex body chasing, it turns out that convex body chasing in dimension is enough to solve convex function chasing in dimension (this is left to the reader as an exercise). Roughly speaking in dynamic power management, one controls various power levels (number of servers online, etc), and a request correspond to a new set of jobs. Turning on/off servers has an associated cost (the movement cost), while servicing the jobs with a certain number of servers has an associated delay (the service cost). You get the idea.

**The Steiner point**

In a paper to appear at SODA too, we propose (with Yin Tat Lee, Yuanzhi Li, Bo’az Klartag, and Mark Sellke) to use the **Steiner point** for the nested convex body chasing problem. In nested convex body chasing the sequence of bodies is nested, and to account for irrelevant additive constant we assume that the oracle optimum starts from the worst point in the starting set . In other words, opt is starting at the point the furthest away in from the closest point in , so its movement is exactly equal to the Hausdorff distance between and . So all we have to do is to give an online algorithm whose movement is proportional to this Hausdorff distance. Some kind of Lipschitzness in Hausdorff distance. Well, it turns out that mathematicians have been looking at this for centuries already, and Steiner back in the 1800’s introduced just what we need: a **selector** (a map from convex sets to points in them, i.e. for any convex set ) which is Lipschitz with respect to the Hausdorff distance, that is:

Note that obtaining such a selector is not trivial, for example the first natural guess would be the center of gravity, but it turns out that this is **not** Lipschitz (consider for example what happens with being a very thin triangle and being the base of this triangle). Crazy thing is, this Steiner point selector (to be defined shortly) is not only Lipschitz, it is in fact **the most Lipschitz selector** for convex sets. How would you prove such a thing? Well, you clearly need a miracle to happen, and the miracle here is that the Steiner point satisfies some symmetries, which define it uniquely. From there all you need to do is that starting with any selector, you can add some of these symmetries while also improving the Lipschitz constant (for more on this see the references in the linked paper above).

OK, so what is the Steiner point? It has many definitions, but a particular appealing one from an optimizer’s viewpoint is as follows. For a convex set and a vector let be the maximizer of the linear function in the set . Then the Steiner point of , is defined by:

In words: for any direction take the furthest point in in that direction, and average over all directions. This feels like a really nice object, and clearly it satisfies (i.e., it is a selector). The algorithm we proposed in our SODA paper is simply to move to . Let us now see how to analyze this algorithm.

**The nested case proof**

First we give an alternative formula for the Steiner point. Denote , and observe that for almost all . Thus using the divergence theorem one obtains:

(The factor comes from the ratio of the volume of the ball to the sphere, and the in the expectation is because the normal to is on the sphere.)

Now we have the following one-line inequality to control the movement of the Steiner point:

It only remains to observe that to obtain a proof that Steiner is -Lipschitz (in fact as an exercise you can try to improve the above argument to obtain -Lipschitz). Now for nested convex body chasing we will have so that

(i.e., no absolute values), and thus the upper bound on the movement will telescope! More precisely:

This proves that the Steiner point is -competitive for nested convex body chasing. How to generalize this to the non-nested case seemed difficult, and the breakthrough of Mark and the CMU team is to bring back an old friend of the online algorithm community: the work function.

**The work function**

The work function is defined as follows. For any point it is the smallest cost one can pay to satisfy all the requests and end up at . First observe that this function is convex. Indeed take two point and , and the middle point between and (any point on the segment between and would be treated similarly). In the definition of the work function there is an associated trajectory for both and . By convexity of the requests, the sequence of mid points between those two trajectories is a valid trajectory for ! And moreover the movement of this mid trajectory is (by triangle inequality) less than the average of the movement of the trajectory for and . Hence .

A very simple property we will need from the work function is that the norm of its gradient carries some information on the current request. Namely, (indeed, if is not in the current request set, then the best way to end up there is to move to a point in and then move to , and if is a polytope then when you move a little bit you don’t move , so the cost is changing at rate , hence the norm of the gradient being ). Or to put it differently: .

**The Sellke point**

Mark’s beautiful idea (and the CMU team very much related –in fact equivalent– idea) to generalize the Steiner point to the non-nested case is to use the work function as a surrogate for the request. Steiner of will clearly not work since all of does not matter in the same way, as some points might be essentially irrelevant because they are very far from the previous requests, a fact which will be uncovered by the work function (while on the other hand the random direction from the Steiner point definition is oblivious to the geometry of the previous requests). So how to pick an appropriately random point in while respecting the work function structure? Well we just saw that . So how about taking a random direction , and applying the inverse of the gradient map, namely the gradient of the **Fenchel dual** ? Recall that the Fenchel dual is defined by:

This is exactly the algorithm Mark proposes and it goes like this (Mark calls it the *functional Steiner point*):

Crucially this is a valid point, namely . Moreover just like for the Steiner point we can apply the divergence theorem and obtain:

The beautiful thing is that we can now **exactly** repeat the nested convex body argument, since for all (just like we had in the nested case ) and so we get:

The first term with is just some additive constant which we can ignore, while the second term is bounded as follows:

Thus we exactly proved that the movement of the Sellke point is upped bounded by a constant plus times the minimum of the work function, and the latter is nothing but the value of the oracle optimum!

Note that once everything is done and said, the proof has only **two** inequalities (triangle inequality as in the nested case, and the minimum of the expectation is less than the expectation of the minimum). It doesn’t get any better than this!

This is a continuation of Julien Mairal‘s guest post on CNNs, see part I here.

**Stability to deformations of convolutional neural networks**

In their ICML paper Zhang et al. introduce a functional space for CNNs with one layer, by noticing that for some dot-product kernels, smoothed variants of rectified linear unit activation functions (ReLU) live in the corresponding RKHS, see also this paper and that one. By following a similar reasoning with multiple layers, it is then possible to show that the functional space described in part I contains CNNs with such smoothed ReLU, and that the norm of such networks can be controlled by the spectral norms of filter matrices. This is consistent with previous measures of complexity for CNNs, see this paper by Bartlett et al.

A perhaps more interesting finding is that the abstract representation , which only depends on the network architecture, may provide near-translation invariance and stability to small image deformations while preserving information—that is, can be recovered from . The original characterization we use was introduced by Mallat in his paper on the scattering transform—a multilayer architecture akin to CNNs based on wavelets, and was extended to by Alberto Bietti, who should be credited for all the hard work here.

Our goal is to understand under which conditions it is possible to obtain a representation that (i) is near-translation invariant, (ii) is stable to deformations, (iii) preserves signal information. Given a -diffeomorphism and denoting by its action operator (for an image defined on the continuous domain ), the main stability bound we obtain is the following one, see Theorem 7 in Mallat’s paper if , for all ,

where are universal constants, is the scale parameter of the pooling operator corresponding to the “amount of pooling” performed up to the last layer, is the maximum pixel displacement and represents the maximum amount of deformation, see the paper for the precise definitions of all these quantities. Note that when , the representation becomes translation invariant: indeed, consider the particular case of being a translation, then and .

The stability bound and a few additional results tell us a few things about the network architecture: (a) small patches lead to more stable representations (the dependency is hidden in ); (b) signal preservation for discrete signals requires small subsampling factors (and thus small pooling) between layers. In such a setting, the scale parameter still grows exponentially with and near translation invariance may be achieved with several layers.

Interestingly, we may now come back to the Cauchy-Schwarz inequality from part 1, and note that if is stable, the RKHS norm is then a natural quantity that provides stability to deformations to the prediction function , in addition to measuring model complexity in a traditional sense.

**Feature learning in RKHSs and convolutional kernel networks**

The previous paragraph is devoted to the characterization of convolutional architectures such as CNNs but the previous kernel construction can in fact be used to derive more traditional kernel methods. After all, why should one spend efforts defining a kernel between images if not to use it?

This can be achieved by considering finite-dimensional approximations of the previous feature maps. In order to shorten the presentation, we simply describe the main idea based on the Nystrom approximation and refer to the paper for more details. Approximating the infinite-dimensional feature maps (see the figure at the top of part I) can be done by projecting each point in onto a -dimensional subspace leading to a finite-dimensional feature map akin to CNNs, see the figure at the top of the post.

By parametrizing with anchor points , and using a dot-product kernel, a patch from is encoded through the mapping function

where is applied pointwise. Then, computing from admits a CNN interpretation, where only the normalization and the matrix multiplication by are not standard operations. It remains now to choose the anchor points:

**kernel approximation:**a first approach consists of using a variant of the Nystrom method, see this paper and that one. When plugging the corresponding image representation in a linear classifier, the resulting approach behaves as a classical kernel machine. Empirically, we observe that the higher the number of anchor points, the better the kernel approximation, and the higher the accuracy. For instance, a two-layer network with a -dimensional representations achieves about accuracy on CIFAR-10 without data augmentation (see here).**back-propagation, feature selection**: learning the anchor points can also be done as in a traditional CNN, by optimizing them end-to-end. This allows using deeper lower-dimensional architectures and empirically seems to perform better when enough data is available, e.g., accuracy on CIFAR-10 with simple data augmentation. There, the subspaces are not learned anymore to provide the best kernel approximation, but the model seems to perform a sort of feature selection in each layer’s RKHS , which is not well understood yet (This feature selection interpretation is due to my collaborator Laurent Jacob).

Note that the first CKN model published here was based on a different approximation principle, which was not compatible with end-to-end training. We found this to be less scalable and effective.

**Other links between neural networks and kernel methods**

Finally, other links between kernels and infinitely-wide neural networks with random weights are classical, but they were not the topic of this blog post (they should be the topic of another one!). In a nutshell, for a large collection of weights distributions and nonlinear functions , the following quantity admits an analytical form

where the terms may be seen as an infinitely-wide single-layer neural network. The first time such a relation appears is likely to be in the PhD thesis of Radford Neal with a Gaussian process interpretation, and it was revisited later by Le Roux and Bengio and by Cho and Saul with multilayer models.

In particular, when is the rectified linear unit and follows a Gaussian distribution, it is known that we recover the arc-cosine kernel. We may also note that random Fourier features also yield a similar interpretation.

Other important links have also been drawn recently between kernel regression and strongly over-parametrized neural networks, see this paper and that one, which is another exciting story.

]]>

I (*n.b., Julien Mairal*) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.

The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.

**Preliminaries on kernel methods**

Given data living in a set , a positive definite kernel implicitly defines a Hilbert space of functions from to , called reproducing kernel Hilbert space (RKHS), along with a mapping function .

A predictive model in associates to every point a label in , and admits a simple form . Then, Cauchy-Schwarz inequality gives us a first basic stability property

This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:

is a distance between two data representations , which are independent of the learning process, and is a norm on the model (typically optimized over data) that acts as a measure of complexity.

Thinking about neural networks in terms of kernel methods then requires defining the underlying representation , which can only depend on the network architecture, and the model , which will be parametrized by (learned) network’s weights.

**Building a convolutional kernel for convolutional neural networks**

Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image , the main idea is to build a sequence of “feature maps” that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.

Formally, an input image is represented as a square-integrable function in , where is a set of pixel coordinates, and is a Hilbert space. may be a discrete grid or a continuous domain such as , and may simply be for RGB images. Then, a feature map in is obtained from a previous layer as follows:

*modeling larger neighborhoods than in the previous layer:*we map neighborhoods (patches) from to a new Hilbert space . Concretely, we define a homogeneous dot-product kernel between patches from :where is an inner-product derived from , and is a non-linear function that ensures positive definiteness,

*e.g.*, for vectors with unit norm, see this paper. By doing so, we implicitly define a kernel mapping that maps patches from to a new Hilbert space . This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.*increasing invariance:*to gain invariance to small deformations, we smooth~ with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).

Formally, the previous construction amounts to applying operators (patch extraction), (kernel mapping), and (smoothing/pooling operator) to such that the -th layer representation can be written as

We may finally define a kernel for images as , whose RKHS contains the functions for in . Note now that we have introduced a concept of image representation , which only depends on some network architecture (amounts of pooling, patch size), and predictive model parametrized by .

From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).

Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.

]]>

In the comments of the previous blog post we asked if the new viewpoint on best of both worlds can be used to get clean “interpolation” results. The context is as follows: in a STOC 2018 paper followed by a COLT 2019 paper, the following corruption model was discussed: stochastic bandits, except for rounds which are adversarial. The state of the art bounds were of the form: optimal (or almost optimal) stochastic term plus , and it was mentioned as an open problem whether could be improved to (there is a lower bound showing that is necessary — when ). As was discussed in the comment section, it seemed that indeed this clean best of both worlds approach should certainly shed light on the corruption model. It turns out that this is indeed the case, and a one-line calculation resolves positively the open problem from the COLT paper. The formal result is as follows (recall the notation/definitions from the previous blog post):

Lemma:Consider a strategy whose regret with respect to the optimal action is upper bounded by(1)

Then in the -corruption stochastic bandit model one has that the regret is bounded by:

Note that by the previous blog post we know strategies that satisfy (1) with (see Lemma 2 in the previous post).

*Proof: In equation (1) let us apply Jensen over the corrupt rounds, this yields a term . For the non-corrupt rounds, let us use that*

The sum of the second term on the right hand side is upper bounded by . On the other hand the sum (over non-corrupt rounds) of the first term is equal to of the regret over the non-corrupt rounds, which is certainly smaller than of the total regret plus . Thus we obtain (denoting for the total regret):

which concludes the proof.

]]>

**Stochastic bandit and adversarial examples**

In multi-armed bandit problems the gold standard property, going back to a seminal paper of Lai and Robbins in 1985 is to have a regret upper bounded by:

(1)

Let me unpack this a bit: this is for the scenario where the reward process for each action is simply an i.i.d. sequence from some fixed distribution, is the index of the (unique) best action, and is the gap between the mean value of the best action and the one of . Such guarantee is extremely strong, as in particular it means that actions whose average performance is a constant away from the optimal arm are very rarely played (only of order ). On the other hand, the price to pay for such an aggressive behavior (by this I mean focusing on good actions very quickly) is that all the classical algorithms attaining the above bound are extremely sensitive to *adversarial examples*: that is if there is some deviation from the i.i.d. assumption (even very brief in time), the algorithms can suddenly suffer linear in regret.

**Adversarial bandit**

Of course there is an entirely different line of works, on *adversarial multi-armed bandits*, where the whole point is to prove regret guarantee for *any* reward process. In this case the best one can hope for is a regret of order . The classical algorithm in this model, Exp3, attains a regret of order . In joint work with Jean-Yves Audibert we showed back in 2009 that the following strategy, which we called PolyINF, attains the optimal : view Exp3 as Mirror Descent with the (negative) entropy as a regularizer, and now replace the entropy by a simple rational function namely with (this mirror descent view was actually derived in a later paper with Gabor Lugosi). The proof becomes one line (given the appropriate knowledge of mirror descent and estimation in bandit games): the radius part of the bound is of the form , while the variance is of the form (since the inverse of the Hessian of the mirror map is a diagonal matrix with entries ):

Thus optimizing over yields for any . Interestingly, the best numerical constant in the bound is obtained for .

**Best of both worlds**

This was the state of affairs back in 2011, when with Alex Slivkins we started working on a *best of both worlds* type algorithm (which in today’s language is exactly a stochastic MAB robust to adversarial examples): namely one that gets the guarantee (in fact in our original paper) if the environment is the nice i.i.d. one, and also (in fact ) in the worst case. This original best of both worlds algorithm was of the following form: be aggressive as if it was a stochastic environment, but still sample sufficiently often the bad actions to make sure there isn’t an adversary trying to hide some non-stochastic behavior on these seemingly bad performing actions. Of course the whole difficulty was to show that it is possible to implement such a defense without hurting the stochastic performance too much (remember that bad actions can only be sampled of order times!). Since this COLT 2012 paper there has been many improvements to the original strategy, as well as many variants/refinements (one such variant worth mentioning are the works trying to do a smooth transition between the stochastic and adversarial models, see e.g. here and here).

**A stunning twist**

The amazing development that I want to talk about in this post is the following: about a year ago Julian Zimmert and Yevgeny Seldin proved that the 2009 PolyINF (crucially with ) strategy actually gets the 2012 best of both worlds bound! This is truly surprising, as in principle mirror descent does not “know” anything about stochastic environments, it does not make any sophisticated concentration reasoning (say as in Lai and Robbins), yet it seems to automatically and optimally pick up on the regularity in the data. This is really amazing to me, and of course also a total surprise that the polynomial regularizer has such strong adaptivity property, while it was merely introduced to remove a benign log term.

The crucial observation of Zimmert and Seldin is that a a certain *self-bounding* property of the regret implies (in a one-line calculation) the best of both worlds result:

Lemma 1:Consider a strategy whose regret with respect to the optimal action is upper bounded by(2)

(Recall that for multi-armed bandit one selects a probability distribution over the actions, so denote here the probability of playing action at time .) Then one has that the regret is in fact bounded by (this follows trivially by Jensen on the sum), and moreover if the environment is stochastic one has that the regret is in fact bounded by times .

*Proof:* Assuming that the environment is stochastic we can write the regret as , so by assumption and using that one has:

which means that the left hand side is smaller than which is indeed smaller than times .

**Yet another one-line proof (okay, maybe 5 lines)**

Zimmert and Seldin proved that PolyINF with actually satisfies the self-bounding property of Lemma 1 (and thus obtains the best of both worlds guarantee). In another recent paper by Zimmert, in joint work with Haipeng Luo and Chen-Yu Wei, they simplify the analysis by using a very mild variation of the PolyINF regularizer, namely . In my view it’s the proof from the book for the best of both worlds result (or very close to it)! Here it is:

Lemma 2:Equation with is an upper bound on the regret of mirror descent with learning , mirror map , and standard multi-armed bandit loss estimator.

*Proof:* The classical mirror descent analysis from any good book will tell you that the regret is controlled by (for and with the convention ):

(3)

We now consider those terms for the specific and suggested in the lemma. First notice that . Moreover (since is integral) so that . In other words the first term in is upper bounded by

where the inequality simply comes from .

Next note that , so that the second term in is upper bounded by

where the inequality follows trivially by considering the two cases whether is smaller or larger than .

]]>Let be a -smooth function. Denote . Fix a sequence , to be optimized later. We consider the “conjugate” point . The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

Let us denote and for shorthand. The key point is that , and in particular . Now recognize that is a lower bound on the improvement (here we use that is better than ). Thus we get:

In other words if the sequence is chosen such that then we get

This is good because roughly the reverse inequality also holds true by convexity (and the fact that so ):

So finally we get , and it just remains to realize that is of order so that .

]]>

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that is -smooth convex function, and we take in the gradient step. The momentum term will be set to a very particular value, which comes out naturally in the proof.

**The two basic inequalities**

Let us denote and (note that ). Now let us write our favorite inequalities (using and ):

and

**On the way to a telescopic sum**

Recall now that , so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient on the second inequality, so let us do times the first inequality plus times the second inequality. We obtain an inequality whose right hand side is given by times

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose and . What we would like to have is:

Observe that (since ) the right hand side can be written as , and thus we see that we simply need to have:

**Setting the parameters and concluding the proof**

Writing we now obtain as a result of the combination of our two starting inequalities:

It only remains to select such that (i.e., roughly is of order ) so that by summing the previous inequality one obtains which is exactly the rate we were looking for.

]]>This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.

]]>**Syllabus**

Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.

**Some notes on algorithmic generalization**

Let be input/output spaces. Let be a loss function, a probability measure supported on , and a learning rule (in words takes as input a dataset of examples, and output a mapping from -inputs to -outputs). With a slight abuse of notation, for and , we write . We define the generalization of on by:

In words, if then we expect the empirical performance of the learned classifier to be representative of its performance on a fresh out-of-sample data point, up to an additive . The whole difficulty of course is that the empirical evaluation is done with the *same *dataset that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class one trivially has

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use *algorithm-based* arguments.

**Stability**

The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a *stability* notion:

where . This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).

**Robustness**

The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define -robustness as follows: assume that can be partitioned into sets such that if and are in the same set then

A good example to have in mind would be a binary classifier with large margin, in which case corresponds to the covering number of at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases would be typically exponential in the dimension of . The key result of Xu and Mannor that we prove next is a generalization bound of order . In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A -robust learning rule satisfies

**Proof:** Let and note that . Now one has for a robust :

It only remains to observe that

**Information theoretic perspective**

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was *memorized* by the neural network (which be a *very* non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

Note that here we have assumed that the codomain of the learning rule consists of deterministic maps from inputs to outputs, in which case the mutual information is simply the entropy . However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

where .

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:

Lemma:Let . Let be random variables in and , be mutually independent copies of and . Assume that is -subgaussian (i.e., ) then

**Proof:** The mutual information is equal to the relative entropy between the distribution of and the distribution of . Recall also the variational representation of the relative entropy which is that the map is the convex conjugate of the log-partition function . In particular one has a lower bound on the mutual information for any such which means:

Now it only remains to use the definition of subgaussianity, that is take , and optimize over .

]]>