I’m a bandit

2020

Posted on October 13, 2020 by Sebastien Bubeck

My latest post on this blog was on December 30th 2019. It seems like a lifetime away. The rate at which paradigm shifting events have been happening in 2020 is staggering. And it might very well be that the worst of 2020 is ahead of us, especially for those of us currently in the USA.

When I started communicating online broadly (blog, twitter) I promised myself to keep it strictly about science (or very closely neighboring topics), so the few lines above is all I will say about the current worldwide situation.

In other news, as is evident from the 10 months hiatus in blogging, I have taken elsewhere (at least temporarily) my need for rapid communication about theorems that currently excite me. Namely to youtube. Since the beginning of the pandemic I have been recording home videos of what would have been typically blog posts, with currently 5 such videos:

A law of robustness for neural networks : I explain the conjecture we recently made that, for random data, any interpolating two-layers neural network must have its Lipschitz constant larger than the squareroot of the ratio between the size of the data set and the number of neurons in the network. This would prove that overparametrization is *necessary* for robustness.
Provable limitations of kernel methods : I give the proof by Zeyuan Allen-Zhu and Yuanzhi Li that there are simple noisy learning tasks where *no kernel* can perform well while simple two-steps procedures can learn.
Memorization with small neural networks : I explain old (classical combinatorial) and new (NTK style) construction of optimally-sized interpolating two-layers neural networks.
Coordination without communication : This video is the only one in the current series where I don’t talk at all about neural networks. Specifically it is about the cooperative multiplayer multiarmed bandit problem. I explain the strategy we devised with Thomas Budzinski to solve this problem (for the stochastic version) without *any* collision at all between the players.
Randomized smoothing for certified robustness : Finally, in the first video chronologically, I explain the only known technique for provable robustness guarantees in neural networks that can scale up to large models.

The next video will be about basic properties of tensors, and how it can be used for smooth interpolation (in particular in the context of our law of robustness conjecture). After that, we will see, maybe more neural networks, maybe more bandits, maybe some non-convex optimization ….

Stay safe out there!

Posted in Uncategorized | Leave a comment

A decade of fun and learning

Posted on December 30, 2019 by Sebastien Bubeck

I started out this decade with the project of writing a survey of the multi-armed bandit literature, which I had read thoroughly during the graduate studies that I was about to finish. At the time we resisted the temptation to name the survey “modern banditology”, which was indeed the right call given how much this “modern” picture has evolved over the decade! It is truly wonderful to now end the decade with two new iterations on the work we did in that survey:

Bandit algorithms by Tor Lattimore and Csaba Szepesvari
Introduction to bandits by Alex Slivkins

These new references very significantly expand the 2012 survey, and they are wonderful starting points for anyone who wants to enter the field.

Here are some of the discoveries in the world of bandits that stood out for me this decade:

We now understand very precisely Thompson Sampling, the first bandit strategy that was proposed back in 1933. The most beautiful reference here is the one by Dan Russo and Ben Van Roy: An Information-Theoretic Analysis of Thompson Sampling, JMLR 2016. Another one that stands out is Analysis of thompson sampling for the multi-armed bandit problem by S. Agrawal and N. Goyal at COLT 2012.
T^{2/3} lower bound for *non-stochastic* bandit with switching cost by Dekel, Ding, Koren and Peres at STOC 2014. This is a striking result for several reasons. In particular the proof has to be based on a non-trivial stochastic process, since for the classical stochastic i.i.d. model one can obtain \sqrt{T} (very easily in fact).
We now know that bandit convex optimization is “easy”, in the sense that it is a \sqrt{T}-regret type problem. What’s more is that in our STOC 2017 paper with Y.T. Lee and R. Eldan we introduced a new way to do function estimation based on bandit feedback, using kernels (I have written at length about this on this blog).
A very intriguing model of computation for contextual bandit was proposed, where one can access the policy space only through an offline optimization oracle. With such access, the classical Exp4 algorithm cannot be simulated, and thus one needs new strategies. We now have a reasonable understanding that \sqrt{T} is doable with mild assumptions (see e.g. this ICML 2014 paper on “Taming the Monster“ by Agarwal, Hsu, Kale, Langford, L. Li and Schapire) and that it is impossible with no assumptions (work of Hazan and Koren at STOC 2016).
Honorable mentions also go to the work of Wei and Luo showing that very strong variation bounds are possible in bandits (see this COLT 2018 paper), and Zimmert and Seldin who made striking progress on the best of both worlds phenomenon that we discovered with Slivkins at the beginning of the decade (I blogged about it here already).

Life beyond bandits

In addition to starting the decade with the bandit survey, I also started it with being bored with the bandit topic altogether. I thought that many (if not most) of the fundamental results were now known, and it was a good idea to move on to something else. Obviously I was totally wrong, as you can see with all the works cited above (and many many more for stochastic bandits, including much deeper understanding of best arm identification, a topic very close to my heart, see e.g., [Kaufmann, Cappe, Garivier, JMLR 16]). In fact I am now optimistic that there is probably another decade-worth of exploration left for the bandit problem(s). Nevertheless I ventured outside, and explored the world of optimization (out of which first came a survey, and more recently video lectures) and briefly networks (another modest survey came out of this too).

Here are some of the landmark optimization results of this decade in my view:

Perhaps the most striking result of the decade in optimization is the observation that for finite sum problems, one can reduce the variance in stochastic gradient descent by somehow centering the estimates (e.g., using a slowly moving sequence on which we can afford to compute full gradients; but this is not the only way to perform such variance reduction). This idea, while very simple, has a lot of implications, both in practice and in theory! The origin of the idea are in the SAG algorithm of [Schmidt, Le Roux, Bach, NIPS 2012] and SDCA [Shalev-Shwartz and Zhang, JMLR 2013]. A simpler instantiation of the idea, called SVRG appeared shortly after in [Johnson and Zhang, NIPS 2013] (and also independently at the same NeurIPS, in [M. Madhavi, L. Zhang, R. Li, NIPS 2013]).
An intriguing direction that I pursued fervently is the use of convex optimization for problems that have a priori nothing to do with convex optimization. A big inspiration for me was the COLT 2008 paper by Abernethy, Hazan and Rakhlin, who showed how mirror descent naturally solves bandit problems. In this decade, we (this we includes myself and co-authors, but also various other teams) explored how to use mirror descent for other online decision making problems, and made progress on some long-standing problems (k-server and MTS), see for example this set of video lectures on the “Five miracles of mirror descent”.
Arnak Dalalyan showed how to use ideas inspired from convex optimization to analyze the Langevin Monte Carlo algorithm. This was absolutely beautiful work, that led to many many follow-ups.
There has been a lot of rewriting of Nesterov’s acceleration, to try to demystify it. Overall the enterprise is not yet a resounding success in my opinion, but certainly a lot of progress has been made (again I have written a lot about it on this blog already). We now even have optimal acceleration for higher order of smoothness (see this 15 authors paper at COLT 2019), but these techniques are clouded with the same shroud of mystery as was Nesterov’s original method.
Yin Tat Lee and Aaron Sidford obtained an efficient construction of a universal barrier.
We now know that certain problems cannot be efficiently represented by SDPs (the so-called “extension complexity), see e.g. this work by Lee-Raghavendra-Steurer.
We now know how to chase convex bodies, and we can even do so very elegantly with the Steiner/Sellke point.

Some other things that captivated me

The papers above are mostly topics on which I tried to work at some point. Here are some questions that I didn’t work on but followed closely and was fascinated by the progress:

The stochastic block model was essentially solved during this decade, see for example this survey by Emmanuel Abbe.
The computational/statistical tradeoffs were extensively explored, yet they remain mysterious. A nice impulse to the field was given by this COLT 2013 paper by Berthet and Rigollet relating sparse PCA and planted clique. In a similar spirit I also enjoyed the more recent work by Moitra, Jerry Li, and many co-authors on computationally efficient robust estimation (see e.g., this recent paper)
Adaptive data analysis strikes me as both very important in practice, and quite deep theoretically, see e.g. the reusable holdout by Dwork et al. A related paper that I liked a lot is this ICML 2015 paper by Blum and Hardt, which essentially explores the regularization effect of publishing only models that beat the state of the art significantly (more generally this is an extremely interesting question, of why we can keep using the same datasets to evaluate progress in machine learning, see this provokingly titled paper “Do ImageNet Classifiers Generalize to ImageNet?“).
A general trend has been in finding very fast (nearly linear time) method for many classical problems. Sometimes these investigations even led to actually practical algorithm, as with this now classical paper by Marco Cuturi at NIPS 2013 titled “Sinkhorn Distances: Lightspeed Computation of Optimal Transport“.

Oh, one last thing

I also heard that, surprisingly, gradient descent can work to optimize highly non-convex functions such as training loss for neural networks. Not sure what this is about, it’s a pretty obscure topic, maybe it will catch up in the decade 2020-2029…

Share some more in the comments!

The above is only a tiny sample, there were many many more interesting directions being explored (tensor methods for latent variable models [Anandkumar, Ge, Hsu, Kakade, Telgarsky, JMLR 14]; phenomenon of “all local minima are good” for various non-convex learning problems, see e.g., [Ge, Lee, Ma, NIPS 2016]; etc etc). Feel free to share your favorite ML theory paper in the comments!

Posted in Uncategorized | 5 Comments

Convex body chasing, Steiner point, Sellke point, and SODA 2020 best papers

Posted on November 5, 2019 by Sebastien Bubeck

Big congratulations to my former intern Mark Sellke, and to the CMU team (C. J. Argue, Anupam Gupta, Guru Guruganesh, and Ziye Tang) for jointly winning the best paper award at SODA 2020 (as well as the best student paper for Mark)! They obtain a linear in the dimension competitive ratio for convex body chasing, a problem which was entirely open for any $d \geq 3$ just two years ago. What’s more is that they found the algorithm (and proof) from The Book! Let me explain.

Convex body chasing

In convex body chasing an online algorithm is presented at each time step with a convex body $K_t \subset \mathbb{R}^d$ , and it must choose a point $x_t \in K_t$ . The online algorithm is trying to minimize its total movement, namely $\sum_{t \geq 0} \|x_{t+1} - x_{t}\|$ (all the norms here will be Euclidean, for simplicity). To evaluate the performance of the algorithm, one benchmarks it against the smallest movement one could achieve for this sequence of bodies, if one knew in advance the whole sequence. A good example to have in mind is if the sequence $K_t$ corresponds to lines rotating around some fixed point, then the best thing to do is to move to this fixed point and never move again, while a greedy algorithm might have infinite movement (in the continuous time limit). The competitive ratio is the worst case (over all possible sequences of convex bodies) ratio of the algorithm’s performance to those of the oracle optimum. This problem was introduced in 1993 by Linial and Friedman, mainly motivated by considering a more geometric version of the $k$ -server problem.

As it turns out, this problem in fact has a lot of “real” applications. It is not too surprising given how elementary the problem is. Just to give a flavor, Adam Wierman and friends show that dynamic powering of data centers can be viewed as an instance of convex *function* chasing. In convex function chasing, instead of a convex body one gets a convex function $f_t$ , and there is then a movement cost $\|x_t - x_{t-1}\|$ and a *service cost* $f_t(x_t)$ . While this is more general than convex body chasing, it turns out that convex body chasing in dimension $d+1$ is enough to solve convex function chasing in dimension $d$ (this is left to the reader as an exercise). Roughly speaking in dynamic power management, one controls various power levels (number of servers online, etc), and a request correspond to a new set of jobs. Turning on/off servers has an associated cost (the movement cost), while servicing the jobs with a certain number of servers has an associated delay (the service cost). You get the idea.

The Steiner point

In a paper to appear at SODA too, we propose (with Yin Tat Lee, Yuanzhi Li, Bo’az Klartag, and Mark Sellke) to use the Steiner point for the nested convex body chasing problem. In nested convex body chasing the sequence of bodies is nested, and to account for irrelevant additive constant we assume that the oracle optimum starts from the worst point in the starting set $K_0$ . In other words, opt is starting at the point the furthest away in $K_0$ from the closest point in $K_T$ , so its movement is exactly equal to the Hausdorff distance between $K_0$ and $K_T$ . So all we have to do is to give an online algorithm whose movement is proportional to this Hausdorff distance. Some kind of Lipschitzness in Hausdorff distance. Well, it turns out that mathematicians have been looking at this for centuries already, and Steiner back in the 1800’s introduced just what we need: a selector (a map from convex sets to points in them, i.e. $S(K) \in K$ for any convex set $K$ ) which is Lipschitz with respect to the Hausdorff distance, that is:

$\|S(K) - S(K')\| \leq L \cdot \mathrm{dist}_H(K,K') \,.$

Note that obtaining such a selector is not trivial, for example the first natural guess would be the center of gravity, but it turns out that this is not Lipschitz (consider for example what happens with $K$ being a very thin triangle and $K'$ being the base of this triangle). Crazy thing is, this Steiner point selector (to be defined shortly) is not only Lipschitz, it is in fact the most Lipschitz selector for convex sets. How would you prove such a thing? Well, you clearly need a miracle to happen, and the miracle here is that the Steiner point satisfies some symmetries, which define it uniquely. From there all you need to do is that starting with any selector, you can add some of these symmetries while also improving the Lipschitz constant (for more on this see the references in the linked paper above).

OK, so what is the Steiner point? It has many definitions, but a particular appealing one from an optimizer’s viewpoint is as follows. For a convex set $K$ and a vector $\theta \in \mathbb{R}^d$ let $g_K(\theta)$ be the maximizer of the linear function $\theta$ in the set $K$ . Then the Steiner point of $K$ , $\mathrm{St}(K)$ is defined by:

$\mathrm{St}(K) = \mathbb{E}_{\theta : \|\theta\| \leq 1} [g_K(\theta)] \,.$

In words: for any direction take the furthest point in $K$ in that direction, and average over all directions. This feels like a really nice object, and clearly it satisfies $\mathrm{St}(K) \in K$ (i.e., it is a selector). The algorithm we proposed in our SODA paper is simply to move to $x_t = \mathrm{St}(K_t)$ . Let us now see how to analyze this algorithm.

The nested case proof

First we give an alternative formula for the Steiner point. Denote $h_K(\theta) = \max_{x \in K} \theta \cdot x$ , and observe that $g_K(\theta) = \nabla h_K(\theta)$ for almost all $\theta$ . Thus using the divergence theorem one obtains:

$\mathrm{St}(K) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta h_K(\theta)] \,.$

(The factor $d$ comes from the ratio of the volume of the ball to the sphere, and the $\theta$ in the expectation is because the normal to $\theta$ is $\theta$ on the sphere.)

Now we have the following one-line inequality to control the movement of the Steiner point:

$\begin{align*} \|\mathrm{St}(K) - \mathrm{St}(K')\| & = d \cdot \| \mathbb{E}_{\theta : \|\theta\| = 1} [\theta (h_K(\theta) - h_{K'}(\theta))] \| \\ & \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ |h_K(\theta) - h_{K'}(\theta)| ] \end{align*}$

It only remains to observe that $|h_K(\theta) - h_{K'}(\theta)| \leq \mathrm{dist}_H(K,K')$ to obtain a proof that Steiner is $d$ -Lipschitz (in fact as an exercise you can try to improve the above argument to obtain $O(\sqrt{d})$ -Lipschitz). Now for nested convex body chasing we will have $K' \subset K$ so that

$|h_K(\theta) - h_{K'}(\theta)| = h_K(\theta) - h_{K'}(\theta)$ (i.e., no absolute values), and thus the upper bound on the movement will telescope! More precisely:

$\begin{align*} \sum_{t \geq 0} \|\mathrm{St}(K_t) - \mathrm{St}(K_{t+1})\| & \leq d \cdot \sum_{t \geq 0} \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_t}(\theta) - h_{K_{t+1}}(\theta) ] \\ & = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_0}(\theta) - h_{K_{T}}(\theta) ] \\ & \leq d \cdot \mathrm{dist}_H(K_0, K_T) \,. \end{align*}$

This proves that the Steiner point is $d$ -competitive for nested convex body chasing. How to generalize this to the non-nested case seemed difficult, and the breakthrough of Mark and the CMU team is to bring back an old friend of the online algorithm community: the work function.

The work function

The work function $W_t : \mathbb{R}^d \rightarrow \mathbb{R}$ is defined as follows. For any point $x \in \mathbb{R}^d$ it is the smallest cost one can pay to satisfy all the requests $K_0, \hdots, K_t$ and end up at $x$ . First observe that this function is convex. Indeed take two point $x$ and $y$ , and $z$ the middle point between $x$ and $y$ (any point on the segment between $x$ and $y$ would be treated similarly). In the definition of the work function there is an associated trajectory for both $x$ and $y$ . By convexity of the requests, the sequence of mid points between those two trajectories is a valid trajectory for $z$ ! And moreover the movement of this mid trajectory is (by triangle inequality) less than the average of the movement of the trajectory for $x$ and $y$ . Hence $W_t((x+y)/2) \leq (W_t(x) + W_t(y))/2$ .

A very simple property we will need from the work function is that the norm of its gradient carries some information on the current request. Namely, $x \not\in K_t \Rightarrow \|\nabla W_t(x)\| = 1$ (indeed, if $x$ is not in the current request set, then the best way to end up there is to move to a point $z$ in $K_t$ and then move to $x$ , and if $K_t$ is a polytope then when you move a little bit $x$ you don’t move $z$ , so the cost is changing at rate $1$ , hence the norm of the gradient being $1$ ). Or to put it differently: $\|\nabla W_t(x)\| < 1 \Rightarrow x \in K_t$ .

The Sellke point

Mark’s beautiful idea (and the CMU team very much related –in fact equivalent– idea) to generalize the Steiner point to the non-nested case is to use the work function as a surrogate for the request. Steiner of $K_t$ will clearly not work since all of $K_t$ does not matter in the same way, as some points might be essentially irrelevant because they are very far from the previous requests, a fact which will be uncovered by the work function (while on the other hand the random direction from the Steiner point definition is oblivious to the geometry of the previous requests). So how to pick an appropriately random point in $K_t$ while respecting the work function structure? Well we just saw that $\|\nabla W_t(x)\| < 1 \Rightarrow x \in K_t$ . So how about taking a random direction $\theta$ , and applying the inverse of the gradient map, namely the gradient of the Fenchel dual $W_t^*$ ? Recall that the Fenchel dual is defined by:

$W_t^*(\theta) = \max_{x \in \mathbb{R}^d} \theta \cdot x - W_t(x) \,.$

This is exactly the algorithm Mark proposes and it goes like this (Mark calls it the *functional Steiner point*):

$\mathrm{Se}(K_t) = \mathbb{E}_{\theta : \|\theta\| < 1} [\nabla W_t^*(\theta)] \,.$

Crucially this is a valid point, namely $\mathrm{Se}(K_t) \in K_t$ . Moreover just like for the Steiner point we can apply the divergence theorem and obtain:

$\mathrm{Se}(K_t) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta W_t^*(\theta)] \,.$

The beautiful thing is that we can now exactly repeat the nested convex body argument, since $W_{t+1}^*(\theta) \leq W_t^*(\theta)$ for all $\theta$ (just like we had in the nested case $h_{K_{t+1}}(\theta) \leq h_{K_t}(\theta)$ ) and so we get:

$\sum_{t \geq 0} \|\mathrm{Se}(K_t) - \mathrm{Se}(K_{t+1})\| \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ W_{0}^*(\theta) - W^*_{{T}}(\theta) ]$

The first term with $W_0^*$ is just some additive constant which we can ignore, while the second term is bounded as follows:

$\begin{align*} \mathbb{E}_{\theta : \|\theta\| = 1} [ - W^*_{{T}}(\theta) ] & = \mathbb{E}_{\theta : \|\theta\| = 1} [ \min_{x \in \mathbb{R}^d} W_t(x) - \theta \cdot x ] \\ & \leq \min_{x \in \mathbb{R}^d} \mathbb{E}_{\theta : \|\theta\| = 1}[ W_t(x) - \theta \cdot x ] \\ & = \min_{x \in \mathbb{R}^d} W_t(x) \,. \end{align*}$

Thus we exactly proved that the movement of the Sellke point is upped bounded by a constant plus $d$ times the minimum of the work function, and the latter is nothing but the value of the oracle optimum!

Note that once everything is done and said, the proof has only two inequalities (triangle inequality as in the nested case, and the minimum of the expectation is less than the expectation of the minimum). It doesn’t get any better than this!

Posted in Theoretical Computer Science | 2 Comments

Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part II

Posted on July 17, 2019 by Sebastien Bubeck

This is a continuation of Julien Mairal‘s guest post on CNNs, see part I here.

Stability to deformations of convolutional neural networks

In their ICML paper Zhang et al. introduce a functional space for CNNs with one layer, by noticing that for some dot-product kernels, smoothed variants of rectified linear unit activation functions (ReLU) live in the corresponding RKHS, see also this paper and that one. By following a similar reasoning with multiple layers, it is then possible to show that the functional space described in part I $\{ f_w: x \mapsto \langle w , \Phi_n(x_0) \rangle; w \in L^2(\Omega,\mathcal{H}_n) \}$ contains CNNs with such smoothed ReLU, and that the norm $\|f_w\|$ of such networks can be controlled by the spectral norms of filter matrices. This is consistent with previous measures of complexity for CNNs, see this paper by Bartlett et al.

A perhaps more interesting finding is that the abstract representation $\Phi_n(x)$ , which only depends on the network architecture, may provide near-translation invariance and stability to small image deformations while preserving information—that is, $x$ can be recovered from $\Phi_n(x)$ . The original characterization we use was introduced by Mallat in his paper on the scattering transform—a multilayer architecture akin to CNNs based on wavelets, and was extended to $\Phi_n$ by Alberto Bietti, who should be credited for all the hard work here.

Our goal is to understand under which conditions it is possible to obtain a representation that (i) is near-translation invariant, (ii) is stable to deformations, (iii) preserves signal information. Given a $C^1$ -diffeomorphism $\tau: \mathbb{R}^2 \to \mathbb{R}^2$ and denoting by $L_\tau x(u) = x(u-\tau(u))$ its action operator (for an image defined on the continuous domain $\mathbb{R}^2$ ), the main stability bound we obtain is the following one, see Theorem 7 in Mallat’s paper if $\|\nabla \tau\|_\infty \leq 1/2$ , for all $x$ ,

$\| \Phi_n(L_\tau x) - \Phi_n(x)\| \leq \left ( C_1 (1+n) \|\nabla \tau\|_\infty + \frac{C_2}{\sigma_n} \|\tau\|_\infty \right) \|x\|,$

where $C_1, C_2$ are universal constants, $\sigma_n$ is the scale parameter of the pooling operator $A_n$ corresponding to the “amount of pooling” performed up to the last layer, $\|\tau\|_\infty$ is the maximum pixel displacement and $\|\nabla \tau\|_\infty$ represents the maximum amount of deformation, see the paper for the precise definitions of all these quantities. Note that when $C_2/\sigma_n \to 0$ , the representation $\Phi_n$ becomes translation invariant: indeed, consider the particular case of $\tau$ being a translation, then $\nabla \tau=0$ and $\|\Phi_n(L_\tau x) - \Phi_n(x)\| \to 0$ .

The stability bound and a few additional results tell us a few things about the network architecture: (a) small patches lead to more stable representations (the dependency is hidden in $C_1$ ); (b) signal preservation for discrete signals requires small subsampling factors (and thus small pooling) between layers. In such a setting, the scale parameter $\sigma_n$ still grows exponentially with $n$ and near translation invariance may be achieved with several layers.

Interestingly, we may now come back to the Cauchy-Schwarz inequality from part 1, and note that if $\Phi_n$ is stable, the RKHS norm $\|f\|$ is then a natural quantity that provides stability to deformations to the prediction function $f$ , in addition to measuring model complexity in a traditional sense.

Feature learning in RKHSs and convolutional kernel networks

The previous paragraph is devoted to the characterization of convolutional architectures such as CNNs but the previous kernel construction can in fact be used to derive more traditional kernel methods. After all, why should one spend efforts defining a kernel between images if not to use it?

This can be achieved by considering finite-dimensional approximations of the previous feature maps. In order to shorten the presentation, we simply describe the main idea based on the Nystrom approximation and refer to the paper for more details. Approximating the infinite-dimensional feature maps $x_k$ (see the figure at the top of part I) can be done by projecting each point in $\mathcal{H}_k$ onto a $p_k$ -dimensional subspace $\mathcal{F}_k$ leading to a finite-dimensional feature map $\tilde{x}_k$ akin to CNNs, see the figure at the top of the post.

By parametrizing $\mathcal{F}_k=\text{span}(\varphi_k(z_1),\varphi_k(z_2),\ldots,\varphi_k(z_{p_k}))$ with $p_k$ anchor points $Z=[z_1,\ldots,z_{p_k}]$ , and using a dot-product kernel, a patch $z$ from $\tilde{x}_{k-1}$ is encoded through the mapping function

$\psi_k(z) = \|z\| \kappa_k( Z^\top Z)^{-1/2} \kappa_k\left( Z^\top \frac{z}{\|z\|} \right),$

where $\kappa_k$ is applied pointwise. Then, computing $\tilde{x}_k$ from $\tilde{x}_{k-1}$ admits a CNN interpretation, where only the normalization and the matrix multiplication by $\kappa_k( Z^\top Z)^{-1/2}$ are not standard operations. It remains now to choose the anchor points:

kernel approximation: a first approach consists of using a variant of the Nystrom method, see this paper and that one. When plugging the corresponding image representation in a linear classifier, the resulting approach behaves as a classical kernel machine. Empirically, we observe that the higher the number of anchor points, the better the kernel approximation, and the higher the accuracy. For instance, a two-layer network with a $300k$ -dimensional representations achieves about $86\%$ accuracy on CIFAR-10 without data augmentation (see here).
back-propagation, feature selection: learning the anchor points $Z$ can also be done as in a traditional CNN, by optimizing them end-to-end. This allows using deeper lower-dimensional architectures and empirically seems to perform better when enough data is available, e.g., $92\%$ accuracy on CIFAR-10 with simple data augmentation. There, the subspaces $\mathcal{F}_k$ are not learned anymore to provide the best kernel approximation, but the model seems to perform a sort of feature selection in each layer’s RKHS $\mathcal{H}_k$ , which is not well understood yet (This feature selection interpretation is due to my collaborator Laurent Jacob).

Note that the first CKN model published here was based on a different approximation principle, which was not compatible with end-to-end training. We found this to be less scalable and effective.

Other links between neural networks and kernel methods

Finally, other links between kernels and infinitely-wide neural networks with random weights are classical, but they were not the topic of this blog post (they should be the topic of another one!). In a nutshell, for a large collection of weights distributions and nonlinear functions $s: \mathbb{R} \to \mathbb{R}$ , the following quantity admits an analytical form

$K(x,x') = \E_{w}[ s(w^\top x) s(w^\top x')],$

where the terms $s(w^\top x)$ may be seen as an infinitely-wide single-layer neural network. The first time such a relation appears is likely to be in the PhD thesis of Radford Neal with a Gaussian process interpretation, and it was revisited later by Le Roux and Bengio and by Cho and Saul with multilayer models.

In particular, when $s$ is the rectified linear unit and $w$ follows a Gaussian distribution, it is known that we recover the arc-cosine kernel. We may also note that random Fourier features also yield a similar interpretation.

Other important links have also been drawn recently between kernel regression and strongly over-parametrized neural networks, see this paper and that one, which is another exciting story.

Posted in Machine learning | Leave a comment

Guest post by Julien Mairal: A Kernel Point of View on Convolutional Neural Networks, part I

Posted on July 10, 2019 by Sebastien Bubeck

I (n.b., Julien Mairal) have been interested in drawing links between neural networks and kernel methods for some time, and I am grateful to Sebastien for giving me the opportunity to say a few words about it on his blog. My initial motivation was not to provide another “why deep learning works” theory, but simply to encode into kernel methods a few successful principles from convolutional neural networks (CNNs), such as the ability to model the local stationarity of natural images at multiple scales—we may call that modeling receptive fields—along with feature compositions and invariant representations. There was also something challenging in trying to reconcile end-to-end deep neural networks and non-parametric methods based on kernels that typically decouple data representation from the learning task.

The main goal of this blog post is then to discuss the construction of a particular multilayer kernel for images that encodes the previous principles, derive some invariance and stability properties for CNNs, and also present a simple mechanism to perform feature learning in reproducing kernel Hilbert spaces. In other words, we should not see any intrinsic contradiction between kernels and representation learning.

Preliminaries on kernel methods

Given data living in a set $\mathcal{X}$ , a positive definite kernel $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ implicitly defines a Hilbert space $\mathcal{H}$ of functions from $\mathcal{X}$ to $\mathbb{R}$ , called reproducing kernel Hilbert space (RKHS), along with a mapping function $\varphi: \mathcal{X} \to \mathcal{H}$ .

A predictive model $f$ in $\mathcal{H}$ associates to every point $x$ a label in $\mathbb{R}$ , and admits a simple form $f(x) =\langle f, \varphi(x) \rangle_{\mathcal{H}}$ . Then, Cauchy-Schwarz inequality gives us a first basic stability property

$\forall x, x'\in \mathcal{X},~~~~~ |f(x)-f(x')| \leq \|f\|_{\mathcal{H}} \| \varphi(x) - \varphi(x')\|_\mathcal{H}.$

This relation exhibits a discrepancy between neural networks and kernel methods. Whereas neural networks optimize the data representation for a specific task, the term on the right involves the product of two quantities where data representation and learning are decoupled:

$\|\varphi(x)-\varphi(x')\|_\mathcal{H}$ is a distance between two data representations $\varphi(x),\varphi(x')$ , which are independent of the learning process, and $\|f\|_\mathcal{H}$ is a norm on the model $f$ (typically optimized over data) that acts as a measure of complexity.

Thinking about neural networks in terms of kernel methods then requires defining the underlying representation $\varphi(x)$ , which can only depend on the network architecture, and the model $f$ , which will be parametrized by (learned) network’s weights.

Building a convolutional kernel for convolutional neural networks

Following Alberto Bietti’s paper, we now consider the direct construction of a multilayer convolutional kernel for images. Given a two-dimensional image $x_0$ , the main idea is to build a sequence of “feature maps” $x_1,x_2,\ldots$ that are two-dimensional spatial maps carrying information about image neighborhoods (a.k.a receptive fields) at every location. As we proceed in this sequence, the goal is to model larger neighborhoods with more “invariance”.

Formally, an input image $x_0$ is represented as a square-integrable function in $L^2(\Omega,\mathcal{H}_0)$ , where $\Omega$ is a set of pixel coordinates, and $\mathcal{H}_0$ is a Hilbert space. $\Omega$ may be a discrete grid or a continuous domain such as $\mathbb{R}^2$ , and $\mathcal{H}_0$ may simply be $\mathbb{R}^3$ for RGB images. Then, a feature map $x_k$ in $L^2(\Omega,\mathcal{H}_k)$ is obtained from a previous layer $x_{k-1}$ as follows:

modeling larger neighborhoods than in the previous layer: we map neighborhoods (patches) from $x_{k-1}$ to a new Hilbert space $\mathcal{H}_k$ . Concretely, we define a homogeneous dot-product kernel between patches $z, z'$ from $x_{k-1}$ :
$K_k(z,z') = \|z\| \|z'\| \kappa_k \left( \left\langle \frac{z}{\|z\|}, \frac{z'}{\|z'\|} \right\rangle \right),$

where $\langle . , . \rangle$ is an inner-product derived from $\mathcal{H}_{k-1}$ , and $\kappa_k$ is a non-linear function that ensures positive definiteness, e.g., $\kappa_k(\langle u,u'\rangle ) = e^{\alpha (\langle u,u'\rangle -1)} = e^{-\frac{\alpha}{2}\|u-u'\|^2}$ for vectors $u, u'$ with unit norm, see this paper. By doing so, we implicitly define a kernel mapping $\varphi_k$ that maps patches from $x_{k-1}$ to a new Hilbert space $\mathcal{H}_k$ . This mechanism is illustrated in the picture at the beginning of the post, and produces a spatial map that carries these patch representations.
increasing invariance: to gain invariance to small deformations, we smooth~ $x_{k-1}$ with a linear filter, as shown in the picture at the beginning of the post, which may be interpreted as anti-aliasing (in terms of signal processing) or linear pooling (in terms of neural networks).

Formally, the previous construction amounts to applying operators $P_k$ (patch extraction), $M_k$ (kernel mapping), and $A_k$ (smoothing/pooling operator) to $x_{k-1}$ such that the $n$ -th layer representation can be written as

$\Phi_n(x_0)= x_n= A_n M_n P_n \ldots A_1 M_1 P_1 x_0~~~\text{in}~~~~L^2(\Omega,\mathcal{H}_n).$

We may finally define a kernel for images as $\mathcal{K}_n(x_0,x_0')=\langle \Phi_n(x_0), \Phi_n(x_0') \rangle$ , whose RKHS contains the functions $f_w(x_0) = \langle w , \Phi_n(x_0) \rangle$ for $w$ in $L^2(\Omega,\mathcal{H}_n)$ . Note now that we have introduced a concept of image representation $\Phi_n$ , which only depends on some network architecture (amounts of pooling, patch size), and predictive model $f_w$ parametrized by $w$ .

From such a construction, we will now derive stability results for classical convolutional neural networks (CNNs) and then derive non-standard CNNs based on kernel approximations that we call convolutional kernel networks (CKNs).

Next week, we will see how to perform feature (end-to-end) learning with the previous kernel representation, and also discuss other classical links between neural networks and kernel methods.

Posted in Machine learning | 1 Comment

Optimal bound for stochastic bandits with corruption

Posted on June 20, 2019 by Sebastien Bubeck

Guest post by Mark Sellke.

In the comments of the previous blog post we asked if the new viewpoint on best of both worlds can be used to get clean “interpolation” results. The context is as follows: in a STOC 2018 paper followed by a COLT 2019 paper, the following corruption model was discussed: stochastic bandits, except for $C$ rounds which are adversarial. The state of the art bounds were of the form: optimal (or almost optimal) stochastic term plus $K C$ , and it was mentioned as an open problem whether $KC$ could be improved to $C$ (there is a lower bound showing that $C$ is necessary — when $C = O(\sqrt{T})$ ). As was discussed in the comment section, it seemed that indeed this clean best of both worlds approach should certainly shed light on the corruption model. It turns out that this is indeed the case, and a one-line calculation resolves positively the open problem from the COLT paper. The formal result is as follows (recall the notation/definitions from the previous blog post):

Lemma: Consider a strategy whose regret with respect to the optimal action $i^*$ is upper bounded by

(1) $\begin{equation*} c \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}$

Then in the $C$ -corruption stochastic bandit model one has that the regret is bounded by:

$C + 2 c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i}$

Note that by the previous blog post we know strategies that satisfy (1) with $c=10$ (see Lemma 2 in the previous post).

Proof: In equation (1) let us apply Jensen over the corrupt rounds, this yields a term $c \sqrt{K C}$ . For the non-corrupt rounds, let us use that

$c \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{c^2}{t \Delta_i} \right)$

The sum of the second term on the right hand side is upper bounded by $c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i}$ . On the other hand the sum (over non-corrupt rounds) of the first term is equal to $1/2$ of the regret over the non-corrupt rounds, which is certainly smaller than $1/2$ of the total regret plus $C$ . Thus we obtain (denoting $R$ for the total regret):

$R \leq c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i} + \frac{C}{2} + \frac{R}{2}$

which concludes the proof.

Posted in Machine learning, Optimization, Theoretical Computer Science | Leave a comment

Amazing progress in adversarially robust stochastic multi-armed bandits

Posted on June 10, 2019 by Sebastien Bubeck

In this post I briefly discuss some recent stunning progress on robust bandits (for more background on bandits see these two posts, part 1 and part 2, in particular what is described below gives a solution to Open Problem 3 at the end of part 2).

Stochastic bandit and adversarial examples

In multi-armed bandit problems the gold standard property, going back to a seminal paper of Lai and Robbins in 1985 is to have a regret upper bounded by:

(1) $\begin{equation*} \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \,. \end{equation*}$

Let me unpack this a bit: this is for the scenario where the reward process for each action is simply an i.i.d. sequence from some fixed distribution, $i^*$ is the index of the (unique) best action, and $\Delta_i$ is the gap between the mean value of the best action and the one of $i$ . Such guarantee is extremely strong, as in particular it means that actions whose average performance is a constant away from the optimal arm are very rarely played (only of order $\log(T)$ ). On the other hand, the price to pay for such an aggressive behavior (by this I mean focusing on good actions very quickly) is that all the classical algorithms attaining the above bound are extremely sensitive to adversarial examples: that is if there is some deviation from the i.i.d. assumption (even very brief in time), the algorithms can suddenly suffer linear in $T$ regret.

Adversarial bandit

Of course there is an entirely different line of works, on adversarial multi-armed bandits, where the whole point is to prove regret guarantee for any reward process. In this case the best one can hope for is a regret of order $\sqrt{K T}$ . The classical algorithm in this model, Exp3, attains a regret of order $\sqrt{K T \log(K)}$ . In joint work with Jean-Yves Audibert we showed back in 2009 that the following strategy, which we called PolyINF, attains the optimal $\sqrt{KT}$ : view Exp3 as Mirror Descent with the (negative) entropy as a regularizer, and now replace the entropy by a simple rational function namely $- \sum_{i=1}^K x_i^p$ with $p \in (0,1)$ (this mirror descent view was actually derived in a later paper with Gabor Lugosi). The proof becomes one line (given the appropriate knowledge of mirror descent and estimation in bandit games): the radius part of the bound is of the form $\frac{1}{\eta} \sum_{i=1}^K x_i^p \leq \frac{K^{p}}{\eta}$ , while the variance is of the form (since the inverse of the Hessian of the mirror map is a diagonal matrix with entries $x_i^{2-p}$ ):

$\eta \sum_{i=1} x_i^{2-p} \frac{1}{x_i} \leq \eta K^{1-p} \,.$

Thus optimizing over $\eta$ yields $\sqrt{K T}$ for any $p \in (0,1)$ . Interestingly, the best numerical constant in the bound is obtained for $p=1/2$ .

Best of both worlds

This was the state of affairs back in 2011, when with Alex Slivkins we started working on a best of both worlds type algorithm (which in today’s language is exactly a stochastic MAB robust to adversarial examples): namely one that gets the $\log(T)$ guarantee (in fact $\log^2(T)$ in our original paper) if the environment is the nice i.i.d. one, and also $\sqrt{K T}$ (in fact $\sqrt{K T \log^3(T)}$ ) in the worst case. This original best of both worlds algorithm was of the following form: be aggressive as if it was a stochastic environment, but still sample sufficiently often the bad actions to make sure there isn’t an adversary trying to hide some non-stochastic behavior on these seemingly bad performing actions. Of course the whole difficulty was to show that it is possible to implement such a defense without hurting the stochastic performance too much (remember that bad actions can only be sampled of order $\log(T)$ times!). Since this COLT 2012 paper there has been many improvements to the original strategy, as well as many variants/refinements (one such variant worth mentioning are the works trying to do a smooth transition between the stochastic and adversarial models, see e.g. here and here).

A stunning twist

The amazing development that I want to talk about in this post is the following: about a year ago Julian Zimmert and Yevgeny Seldin proved that the 2009 PolyINF (crucially with $p=1/2$ ) strategy actually gets the 2012 best of both worlds bound! This is truly surprising, as in principle mirror descent does not “know” anything about stochastic environments, it does not make any sophisticated concentration reasoning (say as in Lai and Robbins), yet it seems to automatically and optimally pick up on the regularity in the data. This is really amazing to me, and of course also a total surprise that the polynomial regularizer has such strong adaptivity property, while it was merely introduced to remove a benign log term.

The crucial observation of Zimmert and Seldin is that a a certain self-bounding property of the regret implies (in a one-line calculation) the best of both worlds result:

Lemma 1: Consider a strategy whose regret with respect to the optimal action $i^*$ is upper bounded by

(2) $\begin{equation*} C \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}$

(Recall that for multi-armed bandit one selects a probability distribution $x_t$ over the actions, so $x_{i,t}$ denote here the probability of playing action $i$ at time $t$ .) Then one has that the regret is in fact bounded by $2 C \sqrt{K T}$ (this follows trivially by Jensen on the $i$ sum), and moreover if the environment is stochastic one has that the regret is in fact bounded by $C^2$ times $(1)$ .

Proof: Assuming that the environment is stochastic we can write the regret as $\sum_{i,t} \Delta_i x_{i,t}$ , so by assumption and using that $C \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right)$ one has:

$\sum_{i \neq i^*,t} \Delta_i x_{i,t} \leq \frac{1}{2} \sum_{i \neq i^*,t} \left(\Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) \,,$

which means that the left hand side is smaller than $\sum_{i \neq i^*,t} \frac{C^2}{t \Delta_i}$ which is indeed smaller than $C^2$ times $(1)$ .

Yet another one-line proof (okay, maybe 5 lines)

Zimmert and Seldin proved that PolyINF with $p=1/2$ actually satisfies the self-bounding property of Lemma 1 (and thus obtains the best of both worlds guarantee). In another recent paper by Zimmert, in joint work with Haipeng Luo and Chen-Yu Wei, they simplify the analysis by using a very mild variation of the PolyINF regularizer, namely $- \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i})$ . In my view it’s the proof from the book for the best of both worlds result (or very close to it)! Here it is:

Lemma 2: Equation $(2)$ with $C=10$ is an upper bound on the regret of mirror descent with learning $\eta_t = \frac{1}{\sqrt{t}}$ , mirror map $\Phi(x) = - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i})$ , and standard multi-armed bandit loss estimator.

Proof: The classical mirror descent analysis from any good book will tell you that the regret is controlled by (for $\Phi(x) = \sum_{i=1}^K \phi(x_i)$ and with the convention $\eta_0 = + \infty$ ):

(3) $\begin{equation*} \sum_{t=1}^T \left(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}\right) (\Phi(x^*) - \Phi(x_t)) + \sum_{t=1}^T \eta_t \sum_{i=1}^K \frac{1}{x_{i,t} \phi''(x_t)} \,. \end{equation*}$

We now consider those terms for the specific $\Phi$ and $\eta_t$ suggested in the lemma. First notice that $\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}} \leq \eta_t$ . Moreover $\phi(x^*_i) = - 1$ (since $x^*$ is integral) so that $\phi(x^*_i) - \phi(x_{i,t}) \leq \min(\sqrt{x_{i,t}}, \sqrt{1-x_{i,t}})$ . In other words the first term in $(3)$ is upper bounded by

$\sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{\min(x_{i,t}, 1-x_{i,t})}{t}} \leq 2 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,$

where the inequality simply comes from $\sqrt{1-x_{i^*,t}} = \sqrt{\sum_{i \neq i^*} x_{i,t}} \leq \sum_{i \neq i^*} \sqrt{x_{i,t}}$ .

Next note that $\phi''(s) = \frac{1}{4} (s^{-3/2} + (1-s)^{-3/2}) \geq \frac{1}{4 \min(s,1-s)^{3/2}}$ , so that the second term in $(3)$ is upper bounded by

$4 \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{x_{i,t}}{t}} \min(1, (1-x_{i,t})/x_{i,t})^{3/2} \leq 8 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,$

where the inequality follows trivially by considering the two cases whether $x_{i^*,t}$ is smaller or larger than $1/2$ .

Posted in Machine learning, Optimization, Theoretical Computer Science | 6 Comments

Nemirovski’s acceleration

Posted on January 9, 2019 by Sebastien Bubeck

I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a $2$ -dimensional plane-search subroutine (which, in theory, can be implemented in $\log(1/\epsilon)$ calls to a first-order oracle). He later improved it to only require a $1$ -dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).

Let $f$ be a $1$ -smooth function. Denote $x^{+} = x - \nabla f(x)$ . Fix a sequence $(\lambda_t)_{t \in \N}$ , to be optimized later. We consider the “conjugate” point $\sum_{s =1}^t \lambda_s \nabla f(x_s)$ . The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

$x_{t+1} = \mathrm{argmin}_{x \in P_t} f(x) \, \text{where} \, P_t = \mathrm{span}\left(x_t^+, \sum_{s =1}^t \lambda_s \nabla f(x_s)\right) \,.$

Let us denote $g_s = \nabla f(x_s)$ and $\delta_s = f(x_s) - f(x^*)$ for shorthand. The key point is that $g_{t+1} \in P_t^{\perp}$ , and in particular $\|\sum_{s \leq t} \lambda_s g_s\|^2 = \sum_{s \leq t} \lambda_s^2 \|g_s\|^2$ . Now recognize that $\|g_s\|^2$ is a lower bound on the improvement $\delta_s - \delta_{s+1}$ (here we use that $x_{s+1}$ is better than $x_s^+$ ). Thus we get:

$\|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s^2 (\delta_s - \delta_{s+1}) \leq \sum_{s \leq t} \delta_s (\lambda_s^2 - \lambda_{s-1}^2) \,.$

In other words if the sequence $\lambda$ is chosen such that $\lambda_s = \lambda_s^2 - \lambda_{s-1}^2$ then we get

$\|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s \delta_s \,.$

This is good because roughly the reverse inequality also holds true by convexity (and the fact that $x_s \in P_s$ so $g_s \cdot x_s = 0$ ):

$\sum_{s \leq t} \lambda_s \delta_s \leq \sum_{s \leq t} \lambda_s g_s \cdot (x_s - x^*) \leq \|x^*\| \cdot \| \sum_{s \leq t} \lambda_s g_s\| \,.$

So finally we get $\sum_{s \leq t} \lambda_s \delta_s \leq \|x^*\|^2$ , and it just remains to realize that $\lambda_s$ is of order $s$ so that $\delta_t \leq \|x^*\|^2 / t^2$ .

Posted in Optimization | 8 Comments

A short proof for Nesterov’s momentum

Posted on November 21, 2018 by Sebastien Bubeck

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

$\begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}$

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the $1/t^2$ convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that $f$ is $\beta$ -smooth convex function, and we take $\eta = 1/\beta$ in the gradient step. The momentum term $\gamma_t$ will be set to a very particular value, which comes out naturally in the proof.

The two basic inequalities

Let us denote $\delta_t = f(x_t) - f(x^*)$ and $g_t = - \frac{1}{\beta} \nabla f(x_t + d_t)$ (note that $x_{t+1} = x_t + g_t + d_t$ ). Now let us write our favorite inequalities (using $f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2$ and $f(x) - f(y) \leq \nabla f(x) \cdot (x-y)$ ):

$\delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,,$

and

$\delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,.$

On the way to a telescopic sum

Recall now that $|a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2$ , so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient $1$ on the second inequality, so let us do $\lambda_t-1$ times the first inequality plus $1$ times the second inequality. We obtain an inequality whose right hand side is given by $-\frac{\beta}{2}$ times

$\begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}$

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose $\lambda_t$ and $\gamma_t$ . What we would like to have is:

$x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,.$

Observe that (since $d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)$ ) the right hand side can be written as $x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*$ , and thus we see that we simply need to have:

$\lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,.$

Setting the parameters and concluding the proof

Writing $u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2$ we now obtain as a result of the combination of our two starting inequalities:

$\lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,.$

It only remains to select $\lambda_t$ such that $\lambda_t^2 - \lambda_t = \lambda_{t-1}^2$ (i.e., roughly $\lambda_t$ is of order $t$ ) so that by summing the previous inequality one obtains $\delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2}$ which is exactly the $1/t^2$ rate we were looking for.

Posted in Optimization | 11 Comments

Remembering Michael

Posted on October 7, 2018 by Sebastien Bubeck

It has been a year since the tragic event of September 2017. We now know what happened, and it is a tremendously sad story of undiagnosed type 1 diabetes.

This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.