I started out this decade with the project of writing a survey of the multi-armed bandit literature, which I had read thoroughly during the graduate studies that I was about to finish. At the time we resisted the temptation to name the survey “modern banditology”, which was indeed the right call given how much this “modern” picture has evolved over the decade! It is truly wonderful to now end the decade with two new iterations on the work we did in that survey:

- Bandit algorithms by Tor Lattimore and Csaba Szepesvari
- Introduction to bandits by Alex Slivkins

These new references very significantly expand the 2012 survey, and they are wonderful starting points for anyone who wants to enter the field.

Here are some of the discoveries in the world of bandits that stood out for me this decade:

- We now understand very precisely Thompson Sampling, the first bandit strategy that was proposed back in 1933. The most beautiful reference here is the one by Dan Russo and Ben Van Roy: An Information-Theoretic Analysis of Thompson Sampling, JMLR 2016. Another one that stands out is Analysis of thompson sampling for the multi-armed bandit problem by S. Agrawal and N. Goyal at COLT 2012.
- T^{2/3} lower bound for *non-stochastic* bandit with switching cost by Dekel, Ding, Koren and Peres at
__STOC 201__4. This is a striking result for several reasons. In particular the proof has to be based on a non-trivial stochastic process, since for the classical stochastic i.i.d. model one can obtain \sqrt{T} (very easily in fact). - We now know that bandit convex optimization is “easy”, in the sense that it is a \sqrt{T}-regret type problem. What’s more is that in our STOC 2017 paper with Y.T. Lee and R. Eldan we introduced a new way to do function estimation based on bandit feedback, using kernels (I have written at length about this on this blog).
- A very intriguing model of computation for contextual bandit was proposed, where one can access the policy space only through an offline optimization oracle. With such access, the classical Exp4 algorithm cannot be simulated, and thus one needs new strategies. We now have a reasonable understanding that \sqrt{T} is doable with mild assumptions (see e.g. this ICML 2014 paper on “
__Taming the Monster“__by Agarwal, Hsu, Kale, Langford, L. Li and Schapire) and that it is impossible with no assumptions (work of Hazan and Koren at STOC 2016). - Honorable mentions also go to the work of Wei and Luo showing that very strong variation bounds are possible in bandits (see this COLT 2018 paper), and Zimmert and Seldin who made striking progress on the best of both worlds phenomenon that we discovered with Slivkins at the beginning of the decade (I blogged about it here already).

## Life beyond bandits

In addition to starting the decade with the bandit survey, I also started it with being bored with the bandit topic altogether. I thought that many (if not most) of the fundamental results were now known, and it was a good idea to move on to something else. Obviously I was totally wrong, as you can see with all the works cited above (and many many more for stochastic bandits, including much deeper understanding of best arm identification, a topic very close to my heart, see e.g., [Kaufmann, Cappe, Garivier, JMLR 16]). In fact I am now optimistic that there is probably another decade-worth of exploration left for the bandit problem(s). Nevertheless I ventured outside, and explored the world of optimization (out of which first came a survey, and more recently video lectures) and briefly networks (another modest survey came out of this too).

Here are some of the landmark optimization results of this decade in my view:

- Perhaps the most striking result of the decade in optimization is the observation that for finite sum problems, one can reduce the variance in stochastic gradient descent by somehow centering the estimates (e.g., using a slowly moving sequence on which we can afford to compute full gradients; but this is not the only way to perform such variance reduction). This idea, while very simple, has a lot of implications, both in practice and in theory! The origin of the idea are in the SAG algorithm of [Schmidt, Le Roux, Bach, NIPS 2012] and SDCA [Shalev-Shwartz and Zhang, JMLR 2013]. A simpler instantiation of the idea, called SVRG appeared shortly after in [Johnson and Zhang, NIPS 2013] (and also independently at the same NeurIPS, in [M. Madhavi, L. Zhang, R. Li, NIPS 2013]).
- An intriguing direction that I pursued fervently is the use of convex optimization for problems that have a priori nothing to do with convex optimization. A big inspiration for me was the COLT 2008 paper by Abernethy, Hazan and Rakhlin, who showed how mirror descent naturally solves bandit problems. In this decade, we (this we includes myself and co-authors, but also various other teams) explored how to use mirror descent for other online decision making problems, and made progress on some long-standing problems (k-server and MTS), see for example this set of video lectures on the “Five miracles of mirror descent”.
- Arnak Dalalyan showed how to use ideas inspired from convex optimization to analyze the Langevin Monte Carlo algorithm. This was absolutely beautiful work, that led to many many follow-ups.
- There has been a lot of rewriting of Nesterov’s acceleration, to try to demystify it. Overall the enterprise is not yet a resounding success in my opinion, but certainly a lot of progress has been made (again I have written a lot about it on this blog already). We now even have optimal acceleration for higher order of smoothness (see this 15 authors
__paper at COLT 2019__), but these techniques are clouded with the same shroud of mystery as was Nesterov’s original method. - Yin Tat Lee and Aaron Sidford obtained an efficient construction of a universal barrier.
- We now know that certain problems cannot be efficiently represented by SDPs (the so-called “extension complexity), see e.g. this work by Lee-Raghavendra-Steurer.
- We now know how to chase convex bodies, and we can even do so very elegantly with the Steiner/Sellke point.

## Some other things that captivated me

The papers above are mostly topics on which I tried to work at some point. Here are some questions that I didn’t work on but followed closely and was fascinated by the progress:

- The stochastic block model was essentially solved during this decade, see for example this survey by Emmanuel Abbe.
- The computational/statistical tradeoffs were extensively explored, yet they remain mysterious. A nice impulse to the field was given by this COLT 2013 paper by Berthet and Rigollet relating sparse PCA and planted clique. In a similar spirit I also enjoyed the more recent work by Moitra, Jerry Li, and many co-authors on computationally efficient robust estimation (see e.g., this recent paper)
- Adaptive data analysis strikes me as both very important in practice, and quite deep theoretically, see e.g. the reusable holdout by Dwork et al. A related paper that I liked a lot is this ICML 2015 paper by Blum and Hardt, which essentially explores the regularization effect of publishing only models that beat the state of the art significantly (more generally this is an extremely interesting question, of why we can keep using the same datasets to evaluate progress in machine learning, see this provokingly titled paper “Do ImageNet Classifiers Generalize to ImageNet?“).
- A general trend has been in finding very fast (nearly linear time) method for many classical problems. Sometimes these investigations even led to actually practical algorithm, as with this now classical paper by Marco Cuturi at NIPS 2013 titled “Sinkhorn Distances: Lightspeed Computation of Optimal Transport“.

## Oh, one last thing

I also heard that, surprisingly, gradient descent can work to optimize highly non-convex functions such as training loss for neural networks. Not sure what this is about, it’s a pretty obscure topic, maybe it will catch up in the decade 2020-2029…

## Share some more in the comments!

The above is only a tiny sample, there were many many more interesting directions being explored (tensor methods for latent variable models [Anandkumar, Ge, Hsu, Kakade, Telgarsky, JMLR 14]; phenomenon of “all local minima are good” for various non-convex learning problems, see e.g., [Ge, Lee, Ma, NIPS 2016]; etc etc). Feel free to share your favorite ML theory paper in the comments!

## By MICHAEL FROM NORTHWESTERN UNI. October 23, 2020 - 8:35 am

Hello. Thanx. I am from Northwestern university. Please also follow and be friend.

## By SUZAN FROM ILLINOIS October 23, 2020 - 8:34 am

Hello..its fantastic.

Thank you so much for sharing

http://www.goldensmile.com.tr

## By Anon January 16, 2020 - 2:16 pm

I think Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (2011) by Duchi, Hazan and Singer is probably the most successful theory paper of all time in terms of how much impact it has had in practise (and possibly also in terms of citations?). Its also a fantastically readable paper, and my vote for best overall paper of the decade.

## By Sebastien Bubeck January 16, 2020 - 2:40 pm

Absolutely correct that this paper had an amazing impact in practice. I view it as the “best observation of the decade”. (Certainly saying that it is the most successful theory paper of all time is a bridge too far. Ever heard of Turing? :-).)

## By Daniel January 1, 2020 - 2:41 pm

Hi Seb,

Very nice summary!

One comment is that the “Taming the monster” paper also has Alekh Agarwal as a co-author.

Along the lines of computational-statistical trade-offs: I found the developments centered around Tensor PCA [1][2][3][4] to also be very interesting (and rather different from the sparse PCA work, which I agree is also very nice).

Happy new year!

–Daniel.

## By Sebastien Bubeck January 7, 2020 - 12:39 pm

Thanks Daniel, and so sorry for forgetting Alekh, I fixed it!

## By Avi Messica December 30, 2019 - 7:25 am

Great post!

Indeed, gradient descent does a good job in optimizing highly non-convex functions (training loss) in neural networks and that’s considered a mystery among many.

However, in high dimensional space the distance between minima and maxima gets squeezed, i.e. is short in comparison with 2D/3D and the like. I refer to this phenomenon as the “Blessing of High Dimensionality”. Therefore GD (SGD in case of NNs) finds a local minima that is very close to the global minima and that’s good enough for practical purposes.

Cheers,

Avi