I just came back from NIPS 2015 which was a clear success in terms of numbers (note that this growth is not all because of deep learning, only about 10% of the papers were on this topic, which is about double of those on convex optimization for example):
In this post I want to talk about some of the new emerging directions that the NIPS community is taking. Of course my view is completely biased as I am more representative of COLT than NIPS (though obviously the two communities have a large overlap). Also I only looked in details at about 25% of the papers so perhaps I missed the most juicy breakthrough. In any case below you will find a short summary of each of these new directions with pointers to some of the relevant papers. Before going into the fun math I wanted to first share some thoughts about the big announcement of yesterday.
Thoughts about OpenAI
Obvious disclaimer: the opinions expressed here represent my own and not those of my employer (or previous employer hosting this blog). Now, for those of you who missed it, yesterday Elon Musk and friends made a huge announcement: they are giving $1 billion to create a non-profit organization whose goal is the advancement of AI (see here for the official statement, and here for the New York Times covering). This is just absolutely wonderful news, and I really feel like we are watching history in the making. There are very very few places in the world solely dedicated to basic research and with that kind of money. Examples are useful to get some perspective: the Perimeter Institute for Theoretical Physics was funded with $100 million (I believe it has a major impact in the field), the Institute for Advanced Studies was funded with a similar size gift (a simple statistic give an idea of the impact: 41 out of 57 Fields medalists have been affiliated with IAS), more recently and perhaps closer to us the Simons Institute for the Theory of Computing was created with $60 million and its influence on the field keep growing (it was certainly a very influential place in my own career). Looking at what those places are doing with 1/10 of OpenAI’s budget sets the bar extremely high for OpenAI, and I am very excited to see what direction they take and what their long term plans are!
Now let’s move on to what worries me a little: the 10 founding members of OpenAI are all working on deep learning. Before explaining further why this is worrisome let me emphasize that I strongly believe that disentangling the mysteries behind the impressive practical successes of deep nets is a key challenge for the future of AI (in fact I am spending a good amount of time thinking about this issue, just like many other groups in theoretical machine learning these days). I also believe that pushing the engineering aspect of deep nets will lead to wonderful technological breakthroughs, which is why it makes sense for companies such as Facebook, Google, Baidu, Microsoft, Amazon to invest heavily in this endeavor. However it seems insane to think that the current understanding of deep nets will be sufficient to achieve even very weak forms of AI. AI is still far from being an engineering problem, and there are some fundamental theoretical questions that have to be resolved before we can brute force our way through this problem. In fact the mission statement of OpenAI mention one such fundamental question about which we know very little: currently we build systems that solve one task (e.g., image segmentation) but how do we combine these systems so that they take advantage of each other and help improving the learning of future tasks? While one can cook up heuristics to attack this problem (such as using the learned weights for one task as the initialization for another one) it seems clear to me that we are lacking the mathematical framework and tools to think properly about this question. I don’t think that deep learners are the best positioned to make conceptual progress on this question (and similar ones), though I definitely admit that they are probably the best positioned right now to make some practical progress. Again this is why all big companies are investing in this, but for an institution that wants to look into the more distant future it seems critical to diversify the portfolio (in fact this is exactly what Microsoft Research does) and not just follow companies who often have much shorter term objectives. I really hope that this is part of their plans.
I wish the best of luck to OpenAI and their members. The game-changing potential of this organization puts a lot of responsibility on them and I sincerely hope that they will try to seriously explore different paths to AI rather than to chase local-in-time advertisement (please don’t just solve Go with deep nets!!!).
Now time for some of the cool stuff that happened at NIPS.
Scaling up sampling
Variational inference is a very successful paradigm in Bayesian learning where instead of trying to compute exactly the posterior distribution one searches through a parametric family for the closest (in relative entropy) distribution to the true posterior. The key observation is that one can perform stochastic gradient descent for this problem without having to compute the normalization constant in the posterior distribution (which is often an intractable problem). The only catch is that one needs to be able to sample from an element (conditioned on the observed data) of the parametric family under consideration, and this might itself be a difficult problem in large-scale applications. A basic MCMC method for this type of problems is the Langevin Monte Carlo (LMC) algorithm for which a very nice theoretical analysis was recently provided by Dalalyan in the case of convex negative log-likelihood. The issue for large-scale applications is that each step of LMC requires going through the entire data set. This is where SGLD (Stochastic Gradient Langevin Dynamics) comes in, a very nice idea of Welling and Whye Teh, where the gradient step on the convex negative log-likelihood is replaced by a stochastic gradient step. The issue is that this introduces a bias in the stationary distribution, and fixing this bias can be done in several ways such as adding an accept-reject step, or modifying appropriately the covariance matrix of the noise in the Langevin Dynamics. The jury is still out on what is the most appropriate fix, and three papers made contributions to this question at NIPS: “On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators“, “Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling“, and “A Complete Recipe for Stochastic Gradient MCMC“. Another key related question on which progress was made is to decide when to stop the chain, see “Measuring Sample Quality with Stein’s Method” (my favorite paper at this NIPS) and “Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path“. My own paper “Finite-Time Analysis of Projected Langevin Monte Carlo” was also in that space: it adds nothing to the large scale picture but it shows how Langevin dynamics can cope with compactly supported distributions. Finally another related paper that I found interesting is “Sampling from Probabilistic Submodular Models“.
When I’m a grown-up I want to do non-convex optimization!
With deep nets in mind all the rage is about non-convex optimization. One direction in that space is to develop more efficient algorithms for specific problems where we already know polynomial-time methods under reasonable assumptions, such as low rank estimation (see “A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements“) and phase retrieval (see “Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems“). The nice thing about those new results is that they essentially show that gradient descent with a spectral initialization will work (previous evidence was already shown for alternating minimization, see also “A Nonconvex Optimization Framework for Low Rank Matrix Estimation“). Another direction in non-convex optimization is to slowly extend the class of functions that one can solve efficiently, see “Beyond Convexity: Stochastic Quasi-Convex Optimization“. Finally a thought-provoking paper which is worth mentioning is “Matrix Manifold Optimization for Gaussian Mixtures” (it comes without provable guarantees but maybe something can be done there…).
Convex optimization strikes back
As I said non-convex optimization is all the rage, yet there are still many things about convex optimization that we don’t understand (an interesting example is given in this paper “Information-theoretic lower bounds for convex optimization with erroneous oracles“). I blogged recently about a new understanding of Nesterov’s acceleration, but this said nothing about the Nesterov’s accelerated gradient descent. The paper “Accelerated Mirror Descent in Continuous and Discrete Time” builds on (and refines) recent advances on understanding the relation of AGD and Mirror Descent, as well as the differential equations underlying them. Talking about Mirror Descent, I was happy to see it applied to deep nets optimization in “End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture“. Another interesting trend is the revival of second-order methods (e.g., Newton’s method) by using various low-rank approximations to the Hessian, see “Convergence rates of sub-sampled Newton methods“, “Newton-Stein Method: A Second Order Method for GLMs via Stein’s Lemma“, and “Natural Neural Networks“.
Other topics
There are a few other topics that caught my attention but I am running out of stamina. These include many papers on the analysis of cascades in networks (I am particularly curious about the COEVOLVE model), papers that further our understanding of random features, adaptive data analysis (see this), and a very healthy list of bandit papers (or Bayesian optimization as some like to call it).
By Ideas on NIPS 2015 and OpenAI | A bunch of data April 21, 2016 - 9:06 am
[…] have been loads of posts concerning NIPS now (see: Sebastien Bubeck, Neil Lawrence, John Langford, Paul Mineiro, and Hal Daume), with loads of good tips to fascinating […]
By Interesting things at NIPS 2015 | iTechFlare March 31, 2016 - 11:07 am
[…] also Seb’s post […]
By Interesting things at NIPS 2015 | A bunch of data March 25, 2016 - 7:54 pm
[…] See also Seb’s post. […]
By google deepmind attacks Go and AlphaGo attacks top human | Turing Machine January 30, 2016 - 1:06 am
[…] 10. On the spirit of NIPS 2015 and OpenAI | I’m a bandit […]
By Amazing Technologies From the Year….2015 - TruthJuice News January 11, 2016 - 6:48 pm
[…] is that the entire world has had access to this improved technology.In artificial intelligence, there has been an explosion of interest for a technique called deep learning. Consequently, some have said that 2015 was a breakthrough year for artificial intelligence. What […]
By Hossein Mobahi January 1, 2016 - 3:34 am
To add to your nonconvex optimization section, this year at NIPS there was also the first workshop on “nonconvex optimization” for machine learning. It seems the organizers are interested in continuing that in the future.
By Amazing Technologies From the Year....2015 - h+ Mediah+ Media December 16, 2015 - 4:34 pm
[…] is that the entire world has had access to this improved technology.In artificial intelligence, there has been an explosion of interest for a technique called deep learning. Consequently, some have said that 2015 was a breakthrough year for artificial intelligence. What […]
By Bram van Ginneken December 14, 2015 - 11:41 am
Thanks for this post. I’m also very curious how OpenAI will develop.
I was at NIPS for the first time, giving a talk at a workshop, and I was also amazed by how busy everything related to deep learning was. I do not agree with your opening statement that the fact that only 10% of the papers was about deep learning shows that the growth is not all because of that. If you have a big music festival with tons of obscure bands and one big name, most of the audience is there for the big name. To me it was clear that the whole NIPS format of a single track conference is completely inappropriate if you have about 4000 attendees. The main hall had about 2400 chairs I estimated, and the overflow room was also pretty busy. Both halls were actually full of people of whom most are working on their laptops during talks and not following the talks. I think at conferences with over 500 attendees single track already does not work anymore.
A music festival was what it reminded me of, as I have been to many of these, I know how to wiggle myself to the front near the stage and I attended several good talks lying on the floor close to the podium. Maybe we can do stage-diving and crowd-surfing next year when the deep learning rock-star demi-gods perform ;).
By Тренды Deep & Reinforcement Learning @ NIPS 2015 | OpenDataScience — Российское сообщество Data Scientist'ов December 14, 2015 - 3:49 am
[…] уже отобрали работу у математиков Кстати, Bubeck тоже разродился анализом NIPS’а. i: а до глубоких сеток этого сделать […]
By John Schulman December 13, 2015 - 1:26 pm
I’m one of the members of OpenAI and a long-time reader of your blog. Our goal is not to be a representative cross-section of what we think is high-quality ML research. Rather, we’ll focus on a small set of important and impactful topics. And the research will mostly be driven by empirical results rather than theory, since in the realm of neural networks, the phenomenology is way ahead of the theory right now. On the theory side, I just hope I can convince my theory-inclined friends to think about some of the problems I’m interested in 🙂
By Sebastien Bubeck December 13, 2015 - 2:29 pm
Hi John,
of course it makes a lot of sense to choose a few directions and focus on them! The point of an entreprise like OpenAI is certainly not to give a “fair” representation of ML topics, NIPS is there for that ;).
However I disagree with your point of view on theory vs practice. Let me take an analogy (which is a bit naive but still…). Imagine that we are back in 1905, and inspired by
the success of the Wright brothers someone decides to invest into perfecting this technique with the objective of getting to the moon. This is not completely crazy, and in fact with only with the knowledge of 1905 (and a few more hundred years of pure
engineering) this could have had some chance of success. However what was really missing was quantum mechanics, which led to efficient transitors, which in turn gave the computers and communication systems that we needed to implement the Kalman filter.
While I believe that regarding “intelligence” we are missing something as fundamental as quantum mechanics, I also agree that the current state of affair is very different from my analogy, as the theory behind the Wright’s brother feat was already well-understood for many decades when it happened…
By Joan December 13, 2015 - 7:46 am
There was a lot of people at the NIPS. A simple solution would be to limit the number of attendees to 3000 and live stream everything for the ones who can’t attend. It’s a pretty common format in lots of successful conferences.
By NG December 13, 2015 - 6:49 am
You write “AI is still far from being an engineering problem, and there are some fundamental theoretical questions that have to be resolved before we can brute force our way through this problem.”
I am very curious to know what these questions are.
By David Relyea December 13, 2015 - 6:20 am
Nobody has addressed the elephant in the room: NIPS had almost 4000 attendees this year, and it’s set to almost double again next year. There are 100 posters per day. In that room, it was a rugby scrum to see any of the deep learning posters. This process just doesn’t scale.
If everyone at the conference suddenly had the mindset of an early-20-something (including the organizers), all the poster sessions would just be videotaped in advance. Online video does scale to any number of watchers, and questions can then be asked in person during the poster sessions (you could even have an additional online session).
I honestly don’t understand the “a paper is the best way to communicate our results, and thus it should be the only way” mentality in modern-day mathematics. A 3-5 minute overview would save the average reader so much time in the long run, allowing for more efficient knowledge dissemination.
Here’s looking forward to next year’s scrums. Bring a helmet! 😉