I just came back from NIPS 2015 which was a clear success in terms of numbers (note that this growth is not all because of deep learning, only about 10% of the papers were on this topic, which is about double of those on convex optimization for example):
In this post I want to talk about some of the new emerging directions that the NIPS community is taking. Of course my view is completely biased as I am more representative of COLT than NIPS (though obviously the two communities have a large overlap). Also I only looked in details at about 25% of the papers so perhaps I missed the most juicy breakthrough. In any case below you will find a short summary of each of these new directions with pointers to some of the relevant papers. Before going into the fun math I wanted to first share some thoughts about the big announcement of yesterday.
Thoughts about OpenAI
Obvious disclaimer: the opinions expressed here represent my own and not those of my employer (or previous employer hosting this blog). Now, for those of you who missed it, yesterday Elon Musk and friends made a huge announcement: they are giving $1 billion to create a non-profit organization whose goal is the advancement of AI (see here for the official statement, and here for the New York Times covering). This is just absolutely wonderful news, and I really feel like we are watching history in the making. There are very very few places in the world solely dedicated to basic research and with that kind of money. Examples are useful to get some perspective: the Perimeter Institute for Theoretical Physics was funded with $100 million (I believe it has a major impact in the field), the Institute for Advanced Studies was funded with a similar size gift (a simple statistic give an idea of the impact: 41 out of 57 Fields medalists have been affiliated with IAS), more recently and perhaps closer to us the Simons Institute for the Theory of Computing was created with $60 million and its influence on the field keep growing (it was certainly a very influential place in my own career). Looking at what those places are doing with 1/10 of OpenAI’s budget sets the bar extremely high for OpenAI, and I am very excited to see what direction they take and what their long term plans are!
Now let’s move on to what worries me a little: the 10 founding members of OpenAI are all working on deep learning. Before explaining further why this is worrisome let me emphasize that I strongly believe that disentangling the mysteries behind the impressive practical successes of deep nets is a key challenge for the future of AI (in fact I am spending a good amount of time thinking about this issue, just like many other groups in theoretical machine learning these days). I also believe that pushing the engineering aspect of deep nets will lead to wonderful technological breakthroughs, which is why it makes sense for companies such as Facebook, Google, Baidu, Microsoft, Amazon to invest heavily in this endeavor. However it seems insane to think that the current understanding of deep nets will be sufficient to achieve even very weak forms of AI. AI is still far from being an engineering problem, and there are some fundamental theoretical questions that have to be resolved before we can brute force our way through this problem. In fact the mission statement of OpenAI mention one such fundamental question about which we know very little: currently we build systems that solve one task (e.g., image segmentation) but how do we combine these systems so that they take advantage of each other and help improving the learning of future tasks? While one can cook up heuristics to attack this problem (such as using the learned weights for one task as the initialization for another one) it seems clear to me that we are lacking the mathematical framework and tools to think properly about this question. I don’t think that deep learners are the best positioned to make conceptual progress on this question (and similar ones), though I definitely admit that they are probably the best positioned right now to make some practical progress. Again this is why all big companies are investing in this, but for an institution that wants to look into the more distant future it seems critical to diversify the portfolio (in fact this is exactly what Microsoft Research does) and not just follow companies who often have much shorter term objectives. I really hope that this is part of their plans.
I wish the best of luck to OpenAI and their members. The game-changing potential of this organization puts a lot of responsibility on them and I sincerely hope that they will try to seriously explore different paths to AI rather than to chase local-in-time advertisement (please don’t just solve Go with deep nets!!!).
Now time for some of the cool stuff that happened at NIPS.
Scaling up sampling
Variational inference is a very successful paradigm in Bayesian learning where instead of trying to compute exactly the posterior distribution one searches through a parametric family for the closest (in relative entropy) distribution to the true posterior. The key observation is that one can perform stochastic gradient descent for this problem without having to compute the normalization constant in the posterior distribution (which is often an intractable problem). The only catch is that one needs to be able to sample from an element (conditioned on the observed data) of the parametric family under consideration, and this might itself be a difficult problem in large-scale applications. A basic MCMC method for this type of problems is the Langevin Monte Carlo (LMC) algorithm for which a very nice theoretical analysis was recently provided by Dalalyan in the case of convex negative log-likelihood. The issue for large-scale applications is that each step of LMC requires going through the entire data set. This is where SGLD (Stochastic Gradient Langevin Dynamics) comes in, a very nice idea of Welling and Whye Teh, where the gradient step on the convex negative log-likelihood is replaced by a stochastic gradient step. The issue is that this introduces a bias in the stationary distribution, and fixing this bias can be done in several ways such as adding an accept-reject step, or modifying appropriately the covariance matrix of the noise in the Langevin Dynamics. The jury is still out on what is the most appropriate fix, and three papers made contributions to this question at NIPS: “On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators“, “Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling“, and “A Complete Recipe for Stochastic Gradient MCMC“. Another key related question on which progress was made is to decide when to stop the chain, see “Measuring Sample Quality with Stein’s Method” (my favorite paper at this NIPS) and “Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path“. My own paper “Finite-Time Analysis of Projected Langevin Monte Carlo” was also in that space: it adds nothing to the large scale picture but it shows how Langevin dynamics can cope with compactly supported distributions. Finally another related paper that I found interesting is “Sampling from Probabilistic Submodular Models“.
When I’m a grown-up I want to do non-convex optimization!
With deep nets in mind all the rage is about non-convex optimization. One direction in that space is to develop more efficient algorithms for specific problems where we already know polynomial-time methods under reasonable assumptions, such as low rank estimation (see “A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements“) and phase retrieval (see “Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems“). The nice thing about those new results is that they essentially show that gradient descent with a spectral initialization will work (previous evidence was already shown for alternating minimization, see also “A Nonconvex Optimization Framework for Low Rank Matrix Estimation“). Another direction in non-convex optimization is to slowly extend the class of functions that one can solve efficiently, see “Beyond Convexity: Stochastic Quasi-Convex Optimization“. Finally a thought-provoking paper which is worth mentioning is “Matrix Manifold Optimization for Gaussian Mixtures” (it comes without provable guarantees but maybe something can be done there…).
Convex optimization strikes back
As I said non-convex optimization is all the rage, yet there are still many things about convex optimization that we don’t understand (an interesting example is given in this paper “Information-theoretic lower bounds for convex optimization with erroneous oracles“). I blogged recently about a new understanding of Nesterov’s acceleration, but this said nothing about the Nesterov’s accelerated gradient descent. The paper “Accelerated Mirror Descent in Continuous and Discrete Time” builds on (and refines) recent advances on understanding the relation of AGD and Mirror Descent, as well as the differential equations underlying them. Talking about Mirror Descent, I was happy to see it applied to deep nets optimization in “End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture“. Another interesting trend is the revival of second-order methods (e.g., Newton’s method) by using various low-rank approximations to the Hessian, see “Convergence rates of sub-sampled Newton methods“, “Newton-Stein Method: A Second Order Method for GLMs via Stein’s Lemma“, and “Natural Neural Networks“.
There are a few other topics that caught my attention but I am running out of stamina. These include many papers on the analysis of cascades in networks (I am particularly curious about the COEVOLVE model), papers that further our understanding of random features, adaptive data analysis (see this), and a very healthy list of bandit papers (or Bayesian optimization as some like to call it).