Unless you live a secluded life without internet (in which case you’re not reading those lines), odds are that you have heard and read about deep learning (such as in this 2012 article in the New York Times, or the Wired’s article about the new Facebook AI lab).

So what is deep learning? Well, from my current limited understanding, it seems to be a pretty simple supervised learning technique which can be explained in a few lines. Of course, simple problems can be very deep (ever heard of the Collatz conjecture?), and it looks like this may be the case with deep learning. After explaining what is deep learning (or more accurately what is my understanding of deep learning), I will point out some recent theoretical progress towards understanding the basic phenomenona at play. Let me be clear right away: it seems to me that no one has a real clue about what’s going on.

**Deep learning: it’s all about the features**

Ok, so, you have a data set with and . Typically is an image, and is say if there is a pinguin in this image. A very efficient algorithm to learn a mapping from to is to use some kind of linear classification, such as multi-class logistic regression (see also this post for some background on linear classification). However applying a linear classification directly on typically does not work very well, and this why we usually first map the point to , where is some feature map. Designing has been part of the art of machine learning, and the idea behind deep learning is to incorporate the problem of finding a good map as part of the learning process (again in the previous post on linear classification I also discussed this possibility in the context of SVM, though we ended up with a semi-definite program which does not scale well with large data sets).

Ok so what kind of do we look at in deep learning? Well they are of the following form, for some sequence of matrices , with :

The (non-linear) mappings are fixed and part of the design choice (i.e., they are not learned). The parameters are also specified beforehand. Furthermore one usually impose extra constraints on the matrices , such as having all columns bounded in -norm (say by ).

The game now is simply to incorporate the matrices as free variables when one performs the optimization of the multiclass logistic loss. This optimization is always performed with Stochastic Gradient Descent (or a simple variant of it, such as mini-batch SGD, see this post and Section 6.2 in my survey): take an example in your data set ; compute the gradient of the logistic loss at this example (with respect to the matrices and the logistic regression weights), and do a small step in the opposite direction of the gradient.

**Design choices**

Part of the success of deep learning is in the choices left open above, such as the choice of the non-linear maps, and the set of constraints on the matrices. Let me partly describe some of these choices that Krizhevsky, Sutskever and Hinton used for their computer vision breakthrough where they basically halfed the error rate of the competition on the ImageNet dataset (the details can be found here).

First they take . Then and are the so-called ReLU function (Rectified Linear Unit), which simply take the positive part coordinate-wise. On the other hand are composition of ReLU with *max-pooling*. The max-pooling is a dimension reduction operation, which simply takes adjacent coordinates (in the case of an image, adjacent pixels) and returns the maximal element. Regarding the constraints on the matrices, and are left unconstrained (besides perhaps the energy control on the columns), while are highly structure *convolutional networks*, or ConvNets as they are usually called. Essentially this means that in these matrices, the columns up to are translations of some very sparse vector. Precisely, for , they take and each sparse vector corresponds to a image. In other words the vector corresponds to a collection of images, each one of them being the convolution of with a small patch (whose weights are being learned).

While this post is not about the history of deep learning (this has been discussed at length pretty much everywhere on the internet), I still would like to point out that the apparently arbitrary choice behind max-pooling and the ConvNets are due to Yann LeCun, who has experimented with these for many many years.

**Some mysterious steps**

So far I would say that the method described above kind of make sense (apart from the max-pooling operation which I really do not understand). But things are now going to turn out a bit more wacky than you might have expected. First note that there is one key element missing from my description of the learning algorithm: what does “small” mean in the sentence “take a small step in the opposite direction of the gradient”. If you know about convex optimization and machine learning, then you know that a lot of work usually goes into understanding precisely what kind of learning rate one should take, and in particular at which rate one should decrease it over time. In deep learning all of this theory goes out of the window since the objective function is not convex (note that there is *some *convexity in the sense that the objective is convex in – thanks to Nicolas Le Roux for fixing a mistake in the first version of the previous sentence). Here it seems that the current state of the art is to start with some guess, say , then look at how the training error decays as more examples are being processed, and if the training error plateau then just divide the learning rate by .

Another point which seems key to the success of such an architecture is to control overfitting. Again for a convex problem we know that one-pass through the data with SGD and the appropriate learning rate would work just fine in terms of generalization error. In deep learning the algorithm makes multiple passes on the data, and furthermore the problem is non-convex. This strongly suggests that overfitting will be a real problem, especially given the number of free parameters. Thankfully Hinton and his colleagues invented a quite smart technique, called “dropout”. The idea is simply to take the gradient with respect to only half of the columns of the matrices , where the half is chosen at random at each iteration. In the network view of the architecture this amounts to “dropping” each unit with probability . Intuitively this forces the network to be more robust, and to “repeat” itself, thus avoiding overfitting.

**Hacking through life**

Of course, if you want to beat everyone else in the world at something, you better have every parts of your system completely optimized. In that respect the deep learning community took the implementation of the above ideas to the next level by using GPUs, thus making the computation of the gradients much faster (by the way, the ReLU non-linearity also helps here, compared to more traditional non-linearities such as sigmoids). Various other tricks are being used, such as artificially expanding the data set by altering the (e.g., by translating the images) or by running the network backward.

**Towards an understanding of deep learning?**

I think it’s fair to say that we do not understand why deep nets work at all, and we understand even less why they are doing so much better than anything else. The three main questions are: (i) why is SGD meaningful for this non-convex problem, (ii) why is dropout enough to avoid overfitting, (iii) can we explain why ConvNets are a sensible way to learn representations?

Very limited progress has been made on these questions, though I will cite some papers that at least try to attack these problems. I try to summarize each work with one sentence, see the papers for more details (and in particular the details will reveal that these papers do not make direct progress on the above questions).

– Sanjeev Arora and his co-authors proved that a “random and sparse” deep net can be learned in polynomial time.

– Roi Livni, Shai Shalev-Shwartz and Ohad Shamir obtain some results on the approximation power of a neural network with a non-linearity of the form .

– Gerard Ben Arous, Yann LeCun and their students draw an interesting connection between fully connected neural networks (such as the last 3 layers in the above architecture) and spherical spin glasses.

– Stephane Mallat gives some insights on (iii) via the scattering transform. Arnab Paul and Suresh Venkat borrow ideas from group theory to shed some light on (iii).

– Francis Bach proves some generalization bounds for neural networks with one hidden layer.

I apologize for misrepresenting deep learning so badly in this post. I know I missed a lot, and you’re welcome to correct me in the comments!

**Edit. Some further references added thanks to various comments:**

– Approaches for (ii): David McAllester with PAC-Bayesian bounds, and Percy Liang and students with a link to AdaGrad.

– Approach to (iii) by a group at Facebook on the relevance of ConvNets for certain natural distributions.

– I said in the post that all optimization theory goes out of the window when the problem is non-convex. This is of course not true, and for instance a 1998 paper by Leon Bottou has results for SGD in more general scenario. However in the non-convex case we only have very weak guarantees which basically say nothing about the global optimum.

## By Geometry of linearized neural networks – KAYENRE Technology June 21, 2017 - 12:05 am

[…] about the recent progress he has made with co-authors to understand various linearized versions of neural networks. I will describe here two such results, one for Residual Neural Networks and one for Recurrent […]

## By Rong Ou December 15, 2015 - 12:36 pm

Have you seen this paper out of Rice (http://arxiv.org/abs/1504.00641)? They seem to have a theory behind the effectiveness of max pooling.

## By Ram December 13, 2015 - 9:33 am

Thanks for the wonderful blog posts. Regarding deep learning, the most puzzling aspect is why other methods which can have arbitrarily large VC-dimension are not doing as well. Could it be because deep nets give you a better chance at "overfitting" (complexity is exponentially larger for the same parameter set?). Perhaps an approach at understanding deep learning is by comparing this method to an alternative method on the same type of problem in a theoretical way.

## By DEEP LEARNING REVOLUTION, summer 2015, state of the art & topnotch links | Turing Machine September 1, 2015 - 5:29 pm

[…] 5. Deep stuff about deep learning? | I’m a bandit […]

## By Mark Montgomery March 28, 2015 - 8:31 am

Just an observation that in a similar way to modern economics deep learning is a live and continuous experiment, which is one reason stated by researchers for dual roles in academia and dominant corporations--the data, resources, and infrastructure. As someone who started a voyage in a modest live lab in the 90s, I fully appreciate the benefits of what is a very unique situation for rapid innovation. However, I'd suggest we need to take a hard look at the ethical relationships particularly given monopoly power, network effect, and extension to other industries relative to exploitation of human genius, and the impact of allowing it. This is particularly obvious when students of the discipline admit that they don't entirely understand why it works--even in the case of DL it's becoming more clear--it does serve as a red flag. The impact on society and economics is vastly more complex than the algorithmic work with an uncomfortably high probability that assumptions based on human bias and institutional conflicts will be considerably less accurate than the algorithms themselves. Not suggestion that the work should not continue, though am very uncomfortable with the relationships--it should give thinking beings pause.

## By Christian Osendorfer March 22, 2015 - 2:28 pm

While not directly answering either (i), (ii) or (iii), the work by Miguel Carreira-Perpinan on MAC (method of auxilliary coordinates, http://arxiv.org/pdf/1212.5921v1.pdf) for learning deeply nested systems should be mentioned here, too, in particular for its refreshing and quite rigorously motivated approach to deep learning.

## By İnanç Gümüş (@inancgumus) March 22, 2015 - 3:41 am

Thank you for the article but it's very hard to grasp what you are talking about without strong math background.

I always wondered why the authors don't/can't explain the concepts to the dummy audience who just know basic math and good coding.

If the authors could make it simpler deep learning and ml will have an impact in the size of open source movement. The current tutorials and the articles only for the academia.

## By Christos March 21, 2015 - 10:51 am

Thanks for the summary. I was always wondering if I was missing something....

By the way, is everyone using (steepest) SGD? Can't we get away with just gradient-related directions?

## By Mohammad Bavarian March 20, 2015 - 11:36 pm

I have some basic questions regarding "the dropout scheme" and (ii):

In the post, is the choice of 1/2 as the dropout rate an arbitrary constant -- or was there more to this? Is dropout probability fixed or is it optimized and changed through the course of the algorithm? If the latter is the case, is there a procedure for the algorithm to realize that it is probably overfitting enabling it to remedy this by increasing the dropout probability?

## By Mohammad Bavarian March 20, 2015 - 11:43 pm

Regarding this sentence "...and we understand even less why they are doing so much better than anything else...":

Can you mention some of these alternative methods that Deep Learning is outperforming? For example, do you know what are the close competitors to Deep Learning in its main application domains (speech recognition and vision I suppose)?

It would be interesting to know if the alternatives are theoretically better understood or not.

## By Fenchel March 20, 2015 - 4:59 pm

Hey Sebastien, here are some more:

* Bartlett's VC dimension bounds for neural nets. These proofs and the intuition within them are _unbelievable_. Maybe I can gain your interest by saying: where else have we seen Sard's theorem used in machine learning?

* Representation power results. Cybenko's proof is very nice and concise (Hahn-Banach + Riesz representation). And the Kolmogorov superposition theorem and its proof are out of this world.

There are many more. The math within neural nets is very nice.

## By Josh March 20, 2015 - 4:57 pm

If you meant 'no clue' as to why it globally works, I don't think it's true: suffices to know a Deep NN is efficiently learning features unsupervised.

I gather you want to get an understanding of some of the details of it, and if so, your approach is sound.

On the other hand, within the larger AI goals, the fact that features are now learned unsupervised is precisely the benefit we needed to derive from DL. In other words, that we do not need to know the details to be able to reap handsome rewards in their applications is the whole point: it avoids forcing us to have total comprehension of the domain model and its detailed algorithmic implementation.

Short version of my POV: we don't need to have a clue to the inner details to have a clue about how it works globally and how it can be applied.

That's the whole (and massive) advantage compared to traditional AI, and even supervised Machine Learning.

## By Sebastien Bubeck March 21, 2015 - 12:08 pm

Of course "at a high level" we understand why deep nets are successful, and also we don't need a theoretical understanding to apply them in practice. However it's a little bit like saying that you don't need to be understand chemistry to follow a recipe and cook an awesome meal; sure, I agree, but still understanding chemistry was pretty useful to mankind...

## By Ian Osband March 20, 2015 - 3:43 pm

Great article Sebastien!

I'm glad to see someone attacking the essential parts of deep learning in a simple way without getting caught up in the fancy hacks that fill up a lot of papers.

On the subject of understanding dropout regularization I thought I'd point to a nice paper that shows it's something like L2 regularization re-weighted according to the data's Fisher information:

"Dropout Training as Adaptive Regularization" http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization

## By Sebastien Bubeck March 20, 2015 - 7:01 pm

Thanks Ian! Actually I wanted to include this paper and the one of McAllester in the original post but apparently I just forgot... it has now been fixed.

## By Josh March 22, 2015 - 4:41 pm

I agree that understanding the details can be useful in some cases, but aren't the more important sections to investigate the topologies to solve certain types of problems, rather than anything else?

## By Nicolas Le Roux March 20, 2015 - 10:09 am

Hi Sébastien,

As far as I know, the loss is not convex in any of the W_k except W_L. There's actually an experiment you can do with a net with any number of layers: compare the decrease in training loss when you optimize only one layer, keeping the others fixed. You'll see that doing this on any layer but the last has almost no effect at all.

## By Sebastien Bubeck March 20, 2015 - 12:40 pm

Of course you're absolutely right Nicolas, thanks for pointing this mistake out! (I fixed it in the post.)

I'm interested in your second comment. Did people try some kind of alternate minimization, where several steps of gradient descent are performed alternatively on $W_L$ and the regression weights; and then once in a while a few steps are done on $W_1, \hdots, W_{L-1}$?

Also, despite your comment, one wouldn't obtain state of the art results with just $L=1$, right? Any intuition on how to reconcile your comment with this observation?

## By vzn March 20, 2015 - 10:04 am

yep great area, its great to see it taken seriously by theoretical scientists, many who somewhat disdain applied stuff although they would never admit it openly. (for example there is very little on deep learning on the stackexchange theoretical computer science site and it seems its not merely due to chance.) more here. and how about using deep learning and/ or big data techniques on hard problems like collatz? it seems ripe for application/ analysis. see eg vzn collatz analysis

## By Sebastien Bubeck March 20, 2015 - 1:26 pm

I disagree with your comment that theorists disdain applied stuff. It's just that theorists work at a different "time scale"; in the sense that we try to prove results which will be interesting 10/20 years from now (in fact the true goal is that a paper remains relevant for a 1000 years, but this rarely happens unfortunately). On the other hand with applied work one tries to be relevant NOW.

Of course, most of the time theorists' papers become quickly irrelevant, while on the other hand it sometimes happen that applied work uncovers deep phenomena that will require decades of study to be properly understood (note also that, historically, paradigm-shifting breakthroughs have been made from both sides).