Deep stuff about deep learning?

Unless you live a secluded life without internet (in which case you’re not reading those lines), odds are that you have heard and read about deep learning (such as in this 2012 article in the New York Times, or the Wired’s article about the new Facebook AI lab).

So what is deep learning? Well, from my current limited understanding, it seems to be a pretty simple supervised learning technique which can be explained in a few lines. Of course, simple problems can be very deep (ever heard of the Collatz conjecture?), and it looks like this may be the case with deep learning. After explaining what is deep learning (or more accurately what is my understanding of deep learning), I will point out some recent theoretical progress towards understanding the basic phenomenona at play. Let me be clear right away: it seems to me that no one has a real clue about what’s going on.


Deep learning: it’s all about the features

Ok, so, you have a data set (X_i, Y_i)_{1 \leq i \leq n} with X_i \in \mathbb{R}^d and Y_i \in \{1, \hdots, M\}. Typically X_i is an image, and Y_i is say 9 if there is a pinguin in this image. A very efficient algorithm to learn a mapping from X to Y is to use some kind of linear classification, such as multi-class logistic regression (see also this post for some background on linear classification). However applying a linear classification directly on X typically does not work very well, and this why we usually first map the point X to \Phi(X), where \Phi is some feature map. Designing \Phi has been part of the art of machine learning, and the idea behind deep learning is to incorporate the problem of finding a good map as part of the learning process (again in the previous post on linear classification I also discussed this possibility in the context of SVM, though we ended up with a semi-definite program which does not scale well with large data sets).

Ok so what kind of \Phi do we look at in deep learning? Well they are of the following form, for some sequence of L matrices W_1, \hdots, W_L, with W_k \in \mathbb{R}^{n_k \times n_{k+1}'}:

    \[\Phi(X) = X^{(L+1)}, \ \text{where} \ X^{(k+1)} = \sigma_k(W_k^{\top} X^{(k)}) \ \text{with} \ X^{(1)} = X .\]

The (non-linear) mappings \sigma_k : \mathbb{R}^{n_{k+1}'} \rightarrow \mathbb{R}^{n_{k+1}} are fixed and part of the design choice (i.e., they are not learned). The parameters L, n_1, n_1', \hdots, n_{L+1}, n_{L+1}' are also specified beforehand. Furthermore one usually impose extra constraints on the matrices W_1, \hdots, W_L, such as having all columns bounded in \ell_2-norm (say by 1).

The game now is simply to incorporate the matrices W_1, \hdots, W_L as free variables when one performs the optimization of the multiclass logistic loss. This optimization is always performed with Stochastic Gradient Descent (or a simple variant of it, such as mini-batch SGD, see this post and Section 6.2 in my survey): take an example in your data set (X_i, Y_i); compute the gradient of the logistic loss at this example (with respect to the matrices W_1, \hdots, W_L and the logistic regression weights), and do a small step in the opposite direction of the gradient.


Design choices

Part of the success of deep learning is in the choices left open above, such as the choice of the non-linear maps, and the set of constraints on the matrices. Let me partly describe some of these choices that Krizhevsky, Sutskever and Hinton used for their computer vision breakthrough where they basically halfed the error rate of the competition on the ImageNet dataset (the details can be found here).

First they take L=8. Then \sigma_6, \sigma_7 and \sigma_8 are the so-called ReLU function (Rectified Linear Unit), which simply take the positive part coordinate-wise. On the other hand \sigma_1, \hdots, \sigma_5 are composition of ReLU with max-pooling. The max-pooling is a dimension reduction operation, which simply takes 4 adjacent coordinates (in the case of an image, 4 adjacent pixels) and returns the maximal element. Regarding the constraints on the matrices, W_6, W_7 and W_8 are left unconstrained (besides perhaps the energy control on the columns), while W_1, \hdots, W_5 are highly structure convolutional networks, or ConvNets as they are usually called. Essentially this means that in these matrices, the columns k d + 1 up to (k+1)d are translations of some very sparse vector. Precisely, for W_1, they take n_2'= 75*d and each sparse vector corresponds to a 5x5 image. In other words the vector W_1 X corresponds to a collection of 75 images, each one of them being the convolution of X with a small 5x5 patch (whose weights are being learned).

While this post is not about the history of deep learning (this has been discussed at length pretty much everywhere on the internet), I still would like to point out that the apparently arbitrary choice behind max-pooling and the ConvNets are due to Yann LeCun, who has experimented with these for many many years.


Some mysterious steps

So far I would say that the method described above kind of make sense (apart from the max-pooling operation which I really do not understand). But things are now going to turn out a bit more wacky than you might have expected. First note that there is one key element missing from my description of the learning algorithm: what does “small” mean in the sentence “take a small step in the opposite direction of the gradient”. If you know about convex optimization and machine learning, then you know that a lot of work usually goes into understanding precisely what kind of learning rate one should take, and in particular at which rate one should decrease it over time. In deep learning all of this theory goes out of the window since the objective function is not convex (note that there is some convexity in the sense that the objective is convex in W_L – thanks to Nicolas Le Roux for fixing a mistake in the first version of the previous sentence). Here it seems that the current state of the art is to start with some guess, say 0.01, then look at how the training error decays as more examples are being processed, and if the training error plateau then just divide the learning rate by 10.

Another point which seems key to the success of such an architecture is to control overfitting. Again for a convex problem we know that one-pass through the data with SGD and the appropriate learning rate would work just fine in terms of generalization error. In deep learning the algorithm makes multiple passes on the data, and furthermore the problem is non-convex. This strongly suggests that overfitting will be a real problem, especially given the number of free parameters. Thankfully Hinton and his colleagues invented a quite smart technique, called “dropout”. The idea is simply to take the gradient with respect to only half of the columns of the matrices W_1, \hdots, W_L, where the half is chosen at random at each iteration. In the network view of the architecture this amounts to “dropping” each unit with probability 1/2. Intuitively this forces the network to be more robust, and to “repeat” itself, thus avoiding overfitting.


Hacking through life

Of course, if you want to beat everyone else in the world at something, you better have every parts of your system completely optimized. In that respect the deep learning community took the implementation of the above ideas to the next level by using GPUs, thus making the computation of the gradients much faster (by the way, the ReLU non-linearity also helps here, compared to more traditional non-linearities such as sigmoids). Various other tricks are being used, such as artificially expanding the data set by altering the X_i (e.g., by translating the images) or by running the network backward.


Towards an understanding of deep learning?

I think it’s fair to say that we do not understand why deep nets work at all, and we understand even less why they are doing so much better than anything else. The three main questions are: (i) why is SGD meaningful for this non-convex problem, (ii) why is dropout enough to avoid overfitting, (iii) can we explain why ConvNets are a sensible way to learn representations?

Very limited progress has been made on these questions, though I will cite some papers that at least try to attack these problems. I try to summarize each work with one sentence, see the papers for more details (and in particular the details will reveal that these papers do not make direct progress on the above questions).

– Sanjeev Arora and his co-authors proved that a “random and sparse” deep net can be learned in polynomial time.

– Roi Livni, Shai Shalev-Shwartz and Ohad Shamir obtain some results on the approximation power of a neural network with a non-linearity of the form \sigma(x) = x^2.

– Gerard Ben Arous, Yann LeCun and their students draw an interesting connection between fully connected neural networks (such as the last 3 layers in the above architecture) and spherical spin glasses.

– Stephane Mallat gives some insights on (iii) via the scattering transform. Arnab Paul and Suresh Venkat borrow ideas from group theory to shed some light on (iii).

– Francis Bach proves some generalization bounds for neural networks with one hidden layer.


I apologize for misrepresenting deep learning so badly in this post. I know I missed a lot, and you’re welcome to correct me in the comments!


Edit. Some further references added thanks to various comments:

– Approaches for (ii): David McAllester with PAC-Bayesian bounds, and Percy Liang and students with a link to AdaGrad.

– Approach to (iii) by a group at Facebook on the relevance of ConvNets for certain natural distributions.

– I said in the post that all optimization theory goes out of the window when the problem is non-convex. This is of course not true, and for instance a 1998 paper by Leon Bottou has results for SGD in more general scenario. However in the non-convex case we only have very weak guarantees which basically say nothing about the global optimum.

Posted in Optimization | 19 Comments

Some pictures in geometric probability

As I discussed in a previous blog post, I have been recently interested in models of randomly growing networks. As a starting point I focused my attention on the preferential attachment rule and its variants, in part because its ubiquity in the literature. However these models miss one key feature, namely the notion of geometry: when a newcomer arrives in the network, she can connect to any vertex, regardless of a possible spatial organization of the network and her current position in that space.

In this blog post I want to discuss some of the most famous (or infamous in some cases) spatial models of network growth. For simplicity, and also to draw pictures, everything will be described in the plane, and specifically on the lattice \mathbb{Z}^2. These models iteratively build an aggregate A_n \subset \mathbb{Z}^2 as follows: First A_0 = \{(0,0)\}. Then A_{n+1}=A_n \cup \{X_n\} where X_n is a random point in the boundary of the aggregate, that is X_n \in \partial A_n := \{x \in \mathbb{Z}^2 : \min_{y \in A_n} \|x-y\|_2 =1\}. Thus a model is defined by the probability measure one puts on the boundary of the aggregate to select the next point to add.

All the pictures are with half a million particles, and the particles are colored as a function of their age, with blue corresponding to old particles, and red corresponding to young ones.

Eden model

The simplest model uses the uniform measure on the boundary, and it is known as the Eden model. Here is a picture:


One of the most basic result about this model is that it admits a limit shape:

Theorem: There exists a (deterministic) convex set B \subset \mathbb{R}^2 such that for any \epsilon >0,

    \[\lim_{n \to +\infty} \mathbb{P} \left( (1- \epsilon) B \subset \frac{1}{\sqrt{n}} A_n \subset (1+\epsilon) B \right) = 1.\]

It is known that B is not an Euclidean ball (this should be clear from the picture), though nobody knows what B is exactly. How do you prove such a theorem? Well it turns out to be pretty easy once you have the right machinery. The first step is to realize that the Eden model exactly corresponds to first passage percolation with exponential clocks: imagine that you have i.i.d. exponential random variables t(e) on the edges of \mathbb{Z}^2, and consider A(t) \subset \mathbb{Z}^2 to be the set of points x such that there exists a path e_1, \hdots e_r from (0,0) to x with \sum_{s=1}^r t(e_s) \leq t. In other words A(t) is the set of points reached at time t by a fluid released at time 0 in (0,0) and with travel times on the edges given by the random variables t(e). It is an easy exercise to convince yourself that A_n has the same distribution as A(t) conditionally on |A(t)| = n (this is actually not true, one in fact needs to put the travel times on the vertices of \mathbb{Z}^2 rather than on the edges, but let me get away with that small mistake). At this point the continuous version of the above theorem can be rather easily proved via Kingman’s subadditive ergodic theorem, see the Saint Flour lecture notes by Kesten for more details.


Diffusion Limited Aggregation (DLA)

Here we consider the harmonic measure from infinity: pick a point of \mathbb{Z}^2 uniformly at random from a large circle that contains A_n, and then start a random walk from this point until it hits \partial A_n (which will happen eventually since a simple random walk in two dimensions is recurrent); let X_n be the latter point. Here is a global picture and a zoom (the same pictures in black and white are provided at the end of this section):

DLA   DLAzoom

Absolutely nothing is known about this model apart from the following simple result of Kesten:

    \[\limsup_{n \to +\infty} \frac{\max_{x \in A_n} \|x\|_2}{n^{2/3}} < + \infty .\]

A fascinating open problem is to show that for some \epsilon>0,

    \[\liminf_{n \to +\infty} \frac{\max_{x \in A_n} \|x\|_2}{n^{0.5+\epsilon}} = + \infty ,\]

which the above picture clearly seems to validate. Note that at the moment nobody can even prove that once rescaled by 1/\sqrt{n} the DLA will not converge to an Euclidean ball…




Ballistic DLA

Here is a model that, as far as I know, was introduced by Ronen Eldan. Instead of having one particle coming from infinity to hit the aggregate through a random walk, imagine now that the aggregate is constantly bombarded (ballistically). Precisely particles are coming from infinity in every direction (that is on every line) at a constant rate. In other words first select a direction uniformly at random, and then among the lines parallel to that direction that hit the current aggregate choose one uniformly at random. The added point is the boundary point at the intersection of this (oriented) line with the aggregate. Here is a picture where we can see some resemblance with the Eden model:


and here is a picture from further away where we can see local resemblance with DLA:


see also this zoom in black and white:


Needless to say, nothing is known about this model.


Internal Diffusion Limited Aggregation (IDLA)

Finally the IDLA model is yet another modification of DLA, where instead of starting the random walk from infinity one starts it from the origin. Perhaps surprisingly, and on the contrary to everything discussed so far, we know quite a lot about this model! Its limit shape is an actual honest Euclidean ball, see this paper by Lawler, Bramson and Griffeath. In fact we even know that the average fluctuation from the aggregate around its limit is of constant order, see this paper by Jerison, Levine and Sheffield (information about the distribution of these fluctuations is also provided!).


Writing this blog post was quite a bit of fun, and I thank Ronen Eldan, Laura Florescu, Shirshendu Ganguly, and Yuval Peres from whom I learned everything discussed here. To conclude the post here are some intriguing pictures from variants of the above model that I cooked up (unfortunately I’m not sure the models are very interesting though):



Posted in Probability theory, Random graphs | 2 Comments

Guest post by Sasho Nikolov: Beating Monte Carlo

If you work long enough in any mathematical science, at some point you will need to estimate an integral that does not have a simple closed form. Maybe your function is really complicated. Maybe it’s really high dimensional. Often you cannot even write it down: it could be a quantity associated with a complex system, that you can only “query” at certain points by running an experiment. But you still need your integral, and then you turn to the trustworthy old Monte Carlo method. (Check this article by Nicholas Metropolis for the history of the method and what it has to do with Stanislaw Ulam’s uncle’s gambling habbit.) My goal in this post is to tell you a little bit about how you can do better than Monte Carlo using discrepancy theory.


The Problem and the Monte Carlo Method

Let us fix some notation and look at the simplest possible setting. We have a function f, that maps the real interval [0,1] to the reals, and we want to know


We will estimate this integral with the average \frac{1}{n}\sum_{i =1}^n{f(y_i)}, where y:= y_1,y_2, \ldots is a sequence of numbers in [0,1]. The error of this estimate is

    \[\mathrm{err}(f, y, n) := \left|\int_0^1{f(x)dx} - \frac{1}{n}\sum_{i =1}^n{f(y_i)}\right|.\]

And here is the main problem I will talk about in this post:

How do we choose a sequence y of points in [0,1] so that the average \frac{1}{n}\sum_{i = 1}^n{f(y_i)} approximates the integral \int_0^1{f(x)dx} as closely as possible?

Intuitively, for larger n the approximation will be better, but what is the best rate we can achieve? Notice that we want a single sequence, so that if we want a more accurate approximation, we just take a few more terms and re-normalize, rather than start everything from scratch. The insight of the Monte Carlo method (surely familiar to every reader of this blog) is to take each y_i to be a fresh random sample from [0,1]. Then for any n, the expectation of \frac{1}{n}\sum_{i =1}^n{f(y_i)} is exactly \int{f(x)dx} (from now on I will skip the limits of my integrals: they will all be from 0 to 1). The standard deviation is of order

    \[\frac{1}{\sqrt{n}}\int{f(x)^2dx} = \frac{\|f\|_{L_2}}{\sqrt{n}},\]

and standard concentration inequalities tell us that, with high probability, \mathrm{err}(f, y,n) will not be much larger than the latter quantity.


Quasi-Monte Carlo and Discrepancy

For a fixed function of bounded L_2 norm, the Monte Carlo method achieves \mathrm{err}(f, y, n) roughly on the order of n^{-1/2}. In other words, if we want to approximate our integral to within \varepsilon, we need to take about \varepsilon^{-2} random samples. It’s clear that in general, even for smooth functions, we cannot hope to average over fewer than \varepsilon^{-1} points, but is \varepsilon^{-2} really the best we can do? It turns out that for reasonably nice f we can do a lot better using discrepancy. We define the star discrepancy function of a sequence y :=y_1,y_2,\ldots as

    \[\delta^*(y, n):= \max_{0 \leq t \leq 1}\left|t - \frac{|\{i: i\leq n, y_i < t\}|}{n}\right|.\]

Notice that this is really just a special case of \mathrm{err}(f, y, n) where f is the indicator function of the interval [0, t). A beautiful inequality due to Koksma shows that in a sense these are the only functions we need to care about:

    \[\mathrm{err}(f, y, n) \leq V(f)\delta^*(y,n).\]

V(f) is the total variation of f, a measure of smoothness, and for continuously differentiable functions it is equal to \int{|f'(x)|dx}. The important part is that we have bounded the integration error by the product of a term that quantifies how nice f is, and a term that quantifies how “random” the sequence y is. With this, our task is reduced to finding a sequence y with small star discrepancy, hopefully smaller than n^{-1/2}. This is the essence of the quasi-Monte Carlo method.


The van der Corput Sequence

A simple sequence with low star discrepancy was discovered by van der Corput in the beginning of the previous century. Let us write the integer i in binary as i = i_k\ldots i_0, i.e. i = i_02^0 + i_12^1 + i_2 2^2 + \ldots + i_k 2^k. Then we define r(i) to be number we get by flipping the binary digits of i around the radix point: r(i) := i_0 2^{-1} + i_1 2^{-2} + \ldots + i_k 2^{-k-1}, or, in binary, r(i) = 0.i_0i_1\ldots i_k. The van der Corput sequence is r(0), r(1), r(2), \ldots.


I have plotted the first eight terms of the van der Corput sequence on the left in the above picture: the index i is on the x-axis and the value r(i-1) on the y-axis. The terms alternate between [0, 1/2) and [1/2, 1); they also visit each of [0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1) exactly once before they return, and so on. For example, each shaded rectangle on the right in the above picture contains exactly one point (the rectangles are open on the top). The key property of the van der Corput sequence then is that r(i) \in [j2^{-k}, (j+1)2^{-k}) if and only if i \equiv j \pmod{2^k}. So for any such dyadic interval, the discrepancy is at most 1/n: the number of integers i less than n such that i \equiv j \pmod{2^k} is either \lfloor n2^{-k} \rfloor or \lceil n2^{-k} \rceil. We can greedily decompose an interval [0, t) into O(\log n) dyadic intervals plus a leftover interval of length o(1/n); therefore the star discrepancy of the van der Corput sequence y is O((\log n)/n). Remember that, together with Koksma’s inequality, this means that we can estimate the integral of any function f with bounded variation with error \mathrm{err}(f, y, n) \ll (V(f)\log n)/n, which, for sufficiently smooth f, is almost quadratically better than the Monte Carlo method! And with a deterministic algorithm, to boot. This was not that hard, so maybe we can achieve the ultimate discrepancy bound O(1/n)? This is the question (asked by van der Corput) which essentially started discrepancy theory. The first proof that O(1/n) is not achievable was given by van Aardenne-Ehrenfest. Klaus Roth simplified her bound and strengthened it to \Omega(\sqrt{\log n}/n) using a brilliant proof based on Haar wavelets. Schmidt later proved that van der Corput’s O((\log n)/n) bound is assymptotically the best possible.


Higher Dimension, Other Measures, and Combinatorial Discrepancy

Quasi-Monte Carlo methods are used in real world applications, for example in quantitative finance, because of the better convergence rates they offer. But there are many complications that arise in practice. One issue is that we usually need to estimate integrals of high-dimensional functions, i.e. functions of a large number of variables. The Koksma inequality generalizes to higher dimensions (the generalization is known as the Koksma-Hlawka inequality), but we need to redefine both discrepancy and total variation for that purpose. The star discrepancy \delta^*(y, n) of a sequence y of d-dimensional points measures the worst-case absolute difference between the d-dimensional volume (Lebesgue measure) of any anchored box [0, t_1) \times \ldots \times [0, t_d) and the fraction of points in the sequence y_1, \ldots, y_n that fall in the box. The generalization of total variation is the Hardy-Krause total variation. Naturally, the best achievable discrepancy increases with dimension, while the class of functions of bounded total variation becomes more restrictive. However, we do not even know what is the best achievable star discrepancy for 2 or higher dimensional sequences! We know that no d-dimensional sequence has discrepancy better than \Omega(\log^{d/2 + \eta_d} n), where \eta_d > 0 is some constant that goes to 0 with d. The van der Corput construction generalizes to higher dimensions and gives sequences with discrepancy O(\log^d n) (the implied constants here and in the lower bounds depend on d). The discrepancy theory community refers to closing this significant gap as “The Great Open Problem”.

Everything so far was about integration with respect to the Lebesgue measure, but in practice we are often interested in a different measure space. We could absorb the measure into the function to be integrated, but this can affect the total variation badly. We could do a change of variables, but, unless we have a nice product measure, this will result in a notion of discrepancy in which the test sets are not boxes anymore. Maybe the most natural solution is to redefine star discrepancy with respect to the measure we care about. But how do we find a low-discrepancy sequence with the new definition? It turns out that combinatorial discrepancy is very helpful here. A classical problem in combinatorial discrepancy, Tusnady’s problem, asks for is the smallest function \Delta_d(n) such that any set of n points in \mathbb{R}^d can be colored with red and blue so that in any axis-aligned box [0, t_1) \times\ldots\times [0, t_d) the absolute difference between the number of red and the number of blue points is at most \Delta_d(n) (see this post for a generalization of this problem). A general transference theorem in discrepancy theory shows that for any probability measure in \mathbb{R}^d there exists a sequence y with star discrepancy at most O(\Delta_{d+1}(n)). The best bound for \Delta_d(n) is O(\log^{d + 0.5} n), only slightly worse than the best star discrepancy for Lebesgue measure. This transference result has long been seen as purely existential, because most non-trivial results in combinatorial discrepancy were not constructive, but recently we have seen amazing progress in algorithms for minimizing combinatorial discrepancy. While even with these advances we don’t get sequences that are nearly as explicit as the van der Corput sequence, there certainly is hope we will get there.



I have barely scratched the surface of Quasi Monte Carlo methods and geometric discrepancy. Koksma-Hlawka type inequalities, discrepancy with respect to various test sets and measures, combinatorial discrepancy are each a big topic in itself. The sheer breadth of mathematical tools that bear on discrepancy questions is impressive: diophantine approximation to construct low discrepancy sequences, reproducing kernels in Hilbert spaces to prove Koksma-Hlawka inequalities, harmonic analysis to prove discrepancy lower bounds, convex geometry for upper and lower bounds in combinatorial discrepancy. Luckily, there are some really nice references available. Matousek has a very accessible book on geometric discrepancy. Chazelle focuses on computer science applications. A new collection of surveys edited by Chen, Srivastav, and Travaglini has many of the latest developments.

Posted in Theoretical Computer Science | 8 Comments

The entropic barrier: a simple and optimal universal self-concordant barrier

Ronen Eldan and I have just uploaded our new paper on the arxiv (it should appear tomorrow, for the moment you can see it here). The abstract reads as follows:

We prove that the Fenchel dual of the log-Laplace transform of the uniform measure on a convex body in \mathbb{R}^n is a (1+o(1)) n-self-concordant barrier. This gives the first construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary duality in exponential families.

Let me briefly put this in context. As you know if you are a reader of this blog, Interior Point Methods are based on self-concordant barriers, and a basic result in this theory is Nesterov and Nemirovski’s construction of a universal self-concordant barrier whose self-concordance parameter is always O(n), see this blog post for instance. As I said at the time of writing the latter post, “this is a difficult theorem, which we state without a proof”. Similarly, in James Renegar’s book on IPM (which I highly recommend) he qualifies this as “a most impressive and theoretically important result”, but he also says “I have yet to find a relatively transparent proof of this result, and hence a proof is not contained in this book”. We hope our paper remedies this issue. In fact, our construction, which is quite different from the Nesterov-Nemirovski’s barrier, has a few advantages:

– The new barrier has a very simple interpretation as the differential entropy of a certain element in a canonical exponential family on the convex body of interest, hence the name entropic barrier.

– Its self-concordance parameter is optimal (up to the second order term). Indeed Nesterov and Nemirovski proved that on a simplex, the self-concordance parameter is at least n, and we obtain a parameter (1+o(1))n for any convex body.

– The proof is reasonably easy. For instance the self-concordance follows immediately from the fact that higher order moments of a log-concave distribution are controlled by the second order moment. (In fairness getting the tight constant 1 requires a little bit of work, but proving  a bound of O(n) is simpler.)

Mirror Descent with the entropic barrier as a mirror map is equivalent to Cover’s continuous exponential weights, which is kind of interesting.

I would like to conclude with a comment on a result by Roland Hildebrand: he proves that for convex cones, there exists a self-concordant barrier with parameter smaller than n (yes I mean 1 \times n). This is quite amazing, and better than what we obtain since we get (1+o(1))n. However his result has two caveats: (i) as far as I understand his barrier is defined only implicitely, as the unique convex solution to a certain PDE, and (ii) while convex cones are “universal” for convex optimization, there are actually other applications where it is important to have barriers on convex bodies, see for instance some of the papers by Sasha Rakhlin. [Edit: thanks to an email by Francois Glineur we realized that a side result of our analysis is an alternative proof of Hildebrand’s bound for homogeneous cones by using a deep result of Guler. This is very mysterious and we don’t quite understand the full extent of the implication of this yet.]

Posted in Optimization | Leave a comment

What’s the (hi)story of my network?

First an announcement: we have researcher and postdoc positions available in Theory Group at MSR, and I may have a position for an intern in the summer. If you are a talented young mathematician with interests in the kind of things I blog about please get in touch with me by email for more details about these positions.

The topic of this post is the reason why the blog has been silent for the past 3 months: I have just uploaded three new network papers, written with amazing co-authors (Luc Devroye, Jian Ding, Ronen Eldan, Gabor Lugosi, Elchanan Mossel, and Miklos Racz), and which can be found here, here, and here. The overall goal of these works is to realize (a version of) the vision that Nati Linial presented in a public lecture at the Simons Institute in Berkeley about a year ago. As Nati puts it, there is a need for a statistical theory on graphs, and specifically he writes that we should “Develop a battery of generative models of graphs and methods to recover the appropriate model from the input graph”. Obviously this type of question has already been looked at by statisticians, but somehow I feel like they have been shy in the models and questions that they consider. I think that there is a huge opportunity to look at more exotic (yet practically relevant) estimation problems with network data. To simplify, and get at the heart of the matter, I will essentially discuss testing problems rather than estimation.


Is there geometry in my network?

In the space of testing problems on random graphs, the king of the hill is testing for community structure. A precise hypothesis testing problem considered by Ery Arias-Castro and Nicolas Verzelen (see here and here) is the following: Under the null hypothesis, the observed graph G on n vertices has been generated by the standard Erdos-Renyi random graph G(n,p), where each edge appears independently with probability p. Under the alternative, G comes from the model G(n,p,k,q) which is similar to G(n,p) except that there is a subset of k vertices (the community) chosen uniformly at random and such that the connection probability between two vertices from that subset is q instead of p. The case p=1/2 and q=1 corresponds to the now famous Planted Clique (PC) problem. In PC it is easy to see that the information theoretic limit for testability is when k is of order \log(n) (use for instance this), but surprisingly we do not know any polynomial-time test when k= o(\sqrt{n}). Bridging this gap is a major open problem.

Testing for community structure is of course very important in practice (and it’s also a beautiful mathematical question), but it seems a bit silly to restrict all the attention to this single question. In my joint work with Jian, Ronen and Miki we propose to test for geometric structure. This is also quite important in practice, as learning representations for the vertices of a graph is a fundamental primitive to many network analysis techniques. The abstract of our paper reads as follows, and I refer you to the paper itself for more details:

We study the problem of detecting the presence of an underlying high-dimensional geometric structure in a random graph. Under the null hypothesis, the observed graph is a realization of an Erdos-Renyi random graph G(n,p). Under the alternative, the graph is generated from the G(n,p,d) model, where each vertex corresponds to a latent independent random vector uniformly distributed on the sphere \mathbb{S}^{d-1}, and two vertices are connected if the corresponding latent vectors are close enough. In the dense regime (i.e., p is a constant), we propose a near-optimal and computationally efficient testing procedure based on a new quantity which we call signed triangles. The proof of the detection lower bound is based on a new bound on the total variation distance between a Wishart matrix and an appropriately normalized GOE matrix. In the sparse regime, we make a conjecture for the optimal detection boundary. We conclude the paper with some preliminary steps on the problem of estimating the dimension in G(n,p,d).


Where did my network come from?

The two other papers (1, 2) are follow-ups on the project that I described a few months ago in this blog post. In a nutshell the goal was to understand if the Preferential Attachment model depends on its initial condition. See for example these beautiful pictures by Igor Kortchemski of the Preferential Attachment tree on 5000 vertices with different seeds:



(click here to see the seeds for the above pictures, and see here and here for pictures with other seeds). With Elchanan and Miki we conjectured that any initial condition leads to a different model, and we were able to partially prove this conjecture. The full conjecture was proved in a beautiful follow-up paper by a group of fellow French, Nicolas Curien, Thomas Duquesne, Igor Kortchemski, and Ioan Manolescu. Using their new proof technique (but with a different statistic) we (Ronen, Elchanan, Miki, and I) were then able to extend the theorem to the Uniform Attachment model, in which case the result is perhaps even more surprising (in fact I was originally convinced that the seed could not possibly have an influence in the Uniform Attachment model!). The paper can be found here.

The Seed Theorem emerging from these works opens up the problem of actually finding the seed. This is the problem we looked at with Luc and Gabor (in the special case where the seed is a single vertex), and the abstract of the paper reads as follows:

We investigate algorithms to find the first vertex in large trees generated by either the uniform attachment or preferential attachment model. We require the algorithm to output a set of K vertices, such that, with probability at least 1-\epsilon, the first vertex is in this set. We show that for any \epsilon, there exist such algorithms with K independent of the size of the input tree. Moreover, we provide almost tight bounds for the best value of K as a function of \epsilon. In the uniform attachment case we show that the optimal K is subpolynomial in 1/\epsilon, and that it has to be at least superpolylogarithmic. On the other hand, the preferential attachment case is exponentially harder, as we prove that the best K is polynomial in 1/\epsilon. We conclude the paper with several open problems.


Open problems

The four papers (1, 2, 3, 4) contain about a dozen open problems, and we could have easily extended the list to twenty open problems (for example I have added a few open problems in my talk at Oberwolfach two weeks ago, see here for Louigi Addario-Berry‘s notes). Can you solve one of them? Or even better, can you propose new interesting graph hypothesis testing problems?

Posted in Random graphs | 3 Comments

Komlos conjecture, Gaussian correlation conjecture, and a bit of machine learning

Today I would like to talk (somewhat indirectly) about a beautiful COLT 2014 paper by Nick Harvey and Samira Samadi. The problem studied in this paper goes as follows: imagine that you have a bunch of data points in \mathbb{R}^n with a certain mean value \mu. Your goal is to generate a sequence out these data points, such that the partial averages of this new sequence are as close as possible to \mu. The “herding” procedure of Max Welling has the property that the average of the t^{th} first terms in the new sequence is at distance at most O(1/t) from \mu (this was shown by Francis Bach, Simon Lacoste-Julien and Guillaume Obozinski by viewing the herding procedure as a simple Frank-Wolfe). In their paper Nick and Samira improves the dependency on the dimension n in the O notation by using a new algorithm inspired by the theory of discrepancy. In the rest of this post I would like to discuss the basic notions of this theory. What follows is a transcription of what Ronen Eldan (a postdoc in the theory group at MSR) explained to me (of course all errors are mine). I will use c and C to denote absolute numerical constants that take whatever value I need them to take.

Discrepancy theory

Consider a large finite set of vertices V, and a size-n set \mathcal{P} of subsets of vertices (in other words (V, \mathcal{P}) is an hypergraph). For sake of simplicity we will assume here that elements of \mathcal{P} are also of size n. We are now interested in the following combinatorial optimization problem:

    \begin{align*} & \qquad \;\; \mathrm{min.} \qquad \;\;\max_{A \in \mathcal{P}} \left| \sum_{v \in A} f(v) \right| \\ & f : V \rightarrow \{-1,1\} . \end{align*}

In words, one is looking for a partitioning f of the vertices, so as to have the hyperedges A \in \mathcal{P} as balanced as possible.

Alright, so let’s try something simple, just randomly partition the vertices: clearly a simple union bounds shows that the obtained objective value in this case is of order \sqrt{n \log n}. Denote D (for Discrepancy) the smallest objective value (in particular we just saw that D \leq C \sqrt{n \log n}). A beautiful result of Joel Spencer shows that one can always do much better than random partitioning:

Theorem (Spencer’s six deviations, 1985): D \leq 6 \sqrt{n}.

We will now see a splendid (partial) proof of this result. Let \gamma be the Gaussian measure in \mathbb{R}^d (in the following we will take d=|V|). First let me remind you of a basic fact about this measure, and a related conjecture. A set of the form S_{v,t} = \{x : |v^{\top} x| \leq t\}, for some v \in \mathbb{R}^n, t >0, is called a slab.

Theorem (Sidak, 1967): Let A be a symmetric convex set and B be a slab, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Obviously there is equality if A and B are two orthogonal slabs, and the theorem says (in particular) that an angle between the slabs would imply positive correlation.

Conjecture (Gaussian correlation): Let A and B be two symmetric convex sets, then

    \[\gamma(A \cap B) \geq \gamma(A) \gamma(B) .\]

Ok, now back to the proof of Spencer’s theorem. First, let’s rewrite things suggestively in terms of the slabs S_A := S_{\frac{\mathds{1}_A}{\sqrt{n}}, C}, A \in \mathcal{P}. We want to show that there exists f \in \{-1,1\}^d such that f \in \cap_{A \in \mathcal{P}} S_A. Using Sidak’s theorem one clearly has \gamma(\cap_{A \in \mathcal{P}} S_A) \geq c^d (with c being roughly like 1-2 \exp(-C^2/2)). From there it is now a fairly short route to show that, if c \geq 0.99 (say), then the intersection of \{-1,1\}^d with \cap_{A \in \mathcal{P}} S_A must be non-empty. Ronen showed me a non-trivial (and algorithmically non-efficient) 5 lines argument for this but I think that it’s yet unpublished work so I won’t reproduce it here. Another proof of this, which yields an efficient algorithm for finding a point in the intersection, is given in this very nice paper of Lovett and Meka.


In what we discussed so far the underlying hypergraph was completely arbitrary. For some specific graphs one can clearly improve the bound on D. Indeed if all hyperedges are disjoint, then D=0. More interestingly, Beck and Fiala showed in 1981 that, if each vertex appears in at most t hyperedges, then D \leq 2 t. They also conjectured that the correct bound should be O(\sqrt{t}). In 1998 Banaszczyk showed a bound of O(\sqrt{t \log(n)}) using the following theorem (which is a variant of the result mentioned above):

Theorem (Banaszczyk, 1998): Let A be a convex set such that \gamma(A) \geq 1/2, and u_1, \hdots u_m \in \mathbb{R}^d such that \|u_i\|_2 \leq C. Then there exists f \in \{-1,1\}^m such that \sum_{i=1}^m f_i u_i \in A.

The longstanding Komlos’ conjecture says that in fact this holds for A=[-1,1]^n (despite the fact that the Gaussian measure of this set is c^n).

Finally let me mention another open problem, which is the matrix version of Spencer’s six deviations, see this blog post by Raghu Meka for details.


Posted in Optimization, Probability theory, Theoretical Computer Science | 5 Comments

A zest of number theory

I just encountered an amazing number theoretic result. It is probably very well known, but for those who never saw it it’s quite something, so I thought I would share it.

Let n be a positive integer. A partition of n is a multiset of positive integers such that the sum of its elements is equal to n (for example \{1,1,2\} is a partition of 4). The partition function p(n) is defined as the number of partitions of n. The amazing result that I want to tell you about is the formula for the asymptotic equivalent of p(n). Before giving away the formula let’s try to get some intuition about the order of magnitude one can expect. For this let me first look at the simpler quantity q(n), defined as the number of multiset of [n] with n elements (for example q(2)=3). With the stars and bars argument (thanks to Luc Devroye for reminding me about this elementary combinatorial technique!) one can easily see that q(n) \geq 2^n-1. Moreover it is also easy to see that there is an injection from multisets of [\sqrt{n}] with \sqrt{n} elements to partitions of n (observe that the sum of the elements of such a multiset is necessarily smaller than n). Thus

    \[p(n) \geq q(\sqrt{n}) \geq 2^{\sqrt{n}}-1.\]

A more involved argument, based on Young Diagram representation of partitions, shows that p(n) \leq 2^{100 \sqrt{n}} (don’t quote me on the 100, I think it’s correct but I have only rough calculations here).

Now here we go for the asymptotic formula, discovered by Hardy and Ramanujan in 1918:

    \[p(n) \sim \frac{1}{4 \sqrt{3} n} \exp\left(\pi \sqrt{2 n /3}\right) .\]

In my opinion this is a beautiful connection between \pi, e, and a basic property of integers. The proof is based on the circle method, an elementary technique in complex analysis (see here for the details).

Posted in Theoretical Computer Science | Leave a comment

Probability in high dimension

The Barcelona events have just ended, and I’m happy to report that everything went very smoothly. In my opinion the quality of the works presented at COLT and at the Foundations of Learning Theory workshop were truly outstanding. I hope to blog about the results that I found the most exciting next week.

The point of this post is to advertise the lecture notes of my friend (and colleague) Ramon van Handel. These notes cover all the basics of modern high dimensional probability. Acquaintance with this material is extremely useful in all the fields that I’m interested in (machine learning, statistics, optimization, and theoretical computer science). It is the first time that I see this material presented in such a cohesive manner, and it is an excellent companion to the book of Boucheron, Lugosi and Massart. Last but not least Ramon is an amazing mathematical writer (for instance if you want to learn about filtering I highly recommend his lecture notes on Hidden Markov Models).

Posted in Probability theory | Leave a comment

Theory of Convex Optimization for Machine Learning

I am extremely happy to release the first draft of my monograph based on the lecture notes published last year on this blog. (Comments on the draft are welcome!) The abstract reads as follows:

This monograph presents the main mathematical ideas in convex optimization. Starting from the fundamental theory of black-box optimization, the material progresses towards recent advances in structural optimization and stochastic optimization. Our presentation of black-box optimization, strongly influenced by the seminal book of Nesterov, includes the analysis of the Ellipsoid Method, as well as (accelerated) gradient descent schemes. We also pay special attention to non-Euclidean settings (relevant algorithms include Frank Wolfe, Mirror Descent, and Dual Averaging) and discuss their relevance in machine learning. We provide a gentle introduction to structural optimization with FISTA (to optimize a sum of a smooth and a simple non-smooth term), Saddle-Point Mirror Prox (Nemirovski’s alternative to Nesterov’s smoothing), and a concise description of Interior Point Methods. In stochastic optimization we discuss Stochastic Gradient Descent, mini-batches, Random Coordinate Descent, and sublinear algorithms. We also briefly touch upon convex relaxation of combinatorial problems and the use of randomness to round solutions, as well as random walks based methods.

Posted in Optimization | 13 Comments

COLT 2014 accepted papers

The accepted papers for COLT 2014 have just been posted! This year we had a record number of 140 submissions, out of which 52 were accepted (38 for a 20mn presentation and 14 for a 5mn presentation). In my opinion the program is very strong, I am looking forward to listen to all these talks in Barcelona! By the way now may be a very good time to register, the deadline for early registration is coming up fast!

Here is a list of the accepted papers together with links when I could find one (if you want me to add a link just send me an email).

COLT 2014 accepted papers

– Jacob Abernethy, Chansoo Lee, Abhinav Sinha and Ambuj Tewari. Learning with Perturbations via Gaussian Smoothing

– Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli and Rashish Tandon. Learning Sparsely Used Overcomplete Dictionaries

– Alekh Agarwal, Ashwin Badanidiyuru, Miroslav Dudik, Robert Schapire and Aleksandrs Slivkins. Robust Multi-objective Learning with Mentor Feedback- Morteza Alamgir, –

– Ulrike von Luxburg and Gabor Lugosi. Density-preserving quantization with application to graph downsampling

– Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher and James Voss. The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

– Sanjeev Arora, Rong Ge and Ankur Moitra. New Algorithms for Learning Incoherent and Overcomplete Dictionaries

– Ashwinkumar Badanidiyuru, John Langford and Aleksandrs Slivkins. Resourceful Contextual Bandits

– Shai Ben-David and Ruth Urner. The sample complexity of agnostic learning under deterministic labels

– Aditya Bhaskara, Moses Charikar and Aravindan Vijayaraghavan. Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability

– Evgeny Burnaev and Vladimir Vovk. Efficiency of conformalized ridge regression

– Karthekeyan Chandrasekaran and Richard M. Karp. Finding a most biased coin with fewest flips

– Yudong Chen, Xinyang Yi and Constantine Caramanis. A Convex Formulation for Mixed Regression: Minimax Optimal Rates

– Amit Daniely, Nati Linial and Shai Shalev-Shwartz. The complexity of learning halfspaces using generalized linear methods

– Amit Daniely and Shai Shalev-Shwartz. Optimal Learners for Multiclass Problems

– Constantinos Daskalakis and Gautam Kamath. Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

– Ofer Dekel, Jian Ding, Tomer Koren and Yuval Peres. Online Learning with Composite Loss Functions Can Be Hard

– Tim van Erven, Wojciech Kotlowski and Manfred K. Warmuth. Follow the Leader with Dropout Perturbations

– Vitaly Feldman and Pravesh Kothari. Learning Coverage Functions and Private Release of Marginals

– Vitaly Feldman and David Xiao. Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

– Pierre Gaillard, Gilles Stoltz and Tim van Erven. A Second-order Bound with Excess Losses

– Eyal Gofer. Higher-Order Regret Bounds with Switching Costs

– Sudipto Guha and Kamesh Munagala. Stochastic Regret Minimization via Thompson Sampling

– Moritz Hardt, Raghu Meka, Prasad Raghavendra and Benjamin Weitz. Computational Limits for Matrix Completion

– Moritz Hardt and Mary Wootters. Fast Matrix Completion Without the Condition Number

– Elad Hazan, Zohar Karnin and Raghu Meka. Volumetric Spanners: an Efficient Exploration Basis for Learning

– Prateek Jain and Sewoong Oh. Learning Mixtures of Discrete Product Distributions using Spectral Decompositions

– Kevin Jamieson, Matthew Malloy, Robert Nowak and Sebastien Bubeck. lil’ UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits

– Satyen Kale. Multiarmed Bandits With Limited Expert Advice

– Varun Kanade and Justin Thaler. Distribution-Independent Reliable Learning

– Ravindran Kannan, Santosh S. Vempala and David Woodruff. Principal Component Analysis and Higher Correlations for Distributed Data

– Emilie Kaufmann, Olivier Cappé and Aurélien Garivier. On the Complexity of A/B Testing

– Matthäus Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding

– Kfir Levy, Elad Hazan and Tomer Koren. Logistic Regression: Tight Bounds for Stochastic and Online Optimization

– Ping Li, Cun-Hui Zhang and Tong Zhang. Compressed Counting Meets Compressed Sensing

– Che-Yu Liu and Sébastien Bubeck. Most Correlated Arms Identification

– Stefan Magureanu, Richard Combes and Alexandre Proutière. Lipschitz Bandits:Regret Lower Bounds and Optimal Algorithms

– Shie Mannor, Vianney Perchet and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization

– Brendan McMahan and Francesco Orabona. Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

– Shahar Mendelson. Learning without Concentration

– Aditya Menon and Robert Williamson. Bayes-Optimal Scorers for Bipartite Ranking

– Elchanan Mossel, Joe Neeman and Allan Sly. Belief Propagation, Robust Reconstruction and Optimal Recovery of Block Models

– Andreas Maurer, Massimiliano Pontil and Bernardino Romera-Paredes. An Inequality with Applications to Structured Sparsity and Multitask Dictionary Learning

– Alexander Rakhlin and Karthik Sridharan. Online Nonparametric Regression

– Harish Ramaswamy, Balaji S.B., Shivani Agarwal and Robert Williamson. On the Consistency of Output Code Based Learning Algorithms for Multiclass Learning Problems

– Samira Samadi and Nick Harvey. Near-Optimal Herding

– Rahim Samei, Pavel Semukhin, Boting Yang and Sandra Zilles. Sample Compression for Multi-label Concept Classes

– Ingo Steinwart, Chloe Pasin and Robert Williamson. Elicitation and Identification of Properties

– Ilya Tolstikhin, Gilles Blanchard and Marius Kloft. Localized Complexities for Transductive Learning

– Robert Williamson. The Geometry of Losses

– Jiaming Xu, Marc Lelarge and Laurent Massoulie. Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

– Se-Young Yun and Alexandre Proutiere. Community Detection via Random and Adaptive Sampling

– Yuchen Zhang, Martin Wainwright and Michael Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression

Posted in Conference/workshop | 1 Comment