## COLT 2016 accepted papers

Like previous years (2014, 2015) I have compiled the list of accepted papers for this year’s edition of COLT together with links to the arxiv submission whenever I could find one. These 63 papers were selected from about 200 submissions (a healthy 11% increase in terms of submissions from last year).

COLT 2016 accepted papers

Posted in Conference/workshop | 1 Comment

## Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let be a pair of random variables, and let be the convex quadratic function defined by

We are interested in finding a minimizer of (we will also be satisfied with such that ). Denoting for the covariance matrix of the features (we refer to as a feature vector) and , one has (if is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case is the smallest-norm element of the affine subspace of minimizers of ). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of . Thus the only constraint to find is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where can be extremely ill-conditioned (for instance the smallest eigenvalue can be , or very close to ), and thus bounds that depend on the condition number of (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., where is independent of ) we define the noise level of the problem by assuming that . For simplicity we will assume (whenever necessary) that almost surely one has and . All our iterative algorithms will start at . We will also not worry too much about numerical constants and write for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary -dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of ).

ERM

The most natural approach (also the most terrible as it will turn out) is to take a large sample and then minimize the corresponding empirical version of , or in other words compute where is the column vector and is the matrix whose column is . Since is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get in iterations. However each iteration (basically computing a gradient of ) cost elementary vector operations, thus leading overall to elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix ). What number of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

(Note that in the above bound we evaluate on instead of , see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking samples and performing elementary vector operations. This number of elementary vector operations completely dooms this approach when is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least samples (in the worst case over all ) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of for some computational easiness is to add a regularization term to in order to make the problem better conditioned (recall that conjugate gradient reaches an -optimal point in iterations, where is the condition number).

In other words we would like to minimize, for some values of and to be specified,

One can write the minimizer as , and this point is now obtained at a cost of elementary vector operations ( denotes the spectral norm of and we used the fact that with high probability ). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to . Also observe that again an easy argument based on Rademacher complexity leads to

. Thus we want to take of order which in terms of is of order leading to an overall number of vector operations of

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing (this is a stochastic gradient, that is it satisfies ) we recall that SGD can be written as:

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size of order , and assuming to simplify (see below for how to get around this) that is known up to a numerical constant so that if steps outside the ball of that radius then we project it back on it, one has (we also note that , and we denote ):

This means that roughly with SGD the computational cost for our main task corresponds vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in instead of , and (ii) instead of . We will fix both issues, but let us also observe that on the positive side the term in accelerated SVRG is replaced by the less conservative quantity (that is one replaces the largest eigenvalue of by the average eigenvalue of ).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on with stochastic gradient which leads to (see Theorem 6.2 here):

Optimizing over we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size of order of a constant times satisfies

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term is minimax optimal, while the bias term matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

Posted in Optimization | 3 Comments

## AlphaGo is born

Google DeepMind is making the front page of Nature (again) with a new AI for Go, named AlphaGo (see also this Nature youtube video). Computer Go is a notoriously difficult problem, and up to now AI were faring very badly compared to good human players. In their paper the DeepMind team reports that AlphaGo won 5-0 against the best European player Fan Hui!!! This is truly a jump in performance: the previous best AI, Crazy Stone, needed several handicap stones to compete with pro players. Congratulations to the DeepMind team for this breakthrough!

How did they do it? From a very high level point of view they simply combined the previous state of the art (Monte Carlo Tree Search) with the new deep learning techniques. Recall that MCTS is a technique inspired from multi-armed bandits to efficiently explore the tree of possible action sequences in a game, for more details see this very nice survey by my PhD advisor Remi Munos: From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. Now in MCTS there are two key elements beyond the bandit part (i.e., how to deal with exploration v.s. exploitation): one needs a way to combine all the information collected to produce a value function for each state (this value is the key term in the upper confidence bound used by the bandit algorithm); and one needs a reasonable random policy to carry out the random rollouts once the bandit strategy has gone deep enough in the tree. In AlphaGo the initial random rollout strategy is learned via supervised deep learning on a dataset of human expert games, and the value function is learned (online, with MCTS guiding the search) via convolutional neural networks (in some sense this corresponds to a very natural inductive bias for this game).

Of course there is much more to AlphaGo than what I described above and you are invited to take a look at the paper (see this reddit thread to find the paper)!

Posted in Uncategorized | 1 Comment

## On the spirit of NIPS 2015 and OpenAI

I just came back from NIPS 2015 which was a clear success in terms of numbers (note that this growth is not all because of deep learning, only about 10% of the papers were on this topic, which is about double of those on convex optimization for example):

In this post I want to talk about some of the new emerging directions that the NIPS community is taking. Of course my view is completely biased as I am more representative of COLT than NIPS (though obviously the two communities have a large overlap). Also I only looked in details at about 25% of the papers so perhaps I missed the most juicy breakthrough. In any case below you will find a short summary of each of these new directions with pointers to some of the relevant papers. Before going into the fun math I wanted to first share some thoughts about the big announcement of yesterday.

Obvious disclaimer: the opinions expressed here represent my own and not those of my employer (or previous employer hosting this blog). Now, for those of you who missed it, yesterday Elon Musk and friends made a huge announcement: they are giving $1 billion to create a non-profit organization whose goal is the advancement of AI (see here for the official statement, and here for the New York Times covering). This is just absolutely wonderful news, and I really feel like we are watching history in the making. There are very very few places in the world solely dedicated to basic research and with that kind of money. Examples are useful to get some perspective: the Perimeter Institute for Theoretical Physics was funded with$100 million (I believe it has a major impact in the field), the Institute for Advanced Studies was funded with a similar size gift (a simple statistic give an idea of the impact: 41 out of 57 Fields medalists have been affiliated with IAS), more recently and perhaps closer to us the Simons Institute for the Theory of Computing was created with \$60 million and its influence on the field keep growing (it was certainly a very influential place in my own career). Looking at what those places are doing with 1/10 of OpenAI’s budget sets the bar extremely high for OpenAI, and I am very excited to see what direction they take and what their long term plans are!

I wish the best of luck to OpenAI and their members. The game-changing potential of this organization puts a lot of responsibility on them and I sincerely hope that they will try to seriously explore different paths to AI rather than to chase local-in-time advertisement (please don’t just solve Go with deep nets!!!).

Now time for some of the cool stuff that happened at NIPS.

Scaling up sampling

When I’m a grown-up I want to do non-convex optimization!

With deep nets in mind all the rage is about non-convex optimization. One direction in that space is to develop more efficient algorithms for specific problems where we already know polynomial-time methods under reasonable assumptions, such as low rank estimation (see “A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements“) and phase retrieval (see “Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems“). The nice thing about those new results is that they essentially show that gradient descent with a spectral initialization will work (previous evidence was already shown for alternating minimization, see also “A Nonconvex Optimization Framework for Low Rank Matrix Estimation“). Another direction in non-convex optimization is to slowly extend the class of functions that one can solve efficiently, see “Beyond Convexity: Stochastic Quasi-Convex Optimization“. Finally a thought-provoking paper which is worth mentioning is “Matrix Manifold Optimization for Gaussian Mixtures” (it comes without provable guarantees but maybe something can be done there…).

Convex optimization strikes back

As I said non-convex optimization is all the rage, yet there are still many things about convex optimization that we don’t understand (an interesting example is given in this paper “Information-theoretic lower bounds for convex optimization with erroneous oracles“). I blogged recently about a new understanding of Nesterov’s acceleration, but this said nothing about the Nesterov’s accelerated gradient descent. The paper “Accelerated Mirror Descent in Continuous and Discrete Time” builds on (and refines) recent advances on understanding the relation of AGD and Mirror Descent, as well as the differential equations underlying them. Talking about Mirror Descent, I was happy to see it applied to deep nets optimization in “End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture“. Another interesting trend is the revival of second-order methods (e.g., Newton’s method) by using various low-rank approximations to the Hessian, see “Convergence rates of sub-sampled Newton methods“, “Newton-Stein Method: A Second Order Method for GLMs via Stein’s Lemma“, and “Natural Neural Networks“.

Other topics

There are a few other topics that caught my attention but I am running out of stamina. These include many papers on the analysis of cascades in networks (I am particularly curious about the COEVOLVE model), papers that further our understanding of random features, adaptive data analysis (see this), and a very healthy list of bandit papers (or Bayesian optimization as some like to call it).

Posted in Conference/workshop | 14 Comments

## Convex Optimization: Algorithms and Complexity

I am thrilled to announce that my short introduction to convex optimization has just came out in the Foundations and Trends in Machine Learning series (free version on arxiv). This project started on this blog in 2013 with the lecture notes “The complexities of optimization”, it then morphed a year later into a first draft titled “Theory of Convex Optimization for Machine Learning”, and finally after one more iteration it has found the more appropriate name: “Convex Optimization: Algorithms and Complexity”. Notable additions since the last version include: a short treatment of conjugate gradient, an almost self-contained analysis of Vaidya’s 1989 cutting plane method (which attains the best of both center of gravity and ellipsoid method in terms of oracle complexity and computational complexity), and finally an algorithm with a simple geometric intuition which attains the rate of convergence of Nesterov’s accelerated gradient descent.

## Crash course on learning theory, part 2

It might be useful to refresh your memory on the concepts we saw in part 1 (particularly the notions of VC dimension and Rademacher complexity). In this second and last part we will discuss two of the most successful algorithm paradigms in learning: boosting and SVM. Note that just like last time each topic we cover have its own books dedicated to them (see for example the boosting book by Schapire and Freund, and the SVM book by Scholkopf and Smola). Finally we conclude our short tour of the learning’s basics with a simple observation: stable algorithms generalize well.

Boosting

Say that given a distribution supported on points one can find (efficiently) a classifier such that (here we are in the context of classification with the zero-one loss). Can we “boost” this weak learning algorithm into a strong learning algorithm with arbitrarily small for large enough? It turns out that this is possible, and even simple. The idea is to build a linear combination of hypotheses in with a greedy procedure. That is at time step our hypothesis is (the sign of) , and we are now looking to add with an approriate weight . A natural guess is to optimize over to minimize the training error of on our sample . This might be a difficult computational problem (how do you optimize over ?), and furthermore we would like to make use of our efficient weak learning algorithm. The key trick is that . More precisely:

where . From this we see that we would like to be a good predictor for the distribution . Thus we can pass to the weak learning algorithm, which in turns gives us with . Thus we now have:

Optimizing the above expression one finds that leads to (using )

The procedure we just described is called AdaBoost (introduce by Schapire and Freund) and we proved that it satisfies

(1)

In particular we see that our weak learner assumption implies that is realizable (and in fact realizable with margin , see next section for the definition of margin) with the hypothesis class:

This class can be thought of as a neural network with one (infinite size) hidden layer. To realize how expressive is compared to it’s a useful exercise to think about the very basic case of decision stumps (for which the empirical risk minimization can be implemented very efficiently):

To derive a bound on the true risk of AdaBoost it remains to calculate the VC dimension of the class where the size of the hidden layer is . This follows from more general results on the VC dimension of neural networks, and up to logarithmic factors one obtains that is of order . Putting this together with \eqref{eq:empada} we see that when is a constant, one should run AdaBoost for rounds, and then one gets .

Margin

We consider to be the set of distributions such that there exists with (again we are in the context of classification with the zero-one loss, and this assumption means that the data is almost surely realizable). The SVM idea is to search for with minimal Euclidean norm and such that . Effectively this is doing empirical risk minimization over the following data dependent hypothesis class (which we write in terms of the set of admissible weight vectors):

The key point is that we can now use the contraction lemma to bound the Rademacher complexity of this class. Indeed replacing the zero-one loss by the (Lipschitz!) “ramp loss” makes no difference for the optimum , and our estimated weight still has training error while its true loss is only surestimated. Using the argument from previous sections we see that the Rademacher complexity of our hypothesis class (with respect to the ramp loss) is bounded by (assuming the examples are normalized to be in the Euclidean ball). Now it is easy to see that the existence of with (and ) exactly corresponds to a geometric margin of between positive and negative examples (indeed the margin is exactly ). To summarize we just saw that under the -margin condition the SVM algorithm has a sample complexity of order . This suggests that from an estimation perspective one should map the points into a high-dimensional space so that one could hope to have the separability condition (with margin). However this raises computational challenges, as the QP given by the SVM can suddenly look daunting. This is where kernels come into the picture.

Kernels

So let’s go overboard and map the points to an infinite dimensional Hilbert space (as we will see in the next subsection this notation will be consistent with being the hypothesis class). Denote for this map, and let be the kernel associated with it. The key point is that we are not using all the dimensions of our Hilbert space in the SVM optimization problem, but rather we are effectively working in the subspace spanned by (this is because we are only working with inner products with those vectors, and we are trying to minimize the norm of the resulting vector). Thus we can restrict our search to (this fact is called Mercer representer theorem and the previous sentence is the proof…). The beauty is that now we only need to compute the Gram matrix as we only need to consider and . In particular we never need to compute the points (which anyway could be infinite dimensional, so we couldn’t really write them down…). Note that the same trick would work with soft SVM (i.e., regularized hinge loss). To drive the point home let’s see an example: leads to . I guess it doesn’t get much better than this :). Despite all this beauty, one should note that we now have to manipulate an object of size (the kernel matrix ) and in our big data days this can be a disaster. We really want to focus on methods with computational complexity linear in , and thus one is led to the problem of kernel approximation, which we will explore below. But first let us explore a bit further what kernel SVM is really doing.

RKHS and the inductive bias of kernels

As we just saw in kernel methods the true central element is the kernel rather than the embedding . In particular since the only thing that matter are inner products we might as well assume that , and is the completion of where the inner product is defined by (and the definition is extended to by linearity). Assuming that is positive definite (that is Gram matrices built from are positive definite) one obtains a well-defined Hilbert space . Furthermore this Hilbert space has a special property: for any , . In other words is a reproducing kernel Hilbert space (RKHS), and in fact any RKHS can be obtained with the above construction (this is a simple consequence of Riesz representation theorem). Now observe that we can rewrite the kernel SVM problem

as

While the first formulation is computationally more effective, the second sheds light on what we are really doing: simply searching the consistent (with margin) hypothesis in with smallest norm. In other words, thinking in terms of inductive bias, one should choose a kernel for which the norm represents the kind of smoothness one expects in the mapping from input to output (more on that next).

It should also be clear now that one can “kernelize” any regularized empirical risk minimization, that is instead of the boring (note that here the loss is defined on instead of )

one can consider the much more exciting

since this can be equivalently written as

This gives the kernel ridge regression, kernel logistic regression, etc…

Translation invariant kernels and low-pass filtering

We will now investigate a bit further the RKHS that one obtains with translation invariant kernels, that is . A beautiful theorem of Bochner characterizes the continuous maps (with ) for which such a is a positive definite kernel: it is necessary and sufficient that is the characteristic function of a probability measure , that is

An important example in practice is the Gaussian kernel: (this corresponds to mapping to the function ). One can check that in this case is itself a Gaussian (centered and with covariance ).

Now let us restrict our attention to the case where has a density with respect to the Lebesgue measure, that is . A standard calculation then shows that

which implies in particular

Note that for the Gaussian kernel one has , that is the high frequency in are severely penalized in the RKHS norm. Also note that smaller values of correspond to less regularization, which is what one would have expected from the feature map representation (indeed the features are more localized around the data point for larger values of ).

To summarize, SVM with translation invariant kernels correspond to some kind of soft low-pass filtering, where the exact form of the penalization for higher frequency depends on the specific kernel being used (smoother kernels lead to more penalization).

Random features

Let us now come back to computational issues. As we pointed out before, the vanilla kernel method has at least a quadratic cost in the number of data points. A common approach to reduce this cost is to use a low rank approximation of the Gram matrix (indeed thanks to the i.i.d. assumption there is presumably a lot of redundancy in the Gram matrix), or to resort to online algorithms (see for example the forgetron of Dekel, Shalev-Shwartz and Singer). Another idea is to approximate the feature map itself (a priori this doesn’t sound like a good idea, since as we explained above the beauty of kernels is that we avoid computing this feature map). We now describe an elegant and powerful approximation of the feature map (for translation invariant kernels) proposed by Rahimi and Recht which is based on random features.

Let be a translation invariant kernel and its corresponding probability measure. Let’s rewrite in a convenient form, using and Bochner’s theorem,

A simple idea is thus to build the following random feature map: given and , i.i.d. draws from respectively and , let be defined by where

For it is an easy exercise to verify that with probability at least (provided has a second moment at most polynomial in ) one will have for any in some compact set (with diameter at most polynomial in ),

The SVM problem can now be approximated by:

This optimization problem is potentially much simpler than the vanilla kernel SVM when is much bigger than (essentially replaces for most computational aspects of the problem, including space/time complexity of prediction after training).

Stability

We conclude our tour of the basic topics in statistical learning with a different point of view on generalization that was put forward by Bousquet and Elisseeff.

Let’s start with a simple observation. Let be an independent copy of , and denote . Then one can write, using the slight abuse of notation ,

This last quantity can be interpreted as a stability notion, and we see that controlling it would in turn control how different is the true risk compared to the empirical risk. Thus stable methods generalize.

We will now show that regularization can be interpreted as a stabilizer. Precisely we show that

is -stable for a convex and -Lipschitz loss. Denote for the above objective function, then one has

and thus by Lipschitzness

On the other hand by strong convexity one has

and thus with the above we get which implies (by Lipschitzness)

or in other words regularized empirical risk minimization () is stable. In particular denoting for the minimizer of we have:

Assuming and optimizing over we recover the bound we obtained previously via Rademacher complexity.

## Crash course on learning theory, part 1

This week and next week I’m giving 90 minutes lectures at MSR on the fundamentals of learning theory. Below you will find my notes for the first course, where we covered the basic setting of statistical learning theory, Glivenko-Cantelli classes, Rademacher complexity, VC dimension, linear regression, logistic regression, soft-SVM, sample complexity of classification with halfspaces, and convex learning with SGD. All of this (and much more) can be found in the excellent book by Shai Shalev-Shwartz and Shai Ben-David (freely available online) which I heavily used to prepare these lectures. Next week we will cover boosting, margin and kernels, stability, and an introduction to neural networks [note: part 2 can be found here].

The basic ingredients in statistical learning theory are as follows:

• Two measurable spaces and (the input and output sets).
• A loss function .
• A set of maps from to (the hypothesis class)
• A set of probability measures on .

The mathematically most elementary goal in learning can now be described as follows: given an i.i.d. sequence drawn from some unknown measure , one wants to find a map from to that is doing almost as well as the best map in in terms of predicting given when is drawn from the unknown and the prediction mistakes are evaluated according to the loss function . In other words, denoting for , one is interested in finding a map that minimizes:

where

Let us make a few comments/observations about this problem and some of its variants. In all our discussion we restrict to and .

• The two canonical examples of the above framework are the problem of classification with halfspaces (, , and ), and the problem of linear regression (, , and ). The loss functions just defined are respectively called the zero-one loss and the least-squares loss.
• The prior knowledge about the relation between inputs and outputs (also called the inductive bias) is expressed by either restricting the comparison set (this is called the discriminative approach) or restricting the set of possible distributions (the generative approach).
• One says that is learnable if there exists such that . The sample complexity of the problem (at scale ) is the smallest such that there exists with .
• Obviously some problems are not learnable, say in classification with the zero-one loss one cannot learn if is the set of all distributions on and . This is sometimes called the no-free lunch theorem (note however that this observation really doesn’t deserve the name “theorem”). Interestingly it turns out that it doesn’t take much (in some sense) to make the problem learnable: if one restricts to be the set of distributions such that the conditional probability is -Lipschitz, then the nearest neighbor rule (i.e., outputs the label of the closest point to ) shows that the problem is (essentially) learnable with sample complexity . However this is not so good news, as this sample complexity is exponential in the dimension (this is the so-called “curse of dimensionality”). To get anything practical we will have to strengthen our inductive bias.
• The stronger notion of PAC learnability replaces the convergence in expectation of to by a convergence in probability (and this convergence should be uniform in ). Furthermore in basic PAC learning one assumes that is restricted to measures such that , also called the realizable case in the context of classification. On the other hand in agnostic PAC learning is the set of all measures on .
• A weaker notion of learning is to simply require consistency, that is for any one has going to zero. One can show that -Nearest Neighbor is “universally consistent” provided that grows to infinity at a sublinear rate. This fact is not too surprising if you recall Lusin’s theorem.
• Proper learning refers to the situation where is restricted to map into .
• Of course we are interested in the computational complexity of learning too, and the “real” objective of machine learning is to find large and expressive hypothesis classes that are learnable in polynomial sample complexity and polynomial time.

Recall that a set of functions is a Glivenko-Cantelli class if for any measure on , denoting for the empirical measure of , one has that goes to in probability as goes to infinity. A useful technique to get a handle on Glivenko-Cantelli is the following symmetrization argument, where ,

This last quantity is known as the Rademacher complexity of on , and its supremum over is simply called the Rademacher complexity of , denoted . In particular McDiarmid’s inequality together with boundedness of directly implies that with probability at least ,

(1)

In other words if the Rademacher complexity of goes to , then is a Glivenko-Cantelli class.

In learning theory the class of interest is (in particular ). If is a Glivenko-Cantelli class (that is the true loss and the empirical loss are close for any ) then one obtains a consistent method with the rule (empirical risk minimization):

Furthermore if the Glivenko-Cantelli convergence is uniform in (which is the case if say the Rademacher complexity goes to ) then the problem is agnostic PAC learnable (thanks to (1))

Bounding the Rademacher complexity (and variants such as when the Rademacher random variables are replaced by Gaussians) is one of the main objective in the rich theory of empirical processes. Many tools have been developed for this task and we rapidly cover some of them below.

Lipschitz loss

The contraction lemma says that if and is 1-Lipschitz for any and , then . Thus, if one wants to show that linear regression is PAC learnable (say with ) it suffices to study the Rademacher complexity of linear functions. For instance if and are both the unit Euclidean ball then the Rademacher complexity of linear functions is bounded by (in particular it is independent of the dimension, more on that below). Thus summarizing this discussion we see that in this case the map

(2)

satisfies for any , with probability at least ,

Also observe that one can rewrite the optimization problem (2) by Lagrangian duality as follows, for some and with obvious notations for and ,

which is sometimes called the ridge regression problem. The added penalty is thought of as a regularizer, and from the perspective of this section it enforces a small Rademacher complexity. The tradeoff is that larger values of leads to smaller Rademacher complexity but also to a smaller comparison set . This two types of error should be thought of as the estimation and approximation errors.

Another interesting example is the LASSO problem

which corresponds to being the -unit ball (up to scaling). In this case if is the -unit ball one can show that the Rademacher complexity of linear functions is of order . This corresponds in some sense to the inductive bias that the optimal weight is sparse.

Finally let us make an important remark about scaling issues. In practice features are often normalized individually, i.e. , in which case the Rademacher complexity (with ) scales with the Euclidean radius which is . However if the examples are all sparse then in fact the dimension is replaced by the sparsity, which can make a huge difference in practice (think of the case where is the number of words in the english dictionary, and our data points are articles, which are typically much shorter than the dictionary). We see that ridge regression and LASSO are after different kind of sparsity, the former is about sparsity in the features while the latter is about sparsity of the features that matter for prediction.

Zero-one loss

For the zero-one loss we cannot use the contraction lemma as this loss is not Lipschitz. Instead we use Massart’s lemma which states that if then

Recall that in learning with the zero-one loss one has .

Thus we need to count the set of vectors . The Vapnik-Chervonenkis dimension is introduced exactly for this task: for an hypothesis class , is the largest integer such that one can find a set shattered by , that is . The Sauer-Shelah lemma then gives

In particular we obtain:

As a rule of thumb the VC dimension is of order of the number of parameters defining the hypothesis class (obviously this is not always true…). For instance one always has , and the VC dimension of halfspaces in is .

Unfortunately the ERM rule for the zero-one loss on halfspaces is computationally hard to implement (on the other hand in the realizable case it can be written as an LP, we will come back to this a bit later). In practice what we do is to relax the zero-one loss by some convex surrogate so that we can then write . Thus if somehow we manage to have a small error with the surrogate loss then we also have a small error with the zero-one loss. Two important examples of surrogate losses for classification with halfspaces are the hinge loss (which gives the soft-SVM algorithm, more on that later):

and the logistic loss (which gives regularized logistic regression):

Using the contraction lemma from the previous section one can easily control the Rademacher complexity with those losses.

Convex learning

We say that the loss is convex if for any the map is convex (in particular is a convex set).

Theorem: The following situations are learnable: convex loss with either [bounded gradients, bounded ], or [bounded second derivative, bounded ], or [strongly convex, bounded gradients].

Just as an example assume that for any , and that for any (the set of subgradients for ) one has .

Then running one-pass SGD with step-size given by gives (again this is dimension independent!). Observe that if one runs multiple passes of SGD then this bound does not apply, and one is in fact implementing the ERM rule from the previous section (for which bounds can then be derived via Rademacher complexity as we showed before).

## Some stuff I learned this week

This week I had the pleasure to spend 3 days at a wonderful workshop (disclaimer: I was part of the organization, with Ofer Dekel, Yuval Peres and James Lee, so I might be a bit biased). Below you will find three results that I particularly enjoyed (one from each day). Obviously many more topics were covered at the workshop, from finding correlations in subquadratic time, to the basic problem of learning halfspaces (see for instances these two COLT’15 papers, by Daniely and by Awasthi, Balcan, Haghtalab and Urner), or how stability can explain why deep nets trained with SGD are not overfitting (upcoming work by Hardt-Recht-Singer; also on a related note I learned from Yuval about the Kolmogrov-Arnold representation theorem, which seems very much related to the famous Barron’1990 result on the universal approximation properties of neural networks with 1 hidden layer). Finally two more things: (i) we also have a blog where we collected all the open problems discussed during the workshop, and (ii) many of the talks have been recorded (including the ones I mention below, as well as the nice tutorial by Sergiu Hart on dynamics in games). I will post the link to the videos as soon as we upload them.

Negative association

Robin Pemantle gave an excellent tutorial on negative association, see the slides here. The punchline can be summarized pretty succinctly. Consider a measure on . We say that is positively correlated if for any monotone increasing one has . On the other hand we say that is negatively correlated if for any monotone increasing and such that and depends on disjoint subset of variables, one has . Positive correlation is particularly easy to check thanks to the famous FKG theorem which states that if is log-supermodular (i.e. is a submodular function) then is positively correlated (this is nice because submodularity is a local condition). One may hope that log-submodular distributions are negatively correlated, but unfortunately this is not true in general. The route to an easily checkable condition for negative correlation is a bit more tortuous and it goes as follows. First observe that negative association clearly implies pairwise negative correlation, that is for any and ,

Writing this inequality in terms of the generating function one has (with ):

One says that is strongly Rayleigh if the above equation holds true for any instead of just at . It turns out that this condition is more restrictive than negative correlation, but it is also much more tractable, in particular because it is equivalent to the fact that is a stable polynomial (that is does not have any roots in where is the complex upper half plane). Once you know that latter equivalence it becomes trivial to check that determinantal measures are negatively correlated since they are strongly Rayleigh (just use the fact that the characteristic polynomial of a symmetric matrix only has roots on the real line). A particularly useful property of strongly Rayleigh measures is that they have the usual Gaussian tails for any Lipschitz functional, just like product measures (this was conjectured by Elchanan Mossel, and recently proved by Robin Pemantle and Yuval Peres). Recently this set of measures has also been useful in theoretical computer science, see the recent work of Shayan Oveis Gharan and his co-authors (where they use negative correlation properties of random spanning trees). Finally I briefly mention that the sampling problem for negatively associated measures is a quite rich question, see the slides by Robin and also this COLT’15 paper by Rebeschini and Karbasi.

Learning in networks

Elchanan Mossel talked about a beautiful set of open questions. The basic problem goes as follows: let be a graph, and assume that each vertex is a player that receives an independent random variable distributed according to some measure. The measure used to produce the random variables is the same for all vertices, and it is either or (say the players also know that the choice of the measure was a coin flip). Question: assuming that at the end of each day the players reveal their conditional expectation for (given all the information they currently have) to their neighbors in , how many days do we need to reach an agreement? If convergence happens, is the conditional expectation that is reached the optimal one (that is the conditional expectation given all the random variables)? What if instead of revealing your belief you only reveal what is the most likely to you between and (revealed action game)? Mossel-Sly-Tamuz show that both in the revealed belief and the revealed action game there will eventually be an agreement. Furthermore in the revealed beliefs this will be close to the optimal posterior provided that is large enough. Both of these theorems currently lack a convergence rate.

The complexity of stochastic block models

The stochastic block model is a canonical model of a random graph with communities. The general version of this model is denoted where is the number of vertices, represents the proportion of each community, and represents the matrix of connection probabilities. More precisely the random graph is defined as follows: First each vertex is assigned a latent label , where the latent label vector is drawn uniformly at random conditionally on the fact that the number of vertices with label is exactly . The edges are then drawn independently (conditionally on ), and each edge is present with probability . This model has received a LOT of attention in the last thirty years. The basic objective is to recover the latent label vector from the observation of the unlabelled graph . The following very recent result by Emmanuel Abbe and Colin Sandon answers the most basic question one can ask: when is this objective achievable? First we introduce the Abbe-Sandon divergence for vectors :

Theorem (Abbe-Sandon 2015): Asymptotically almost sure exact recovery of the latent feature vector in is possible if and only if for any , ,

In their paper Abbe and Sandon also give an efficient algorithm which achieves the information theoretic limit given above. It remains an outstanding open problem to obtain such a precise picture for the problem of weak recovery where one asks to recover correctly only a fraction of the latent label vector. This question is particularly interesting in the sparse regime where the average degree is constant (note that in the above theorem the average degree diverges at the rate ). Recently a lot of progress has been made in this mathematically challenging regime too, see in particular the papers by Mossel-Neeman-Sly, Massoulie and Bordenave-Lelarge-Massoulie.

## A solution to bandit convex optimization

Ronen Eldan and I have just uploaded to the arXiv our newest paper which finally proves that for online learning with bandit feedback, convex functions are not much harder than linear functions. The quest for this result started in 2004 with a paper by Robert Kleinberg, and one by Flaxman, Kalai, McMahan, and over the years it has become the flagship open problem in bandit theory (see for instance Hazan and Levy 2015 where they refer to the attainable regret for convex bandit as the most important open question in bandit learning).

Our result with Ronen however is far from the end of the story for this problem: currently our dependency on the dimension is , which can certainly be improved, and even more importantly we do not yet know an algorithm which attains our regret bound! Indeed our proof follows the trail of my COLT’15 paper with Ofer Dekel, Tomer Koren and Yuval Peres where we suggest to attack the dual Bayesian problem (in the latter paper we solved this problem in dimension one). Thus finding an algorithm with for bandit convex optimization remains as open as before, but now at least we know that such an algorithm exist!!

Along the way of our new proof we solve an interesting property testing problem on convex functions, which we feel could have applications beyond convex bandits. The problem can be described as follows. Assume that you are given a convex and non-negative convex function . Now suppose that there is some unknown convex function defined on the same domain as and for which you would like to know if is equal to . I also guarantee you that if is not equal to then takes a negative value somewhere on its domain (say even less than ). You can obtain information on through noisy queries, that is you can ask what is the value of at and you obtain where is a standard Gaussian. Can you design a non-adaptive query strategy such that after at most (with possibly extra logarithmic factors) you will be able to correctly decide if equals or not with probability at least ? In the paper we give a positive answer by constructing a map where is a distribution supported on the domain of which “explores at every scale simultaneously”. At this point this map is still somewhat mysterious, and there are many open questions around it (can you compute the map efficiently? can this be useful beyond Bayesian convex bandits? can you find a better construction than ours? what is the optimal number of queries?).  See the paper for all the details!

Posted in Optimization | 2 Comments

## Revisiting Nesterov’s Acceleration

Nesterov’s accelerated gradient descent (AGD) is hard to understand. Since Nesterov’s 1983 paper people have tried to explain “why” acceleration is possible, with the hope that the answer would go beyond the mysterious (but beautiful) algebraic manipulations of the original proof. This search for a good “why” has accelerated in the recent years under the impulsion of the machine learning community, as one would like to generalize Nesterov’s acceleration in various ways to deal with machine learning applications. Obviously generalizing is difficult when even the basic phenomenon is not well understood. Recent works in the “why” direction are re-interpreting the AGD algorithm from different point of views:

Allen-Zhu and Orrechia view AGD as a linear coupling of Gradient Descent and Mirror Descent. This gives a nice new proof where the acceleration results from a primal-dual point of view on the progress of the algorithm.

Su, Boyd and Candes (building on earlier work of O’Donoghue and Candes) view AGD as the discretization of a certain second-order ODE. Other papers in a somewhat similar direction include the work of Lessard, Recht and Packard; as well as Flammarion and Bach. Note that these papers do not give alternative proofs of acceleration for general smooth functions.

– Other relevant references include a recent paper of Arjevani, Shalev-Shwartz and Shamir, and a blog post by Hardt.

In joint work with Yin-Tat Lee and Mohit Singh we discovered a simple geometric reason for the possiblity of acceleration, which can be turned into a new algorithm with the same theoretical guarantees as AGD, and improved performances in practice. The proof of the convergence rate is 3 lines, assuming a completely elementary geometric fact (in dimension 2) which itself takes less than half a page to prove. The general principle is loosely inspired from the ellipsoid method (except that we work with balls, which makes the algorithm much more efficient). Let me guide you through this.

Preliminaries

Ok so we are trying to minimize on a function which is -strongly convex and smooth (see here for definitions). Denote the minimizer of , the condition number of , for a Euclidean ball centered at and of radius squared ,

which are respectively a standard and long step of gradient descent. The crucial observation is that the local information at and the assumptions on allow to enclose the minimizer in some ball, precisely:

Finally we need some fact on the intersection of balls. Take a ball with radius squared , and intersect it with some other ball which has the center of the previous ball on its boundary. This intersection is obviously contained in a ball of radius squared . An easy calculation shows that if the radius squared of one ball shrinks by then the intersection is included in a ball of radius squared . Furthermore if both balls shrink (additively) by the same amount (which is an fraction of the radius squared for one of the balls), then the intersection is included in a ball of radius squared . As we will see this fact is the reason why one can accelerate.

A suboptimal algorithm

Assume that we are given such that . As we saw before one also has

Taking the intersection of these two balls and using the fact from the previous section we see that there exists such that

Thus by iterating the map from to (defined implicitely above) we obtain a new algorithm with the same rate of convergence as gradient descent.

Acceleration

Now assume that the point from the previous section and its guarantee on the distance to the optimum came from the local information at some other point , that is and where . Assume furthermore that is at least as good as a step of gradient descent at , that is . Then we can use the local information at to reduce the ball around , indeed one has

and thus

Furthermore, as in the last section, one has

Taking the intersection of the above balls and enclosing it in another ball one obtains (see preliminaries) that there exists such that

The only issue at this point is that to iterate the map from to one would need that is at least as good as a step of gradient descent on , which is not necessarily true (just as the assumption is not necessarily true). We circumvent this difficulty by doing a line search on the line going through and . The details (which are straightforward given the above discussion) are in the paper, specifically in Section 3. We also report encouraging preliminary experimental results.

Posted in Optimization | 7 Comments