This week and next week I’m giving 90 minutes lectures at MSR on the fundamentals of learning theory. Below you will find my notes for the first course, where we covered the basic setting of statistical learning theory, Glivenko-Cantelli classes, Rademacher complexity, VC dimension, linear regression, logistic regression, soft-SVM, sample complexity of classification with halfspaces, and convex learning with SGD. All of this (and much more) can be found in the excellent book by Shai Shalev-Shwartz and Shai Ben-David (freely available online) which I heavily used to prepare these lectures. Next week we will cover boosting, margin and kernels, stability, and an introduction to neural networks [note: part 2 can be found here].

**What statistical learning is about**

The basic ingredients in statistical learning theory are as follows:

- Two measurable spaces and (the
*input*and*output*sets). - A loss function .
- A set of maps from to (the
*hypothesis class*) - A set of probability measures on .

The mathematically most elementary goal in learning can now be described as follows: given an i.i.d. sequence drawn from some unknown measure , one wants to find a map from to that is doing almost as well as the best map in in terms of predicting given when is drawn from the unknown and the prediction mistakes are evaluated according to the loss function . In other words, denoting for , one is interested in finding a map that minimizes:

where

Let us make a few comments/observations about this problem and some of its variants. In all our discussion we restrict to and .

- The two canonical examples of the above framework are the problem of classification with halfspaces (, , and ), and the problem of linear regression (, , and ). The loss functions just defined are respectively called the zero-one loss and the least-squares loss.
- The prior knowledge about the relation between inputs and outputs (also called the inductive bias) is expressed by either restricting the comparison set (this is called the
*discriminative approach*) or restricting the set of possible distributions (the*generative approach*). - One says that is learnable if there exists such that . The sample complexity of the problem (at scale ) is the smallest such that there exists with .
- Obviously some problems are not learnable, say in classification with the zero-one loss one cannot learn if is the set of all distributions on and . This is sometimes called the no-free lunch theorem (note however that this observation really doesn’t deserve the name “theorem”). Interestingly it turns out that it doesn’t take much (in some sense) to make the problem learnable: if one restricts to be the set of distributions such that the conditional probability is -Lipschitz, then the nearest neighbor rule (i.e., outputs the label of the closest point to ) shows that the problem is (essentially) learnable with sample complexity . However this is not so good news, as this sample complexity is exponential in the dimension (this is the so-called “curse of dimensionality”). To get anything practical we will have to strengthen our inductive bias.
- The stronger notion of PAC learnability replaces the convergence in expectation of to by a convergence in probability (and this convergence should be uniform in ). Furthermore in basic PAC learning one assumes that is restricted to measures such that , also called the
*realizable case*in the context of classification. On the other hand in*agnostic*PAC learning is the set of all measures on . - A weaker notion of learning is to simply require
*consistency*, that is for any one has going to zero. One can show that -Nearest Neighbor is “universally consistent” provided that grows to infinity at a sublinear rate. This fact is not too surprising if you recall Lusin’s theorem. - Proper learning refers to the situation where is restricted to map into .
- Of course we are interested in the computational complexity of learning too, and the “real” objective of machine learning is to find large and expressive hypothesis classes that are learnable in polynomial sample complexity and polynomial time.

**Glivenko-Cantelli and Rademacher complexity**

Recall that a set of functions is a *Glivenko-Cantelli* class if for any measure on , denoting for the empirical measure of , one has that goes to in probability as goes to infinity. A useful technique to get a handle on Glivenko-Cantelli is the following symmetrization argument, where ,

This last quantity is known as the *Rademacher complexity* of on , and its supremum over is simply called the Rademacher complexity of , denoted . In particular McDiarmid’s inequality together with boundedness of directly implies that with probability at least ,

(1)

In other words if the Rademacher complexity of goes to , then is a Glivenko-Cantelli class.

In learning theory the class of interest is (in particular ). If is a Glivenko-Cantelli class (that is the true loss and the empirical loss are close for any ) then one obtains a consistent method with the rule (empirical risk minimization):

Furthermore if the Glivenko-Cantelli convergence is uniform in (which is the case if say the Rademacher complexity goes to ) then the problem is agnostic PAC learnable (thanks to (1))

Bounding the Rademacher complexity (and variants such as when the Rademacher random variables are replaced by Gaussians) is one of the main objective in the rich theory of empirical processes. Many tools have been developed for this task and we rapidly cover some of them below.

**Lipschitz loss**

The contraction lemma says that if and is 1-Lipschitz for any and , then . Thus, if one wants to show that linear regression is PAC learnable (say with ) it suffices to study the Rademacher complexity of linear functions. For instance if and are both the unit Euclidean ball then the Rademacher complexity of linear functions is bounded by (in particular it is independent of the dimension, more on that below). Thus summarizing this discussion we see that in this case the map

(2)

satisfies for any , with probability at least ,

Also observe that one can rewrite the optimization problem (2) by Lagrangian duality as follows, for some and with obvious notations for and ,

which is sometimes called the *ridge regression* problem. The added penalty is thought of as a *regularizer*, and from the perspective of this section it enforces a small Rademacher complexity. The tradeoff is that larger values of leads to smaller Rademacher complexity but also to a smaller comparison set . This two types of error should be thought of as the *estimation and approximation* errors.

Another interesting example is the LASSO problem

which corresponds to being the -unit ball (up to scaling). In this case if is the -unit ball one can show that the Rademacher complexity of linear functions is of order . This corresponds in some sense to the inductive bias that the optimal weight is sparse.

Finally let us make an important remark about scaling issues. In practice features are often normalized individually, i.e. , in which case the Rademacher complexity (with ) scales with the Euclidean radius which is . However if the examples are all sparse then in fact the dimension is replaced by the sparsity, which can make a huge difference in practice (think of the case where is the number of words in the english dictionary, and our data points are articles, which are typically much shorter than the dictionary). We see that ridge regression and LASSO are after different kind of sparsity, the former is about sparsity in the features while the latter is about sparsity of the features that matter for prediction.

**Zero-one loss**

For the zero-one loss we cannot use the contraction lemma as this loss is not Lipschitz. Instead we use Massart’s lemma which states that if then

Recall that in learning with the zero-one loss one has .

Thus we need to count the set of vectors . The Vapnik-Chervonenkis dimension is introduced exactly for this task: for an hypothesis class , is the largest integer such that one can find a set *shattered* by , that is . The Sauer-Shelah lemma then gives

In particular we obtain:

As a rule of thumb the VC dimension is of order of the number of parameters defining the hypothesis class (obviously this is not always true…). For instance one always has , and the VC dimension of halfspaces in is .

Unfortunately the ERM rule for the zero-one loss on halfspaces is computationally hard to implement (on the other hand in the realizable case it can be written as an LP, we will come back to this a bit later). In practice what we do is to relax the zero-one loss by some convex surrogate so that we can then write . Thus if somehow we manage to have a small error with the surrogate loss then we also have a small error with the zero-one loss. Two important examples of surrogate losses for classification with halfspaces are the hinge loss (which gives the soft-SVM algorithm, more on that later):

and the logistic loss (which gives regularized logistic regression):

Using the contraction lemma from the previous section one can easily control the Rademacher complexity with those losses.

**Convex learning**

We say that the loss is convex if for any the map is convex (in particular is a convex set).

The following situations are learnable: convex loss with either [bounded gradients, bounded ], or [bounded second derivative, bounded ], or [strongly convex, bounded gradients].Theorem:

Just as an example assume that for any , and that for any (the set of subgradients for ) one has .

Then running *one-pass SGD *with step-size given by gives (again this is dimension independent!). Observe that if one runs multiple passes of SGD then this bound does not apply, and one is in fact implementing the ERM rule from the previous section (for which bounds can then be derived via Rademacher complexity as we showed before).

## By Dmitry February 25, 2016 - 9:25 am

Sebastien, first, thank you for the great notes.

s there a typo when you divide by $\sqrt{m}$ or $m$ in either the 1st or the 3rd display of *Zero-one loss*?

Also, it seems that $d$ and $n$ both denote the same thing.

## By Dmitry February 25, 2016 - 9:32 am

As far as I get it, there should be $latex m$ with no square root in both cases; the $latex 1/m$ rate is due to the norm of $latex w$ being bounded.

## By No-Stupid-Questions? October 31, 2015 - 1:13 am

Sorry to nitpick. I suspect that even with 1-Lipschitz conditional probabilities (say if equal to 0.25 over the entire input space), the 1-nearest neighbor rule might attain at best constant excess risk for some distributions (0.125 for the stated case).

Thanks for putting these set of notes together. I can’t imagine a shorter and as well written introduction to “rigorous” aspects of statistical learning.

## By Sebastien Bubeck October 31, 2015 - 5:22 pm

Yes you’re right, that’s why I put the term “essentially” in the sentence :).

## By Lukasz October 14, 2015 - 2:30 pm

Is there an equivalent theory for unsupervised learning / density estimation ?

## By Sebastien Bubeck October 20, 2015 - 2:21 am

Well, there is a long-standing debate about this question. As a starting point you can check this piece on “Clustering: Art of Science” http://jmlr.csail.mit.edu/proceedings/papers/v27/luxburg12a/luxburg12a.pdf

## By Paul October 14, 2015 - 8:50 am

Thanks, this was awesome! Is it possible to attach a PDF/LaTeX document for nicer printing?

## By Royi October 15, 2015 - 10:20 am

That would be really great.

## By Sebastien Bubeck October 20, 2015 - 2:20 am

Hi Paul,

I would prefer to keep this as a blog post because it is not as polished as I would like it to be if it were to be a pdf. That being said I might revisit this post (and the next one coming up) later this year and add further details, in which case I will post a pdf on my website.

Sebastien