This week and next week I’m giving 90 minutes lectures at MSR on the fundamentals of learning theory. Below you will find my notes for the first course, where we covered the basic setting of statistical learning theory, Glivenko-Cantelli classes, Rademacher complexity, VC dimension, linear regression, logistic regression, soft-SVM, sample complexity of classification with halfspaces, and convex learning with SGD. All of this (and much more) can be found in the excellent book by Shai Shalev-Shwartz and Shai Ben-David (freely available online) which I heavily used to prepare these lectures. Next week we will cover boosting, margin and kernels, stability, and an introduction to neural networks [note: part 2 can be found here].
What statistical learning is about
The basic ingredients in statistical learning theory are as follows:
- Two measurable spaces
and
(the input and output sets).
- A loss function
.
- A set
of maps from
to
(the hypothesis class)
- A set
of probability measures on
.
The mathematically most elementary goal in learning can now be described as follows: given an i.i.d. sequence drawn from some unknown measure
, one wants to find a map from
to
that is doing almost as well as the best map in
in terms of predicting
given
when
is drawn from the unknown
and the prediction mistakes are evaluated according to the loss function
. In other words, denoting for
, one is interested in finding a map
that minimizes:
where
Let us make a few comments/observations about this problem and some of its variants. In all our discussion we restrict to and
.
- The two canonical examples of the above framework are the problem of classification with halfspaces (
,
, and
), and the problem of linear regression (
,
, and
). The loss functions just defined are respectively called the zero-one loss and the least-squares loss.
- The prior knowledge about the relation between inputs and outputs (also called the inductive bias) is expressed by either restricting the comparison set
(this is called the discriminative approach) or restricting the set of possible distributions
(the generative approach).
- One says that
is learnable if there exists
such that
. The sample complexity of the problem (at scale
) is the smallest
such that there exists
with
.
- Obviously some problems are not learnable, say in classification with the zero-one loss one cannot learn if
is the set of all distributions on
and
. This is sometimes called the no-free lunch theorem (note however that this observation really doesn’t deserve the name “theorem”). Interestingly it turns out that it doesn’t take much (in some sense) to make the problem learnable: if one restricts
to be the set of distributions such that the conditional probability
is
-Lipschitz, then the nearest neighbor rule (i.e.,
outputs the label
of the closest point
to
) shows that the problem is (essentially) learnable with sample complexity
. However this is not so good news, as this sample complexity is exponential in the dimension (this is the so-called “curse of dimensionality”). To get anything practical we will have to strengthen our inductive bias.
- The stronger notion of PAC learnability replaces the convergence in expectation of
to
by a convergence in probability (and this convergence should be uniform in
). Furthermore in basic PAC learning one assumes that
is restricted to measures such that
, also called the realizable case in the context of classification. On the other hand in agnostic PAC learning
is the set of all measures on
.
- A weaker notion of learning is to simply require consistency, that is for any
one has
going to zero. One can show that
-Nearest Neighbor is “universally consistent” provided that
grows to infinity at a sublinear rate. This fact is not too surprising if you recall Lusin’s theorem.
- Proper learning refers to the situation where
is restricted to map into
.
- Of course we are interested in the computational complexity of learning too, and the “real” objective of machine learning is to find large and expressive hypothesis classes that are learnable in polynomial sample complexity and polynomial time.
Glivenko-Cantelli and Rademacher complexity
Recall that a set of functions is a Glivenko-Cantelli class if for any measure
on
, denoting
for the empirical measure of
, one has that
goes to
in probability as
goes to infinity. A useful technique to get a handle on Glivenko-Cantelli is the following symmetrization argument, where
,
This last quantity is known as the Rademacher complexity of on
, and its supremum over
is simply called the Rademacher complexity of
, denoted
. In particular McDiarmid’s inequality together with boundedness of
directly implies that with probability at least
,
(1)
In other words if the Rademacher complexity of goes to
, then
is a Glivenko-Cantelli class.
In learning theory the class of interest is (in particular
). If
is a Glivenko-Cantelli class (that is the true loss and the empirical loss are close for any
) then one obtains a consistent method with the
rule (empirical risk minimization):
Furthermore if the Glivenko-Cantelli convergence is uniform in (which is the case if say the Rademacher complexity
goes to
) then the problem is agnostic PAC learnable (thanks to (1))
Bounding the Rademacher complexity (and variants such as when the Rademacher random variables are replaced by Gaussians) is one of the main objective in the rich theory of empirical processes. Many tools have been developed for this task and we rapidly cover some of them below.
Lipschitz loss
The contraction lemma says that if and
is 1-Lipschitz for any
and
, then
. Thus, if one wants to show that linear regression is PAC learnable (say with
) it suffices to study the Rademacher complexity of linear functions. For instance if
and
are both the unit Euclidean ball
then the Rademacher complexity of linear functions is bounded by
(in particular it is independent of the dimension, more on that below). Thus summarizing this discussion we see that in this case the map
(2)
satisfies for any , with probability at least
,
Also observe that one can rewrite the optimization problem (2) by Lagrangian duality as follows, for some and with obvious notations for
and
,
which is sometimes called the ridge regression problem. The added penalty is thought of as a regularizer, and from the perspective of this section it enforces a small Rademacher complexity. The tradeoff is that larger values of
leads to smaller Rademacher complexity but also to a smaller comparison set
. This two types of error should be thought of as the estimation and approximation errors.
Another interesting example is the LASSO problem
which corresponds to being the
-unit ball (up to scaling). In this case if
is the
-unit ball one can show that the Rademacher complexity of linear functions is of order
. This corresponds in some sense to the inductive bias that the optimal weight is sparse.
Finally let us make an important remark about scaling issues. In practice features are often normalized individually, i.e. , in which case the Rademacher complexity (with
) scales with the Euclidean radius which is
. However if the examples are all sparse then in fact the dimension is replaced by the sparsity, which can make a huge difference in practice (think of the case where
is the number of words in the english dictionary, and our data points
are articles, which are typically much shorter than the dictionary). We see that ridge regression and LASSO are after different kind of sparsity, the former is about sparsity in the features while the latter is about sparsity of the features that matter for prediction.
Zero-one loss
For the zero-one loss we cannot use the contraction lemma as this loss is not Lipschitz. Instead we use Massart’s lemma which states that if then
Recall that in learning with the zero-one loss one has .
Thus we need to count the set of vectors . The Vapnik-Chervonenkis dimension is introduced exactly for this task: for an hypothesis class
,
is the largest integer
such that one can find a set
shattered by
, that is
. The Sauer-Shelah lemma then gives
In particular we obtain:
As a rule of thumb the VC dimension is of order of the number of parameters defining the hypothesis class (obviously this is not always true…). For instance one always has , and the VC dimension of halfspaces in
is
.
Unfortunately the ERM rule for the zero-one loss on halfspaces is computationally hard to implement (on the other hand in the realizable case it can be written as an LP, we will come back to this a bit later). In practice what we do is to relax the zero-one loss by some convex surrogate so that we can then write
. Thus if somehow we manage to have a small error with the surrogate loss then we also have a small error with the zero-one loss. Two important examples of surrogate losses for classification with halfspaces are the hinge loss (which gives the soft-SVM algorithm, more on that later):
and the logistic loss (which gives regularized logistic regression):
Using the contraction lemma from the previous section one can easily control the Rademacher complexity with those losses.
Convex learning
We say that the loss is convex if for any the map
is convex (in particular
is a convex set).
Theorem: The following situations are learnable: convex loss with either [bounded gradients, bounded
], or [bounded second derivative, bounded
], or [strongly convex, bounded gradients].
Just as an example assume that for any , and that for any
(the set of subgradients for
) one has
.
Then running one-pass SGD with step-size given by gives
(again this is dimension independent!). Observe that if one runs multiple passes of SGD then this bound does not apply, and one is in fact implementing the ERM rule from the previous section (for which bounds can then be derived via Rademacher complexity as we showed before).
By Dmitry February 25, 2016 - 9:25 am
Sebastien, first, thank you for the great notes.
s there a typo when you divide by $\sqrt{m}$ or $m$ in either the 1st or the 3rd display of *Zero-one loss*?
Also, it seems that $d$ and $n$ both denote the same thing.
By Dmitry February 25, 2016 - 9:32 am
As far as I get it, there should be
with no square root in both cases; the
rate is due to the norm of
being bounded.
By No-Stupid-Questions? October 31, 2015 - 1:13 am
Sorry to nitpick. I suspect that even with 1-Lipschitz conditional probabilities (say if equal to 0.25 over the entire input space), the 1-nearest neighbor rule might attain at best constant excess risk for some distributions (0.125 for the stated case).
Thanks for putting these set of notes together. I can’t imagine a shorter and as well written introduction to “rigorous” aspects of statistical learning.
By Sebastien Bubeck October 31, 2015 - 5:22 pm
Yes you’re right, that’s why I put the term “essentially” in the sentence :).
By Lukasz October 14, 2015 - 2:30 pm
Is there an equivalent theory for unsupervised learning / density estimation ?
By Sebastien Bubeck October 20, 2015 - 2:21 am
Well, there is a long-standing debate about this question. As a starting point you can check this piece on “Clustering: Art of Science” http://jmlr.csail.mit.edu/proceedings/papers/v27/luxburg12a/luxburg12a.pdf
By Paul October 14, 2015 - 8:50 am
Thanks, this was awesome! Is it possible to attach a PDF/LaTeX document for nicer printing?
By Royi October 15, 2015 - 10:20 am
That would be really great.
By Sebastien Bubeck October 20, 2015 - 2:20 am
Hi Paul,
I would prefer to keep this as a blog post because it is not as polished as I would like it to be if it were to be a pdf. That being said I might revisit this post (and the next one coming up) later this year and add further details, in which case I will post a pdf on my website.
Sebastien