Crash course on learning theory, part 1

This week and next week I’m giving 90 minutes lectures at MSR on the fundamentals of learning theory. Below you will find my notes for the first course, where we covered the basic setting of statistical learning theory, Glivenko-Cantelli classes, Rademacher complexity, VC dimension, linear regression, logistic regression, soft-SVM, sample complexity of classification with halfspaces, and convex learning with SGD. All of this (and much more) can be found in the excellent book by Shai Shalev-Shwartz and Shai Ben-David (freely available online) which I heavily used to prepare these lectures. Next week we will cover boosting, margin and kernels, stability, and an introduction to neural networks [note: part 2 can be found here].

 

What statistical learning is about

The basic ingredients in statistical learning theory are as follows:

  • Two measurable spaces \mathcal{X} and \mathcal{Y} (the input and output sets).
  • A loss function \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}.
  • A set \mathcal{H} \subset \mathcal{Y}^{\mathcal{X}} of maps from \mathcal{X} to \mathcal{Y} (the hypothesis class)
  • A set \mathcal{D} of probability measures on \mathcal{X} \times \mathcal{Y}.

The mathematically most elementary goal in learning can now be described as follows: given an i.i.d. sequence S := ((X_i, Y_i))_{i \in [m]} drawn from some unknown measure \mu \in \mathcal{D}, one wants to find a map from \mathcal{X} to \mathcal{Y} that is doing almost as well as the best map in \mathcal{H} in terms of predicting Y given X when (X,Y) is drawn from the unknown \mu and the prediction mistakes are evaluated according to the loss function \ell. In other words, denoting for h \in \mathcal{Y}^{\mathcal{X}}, L_{\mu}(h) = \mathbb{E}_{(X,Y) \sim \mu} \ell(h(X), Y), one is interested in finding a map A : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^{\mathcal{X}} that minimizes:

    \[R_m(A) := \sup_{\mu \in \mathcal{D}} R_{m}(A; \mu) ,\]

where

    \[R_{m}(A; \mu) := \mathbb{E}_{S \sim \mu^{\otimes m}} \; L_{\mu}(A(S)) - \inf_{h \in \mathcal{H}} L_{\mu}(h) .\]

Let us make a few comments/observations about this problem and some of its variants. In all our discussion we restrict to \mathcal{X} \subset \mathbb{R}^n and \mathcal{H} = \{h_w : w \in \mathcal{W} \subset \mathbb{R}^d\}.

  • The two canonical examples of the above framework are the problem of classification with halfspaces (\mathcal{Y} = \{-1,1\}, h_w(x) = \mathrm{sign}(\langle w, x\rangle), and \ell(y,y') = \mathds1\{y \neq y'\}), and the problem of linear regression (\mathcal{Y} = \mathbb{R}, h_w(x) = \langle w, x \rangle, and \ell(y,y') = (y-y')^2). The loss functions just defined are respectively called the zero-one loss and the least-squares loss.
  • The prior knowledge about the relation between inputs and outputs (also called the inductive bias) is expressed by either restricting the comparison set \mathcal{H} (this is called the discriminative approach) or restricting the set of possible distributions \mathcal{D} (the generative approach).
  • One says that (\mathcal{D}, \mathcal{H}, \ell) is learnable if there exists A such that \lim_{m\rightarrow+\infty}R_m(A) \leq 0. The sample complexity of the problem (at scale \epsilon >0) is the smallest m such that there exists A with R_m(A) \leq \epsilon.
  • Obviously some problems are not learnable, say in classification with the zero-one loss one cannot learn if \mathcal{D} is the set of all distributions on [0,1]^n \times \{-1,1\} and \mathcal{H} = \{-1,1\}^{[0,1]^n}. This is sometimes called the no-free lunch theorem (note however that this observation really doesn’t deserve the name “theorem”). Interestingly it turns out that it doesn’t take much (in some sense) to make the problem learnable: if one restricts \mathcal{D} to be the set of distributions such that the conditional probability \eta(x) = \mathbb{P}(Y=1 | X=x) is 1-Lipschitz, then the nearest neighbor rule (i.e., A(S)(x) outputs the label Y_i of the closest point X_i to x) shows that the problem is (essentially) learnable with sample complexity 1/\epsilon^{n+1}. However this is not so good news, as this sample complexity is exponential in the dimension (this is the so-called “curse of dimensionality”). To get anything practical we will have to strengthen our inductive bias.
  • The stronger notion of PAC learnability replaces the convergence in expectation of L_{\mu}(A(S)) to \inf_{h \in \mathcal{H}} L_{\mu}(h) by a convergence in probability (and this convergence should be uniform in \mu \in \mathcal{D}). Furthermore in basic PAC learning one assumes that \mathcal{D} is restricted to measures such that \inf_{h \in \mathcal{H}} L_{\mu}(h) = 0, also called the realizable case in the context of classification. On the other hand in agnostic PAC learning \mathcal{D} is the set of all measures on \mathcal{X} \times \mathcal{Y}.
  • A weaker notion of learning is to simply require consistency, that is for any \mu \in \mathcal{D} one has R_m(A; \mu) going to zero. One can show that k-Nearest Neighbor is “universally consistent” provided that k grows to infinity at a sublinear rate. This fact is not too surprising if you recall Lusin’s theorem.
  • Proper learning refers to the situation where A is restricted to map into \mathcal{H}.
  • Of course we are interested in the computational complexity of learning too, and the “real” objective of machine learning is to find large and expressive hypothesis classes that are learnable in polynomial sample complexity and polynomial time.

 

Glivenko-Cantelli and Rademacher complexity

Recall that a set of functions \mathcal{F} \subset \mathbb{R}^{\mathcal{Z}} is a Glivenko-Cantelli class if for any measure \mu on \mathcal{Z}, denoting \hat{\mu}_m = \frac{1}{m} \sum_{i=1}^m \delta_{z_i} for the empirical measure of z \sim \mu^{\otimes m}, one has that \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| goes to 0 in probability as m goes to infinity. A useful technique to get a handle on Glivenko-Cantelli is the following symmetrization argument, where \epsilon \sim \mathrm{unif}(\{-1,1\}^m),

    \begin{eqnarray*} \mathbb{E}_z \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| & = & \mathbb{E}_z \sup_{f \in \mathcal{F}} \left| \mathbb{E}_{z'} \frac{1}{m} \sum_{i=1}^m (f(z_i) - f(z_i')) \right| \\ & \leq & \mathbb{E}_{z, z', \epsilon} \sup_{f \in \mathcal{F}} \left| \frac{1}{m} \sum_{i=1}^m \epsilon_i (f(z_i) - f(z_i')) \right| \\ & \leq & 2 \mathbb{E}_{z, \epsilon} \sup_{f \in \mathcal{F}} \left|\frac{1}{m} \sum_{i=1}^m \epsilon_i f(z_i)\right| . \end{eqnarray*}

This last quantity is known as the Rademacher complexity of \mathcal{F} on \mu, and its supremum over \mu is simply called the Rademacher complexity of \mathcal{F}, denoted \mathrm{Rad}_m(\mathcal{F}). In particular McDiarmid’s inequality together with boundedness of f directly implies that with probability at least 1-\delta,

(1)   \begin{equation*} \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| \lesssim \mathrm{Rad}_m(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{m}}. \end{equation*}

In other words if the Rademacher complexity of \mathcal{F} goes to 0, then \mathcal{F} is a Glivenko-Cantelli class.

In learning theory the class of interest is \mathcal{F} = \{(x,y) \mapsto \ell(h(x), y), h \in \mathcal{H}\} (in particular \mathcal{Z} = \mathcal{X} \times \mathcal{Y}). If \mathcal{F} is a Glivenko-Cantelli class (that is the true loss and the empirical loss are close for any h \in \mathcal{H}) then one obtains a consistent method with the \mathrm{ERM} rule (empirical risk minimization):

    \[\mathrm{ERM}(S) = \mathrm{argmin}_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \ell(h(X_i), Y_i) .\]

Furthermore if the Glivenko-Cantelli convergence is uniform in \mu \in \mathcal{D} (which is the case if say the Rademacher complexity \mathrm{Rad}_m(\mathcal{F}) goes to 0) then the problem is agnostic PAC learnable (thanks to (1))

Bounding the Rademacher complexity (and variants such as when the Rademacher random variables are replaced by Gaussians) is one of the main objective in the rich theory of empirical processes. Many tools have been developed for this task and we rapidly cover some of them below.

 

Lipschitz loss

The contraction lemma says that if \mathcal{Y} \subset \mathbb{R} and x \mapsto \ell(h(x), y) is 1-Lipschitz for any h and y, then \mathrm{Rad}_m(\mathcal{F}) \leq \mathrm{Rad}_m(\mathcal{H}). Thus, if one wants to show that linear regression is PAC learnable (say with \mathcal{Y} = [-1,1]) it suffices to study the Rademacher complexity of linear functions. For instance if \mathcal{X} and \mathcal{W} are both the unit Euclidean ball \mathrm{B}_2^n then the Rademacher complexity of linear functions is bounded by 2/\sqrt{m} (in particular it is independent of the dimension, more on that below). Thus summarizing this discussion we see that in this case the map

(2)   \begin{equation*} \mathrm{ERM} : (X_i, Y_i)_{i \in [m]} \mapsto \mathrm{argmin}_{w \in \mathrm{B}_2^n} \frac{1}{m} \sum_{i=1}^m (\langle w, X_i\rangle - Y_i)^2 , \end{equation*}

satisfies for any \mu, with probability at least 1- \delta,

    \[L_{\mu}(\mathrm{ERM}(S)) - \inf_{h \in \mathcal{H}} L_{\mu}(h) \lesssim \sqrt{\frac{\log(1/\delta)}{m}} .\]

Also observe that one can rewrite the optimization problem (2) by Lagrangian duality as follows, for some \lambda>0 and with obvious notations for \mathbb{X} and \mathbb{Y},

    \[\mathrm{argmin}_{w \in \mathbb{R}^n} \frac1{m} |\mathbb{X} w - \mathbb{Y}|^2 + \lambda |w|^2 ,\]

which is sometimes called the ridge regression problem. The added penalty \lambda | \cdot | is thought of as a regularizer, and from the perspective of this section it enforces a small Rademacher complexity. The tradeoff is that larger values of \lambda leads to smaller Rademacher complexity but also to a smaller comparison set \mathcal{H}. This two types of error should be thought of as the estimation and approximation errors.

Another interesting example is the LASSO problem

    \[\mathrm{argmin}_{w \in \mathbb{R}^n} \frac1{m} |\mathbb{X} w - \mathbb{Y}|^2 + \lambda \|w\|_1 ,\]

which corresponds to \mathcal{W} being the \ell_1-unit ball (up to scaling). In this case if \mathcal{X} is the \ell_{\infty}-unit ball one can show that the Rademacher complexity of linear functions is of order \sqrt{\log(n) / m}. This corresponds in some sense to the inductive bias that the optimal weight is sparse.

Finally let us make an important remark about scaling issues. In practice features are often normalized individually, i.e. \mathcal{X} = [-1,1]^d, in which case the Rademacher complexity (with \mathcal{W} = \mathrm{B}_2^n) scales with the Euclidean radius which is \sqrt{d}. However if the examples are all sparse then in fact the dimension is replaced by the sparsity, which can make a huge difference in practice (think of the case where d is the number of words in the english dictionary, and our data points x_i are articles, which are typically much shorter than the dictionary). We see that ridge regression and LASSO are after different kind of sparsity, the former is about sparsity in the features while the latter is about sparsity of the features that matter for prediction.

 

Zero-one loss

For the zero-one loss we cannot use the contraction lemma as this loss is not Lipschitz. Instead we use Massart’s lemma which states that if A \subset \mathrm{B}_2^m then

    \[\mathbb{E}_{\epsilon}\sup_{a \in A} \left|\frac{1}{m} \sum_{i=1}^m \epsilon_i a_i\right|\leq \frac{\sqrt{2 \log |A|}}{m} .\]

Recall that in learning with the zero-one loss one has \mathcal{F} = \{(x,y) \mapsto \mathds1\{h(x) \neq y\}, h \in \mathcal{H}\}.

Thus we need to count the set of vectors \{(\mathds1\{h(x_1) \neq y_1\}, \hdots, \mathds1\{h(x_m) \neq y_m\}), h \in \mathcal{H}\}. The Vapnik-Chervonenkis dimension is introduced exactly for this task: for an hypothesis class \mathcal{H} \subset \{-1,1\}^{\mathcal{X}}, \mathrm{VC}(\mathcal{H}) is the largest integer k such that one can find a set \{x_1, \hdots, x_k\} \subset \mathcal{X} shattered by \mathcal{H}, that is |\{(h(x_1), \hdots, h(x_k)), h \in \mathcal{H}\}| = 2^k. The Sauer-Shelah lemma then gives

    \[\left|\left\{(\mathds1\{h(x_1) \neq y_1\}, \hdots, \mathds1\{h(x_m) \neq y_m\}), h \in \mathcal{H}\right\}\right| \leq \left(\frac{e m}{\mathrm{VC}(\mathcal{H})}\right)^{\mathrm{VC}(\mathcal{H})}.\]

In particular we obtain:

    \[\mathrm{Rad}(\mathcal{F}) \leq \sqrt{\frac{2 \mathrm{VC}(\mathcal{H}) \log\left( \frac{e m}{\mathrm{VC}(\mathcal{H})} \right)}{m}} .\]

As a rule of thumb the VC dimension is of order of the number of parameters defining the hypothesis class (obviously this is not always true…). For instance one always has \mathrm{VC}(\mathcal{H}) \leq \log_2(|\mathcal{H}|), and the VC dimension of halfspaces in \mathbb{R}^d is d.

Unfortunately the ERM rule for the zero-one loss on halfspaces is computationally hard to implement (on the other hand in the realizable case it can be written as an LP, we will come back to this a bit later). In practice what we do is to relax the zero-one loss by some convex surrogate \ell so that we can then write L_{\mu}^{0-1} \leq L_{\mu}^{\ell}. Thus if somehow we manage to have a small error with the surrogate loss then we also have a small error with the zero-one loss. Two important examples of surrogate losses for classification with halfspaces are the hinge loss (which gives the soft-SVM algorithm, more on that later):

    \[\ell(h_w, (x,y)) = \max(0, 1 - y \langle w, x\rangle) ,\]

and the logistic loss (which gives regularized logistic regression):

    \[\ell(h_w, (x,y)) = \log_2(1+\exp(- y \langle w, x\rangle)).\]

Using the contraction lemma from the previous section one can easily control the Rademacher complexity with those losses.

 

Convex learning

We say that the loss is convex if for any (x, y) \in \mathcal{X} \times \mathcal{Y} the map w \in \mathcal{W} \mapsto \ell(h_w(x), y) is convex (in particular \mathcal{W} is a convex set).

Theorem: The following situations are learnable: convex loss with either [bounded gradients, bounded \mathcal{W}], or [bounded second derivative, bounded \mathcal{W}], or [strongly convex, bounded gradients].

Just as an example assume that for any w \in \mathcal{W}, |w| \leq R, and that for any x, y, w, g \in \partial_w \ell(h_w(x), y) (the set of subgradients for \ell) one has |g| \leq L.

Then running one-pass SGD with step-size given by \frac{R}{L} \sqrt{\frac{2}{t}} gives R_m(SGD) \leq R L \sqrt{\frac{2}{m}} (again this is dimension independent!). Observe that if one runs multiple passes of SGD then this bound does not apply, and one is in fact implementing the ERM rule from the previous section (for which bounds can then be derived via Rademacher complexity as we showed before).

This entry was posted in Machine learning, Optimization, Probability theory. Bookmark the permalink.

9 Responses to "Crash course on learning theory, part 1"

Leave a reply