It might be useful to refresh your memory on the concepts we saw in part 1 (particularly the notions of VC dimension and Rademacher complexity). In this second and last part we will discuss two of the most successful algorithm paradigms in learning: boosting and SVM. Note that just like last time each topic we cover have its own books dedicated to them (see for example the boosting book by Schapire and Freund, and the SVM book by Scholkopf and Smola). Finally we conclude our short tour of the learning’s basics with a simple observation: stable algorithms generalize well.
Boosting
Say that given a distribution supported on
points
one can find (efficiently) a classifier
such that
(here we are in the context of classification with the zero-one loss). Can we “boost” this weak learning algorithm into a strong learning algorithm with
arbitrarily small for
large enough? It turns out that this is possible, and even simple. The idea is to build a linear combination of hypotheses in
with a greedy procedure. That is at time step
our hypothesis is (the sign of)
, and we are now looking to add
with an approriate weight
. A natural guess is to optimize over
to minimize the training error of
on our sample
. This might be a difficult computational problem (how do you optimize over
?), and furthermore we would like to make use of our efficient weak learning algorithm. The key trick is that
. More precisely:
where . From this we see that we would like
to be a good predictor for the distribution
. Thus we can pass
to the weak learning algorithm, which in turns gives us
with
. Thus we now have:
Optimizing the above expression one finds that leads to (using
)
The procedure we just described is called AdaBoost (introduce by Schapire and Freund) and we proved that it satisfies
(1)
In particular we see that our weak learner assumption implies that is realizable (and in fact realizable with margin
, see next section for the definition of margin) with the hypothesis class:
This class can be thought of as a neural network with one (infinite size) hidden layer. To realize how expressive is compared to
it’s a useful exercise to think about the very basic case of decision stumps (for which the empirical risk minimization can be implemented very efficiently):
To derive a bound on the true risk of AdaBoost it remains to calculate the VC dimension of the class
where the size of the hidden layer is
. This follows from more general results on the VC dimension of neural networks, and up to logarithmic factors one obtains that
is of order
. Putting this together with \eqref{eq:empada} we see that when
is a constant, one should run AdaBoost for
rounds, and then one gets
.
Margin
We consider to be the set of distributions
such that there exists
with
(again we are in the context of classification with the zero-one loss, and this assumption means that the data is almost surely realizable). The SVM idea is to search for
with minimal Euclidean norm and such that
. Effectively this is doing empirical risk minimization over the following data dependent hypothesis class (which we write in terms of the set of admissible weight vectors):
The key point is that we can now use the contraction lemma to bound the Rademacher complexity of this class. Indeed replacing the zero-one loss by the (Lipschitz!) “ramp loss”
makes no difference for the optimum
, and our estimated weight still has
training error while its true loss is only surestimated. Using the argument from previous sections we see that the Rademacher complexity of our hypothesis class (with respect to the ramp loss) is bounded by
(assuming the examples
are normalized to be in the Euclidean ball). Now it is easy to see that the existence of
with
(and
) exactly corresponds to a geometric margin of
between positive and negative examples (indeed the margin is exactly
). To summarize we just saw that under the
-margin condition the SVM algorithm has a sample complexity of order
. This suggests that from an estimation perspective one should map the points into a high-dimensional space so that one could hope to have the separability condition (with margin). However this raises computational challenges, as the QP given by the SVM can suddenly look daunting. This is where kernels come into the picture.
Kernels
So let’s go overboard and map the points to an infinite dimensional Hilbert space (as we will see in the next subsection this notation will be consistent with
being the hypothesis class). Denote
for this map, and let
be the kernel associated with it. The key point is that we are not using all the dimensions of our Hilbert space in the SVM optimization problem, but rather we are effectively working in the subspace spanned by
(this is because we are only working with inner products with those vectors, and we are trying to minimize the norm of the resulting vector). Thus we can restrict our search to
(this fact is called Mercer representer theorem and the previous sentence is the proof…). The beauty is that now we only need to compute the Gram matrix
as we only need to consider
and
. In particular we never need to compute the points
(which anyway could be infinite dimensional, so we couldn’t really write them down…). Note that the same trick would work with soft SVM (i.e., regularized hinge loss). To drive the point home let’s see an example:
leads to
. I guess it doesn’t get much better than this :). Despite all this beauty, one should note that we now have to manipulate an object of size
(the kernel matrix
) and in our big data days this can be a disaster. We really want to focus on methods with computational complexity linear in
, and thus one is led to the problem of kernel approximation, which we will explore below. But first let us explore a bit further what kernel SVM is really doing.
RKHS and the inductive bias of kernels
As we just saw in kernel methods the true central element is the kernel rather than the embedding
. In particular since the only thing that matter are inner products
we might as well assume that
, and
is the completion of
where the inner product is defined by
(and the definition is extended to
by linearity). Assuming that
is positive definite (that is Gram matrices built from
are positive definite) one obtains a well-defined Hilbert space
. Furthermore this Hilbert space has a special property: for any
,
. In other words
is a reproducing kernel Hilbert space (RKHS), and in fact any RKHS can be obtained with the above construction (this is a simple consequence of Riesz representation theorem). Now observe that we can rewrite the kernel SVM problem
as
While the first formulation is computationally more effective, the second sheds light on what we are really doing: simply searching the consistent (with margin) hypothesis in with smallest norm. In other words, thinking in terms of inductive bias, one should choose a kernel for which the norm
represents the kind of smoothness one expects in the mapping from input to output (more on that next).
It should also be clear now that one can “kernelize” any regularized empirical risk minimization, that is instead of the boring (note that here the loss is defined on
instead of
)
one can consider the much more exciting
since this can be equivalently written as
This gives the kernel ridge regression, kernel logistic regression, etc…
Translation invariant kernels and low-pass filtering
We will now investigate a bit further the RKHS that one obtains with translation invariant kernels, that is . A beautiful theorem of Bochner characterizes the continuous maps
(with
) for which such a
is a positive definite kernel: it is necessary and sufficient that
is the characteristic function of a probability measure
, that is
An important example in practice is the Gaussian kernel: (this corresponds to mapping
to the function
). One can check that in this case
is itself a Gaussian (centered and with covariance
).
Now let us restrict our attention to the case where has a density
with respect to the Lebesgue measure, that is
. A standard calculation then shows that
which implies in particular
Note that for the Gaussian kernel one has , that is the high frequency in
are severely penalized in the RKHS norm. Also note that smaller values of
correspond to less regularization, which is what one would have expected from the feature map representation (indeed the features
are more localized around the data point
for larger values of
).
To summarize, SVM with translation invariant kernels correspond to some kind of soft low-pass filtering, where the exact form of the penalization for higher frequency depends on the specific kernel being used (smoother kernels lead to more penalization).
Random features
Let us now come back to computational issues. As we pointed out before, the vanilla kernel method has at least a quadratic cost in the number of data points. A common approach to reduce this cost is to use a low rank approximation of the Gram matrix (indeed thanks to the i.i.d. assumption there is presumably a lot of redundancy in the Gram matrix), or to resort to online algorithms (see for example the forgetron of Dekel, Shalev-Shwartz and Singer). Another idea is to approximate the feature map itself (a priori this doesn’t sound like a good idea, since as we explained above the beauty of kernels is that we avoid computing this feature map). We now describe an elegant and powerful approximation of the feature map (for translation invariant kernels) proposed by Rahimi and Recht which is based on random features.
Let be a translation invariant kernel and
its corresponding probability measure. Let’s rewrite
in a convenient form, using
and Bochner’s theorem,
A simple idea is thus to build the following random feature map: given and
, i.i.d. draws from respectively
and
, let
be defined by
where
For it is an easy exercise to verify that with probability at least
(provided
has a second moment at most polynomial in
) one will have for any
in some compact set (with diameter at most polynomial in
),
The SVM problem can now be approximated by:
This optimization problem is potentially much simpler than the vanilla kernel SVM when is much bigger than
(essentially
replaces
for most computational aspects of the problem, including space/time complexity of prediction after training).
Stability
We conclude our tour of the basic topics in statistical learning with a different point of view on generalization that was put forward by Bousquet and Elisseeff.
Let’s start with a simple observation. Let be an independent copy of
, and denote
. Then one can write, using the slight abuse of notation
,
This last quantity can be interpreted as a stability notion, and we see that controlling it would in turn control how different is the true risk compared to the empirical risk. Thus stable methods generalize.
We will now show that regularization can be interpreted as a stabilizer. Precisely we show that
is -stable for a convex and
-Lipschitz loss. Denote
for the above objective function, then one has
and thus by Lipschitzness
On the other hand by strong convexity one has
and thus with the above we get which implies (by Lipschitzness)
or in other words regularized empirical risk minimization () is stable. In particular denoting
for the minimizer of
we have:
Assuming and optimizing over
we recover the bound we obtained previously via Rademacher complexity.