A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).
Syllabus
Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.
Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).
Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).
Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.
Some notes on algorithmic generalization
Let be input/output spaces. Let
be a loss function,
a probability measure supported on
, and
a learning rule (in words
takes as input a dataset of
examples, and output a mapping from
-inputs to
-outputs). With a slight abuse of notation, for
and
, we write
. We define the generalization of
on
by:
In words, if then we expect the empirical performance of the learned classifier
to be representative of its performance on a fresh out-of-sample data point, up to an additive
. The whole difficulty of course is that the empirical evaluation is done with the same dataset
that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on
here.
The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class one trivially has
However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.
Stability
The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:
where . This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).
Robustness
The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).
Xu and Mannor define -robustness as follows: assume that
can be partitioned into
sets
such that if
and
are in the same set
then
A good example to have in mind would be a binary classifier with large margin, in which case corresponds to the covering number of
at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases
would be typically exponential in the dimension of
. The key result of Xu and Mannor that we prove next is a generalization bound of order
. In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.
Theorem (Xu and Mannor 2010):
A
-robust learning rule satisfies
Proof: Let and note that
. Now one has for a robust
:
It only remains to observe that
Information theoretic perspective
Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:
Theorem (Russo and Zou 2016):
Note that here we have assumed that the codomain of the learning rule consists of deterministic maps from inputs to outputs, in which case the mutual information
is simply the entropy
. However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.
The key point is very simple: one can view generalization as a decoupling property by writing:
where .
Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:
Lemma:
Let
. Let
be random variables in
and
,
be mutually independent copies of
and
. Assume that
is
-subgaussian (i.e.,
) then
Proof: The mutual information is equal to the relative entropy between the distribution of and the distribution of
. Recall also the variational representation of the relative entropy which is that the map
is the convex conjugate of the log-partition function
. In particular one has a lower bound on the mutual information for any such
which means:
Now it only remains to use the definition of subgaussianity, that is take , and optimize over
.
By traição relação November 27, 2020 - 9:45 am
Good day I am so delighted I found your site, I really found you
by accident, while I was researching on Bing for something else, Anyhow I am here now and would just like to say kudos for a remarkable post and a all round exciting
blog (I also love the theme/design), I don’t
have time to read it all at the minute but I have bookmarked it and also added your RSS feeds, so
when I have time I will be back to read much more, Please do keep up the awesome work.
By Fitness November 25, 2020 - 8:00 am
That’s a good piece here. Thanks for the article on fitness.
By Medic November 25, 2020 - 7:59 am
Wow. Informative content here. Good job, op.
By naver November 23, 2020 - 6:15 pm
I used to be suggested this web site via my cousin. I’m no longer certain whether or
not this submit is written by way of him as nobody else recognize such special
approximately my trouble. You are amazing! Thanks!
By just flirt November 23, 2020 - 4:16 am
Hi! Quick question that’s entirely off topic. Do you know how to
make your site mobile friendly? My blog looks weird when browsing from my iphone4.
I’m trying to find a theme or plugin that might be able to fix this
issue. If you have any recommendations, please share.
Thanks!
By The Proxy November 22, 2020 - 8:51 am
I have to thank you for the efforts you’ve put in penning
this blog. I am hoping to check out the same high-grade
content from you in the future as well. In truth, your
creative writing abilities has inspired me to get my own, personal website
now 😉
By What Is Near Me November 21, 2020 - 4:04 pm
It’s a shame you don’t have a donate button! I’d definitely donate to this
superb blog! I suppose for now i’ll settle for bookmarking and
adding your RSS feed to my Google account.
I look forward to new updates and will share this blog with my Facebook group.
Chat soon!
By distributor printer hp November 21, 2020 - 8:42 am
I’m no longer certain where you are getting your information, however good
topic. I must spend some time studying more or working out more.
Thank you for great information I used to be looking for this info for my mission.
By aduqq online November 20, 2020 - 11:28 pm
Right here is the perfect website for everyone who hopes to find out about this topic.
You know a whole lot its almost tough to argue with you
(not that I personally would want to…HaHa). You definitely put a brand new spin on a subject that’s
been discussed for years. Wonderful stuff, just excellent!
By wisata bandung terbaru November 20, 2020 - 9:42 pm
You can see beautiful 3D models and coloration pictures, a weight tracker system,
a child kicxk counter, a manner of timing
contractions, and lots of information.
Here iss my site; wisata bandung terbaru
By cek skor live November 20, 2020 - 5:27 pm
live skor sepak bola
By salad November 20, 2020 - 4:16 pm
Very soon this website will be famous amid all blogging visitors, due to
it’s pleasant posts
By Notícias do Mundo November 20, 2020 - 6:06 am
That is very fascinating, You’re an excessively
professional blogger. I’ve joined your rss feed and sit up for looking for
more of your great post. Also, I have shared your website in my social networks
By Bodybulding Art November 20, 2020 - 6:03 am
It’s enormous that you are getting ideas from this paragraph as well as
from our dialogue made at this place.
By top-nodep.Com November 20, 2020 - 3:33 am
I like the helpful info you provide on your articles.
I will bookmark your blog and test again right
here frequently. I am moderately sure I will learn many new stuff proper
here! Good luck for the following!
By micheal jackson costume November 18, 2020 - 7:27 pm
Batman Henchman Joker Goon Purple Bomber Jacket
https://moviesjackets.com/Batman-henchman-jacket
By obezite cerrahisi November 18, 2020 - 9:29 am
Obezite Cerrahisi nedir nasıl tedavi uygulanır ve Obezite Cerrahisi için fiyatlar hakkında detaylı bilgiye sahip olucaksınız.