How to generalize (algorithmically)

A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).

Syllabus

Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.

Some notes on algorithmic generalization

Let $\mathcal{X}, \mathcal{Y}$ be input/output spaces. Let $\ell : \mathcal{Y} \times \mathcal{Y} \rightarrow [0,1]$ be a loss function, $\mu$ a probability measure supported on $\mathcal{X} \times \mathcal{Y}$ , and $\underline{h} : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^\mathcal{X}$ a learning rule (in words $\underline{h}$ takes as input a dataset of $m$ examples, and output a mapping from $\mathcal{X}$ -inputs to $\mathcal{Y}$ -outputs). With a slight abuse of notation, for $z=(x,y) \in \mathcal{X} \times \mathcal{Y}$ and $h \in \mathcal{Y}^{\mathcal{X}}$ , we write $\ell(h, z) := \ell(h(x), y)$ . We define the generalization of $\underline{h}$ on $\mu$ by:

$\mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S=(z_1,\hdots,z_m) \sim \mu^{\otimes m}, z \sim \mu} \left[ \ell(\underline{h}(S), z) - \frac1{m} \sum_{i=1}^m \ell(\underline{h}(S), z_i) \right] \,.$

In words, if $\mathrm{gen}(\underline{h} , \mu) \leq \epsilon$ then we expect the empirical performance of the learned classifier $\underline{h}(S)$ to be representative of its performance on a fresh out-of-sample data point, up to an additive $\epsilon$ . The whole difficulty of course is that the empirical evaluation is done with the same dataset $S$ that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on $\mathrm{gen}(\underline{h} , \mu)$ here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class $\mathcal{H}$ one trivially has

$\mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S, z} \,\, \sup_{h \in \mathcal{H}} \left[ \frac1{m} \sum_{i=1}^m \ell(h, z_i) - \ell(h, z) \right] \,.$

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.

Stability

The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:

$\begin{eqnarray*} \mathrm{gen}(\underline{h} , \mu) & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S), z) - \ell(\underline{h}(S), z_i) \right] \\ & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S^{-i}), z_i) - \ell(\underline{h}(S), z_i) \right] \,, \end{eqnarray*}$

where $S^{-i}=(z_1,\hdots, z_{i-1}, z, z_{i+1}, \hdots, z_m)$ . This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).

Robustness

The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define $(K, \epsilon(\cdot))$ -robustness as follows: assume that $\mathcal{Z}$ can be partitioned into $K$ sets $\{C_k\}_{k \in [K]}$ such that if $z$ and $z_i$ are in the same set $C_k$ then

$| \ell(\underline{h}(S), z_i) - \ell(\underline{h}(S), z) | \leq \epsilon(S) \,.$

A good example to have in mind would be a binary classifier with large margin, in which case $K$ corresponds to the covering number of $\mathcal{X}$ at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases $K$ would be typically exponential in the dimension of $\mathcal{X}$ . The key result of Xu and Mannor that we prove next is a generalization bound of order $\sqrt{K / m}$ . In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A $(K, \epsilon(\cdot))$ -robust learning rule satisfies

$\mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S} \, \epsilon(S) + \sqrt{\frac{K}{m}} \,.$

Proof: Let $N_k = |S \cap C_k|$ and note that $\mathbb{E}_S \, N_k = m \mu(C_k)$ . Now one has for a robust $h$ :

$\begin{align*} \mathbb{E}_z \ \ell(h, z) & =\sum_{k=1}^K \mu(C_k) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & = \sum_{k=1}^K \frac{N_k}{m} \mathbb{E}_z [\ell(h,z) | z \in C_k] + \sum_{k=1}^K \left(\mu(C_k) - \frac{N_k}{m}\right) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & \leq \epsilon(S) + \frac{1}{m} \sum_{i = 1}^m \ell(h, z_i) + \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \, . \end{align*}$

It only remains to observe that

$\mathbb{E}_S \, \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \leq \frac{1}{m} \sum_{k=1}^K \sqrt{\mu(C_k)} \leq \sqrt{\frac{K}{m}} \, .$

Information theoretic perspective

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information $\mathrm{I}(\underline{h}(S), S)$ as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

$\mathrm{gen}(\underline{h} , \mu) \leq \sqrt{\frac{\mathrm{I}(\underline{h}(S), S)}{2 m}} \, .$

Note that here we have assumed that the codomain of the learning rule $\underline{h}$ consists of deterministic maps from inputs to outputs, in which case the mutual information $\mathrm{I}(\underline{h}(S), S)$ is simply the entropy $\mathrm{H}(\underline{h}(S))$ . However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

$\mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S, S' \sim \mu^{\otimes m}} [ F(\underline{h}(S), S') - F(\underline{h}(S), S) ] \,,$

where $F(h, S) = \frac{1}{m} \sum_{z \in S} \ell(h, z)$ .

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:

Lemma:

Let $f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}$ . Let $(X,Y)$ be random variables in $\mathcal{X} \times \mathcal{Y}$ and $\bar{X}$ , $\bar{Y}$ be mutually independent copies of $X$ and $Y$ . Assume that $f(\bar{X}, \bar{Y})$ is $\sigma$ -subgaussian (i.e., $\log \mathbb{E} \exp( \lambda (f(\bar{X}, \bar{Y}) - \mathbb{E} f) ) \leq \lambda^2 \sigma^2 /2, \forall \lambda$ ) then

$\mathbb{E}[ f(X,Y) - f(\bar{X}, \bar{Y}) ] \leq \sqrt{2 \sigma^2 \mathrm{I}(X, Y)}$

Proof: The mutual information is equal to the relative entropy between the distribution of $(X,Y)$ and the distribution of $(\bar{X}, \bar{Y})$ . Recall also the variational representation of the relative entropy which is that the map $\mu \mapsto \mathrm{Ent}(\mu, \nu)$ is the convex conjugate of the log-partition function $\Phi_{\nu}(g) := \log(\int \exp(g(x)) d \nu(x))$ . In particular one has a lower bound on the mutual information for any such $g$ which means:

$\mathrm{I}(X, Y) \geq \mathbb{E}[ g(X,Y) ] - \log \mathbb{E} [ \exp(g(\bar{X},\bar{Y})) ] \,.$

Now it only remains to use the definition of subgaussianity, that is take $g = \lambda f$ , and optimize over $\lambda$ .

8 Responses to "How to generalize (algorithmically)"

By Askcorran March 3, 2021 - 1:00 am

So you are a mangastream fan as you are here searching for alternatives to this website. Mangastream was one of the most favorite websites of all Manga comic readers around the world. But now it is no longer on the internet due to some reasons. Do you know that in today’s time where everything is digital and people love watching series and watch movies on entertainment apps? Yet there are many people like you and me who love reading comics especially manga. Yes, in this age where motion pictures are number one in the list of entertainment, there are comics that are very popular and have a huge fanbase, and one of them are manga comics
Is MangaStream Dead? Best 26 Alternatives To Read Manga Instantly
By Askcorran March 2, 2021 - 12:18 am

We always mistake to understand what is ant radio service? It is always confused that the ANT radio service is the same as Bluetooth in apps and phones. And it is logically correct for us to think so. We think it is just like Bluetooth stack that allows us to connect other phones or devices via Bluetooth connectivity buy it is not so. The ANT radio service is pre-installed in our phones by the manufacture. There may be a chance that if you delete the Bluetooth support on your phone then you might also delete the ANT+ devices. Let us understand what is ant radio service and do you need it?
what is ant radio service? Do you need it?
By Jobengg March 2, 2021 - 12:16 am

2018, 2019 Batch Pass Out freshers has a piece of good news. Accenture is inviting applications from B.E/B.Tech, MCA Freshers for Accenture Off Campus Drive 2020. The drive will be conducted across various locations in India. So, candidates from all the major cities like Hyderabad, Bangalore, NCR, Vijayawada, Pune, Mumbai, Kolkata can take the test for Accenture Off Campus 2020 Drive. The details of the upcoming Accenture Pool Campus Drive are given here. So, all the candidates can check their eligibility, requirements from this page.
Accenture Off Campus Drive 2020 for 2019, 2018 Batch
By Techyxl March 2, 2021 - 12:13 am

QuickBooks is a successfully famous program that enables small and medium businesses to run their accounting processes smoothly. The software comes in different variations which include Premier, Pro, and Enterprise. All of these variants are excellent and can assist streamline a company’s financial and accounting operations.
The reason why you should use QuickBooks Premier Hosting in 2021
By Anamounto March 2, 2021 - 12:08 am

Are you searching for How to disable avast? But you are not able to find a correct process for it. Do not take any tension coz here I have mentioned the correct way to disable the Avast. You will see the different methods on how to close the avast antivirus on your computer, mobile, and Laptops. You will also find the process for how to temporarily disable avast? So read this informative article further for more information on how to disable avast properly.
How To Disable Avast? Windows, Mac, And Android
By FlyingRider March 2, 2021 - 12:03 am

Here is a list of the best smart TVs under 20000 with the description and specification, its pros and cons, the verdict, and various other aspects. This gives you a good idea about which best smart TV under 20000 will suit your needs and comes within your budget.
Top 10 Best Smart TVs Under 20000 In India To Buy (2021)
By web desing February 28, 2021 - 4:24 pm

On our Hello website, we share hundreds of different content for you, such as great content, information you need to know. We are sure that this information will affect your life positively. You can visit our website to be informed about our great ideas.
Don’t forget to turn on notifications! https://phosphorousideas.com/
By jain February 22, 2021 - 4:23 am

great

How to generalize (algorithmically)

8 Responses to "How to generalize (algorithmically)"

By Askcorran March 3, 2021 - 1:00 am

By Askcorran March 2, 2021 - 12:18 am

By Jobengg March 2, 2021 - 12:16 am

By Techyxl March 2, 2021 - 12:13 am

By Anamounto March 2, 2021 - 12:08 am

By FlyingRider March 2, 2021 - 12:03 am

By web desing February 28, 2021 - 4:24 pm

By jain February 22, 2021 - 4:23 am

Leave a replyLeave a Reply to Anonymous

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll