Lecture 2. Basics / law of small numbers
Due to scheduling considerations, we postpone the proof of the entropic central limit theorem. In this lecture, we discuss basic properties of the entropy and illustrate them by proving a simple version of the law of small numbers (Poisson limit theorem). The next lecture will be devoted to Sanov’s theorem. We will return to the entropic central limit theorem in Lecture 4.
Conditional entropy and mutual information
We begin by introducing two definitions related to entropy. The first definition is a notion of entropy under conditioning.
Definition. If and are two discrete random variables with probability mass functions and , then the conditional entropy of given is defined as
where is the conditional probability mass function of given .
Remark. If and are absolutely continuous random variables, the conditional differential entropy is defined analogously (where the probability mass functions are replaced by the corresponding probability densities with respect to Lebesgue measure).
Note that
That is, the conditional entropy is precisely the expectation (with respect to the law of ) of the entropy of the conditional distribution of given .
We now turn to the second definition, the mutual information. It describes the degree of dependence between two random variables.
Definition. The mutual information between two random variables and is defined as
where , and denote the distributions of , and .
Conditional entropy and mutual information are closely related. For example, suppose that has density with respect to the Lebesgue measure, then
In particular, since is always positive (because it is a relative entropy), we have just shown that , that is, conditioning reduces entropy. The same result holds for discrete random variables when we replace by .
Chain rules
Chain rules are formulas that relate the entropy of multiple random variables to the conditional entropies of these random variables. The most basic version is the following.
Chain rule for entropy. . In particular, .
Proof. Note that
Thus,
Taking the expectation on both sides under the distribution gives the desired result.
Corollary. Entropy is subadditive, that is, .
Proof. Combine the chain rule with .
There is also a chain rule for relative entropy.
Chain rule for relative entropy.
The following identity will be useful later.
Lemma.
Proof. Note that
Data processing and convexity
Two important properties of the relative entropy can be obtained as consequences of the chain rule.
Data processing inequality. Let and be two probability measures on and suppose is measurable. Then , where is the distribution of when .
The data processing inequality tells us that if we process the data (which might come from one of the two distributions and ), then the relative entropy decreases. In other words, it becomes harder to identify the source distribution after processing the data. The same result (with the same proof) holds also if and are transformed by a transition kernel, rather than by a function.
Proof. Denote by and the joint laws of and when and . By the chain rule and nonnegativity of relative entropy
On the other hand, using again the chain rule,
where we used . Putting these together completes the proof.
Convexity of relative entropy. is jointly convex in its arguments, that is, if , , , are probability measures and , then
Proof. Let be a random variable that takes value with probability and with probability . Conditionally on , draw and . Then and . Using the chain rule twice, we obtain
and the right hand side is precisely .
Corollary. The entropy function is concave.
Proof for a finite alphabet. When the alphabet is finite, the corollary can be proven by noting that .
Relative entropy and total variation distance
Consider the hypothesis testing problem of testing the null hypothesis against the alternative hypothesis . A test is a measurable function . Under the constraint , it can be shown that the optimal rate of decay of as a function of the sample size is of the order of . This means that is the measure of how well one can distinguish between and on the basis of data.
We will not prove this fact, but only introduce it to motivate that the relative entropy is, in some sense, like a measure of distance between probability measures. However, it is not a metric since and the triangle inequality does not hold. So in what sense does the relative entropy represent a distance? In fact, it controls several bona fide metrics on the space of probability measures. One example of such metric is the total variation distance.
Definition. Let and be probability measures on . The total variation distance is defined as .
The following are some simple facts about the total variation distance.
 .

If and have probability density functions and with respect to some common probability measure , then . To see this, define . Then
 .
The following inequality shows that total variance distance is controlled by the relative entropy. This shows that the relative entropy is a strong notion of distance.
Pinsker’s inequality. .
Proof. Without loss of generality, we can assume that and have probability density functions and with respect to some common probability measure on . Let and .
Step 1: Prove this inequality by simple calculation in the case when contains at most elements.
Step 2: Note that and are defined on the space . So Pinsker’s inequality applies to and . Thus,
Law of small numbers
As a first illustration of an application of entropy to probability, let us prove a simple quantitative law of small numbers. An example of the law of small numbers is the well known fact that in distribution as goes to infinity. More generally, if are Bernoulli random variables with , if are weakly dependent, and if none of the dominates the rest, then where . This idea can be quantified easily using relative entropy.
Theorem. If and may be dependent, then
where and .
Proof. Let be independent random variables with . Then . We have
To conclude, it is enough to note that
Remark. If and are independent, then the inequality in the theorem becomes . However, this rate of convergence is not optimal. One can show that under the same condition, , using tools similar to those that will be used later to prove the entropic central limit theorem. Note that it is much harder to prove in the entropic central limit theorem, even without rate of convergence!
Lecture by Mokshay Madiman  Scribed by Cheyu Liu