Lecture 2. Basics / law of small numbers

Due to scheduling considerations, we postpone the proof of the entropic central limit theorem. In this lecture, we discuss basic properties of the entropy and illustrate them by proving a simple version of the law of small numbers (Poisson limit theorem). The next lecture will be devoted to Sanov’s theorem. We will return to the entropic central limit theorem in Lecture 4.

Conditional entropy and mutual information

We begin by introducing two definitions related to entropy. The first definition is a notion of entropy under conditioning.

Definition. If X and Y are two discrete random variables with probability mass functions p_X and p_Y, then the conditional entropy of X given Y is defined as

    \[H(X|Y) := - \mathbb{E} [\log{p_{X|Y}(X|Y)} ]\]

where p_{X|Y}(x|y) = p_{(X,Y)}(x,y)/p_Y(y) is the conditional probability mass function of X given Y.

Remark. If X and Y are absolutely continuous random variables, the conditional differential entropy h(X|Y) is defined analogously (where the probability mass functions are replaced by the corresponding probability densities with respect to Lebesgue measure).

Note that

    \begin{equation*} \begin{split}  H(X|Y)  &=  - \sum_{x,y} p_{(X,Y)}(x,y)\log{p_{X|Y}(x|y)} \\          &= - \sum_y p_Y(y) \sum_x  p_{X|Y}(x|y)\log{p_{X|Y}(x|y)} \\          & = \sum_y p_Y(y) H(X|Y=y). \end{split} \end{equation*}

That is, the conditional entropy H(X|Y) is precisely the expectation (with respect to the law of Y) of the entropy of the conditional distribution of X given Y.

We now turn to the second definition, the mutual information. It describes the degree of dependence between two random variables.

Definition. The mutual information between two random variables X and Y is defined as

    \[I(X,Y) := D( \mathcal{L}(X,Y) || \mathcal{L}(X) \otimes \mathcal{L}(Y)),\]

where \mathcal{L}(X,Y), \mathcal{L}(X) and \mathcal{L}(Y) denote the distributions of (X,Y), X and Y.

Conditional entropy and mutual information are closely related. For example, suppose that (X,Y) has density f_{(X,Y)} with respect to the Lebesgue measure, then

    \begin{equation*} \begin{split} I(X,Y) & = \int f_{(X,Y)}(x,y) \log{\frac{f_{(X,Y)}(x,y)}{f_X(x)f_Y(y)}} \,dx \,dy \\        & = \mathbb{E} \log{\frac{f_{(X,Y)}(X,Y)}{f_X(X)f_Y(Y)}} \\        & = \mathbb{E} \log{\frac{f_{X|Y}(X|Y)}{f_X(X)}}  \\        & = h(X)-h(X|Y). \end{split} \end{equation*}

In particular, since I(X,Y) is always positive (because it is a relative entropy), we have just shown that h(X|Y) \leq h(X), that is, conditioning reduces entropy. The same result holds for discrete random variables when we replace h by H.

Chain rules

Chain rules are formulas that relate the entropy of multiple random variables to the conditional entropies of these random variables. The most basic version is the following.

Chain rule for entropy. H(X_1, X_2, ..., X_n) = \sum_{i=1}^n H(X_i|X_1,...,X_{i-1}). In particular, H(X_2|X_1)=H(X_1, X_2)-H(X_1).

Proof. Note that

    \[p_{(X_1,...,X_n)}(x_1,...,x_n) = \prod_{i=1}^n p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1}).\]


    \[\log{ p_{(X_1,...,X_n)}(x_1,...,x_n)} = \sum_{i=1}^n \log{p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1})}.\]

Taking the expectation on both sides under the distribution (x_1,...,x_n) \sim (X_1,...,X_n) gives the desired result. \qquad\square

Corollary. Entropy is sub-additive, that is, H(X_1, X_2, ..., X_n) \leq \sum_{i=1}^n H(X_i).

Proof. Combine the chain rule with H(X_i|X_1,...,X_{i-1}) \leq H(X_i). \qquad\square

There is also a chain rule for relative entropy.

Chain rule for relative entropy.

    \[D(\mathcal{L}(X,Y) || \mathcal{L}(X^{'},Y^{'})) =  D(\mathcal{L}(X)||\mathcal{L}(X^{'})) + \mathbb{E}_{x \sim X} [ D(\mathcal{L}(Y|X=x) || \mathcal{L}(Y^{'}|X^{'}=x))].\]

The following identity will be useful later.


    \begin{multline*}D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) = \\  \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)).\end{multline*}

Proof. Note that

    \begin{equation*} \begin{split} & D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{Y_1}(X_1)\cdots p_{Y_n}(X_n)}} \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{X_1}(X_1)\cdots p_{X_n}(X_n)}}        + \sum_{i=1}^n \mathbb{E}\log{\frac{p_{X_i}(X_i)}{p_{Y_i}(X_i)}} \\  & = D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)) + \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) . \qquad\square \end{split} \end{equation*}

Data processing and convexity

Two important properties of the relative entropy can be obtained as consequences of the chain rule.

Data processing inequality. Let P and Q be two probability measures on \mathcal{A} and suppose T:\mathcal{A} \rightarrow \mathcal{A}^{'} is measurable. Then D(PT^{-1}||QT^{-1}) \leq D(P||Q), where PT^{-1} is the distribution of T(X) when X \sim P.

The data processing inequality tells us that if we process the data X (which might come from one of the two distributions P and Q), then the relative entropy decreases. In other words, it becomes harder to identify the source distribution after processing the data. The same result (with the same proof) holds also if P and Q are transformed by a transition kernel, rather than by a function.

Proof. Denote by \mathsf{P} and \mathsf{Q} the joint laws of (X,T(X)) and (Y,T(Y)) when X\sim P and Y\sim Q. By the chain rule and nonnegativity of relative entropy

    \[D(PT^{-1}|QT^{-1}) = D(\mathsf{P}||\mathsf{Q}) -    \mathbb{E}_{t \sim PT^{-1}} [ D(\mathcal{L}(X|T(X)=t) || \mathcal{L}(Y|T(Y)=t))] \le D(\mathsf{P}||\mathsf{Q}).\]

On the other hand, using again the chain rule,

    \[D(\mathsf{P}||\mathsf{Q}) = D(P||Q) + \mathbb{E}_{x\sim P} [ D(\mathcal{L}(T(X)|X=x) || \mathcal{L}(T(Y)|Y=x))] =  D(P||Q),\]

where we used \mathcal{L}(T(X)|X=x) = \mathcal{L}(T(Y)|Y=x). Putting these together completes the proof. \qquad\square

Convexity of relative entropy. D(\cdot || \cdot) is jointly convex in its arguments, that is, if P_1, P_2, Q_1, Q_2 are probability measures and 0\leq \lambda \leq 1, then

    \[D(\lambda P_1 + (1-\lambda)P_2 || \lambda Q_1 + (1-\lambda)Q_2 ) \leq \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2).\]

Proof. Let T be a random variable that takes value 1 with probability \lambda and 2 with probability 1-\lambda. Conditionally on T=i, draw X\sim P_i and Y\sim Q_i. Then \mathcal{L}(X)=\lambda P_1+(1-\lambda)P_2 and \mathcal{L}(Y)=\lambda Q_1+(1-\lambda)Q_2. Using the chain rule twice, we obtain

    \[D(\mathcal{L}(X)||\mathcal{L}(Y)) \le    D(\mathcal{L}(X,T)||\mathcal{L}(Y,T)) =    \mathbb{E}_{t\sim \mathcal{L}(T)}[D(\mathcal{L}(X|T=t)||\mathcal{L}(Y|T=t))],\]

and the right hand side is precisely \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2). \qquad\square

Corollary. The entropy function H is concave.

Proof for a finite alphabet. When the alphabet \mathcal{A} is finite, the corollary can be proven by noting that H(P)=\log{|\mathcal{A}|} - D(P||\mathrm{Unif}(\mathcal{A})). \qquad\square

Relative entropy and total variation distance

Consider the hypothesis testing problem of testing the null hypothesis H_0: X \sim P against the alternative hypothesis H_1: X \sim Q. A test is a measurable function T:\mathcal{A} \rightarrow \{0,1\}. Under the constraint P(T(X)=1) \leq \alpha, it can be shown that the optimal rate of decay of Q(T(X)=0) as a function of the sample size n is of the order of \exp{(-n\cdot D(P||Q))}. This means that D(P||Q) is the measure of how well one can distinguish between Q and P on the basis of data.

We will not prove this fact, but only introduce it to motivate that the relative entropy D is, in some sense, like a measure of distance between probability measures. However, it is not a metric since D(P||Q) \neq D(Q||P) and the triangle inequality does not hold. So in what sense does the relative entropy represent a distance? In fact, it controls several bona fide metrics on the space of probability measures. One example of such metric is the total variation distance.

Definition. Let P and Q be probability measures on \mathcal{A}. The total variation distance is defined as d_{\text{TV}}(P,Q)=\sup_{A \in \mathcal{B}(\mathcal{A})} |P(A)-Q(A)|.

The following are some simple facts about the total variation distance.

  1. 0 \leq d_{\text{TV}}(P,Q) \leq 1.
  2. If P and Q have probability density functions p and q with respect to some common probability measure \lambda, then d_{\text{TV}}(P,Q)= \frac{1}{2}||p-q||_{L^{1}(\lambda)}. To see this, define A=\{x\in \mathcal{A}, p(x)>q(x) \}. Then

        \begin{equation*} \begin{split} ||p-q||_{L^{1}(\lambda)} & =  \int_{A}(p(x)-q(x))\lambda(dx) + \int_{A^c}(q(x)-p(x))\lambda(dx)  \\  & = P(A) - Q(A) + (1-Q(A)-1+P(A)) \\  & = 2(P(A)-Q(A)) = 2 d_{\text{TV}}(P,Q)  \\ \end{split} \end{equation*}

  3. d_{\text{TV}}(P,Q)= \inf_{X \sim P, Y \sim Q}  \mathbb{P}(X\neq Y).

The following inequality shows that total variance distance is controlled by the relative entropy. This shows that the relative entropy is a strong notion of distance.

Pinsker’s inequality. d_{\text{TV}}(P,Q)^2 \leq \frac{1}{2} D(P||Q).

Proof. Without loss of generality, we can assume that P and Q have probability density functions p and q with respect to some common probability measure \lambda on \mathcal{A}. Let A=\{x\in \mathcal{A}, p(x)>q(x) \} and T=1_{A}(x).

Step 1: Prove this inequality by simple calculation in the case when \mathcal{A} contains at most 2 elements.

Step 2: Note that PT^{-1} and QT^{-1} are defined on the space \{0,1\}. So Pinsker’s inequality applies to PT^{-1} and QT^{-1}. Thus,

    \begin{equation*} \begin{split} D(P||Q) & \geq  D(PT^{-1}||QT^{-1}) \geq 2 d_{\text{TV}}(PT^{-1},QT^{-1})^2   \\  & = 2(P(A)-Q(A))^2 = 2 d_{\text{TV}}(P,Q)^2. \qquad\square \end{split} \end{equation*}

Law of small numbers

As a first illustration of an application of entropy to probability, let us prove a simple quantitative law of small numbers. An example of the law of small numbers is the well known fact that Bin(n,\frac{\lambda}{n})  \rightarrow Po(\lambda) in distribution as n goes to infinity. More generally, if X_1,...,X_n are Bernoulli random variables with X_i \sim Bern(p_i), if X_1,...,X_n are weakly dependent, and if none of the p_i dominates the rest, then \mathcal{L}(\sum_{i=1}^n X_i) \approx Po(\lambda) where \lambda = \sum_{i=1}^{n} p_i. This idea can be quantified easily using relative entropy.

Theorem. If X_i \sim Bern(p_i) and X_1,...,X_n may be dependent, then

    \[D(\mathcal{L}(\bar X) || Po(\lambda)) \leq \sum_{i=1}^n p_i^2 + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) )\]

where \bar X = \sum_{i=1}^nX_i and \lambda = \sum_{i=1}^n p_i.

Proof. Let Z_1,...,Z_n be independent random variables with Z_i \sim Po(p_i). Then \bar Z = \sum_{i=1}^n Z_i \sim Po(\lambda). We have

    \begin{equation*} \begin{split} D(\mathcal{L}(\bar X) || Po(\lambda) )  & =  D(\mathcal{L}(\bar X) || \mathcal{L}(\bar Z)) \\  & \leq   D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Z_1,...,Z_n) )\\  & = \sum_{i=1}^{n} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) ) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) ). \\ \end{split} \end{equation*}

To conclude, it is enough to note that

    \begin{equation*} \begin{split} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) )  & =  (1-p_i)\log{\frac{1-p_i}{e^{-p_i}}} + p_i\log{\frac{p_i}{p_i e^{-p_i}}}  \\  & =  p_i^2 + (1-p_i)(p_i+\log{(1-p_i)})  \\  & \leq p_i^2 .  \qquad\square \end{split} \end{equation*}

Remark. If p_1= \cdots = p_n = \frac{\lambda}{n} and X_1,...,X_n are independent, then the inequality in the theorem becomes D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) ) \leq \frac{\lambda^2}{n}. However, this rate of convergence is not optimal. One can show that under the same condition, D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) )= o(\frac{1}{n}), using tools similar to those that will be used later to prove the entropic central limit theorem. Note that it is much harder to prove D(\mathcal{L}(S_n)|| \mathcal{N}(0,1)) \rightarrow 0 in the entropic central limit theorem, even without rate of convergence!

Lecture by Mokshay Madiman | Scribed by Che-yu Liu

03. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment