Schedule for rest of the semester

Here is the schedule for the remaining three lectures of the semester:

November 21: Emmanuel Abbe on information theoretic inequalities.

November 28: No lecture (Thanksgiving).

December 5 and 12: Ramon van Handel on transportation-information inequalities and concentration.

The remaining notes on the entropic CLT (Lecture 7) will be posted soon.

20. November 2013 by Ramon van Handel
Categories: Announcement | Leave a comment

Lecture 6. Entropic CLT (3)

In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.

Variance drop inequality

In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.

ANOVA decomposition. Let X_1, \ldots, X_n be independent random variables taking values in \mathbb{R}, and let \Psi: \mathbb{R}^n \rightarrow \mathbb{R} be such that \mathbb{E}[\Psi(X_1,\ldots,X_n)^2]< \infty. Then \Psi satisfies

    \begin{align*} \Psi(X_1,\ldots,X_n)&= \sum_{t \subseteq \{1,\ldots,n\}} \bar{E_t}[\Psi] \\ \text{where } \bar{E_t}[\Psi]&:=\bigg(\prod_{i \notin t} E_i \prod_{j \in t} (I-E_j)\bigg)\Psi \\ \text{and } E_{i} [\Psi] &:= \mathbb{E}[\Psi | X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n]. \end{align*}

Furthermore, \mathbb{E}[\bar{E_t}\Psi \cdot \bar{E_s}\Psi]=0 \text{ if } t\neq s.

Note that \bar{E_t}\Psi is a function only of X_t=(X_{i_1}, \ldots, X_{i_m}) (i_1<\ldots < i_m are the ordered elements of t).

In the previous lecture, we proved subadditivity of the inverse Fisher information I^{-1}. The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.

Lemma. Let U(X_1,\ldots,X_n)=\sum_{s \in \mathcal{G}} \Psi_s(x_s) where \Psi_s : \mathbb{R}^{|s|} \rightarrow \mathbb{R} and \mathcal{G} is some collection of subsets of \{1,\ldots,n\}. If X_1,\ldots, X_n are independent random variables with \mathbb{E} \Psi_s(X_s)^2 < \infty \text{ } \forall s \in \mathcal{G}, then

    \[\mathrm{Var}(U(X_1,\ldots, X_n)) \leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s}\,\mathrm{Var}(\Psi_s(X_s)) ,\]

where \{\beta_s : s \in \mathcal{G} \} is a fractional packing with respect to \mathcal{G}.

Remarks.

  1. Recall that a fractional packing is a function \beta:\mathcal{G} \rightarrow [0;1] such that \sum_{s \in \mathcal{G}, s  \ni i} \beta_s\leq 1 \text{ } \forall i \in [n].

    Example 1. Let d(i)=\#\{s \in \mathcal{G} : s \ni i \}, and define d_+= \max d(i). Taking \beta_s=\frac{1}{d_+} always defines a fractional packing as \sum_{s \in \mathcal{G}, i  \ni s} \frac{1}{d_+}=\frac{1}{d_+} \cdot d(i) \leq 1 by definition of d_+.

    Example 2. If \mathcal{G}=\{s\subseteq\{1,\ldots,n\}:|s|=m\}, then d_+=\binom{n-1}{m-1}.

  2. The original paper of Hoeffding (1948) proves the following special case where each \Psi_s is symmetric in its arguments and \mathcal{G} is as in Example 2 above: U=\frac{1}{\binom{n}{m}}\sum_{|s|=m}\Psi(X_s) (the U-statistic) satisfies

        \[\mathrm{Var}(U) \leq \frac{m}{n} \mathrm{Var}(\Psi(X_s)).\]

    Of course, if m=1 then \mathrm{Var}(U) = \frac{1}{n} \mathrm{Var}(\Psi(X_1)). Thus Hoeffding’s inequality for the variance of U-statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.

Proof. We may assume without loss of generality that each \Psi_s(X_s) has mean zero.

    \begin{align*}  U(X_1,\ldots,X_n)&=\sum_{s \in \mathcal{G}} \Psi_s(X_s)= \sum_{s \in \mathcal{G}} \sum_{t \subseteq s} \bar{E_t}[\Psi_s(X_s)]\qquad\text{ by ANOVA}\\ &= \sum_{t \subseteq [n]} \sum_{s \in \mathcal{G}, s \supseteq t} \bar{E_t} \Psi_s(X_s)\\  &= \sum_{t \subseteq [n]} \bar{E_t} \sum_{s \in \mathcal{G}, s \supseteq t} \Psi_{s}. \end{align*}

We then have, using orthogonality of the terms in the ANOVA decomposition:

    \begin{align*} \mathbb{E}U^2 &= \sum_{t \subseteq [n]} \mathbb{E} \bigg[\sum_{s \in \mathcal{G} , s \supseteq t} \bar{E_t}[\Psi_s(X_s)] \bigg]^2. \end{align*}

For each term, we have

    \begin{align*} \Big[\sum_{s \in \mathcal{G} , s \supseteq t} \frac{\sqrt{\beta_s}\bar{E_t}[\Psi_s(X_s)]}{\sqrt{\beta_s}}\Big]^2 &\leq \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \beta_s\Big] \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s}\Big] \qquad\text{ by Cauchy-Schwarz}\\ &\leq \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} ,  \end{align*}

where the second inequality follows from the definition of fractional packing if t is non-empty, and the fact that \bar{E}_{\varnothing} takes any \Psi_s to its mean. Hence

    \begin{align*} \mathbb{E}[U^2] &\leq \sum_{t \subseteq [n]} \mathbb{E} \Big[ \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} \Big]\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \sum_{t \subseteq s} \mathbb{E} [\bar{E_t} \Psi_s]^2\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \Psi_s^2 , \end{align*}

again using orthogonality of the \bar{E_t}\Psi in the last step. \qquad\square

Monotonicity of Fisher information

We can now finally prove monotonicity of the Fisher information.

Corollary. Let X_i be independent random variables with I(X_i) <\infty. Then

    \[I(X_1+\ldots + X_n) \leq \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I\bigg(\sum_{i \in s} X_i\bigg)\]

for any hypergraph \mathcal{G} on [n], fractional packing \beta, and positive weights \{\omega_s :s \in \mathcal{G}\} summing to 1.

Proof. Recall that I(X)=\mathrm{Var}(\rho_X(X)) and \rho_X(x)=\frac{f_X'(x)}{f_X(x)}. The identity proved in the last lecture states

    \begin{align*} \rho_{X+Y}(x)&= \mathbb{E}[\rho_X(X)|X+Y=x]\\ \rho_{X+Y}(X+Y)&=\mathbb{E}[\rho_X(X)|X+Y] \end{align*}

With T_s=\sum_{i \in S} X_i, we can write

    \[\rho_{T_{[n]}}(T_{[n]})=\mathbb{E}[\rho_{T_s}(T_s)|T_{[n]}] \text{ }\qquad\forall s \in \mathcal{G}\]

since T_{[n]}=T_s+T_s^C. By taking a convex combination of these identities,

    \begin{align*} \rho_{T_{[n]}}(T_{[n]}) &= \sum_{s \in \mathcal{G}} \omega_s \mathbb{E}[\rho_{T_{s}}(T_s) | T_{[n]}] \\ &= \mathbb{E}\bigg[\sum_{s \in \mathcal{G}} \omega_s \rho_{T_{s}}(T_s) | T_{[n]}\bigg] . \end{align*}

Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have

    \begin{align*} I(T_{[n]}) &= \mathbb{E} [\rho^2_{T_{[n]}}(T_{[n]})]\\ &\leq \mathbb{E}\bigg[\bigg(\sum_{s \in \mathcal{G}} \omega_s \rho_{T_s}(T_s)\bigg)^2\bigg]\\ &\leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s} \omega_s^2 \mathbb{E}[\rho^2_{T_s}(T_s)]\\ &= \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I(T_s) \end{align*}

as desired. \qquad\square

Remark. The \omega_s being arbitrary weights, we can optimize over them. This gives

    \[\frac{1}{I(T_{[n]})} \geq \sum_{s \in \mathcal{G}} \frac{\beta_s}{I(T_s)}.\]

With \mathcal{G} being all singletons and \beta_s=1 we recover the superadditivity property of I^{-1}. With \mathcal{G} being all sets of size n-1 and \beta_s=\frac{1}{\binom{n-1}{n-2}}=\frac{1}{n-1}, we get

    \[\frac{1}{I(T_{[n]})} \geq \frac{n}{n-1}\frac{1}{I(T_{n-1})} \Rightarrow I\Big(\frac{T_{[n]}}{\sqrt{n}}\Big) \leq I\Big(\frac{T_{[n-1]}}{\sqrt{n-1}}\Big) \Leftrightarrow I(S_n) \leq I(S_{n-1}).\]

Thus we have proved the monotonicity of Fisher information in the central limit theorem.

From Fisher information to entropy

Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.

To formulate this identity, let us introduce some basic quantities. Let X\sim f on \mathbb{R}, and define

    \[X^t = e^{-t}X+\sqrt{1-e^{-2t}}Z\]

where Z \sim \mathcal{N}(0,1). Denote by f_t the density of X^t. The following facts are readily verified for t>0:

  1. f_t>0.
  2. f_t(\cdot) is smooth.
  3. I[f_t] < \infty.
  4. \frac{\partial f_t(x)}{\partial t}= f_t^{\prime \prime}(x)+ \frac{d}{dx} [xf_t(x)] =: (Lf_t)(x).

Observe that X^0=X has density f, and that as t \rightarrow \infty , X^t converges to Z, which has a standard Gaussian distribution. Thus X^t provides an interpolation between the density f and the normal density.

Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process X(t) is defined by the stochastic differential equation

    \[dX(t)= -X(t) dt + \sqrt{2} dB(t),\]

where B(t) is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by

    \[P_t \Psi(x)=\mathbb{E}[\Psi(X(t))|X_0=x] ,\]

has a generator A (defined via A= \lim_{t\rightarrow \infty} \frac{P_t - I}{t}) given by A\Psi(x)=\Psi^{\prime \prime}(x)-x\Psi'(x). The semigroup P_t generated by A governs the evolution of conditional expectations of functions of the process X(t), while the adjoint semigroup generated by L=A^* governs the evolution of the marginal density of X(t). The above expression for \partial f_t/\partial t follows from this remark by noting that X(t) and X^t are the same in distribution; however, it can also be deduced more simply just by writing down the density of X^t explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.

We can now formulate the key identity.

de Bruijn identity. Let g be the density of the standard normal N(0,1).

  1. Differential form:

        \[\frac{d}{dt} D(f_t \| g)=\frac{1}{2} J(f_t),\]

    where J(f)=Var(f) \cdot I(f)-1 is the normalized Fisher information.

  2. Integral form:

        \[D(f \| g)=\frac{1}{2} \int_0^\infty J(f_t) dt .\]

The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since

    \[D(g \| g)-D(f \| g)=- \frac{1}{2}\int_0^\infty J(f_t) dt ,\]

which yields the desired identity since D(g\|g)=0.

This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.

Lecture by Mokshay Madiman | Scribed by Georgina Hall

13. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Lecture 5. Entropic CLT (2)

The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.

Two lemmas about the score function

Recall that for a random variable X with absolutely continuous density f, the score function is defined as \rho(x)=f'(x)/f(x) and the Fisher information is I(X)=\mathbb{E}[\rho^2(X)].

The following lemma was proved in the previous lecture.

Lemma 1. Let X be a random variable with finite Fisher information. Let w:\mathbb{R} \rightarrow \mathbb{R} be measurable and let \eta(u) = \mathbb{E}[w^2(u + X)]. If \eta is bounded on the interval [a,b], then the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on [a,b] with g'(u) = -\mathbb{E}[w(u+X)\rho(X)] a.e.

There is a converse to the above lemma, which gives a useful characterization of the score function.

Lemma 2. Let X be a random variable with density f and let m: \mathbb{R} \rightarrow \mathbb{R} be a measurable function with \mathbb{E}|m(X)|<\infty. Suppose that for every bounded measurable function w:\mathbb{R} \rightarrow \mathbb{R}, the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on \mathbb{R} with g'(u) = -\mathbb{E}[w(u+X)m(X)] a.e. Then there must exist an absolutely continuous version of the density f, and moreover m(X)=\rho(X) a.s.

Proof. Take w(x)=\mathbb{I}_{\{x\leq t\}}. Then

    \[g(u)=\int_{-\infty}^{t-u}f(x)dx.\]

Hence g(u) is in fact continuously differentiable with derivative g'(u)=-f(t-u). On the other hand,

    \[\begin{split} g'(u)&=-\mathbb{E}[w(u+X)m(X)]\\ &=-\int_{-\infty}^{t-u}f(x)m(x)\qquad\mbox{a.e.} \end{split}\]

by our assumption. By continuity of f and the assumption \mathbb{E}|m(X)|<\infty, we must have

    \[f(y)=\int_{-\infty}^{y}f(x)m(x)dx\]

for every y\in\mathbb{R}. Hence f(x)m(x)=f'(x) a.e. Since \mathbb{P}[f(X)=0]=0, the proof is complete. \square

Score function of the sum of independent variables

We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.

Proposition. Let X and Y be two independent random variables. Suppose X has finite Fisher information. Then \rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y] a.s.

Proof. Let m(u)=\mathbb{E}[\rho_X(X)|X+Y=u]. By Lemma 2, we only need to show that the function g(u) = \mathbb{E} w(u+X+Y) is locally absolutely continuous with g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. for every bounded measurable function w.

Fix a\in\mathbb{R}. By independence of X and Y, we can apply Lemma 1 to X (conditioned on Y) to obtain

    \[\mathbb{E}[w(u+X+Y)|Y]-\mathbb{E}[w(a+X+Y)|Y]=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)|Y]\,dt.\]

Taking expectation of both sides and applying Fubini, we get

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)\rho_X(X)]\,dt.\]

Since

    \[\begin{split}\mathbb{E}[w(t+X+Y)\rho_X(X)]&=\mathbb{E}[\mathbb{E}[w(t+X+Y)\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)\mathbb{E}[\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)m(X+Y)],\end{split}\]

we arrive at

    \[g(u)-g(a)=-\int_a^u\mathbb{E}[w(t+X+Y)m(X+Y)]\,dt,\]

which implies g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. \qquad\square

Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in L^2(\Omega,\mathcal{F},\mathbb{P})), the score function of the sum is just the projection of the score of a summand onto the closed subspace L^2(\Omega, \sigma\{X+Y\}, \mathbb{P}). This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: I(X+Y)=\mathbb{E}[\rho_{X+Y}^2(X+Y)]\leq\mathbb{E}[\rho_{X}^2(X)]=I(X). In fact, we can do better, as we will see forthwith.

Monotonicity of FI in the CLT along a subsequence

We now make a first step towards proving monotonicity of the Fisher information in the CLT.

Theorem. Let X and Y be independent random variables both with finite Fisher information. Then

    \[I(X+Y)\leq\lambda^2I(X)+(1-\lambda)^2I(Y)\]

for any \lambda\in\mathbb{R}.

Before going into the proof, we make some remarks.

Remark. Taking \lambda=1, we get I(X+Y)\leq I(X). Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.

Remark. By taking \lambda=\frac{I(Y)}{I(X)+I(Y)}, we get the following optimized version of the theorem:

    \[\frac{1}{I(X+Y)}\geq\frac{1}{I(X)}+\frac{1}{I(Y)}.\]

Remark. Let X_1,X_2,\ldots be an i.i.d. sequence of random variables with mean zero and unit variance. Let S_n=\frac{1}{\sqrt{n}}\sum_{i=1}^nX_i. By taking \lambda=\frac{1}{2} in the theorem and using scaling property of Fisher information, it is easy to obtain I(S_{2n})\leq I(S_n). Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times 1,2,4,8,16,\ldots: that is, I(S_{2^n}) is monotone in n.
However, the theorem is not strong enough to give monotonicity I(S_{n+1})\leq I(S_n) without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get I(S_n)\leq I(S_1), which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.

Proof. By the projection property of the score function,

    \[\begin{split} &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y],\\ &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_Y(Y)|X+Y].\\ \end{split}\]

Hence

    \[\rho_{X+Y}(X+Y)=\mathbb{E}[\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y)|X+Y].\]

Applying the conditional Jensen inequality, we obtain,

    \[\rho_{X+Y}^2(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2|X+Y].\]

Taking expectations of both sides and using independence, we obtain

    \[I(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2] =\lambda^2I(X)+(1-\lambda)^2I(Y)+2\lambda(1-\lambda)\mathbb{E}[\rho_X(X)]\mathbb{E}[\rho_Y(Y)].\]

By Lemma 1, \mathbb{E}[\rho_X(X)]=0. This finishes the proof. \qquad\square

Monotonicity of Fisher information in the CLT

To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.

In the following, we will use the common notation [n]:=\{1,\ldots,n\}.

Definition. Let \mathcal{G} be a collection of non-empty subsets of [n]. Then a collection of non-negative numbers \{\beta_s:s\in\mathcal{G}\} is called a fractional packing for \mathcal{G} if \sum_{s:i\in s}\beta_s\leq 1 for all 1\leq i\leq n.

We can now state the desired strengthening of our earlier theorem.

Theorem. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G} and let X_1,\ldots,X_n be independent random variables. Let \{w_s:s\in\mathcal{G}\} be a collection of real numbers such that \sum_{s\in\mathcal{G}}w_s=1. Then

    \[I(X_1+\ldots+X_n)\leq\sum_{s\in\mathcal{G}}\frac{w_s^2}{\beta_s}I\bigg(\sum_{j\in s}X_j\bigg)\]

and

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\sum_{s\in\mathcal{G}}\beta_s\frac{1}{I(\sum_{j\in s}X_j)}.\]

Remark. Suppose that the X_i are identically distributed. Take \mathcal{G}=\{[n]\backslash 1,[n]\backslash 2,\ldots,[n]\backslash n\}. For each s\in\mathcal{G}, define \beta_s=\frac{1}{n-1}. It is easy to check that \{\beta_s:s\in\mathcal{G}\} is a fractional packing of \mathcal{G}. Then

    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\frac{n}{n-1}\frac{1}{I(X_1+\ldots+X_{n-1})}\]

by the above theorem. By the scaling property of Fisher information, this is equivalent to I(S_n)\leq I(S_{n-1}), i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).

The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let X_1,\ldots,X_n be independent random variables, and define the Hilbert space

    \[H:=\{\phi:\mathbb{R}^n \rightarrow \mathbb{R}~~:~~\mathbb{E}[\phi^2(X_1,\ldots,X_n)]<\infty\}\]

with inner product

    \[\langle\phi,\psi\rangle:=\mathbb{E}[\phi(X_1,\ldots,X_n)\psi(X_1,\ldots,X _n)].\]

For every j\in[n], define an operator E_j on H as

    \[(E_j\phi)(x_1,\ldots,x_n)=\mathbb{E}[\phi(x_1,\ldots,x_{j+1},X_j,x_{j-1} ,\ldots,x_n)].\]

Proposition. Each \phi\in H can be decomposed as

    \[\phi=\sum_{T\subseteq [n]}\widetilde{E}_T\phi,\]

where

    \[\widetilde{E}_T:=\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\]

satisfies

  1. \widetilde{E}_T\widetilde{E}_S=0 if T\neq S;
  2. \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 if T\neq S;
  3. \widetilde{E}_T\phi does not depend on x_i if i\notin T.

Proof. It is easy to verify that E_j is a projection operator, that is, E_j^2=E_j and \langle E_j\phi,\psi\rangle=\langle\phi,E_j\psi\rangle (self-adjointness). It is also easy to see that E_jE_i=E_iE_j for i\neq j. Hence we have

    \[\phi=\prod_{j=1}^n(E_j+I-E_j)\phi=\sum_{T\subset [n]}\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\phi=\sum_{T\subset [n]}\widetilde{E}_T\phi.\]

If T\neq S, choose j_0\in T such that j_0\notin S. Then

    \[\widetilde{E}_T\widetilde{E}_S&=(I-E_{j_0})E_{j_0}\prod_{j\notin T}E_j\prod_{j\in T\backslash\{j_0\}}(I-E_j)\prod_{j\notin S\cup\{j_0\}}E_j\prod_{j\in S}(I-E_j)=0.\]

It is easily verified that \widetilde{E}_T is itself self-adjoint, so that \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 follows directly. Finally, as by definition E_j\phi does not depend on x_j, it is clear that \widetilde{E}_T\phi does not depend on x_j if j\notin T. \qquad\square

The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here X_S:=(X_{i_1},\ldots,X_{i_{|S|}}) for S=\{i_1,\ldots,i_{|S|}\}\subseteq [n], i_1<\cdots<i_{|S|}.

Lemma. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G}. Let U=\sum_{s\in\mathcal{G}}\phi_s(X_s). Suppose that \mathbb{E}[\phi_s^2(X_s)]<\infty for each s\in\mathcal{G}. Then \text{Var}[U]\leq\sum_{s\in\mathcal{G}}\frac{1}{\beta_s}\text{Var}[\phi_s(X_s)].

To be continued…

Lecture by Mokshay Madiman | Scribed by Liyao Wang

06. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Leave a comment

Next lecture: November 7

A quick reminder that next week is Fall break (no lecture). The next lecture will take place on Thursday, November 7: Mokshay will resume the proof of the entropic central limit theorem.

24. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off

Lecture 4. Entropic CLT (1)

The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean zero and unit variance. Let

    \[S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^nX_i.\]

If h(S_n) > -\infty for some n, then h(S_n) \uparrow h(N(0,1)), or equivalently D( \mathcal{L}(S_n)\| N(0,1)) \downarrow 0. That is, the entropy of S_n increases monotonically to that of the standard Gaussian.

Recall that D( \mathcal{L}(X) \| \mathcal{L}(Z)) = h(Z) - h(X) when Z is Gaussian with the same mean and variance as X, which explains the equivalence stated in the theorem.

Let us note that the assumption h(S_n) > -\infty for some n represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if X_i are discrete.

Entropy power inequality

Historically, the first result on monotonicity of entropy in the CLT was that h(S_{2n}) \ge h(S_n) for all n. This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.

Entropy power inequality. Let X_1 and X_2 be independent real-valued random variables such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)} \ge e^{2h(X_1)} + e^{2h(X_2)},\]

with equality if and only if X_1 and X_2 are Gaussian.

Before we embark on the proof, let us make some remarks.

Remark. None of the assumptions about existence of entropies is redundant: it can happen that h(X_1) and h(X_2) exist but h(X_1+X_2) does not.

Remark. If X_1 and X_2 are i.i.d., S_1=X_1, and S_2=(X_1+X_2)/\sqrt{2}, then the EPI implies

    \[2e^{2h(X_1)} \le e^{2h(X_1+X_2)} = e^{2h(\sqrt{2}S_2)} = 2e^{2h(S_2)},\]

which implies h(S_1) \le h(S_2). Here we have used the easy-to-check equality h(aX) = h(X) + \log|a|, which of course implies e^{2h(aX)}=a^2e^{2h(X)}. From this observation, the proof of the claim that h(S_{2n}) \ge h(S_n) is immediate: simply note that S_{2n} is the sum of two independent copies of S_n.

Remark. It is easy to check that h(X_1+X_2) \ge h(X_1+X_2 | X_2) = h(X_1 | X_2) = h(X_1). In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.

Remark. The EPI admits the following multidimensional extension.

Multidimensional EPI. Let X_1 and X_2 be independent \mathbb{R}^n-valued random vectors such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)/n} \ge e^{2h(X_1)/n} + e^{2h(X_2)/n},\]

with equality if and only if X_1 and X_2 are Gaussian with proportional covariance matrices.

Define the entropy power N(X) for an \mathbb{R}^n-valued random vector X by

    \[N(X) := \exp\left(\frac{2h(X)}{n}\right).\]

The EPI says that N is superadditive under convolution.

Digression: EPI and Brunn-Minkowski

A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If A,B \subset \mathbb{R}^n are Borel sets and |\cdot| denotes n-dimensional Lebesgue measure, then

    \[|A + B|^{1/n} \ge |A|^{1/n} + |B|^{1/n},\]

where A+B := \{a + b : a \in A, \ b \in B\} is the Minkowski sum. In particular, note that |A|^{1/n} is proportional up to an absolute constant to the radius of the n-dimensional Euclidean ball whose volume matches that of A. The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in \mathbb{R}^n, which states that Euclidean balls with volume V have the minimal surface area among all subsets of \mathbb{R}^n with volume V.

In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since

    \[h(N(0,\sigma^2)) = \frac{1}{2}\log(2\pi e\sigma^2) \quad \Rightarrow \quad e^{2h(N(0,\sigma^2))} = 2\pi e\sigma^2,\]

we see that e^{2h(X)} is proportional to the variance of the Gaussian whose entropy matches that of X. The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.

Proposition. The EPI is equivalent to the following statement: if X_1 and X_2 are independent and Z_1 and Z_2 are independent Gaussians with h(Z_1)=h(X_1) and h(Z_2)=h(X_2), then h(X_1+X_2) \ge h(Z_1 +Z_2) provided that all of the entropies exist.

Proof. Both implications follow from

    \[\exp\left(\frac{2h(X_1)}{n}\right)+ \exp\left(\frac{2h(X_2)}{n}\right) = \exp\left(\frac{2h(Z_1)}{n}\right) + \exp\left(\frac{2h(Z_2)}{n}\right) = \exp\left(\frac{2h(Z_1+Z_2)}{n}\right). \qquad\square\]

Proof of the entropy power inequality

There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.

Definition (Fisher information). Let X be a \mathbb{R}-valued random variable whose density f is an absolutely continuous function. The score function of X is defined as

    \[\rho(x) = \rho_X(x) := \begin{cases} \frac{f'(x)}{f(x)} & \text{for } x \text{ at which } f(x)>0 \text{ and } f'(x) \text{ exists,} \\ 0 & \text{otherwise.} \end{cases}\]

The Fisher information (FI) of X is defined as

    \[I(X) := \mathbb{E}\left[\rho_X^2(X)\right]. \]\]

Remark. Let \mathcal{P} = \{p_\theta : \theta \in \Theta\} be a parametric statistical model. In statistics, the score function is usually defined by \rho(\theta,x) := \frac{\partial}{\partial \theta}\log p_\theta(x), and the Fisher information by I(\theta) = \mathbb{E}_\theta\left[\rho^2(\theta,X)\right]. This reduces to our definition in the special case of location families, where p_\theta(x) = p(x-\theta) for some probability density p(x): in this case, \rho(\theta,x) = -\frac{\partial}{\partial x}\log p(x-\theta) and we have

    \[I(\theta) = \int p_\theta(x)\left[\frac{\partial}{\partial x}\log p(x-\theta)\right]^2dx = \int p(x)\left[\frac{\partial}{\partial x}\log p(x)\right]^2dx.\]

Thus for location families I(\theta) does not depend on \theta and coincides precisely with the Fisher information I(Z) as we defined it above for a random variable Z with density p(x). The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that \mathbb{E}Z=0. Then \mathbb{E}_\theta X = \theta for every \theta, so X is an unbiased estimator of \theta. The Cramér-Rao bound therefore implies the inequality

    \[I(Z) \ge \frac{1}{\mathrm{Var}(Z)},\]

with equality if and only if Z is Gaussian. The same conclusion holds when \mathbf{E}Z\in\mathbb{R} is arbitrary, as both FI and variance are invariant under translation.

Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have I(S_n) \downarrow I(N(0,1)) = 1. Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as J(X) := \mathrm{Var}(X)I(X)-1. Note that J is both translation and scale invariant: J(aX)=J(X) and J(X+a)=J(X). We have J(X) \ge 0, with equality if and only if X is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as J(S_n) \downarrow 0.

The strategy of the proof of the EPI is as follows:

  1. We first prove an inequality for Fisher information:

        \[\frac{1}{I(X_1+X_2)} \ge \frac{1}{I(X_1)} + \frac{1}{I(X_2)}.\]

  2. Develop an integral identity relating I and h.
  3. Combine (1) and (2) to get the EPI.

The reason to concentrate on Fisher information is that I is an L^2-type quantity, as opposed to the entropy which is an L\log L-type quantity. This makes I easier to work with.

We begin with some technical results about Fisher information.

Lemma. If X has absolutely continuous density f and I(X) < \infty, then f has bounded variation. In particular, f is bounded.

Proof. Let D_f denote the set of points at which f is differentiable, so that |D_f^c|=0. Define f'(x) := 0 for x \notin D_f. Let S = \{x \in \mathbb{R} : f(x) > 0\}. Then

    \[\int_\mathbb{R}|f'(x)|dx = \int_S|f'(x)|dx = \int_S|\rho(x)|f(x)dx = \mathbb{E}|\rho(X)| \le \sqrt{I(X)} < \infty. \qquad\square\]

Lemma. Let w : \mathbb{R} \rightarrow \mathbb{R} be measurable. Let \eta : [a,b] \rightarrow [0,\infty] be defined as \eta(u) = \mathbb{E} w^2(u + X). If \eta is bounded, then g : [a,b] \rightarrow \mathbb{R} defined by g(u) = \mathbb{E} w(u+X) is absolutely continuous on (a,b) with g'(u) = -\mathbb{E}\left[w(u+X)\rho(X)\right] a.e.

Proof. Let \tilde{g}(u)=-\mathbb{E}\left[w(u+X)\rho(X)\right]; we want to show that g is absolutely continuous with derivative \tilde{g}. First observe that \tilde{g} is integrable on [a,b] from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute

    \begin{eqnarray*}\begin{split} \int_a^b \tilde{g}(u) du  &= - \int_a^b \bigg[ \int w(s) f'(s-u) ds \bigg] du \\ &= -  \int w(s)\bigg[ \int_a^b f'(s-u) du \bigg]  ds \\ &= -  \int w(s)\bigg[ f(s-a) - f(s-b) \bigg]  ds \\ &= \mathbb{E} w(b+X)- \mathbb{E} w(a+X) \\ &= g(b) - g(a) , \end{split}\end{eqnarray*}

which proves the desired result. \qquad\square

Corollary. \mathbb{E}[\rho(X)] = 0 and, if \mathbb{E} X^2 < \infty, then \mathbb{E}[X\rho(X)] = -1.

Proof. Take w \equiv 1 in the lemma above for the first claim. Take w(u)=u for the second. \qquad\square

To be continued…

Lecture by Mokshay Madiman | Scribed by Dan Lacker

23. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

Next lecture: one-time room change

The next lecture is on October 17. Please note that we have a one time room change to Sherrerd Hall Room 101 on October 17 as the usual Bendheim room is in use for another event. We will return to the Bendheim center the following week for the rest of the semester.

10. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off

Lecture 3. Sanov’s theorem

The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:

  1. It is an example of a probabilistic question where entropy naturally appears.
  2. The proof we give uses ideas typical in information theory.
  3. We will need it later to discuss the transportation-information inequalities (if we get there).

What is large deviations?

The best way to start thinking about large deviations is to consider a basic example. Let X_1,X_2,\ldots be i.i.d. random variables with \mathbf{E}[X_1] =0 and \mathbf{E}[X_1^2] < \infty. The law of large numbers states that

    \[\frac{1}{n} \sum_{k=1}^n X_k \xrightarrow{n \rightarrow \infty} 0   \quad \mbox{a.s.}\]

To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that

    \[\sqrt{n} \, \bigg\{ \frac{1}{n} \sum_{k=1}^n X_k\bigg\}     \xrightarrow{n \rightarrow \infty} N(0,\sigma^2)    \quad\mbox{in distribution}.\]

Therefore, for any fixed t, we have

    \[\mathbf{P}\bigg[ \frac{1}{n} \sum_{k=1}^n X_k \ge \frac{t}{\sqrt{n}}\bigg]       = O(1)\]

(the probability converges to a value strictly between zero and one). Informally, this implies that the typical size of \frac 1n \sum_{k=1}^n X_k is of order \frac{1}{\sqrt{n}}.

Rather than considering the probability of typical events, large deviations theory allows us to understand the probability of rare events, that is, events whose probabilities are exponentially small. For example, if X_k \sim N(0,1), then the probability that \frac 1n \sum_{k=1}^n X_k is at least of order unity (a rare event, as we have just shown that the typical size is of order \frac{1}{\sqrt{n}}) can be computed as

    \[\mathbf{P}\bigg[ \frac 1n \sum_{k=1}^n X_k > t\bigg] =       \int_{t\sqrt{n}}^{\infty} \frac{e^{-x^2/2}}{\sqrt{2 \pi}} \approx    e^{-nt^2/2}.\]

The probability of this rare event decays exponentially at rate t^2/2. If the random variables X_i have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of large deviations theory is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.

Sanov’s theorem

Let X_1,X_2,\ldots be i.i.d. random variables with values in a finite set \{1,\ldots,r\} and with distribution P (random variables in a continuous space will be considered at the end of this lecture). Denote by \mathcal{P} the set of probabilities on \{1,\ldots,r \}. Let \hat P_n be the empirical distribution of X:

    \[\hat P_n = \frac{1}{n}\sum_{k=1}^n \delta_{X_k}.\]

The law of large numbers states that \hat P_n \rightarrow P a.s. To define a rare event, we fix \Gamma \subseteq \mathcal{P} that does not contain P. We are interested in behavior of probabilities of the form \mathbf{P}[\hat P_n \in \Gamma], as n \rightarrow \infty.

Example. Let f:\{1,\ldots,r\} \rightarrow \mathbb{R} be such that \int f dP=0. Define \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} for some t>0. Then \mathbf{P}[\hat P_n \in \Gamma]=\mathbf{P}[\frac{1}{n}\sum_{k=1}^n f(X_k)\ge t]. Thus the rare events of the type described in the previous section form a special case of the present setting.

We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities \mathbf{P}[\hat P_n \in \Gamma] decay.

Theorem (Sanov). With the above notations, it holds that

    \begin{eqnarray*} -\inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P) &\le& \liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le& \limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le&-\inf_{Q \in \mathop{\mathrm{cl}} \Gamma} D(Q || P). \end{eqnarray*}

In particular, for “nice” \Gamma such that the left- and right-hand sides coincide we have the exact rate

    \[\lim_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]    = - \inf_{Q \in \Gamma} D(Q || P).\]

In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event \Gamma is controlled by the element of \Gamma that is closest to the true distribution P in the sense of relative entropy.

There are many proofs of Sanov’s theorem (see, for example, the excellent text by Dembo and Zeitouni). Here we will utilize an elegant approach that uses a common device in information theory.

Method of types

It is a trivial observation that each possible value \{1,\ldots,r \} must appear an integer number of times among the samples \{X_1,\ldots,X_n\}. This implies, however, that the empirical measure \hat P_n cannot take arbitrary values: evidently it is always the case that \hat P_n \in \mathcal{P}_n, where we define

    \[\mathcal{P}_n = \bigg \{Q \in \mathcal{P} :     Q(\{i\}) = \frac{k_i}{n}, ~ k_i \in \{0,\ldots,n\},~\sum_{i=1}^rk_i=1 \bigg \}.\]

Each element of \mathcal{P}_n is called a type: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value \hat P_n=Q for each type Q\in\mathcal{P}_n.

Type theorem. For every Q \in \mathcal{P}_n, we have

    \[\frac{1}{(n+1)^r} e^{- n D(Q || P)} \le \mathbf{P}[\hat P_n = Q]     \le e^{- n D(Q || P)}.\]

That is, up to a polynomial factor, the probability of each type Q\in \mathcal{P}_n behaves like e^{- n D(Q || P)}.

In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types Q such that D(Q||P)\ge \inf_{Q\in\Gamma}D(Q||P)+\varepsilon have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in \Gamma. The probability of the rare event \Gamma is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: the probability of a rare event is dominated by the most likely of the unlikely outcomes. The proof of Sanov’s theorem makes this intuition precise.

Proof of Sanov’s theorem.

Upper bound. Note that

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &=& \sum_{Q\in \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \le \sum_{Q\in \Gamma \cap \mathcal{P}_n} e^{- n D(Q || P)}\\ &\le& |\mathcal{P}_n| \sup_{Q \in \Gamma} e^{- n D(Q || P)}  \le (n+1)^r e^{- n \inf _{Q \in \Gamma} D(Q || P)}. \end{eqnarray*}

This yields

    \[\limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \le -\inf_{Q \in \Gamma} D(Q || P).\]

[Note that in the finite case, by continuity, the infimum over \Gamma equals the infimum over \mathop{\mathrm{cl}}\Gamma as stated in the theorem. The closure becomes more important in the continuous setting.]

Lower bound. Note that \bigcup_{n\ge 1}\mathcal{P}_n is dense in \mathcal{P}. As \mathop{\mathrm{int}} \Gamma is open, we can choose Q_n \in  \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n for each n, such that D( Q_n || P) \rightarrow \inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P). Therefore,

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &\ge& \mathbf{P}[\hat P_n \in \mathop{\mathrm{int}} \Gamma] = \sum_{Q \in \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \ge \mathbf{P}[\hat P_n = Q_n]\\ &\ge& \frac{1}{(n+1)^r} e^{-nD(Q_n || P)}. \end{eqnarray*}

It follows that

    \[\liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \mathop{\mathrm{int}}\Gamma} D(Q || P).\]

[Note that despite that we are in the finite case, it is essential to consider the interior of \Gamma.] \qquad\square

Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.

Proof of the type theorem. Let us define

    \[T(Q) = \bigg\{(x_1,\ldots,x_n) \in \{1,\ldots,r\}^n : \frac 1n \sum_{k=1}^n\delta_{x_k}=Q \bigg\}.\]

Then we can write

    \begin{eqnarray*} \mathbf{P}[\hat P_n = Q] &=& \mathbf{P}[(X_1,\ldots,X_n) \in T(Q)] = \sum_{x \in T(Q)} P(\{x_1\}) \ldots P(\{x_n\})\\ &=& \sum_{x \in T(Q)} \prod_{i=1}^r P(i)^{n Q(i)} = e^{n \sum_{i=1}^r Q(i) \log  P(i)} \, |T(Q)| \\ &=& e^{-n \{D(Q || P) + H(Q)\}} |T(Q)|. \end{eqnarray*}

It is therefore sufficient to prove that for every Q \in \mathcal{P}_n

    \[\frac{1}{(n+1)^r} e^{n H(Q)} \le |T(Q)| \le e^{n H(Q)}.\]

To show this, the key idea is to utilize precisely the same expression for \mathbf{P}[\hat P_n = Q] given above, for the case that the distribution P that defined the empirical measure \hat P_n is replaced by Q (which is a type). To this end, let us denote by \hat Q_n the empirical measure of n i.i.d. random variables with distribution Q.

Upper bound. We simply estimate using the above expression

    \[1 \ge \mathbf{P}[\hat Q_n = Q] = e^{-n H(Q)} |T(Q)|.\]

Lower bound. It seems intuitively plausible that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n, that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate

    \[1 = \sum_{R \in \mathcal{P}_n} \mathbf{P}[\hat Q_n = R] \le (n+1)^r \mathbf{P}[\hat Q_n = Q] = (n+1)^r e^{-n H(Q)} |T(Q)|.\]

Proof of the claim. It remains to prove the above claim that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n. To this end, note that T(Q) consists of all vectors x such that nQ(1) of the entries take the value 1, nQ(2) of the entries take the value 2, etc. The number of such vectors is

    \[T(Q) = \frac{n!}{(nQ(1))! \ldots (nQ(r))! }.\]

It is now straightforward to estimate

    \begin{eqnarray*}\frac{\mathbf{P}[\hat Q_n = Q]}{\mathbf{P}[\hat Q_n = R]} &=& \prod_{i=1}^r \frac{(nR(i))!}{(nQ(i))!} Q(i)^{n(Q(i)-R(i))}\\ &\ge& \prod_{i=1}^r (n Q(i))^{n(R(i)-Q(i))} Q(i)^{n(Q(i)-R(i))} \quad \text{using} \quad \frac{n!}{m!} \ge m^{n-m}\\ &=&\prod_{i=1}^r n^{n(R(i)-Q(i))} = n^{n \sum_{i=1}^r (R(i)-Q(i))} =1. \end{eqnarray*}

Thus the claim is established. \qquad\square

Remark. It is a nice exercise to work out the explicit form of the rate function \inf_{Q \in \Gamma} D(Q || P) in the example \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as Cramèr’s theorem.

General form of Sanov’s theorem

The drawback to the method of types is that it relies heavily on the assumption that X_i take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.

Let \mathbb{X} be a Polish space (think \mathbb{R}^n), and let X_1,X_2,\ldots be i.i.d. random variables taking values in \mathbb{X} with distribution P. Denote by \mathcal{P} the space of probability measures on \mathbb{X} endowed with the topology of weak convergence: that is, P_n\to P iff \int f dP_n\to \int f dP for every bounded continuous function f. Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of \mathcal{P}.

Theorem. In the present setting, Sanov’s theorem holds verbatim as stated above.

It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets \Gamma that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]

We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.

Proof of the lower bound. It evidently suffices to consider the case that \Gamma is an open set. We use the following topological fact whose proof will be given below: if \Gamma \subseteq \mathcal{P} is open and R\in\Gamma, then there is a finite (measurable) partition \{A_1,\ldots,A_r\} of \mathbb{X} and \varepsilon>0 such that

    \[\tilde \Gamma = \{Q\in\mathcal{P} : |Q(A_i)-R(A_i)|<\varepsilon~ \forall\, i=1,\ldots,r \} \subseteq \Gamma.\]

Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.

Define the function T:\mathbb{X}\to\{1,\ldots,r\} such that T(x)=i for x\in A_i. Then \hat P_n \in \tilde \Gamma if and only if the empirical measure \hat P_n^\circ of T(X_1),\ldots,T(X_n) lies in \Gamma^\circ=\{Q:|Q(\{i\})-R(A_i)|<\varepsilon~\forall\,i=1,\ldots,r\}. Thus

    \[\mathbf{P}[\hat P_n \in \Gamma] \ge \mathbf{P}[\hat P_n \in \tilde \Gamma] = \mathbf{P}[\hat P_n^\circ \in \Gamma^\circ].\]

As T(X_i) take values in a finite set, and as \Gamma^\circ is open, we obtain from the finite Sanov theorem

    \[\liminf_{n \rightarrow \infty} \frac{1}{n} \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \Gamma^\circ} D(Q || PT^{-1}) = -\inf_{Q \in \tilde \Gamma} D(QT^{-1} || PT^{-1}) \ge - D(R||P),\]

where we have used the data processing inequality and R\in\tilde\Gamma in the last inequality. As R\in\Gamma was arbitrary, taking the supremum over R\in\Gamma completes the proof. \qquad\square

Proof of the topological fact. Sets of the form

    \[\bigg\{Q\in\mathcal{P} : \bigg|\int f_id Q-\int f_i dR\bigg|   <\alpha~ \forall\, i=1,\ldots,k \bigg\}\]

for R\in\mathcal{P}, k<\infty, f_1,\ldots,f_k bounded continuous functions, and \alpha>0 form a base for the weak convergence topology on \mathcal{P}. Thus any open subset \Gamma\subseteq\mathcal{P} must contain a set of this form for every R\in\Gamma (think of the analogous statement in \mathbb{R}^n: any open set B\subseteq\mathbb{R}^n must contain a ball around any x\in B).

It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as f_i is a bounded function, we can find for each i a simple function \tilde f_i such that \|\tilde f_i-f_i\|_\infty\le\alpha/3. Clearly |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3 implies |\int f_i dQ-\int f_i dR|<\alpha, so we can replace the functions f_i by simple functions. But then forming the partition \{A_1,\ldots,A_r\} generated by the sets that define these simple functions, it is evident that if \varepsilon>0 is chosen sufficiently small, then |Q(A_j)-R(A_j)|<\varepsilon for all j implies |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3. The proof is complete. \square

Remark. It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.

Lecture by Ramon van Handel | Scribed by Quentin Berthet

10. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

Next lecture: October 17

Next week (October 10) there will be no stochastic analysis seminar due to conflicting events:

03. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off

Lecture 2. Basics / law of small numbers

Due to scheduling considerations, we postpone the proof of the entropic central limit theorem. In this lecture, we discuss basic properties of the entropy and illustrate them by proving a simple version of the law of small numbers (Poisson limit theorem). The next lecture will be devoted to Sanov’s theorem. We will return to the entropic central limit theorem in Lecture 4.

Conditional entropy and mutual information

We begin by introducing two definitions related to entropy. The first definition is a notion of entropy under conditioning.

Definition. If X and Y are two discrete random variables with probability mass functions p_X and p_Y, then the conditional entropy of X given Y is defined as

    \[H(X|Y) := - \mathbb{E} [\log{p_{X|Y}(X|Y)} ]\]

where p_{X|Y}(x|y) = p_{(X,Y)}(x,y)/p_Y(y) is the conditional probability mass function of X given Y.

Remark. If X and Y are absolutely continuous random variables, the conditional differential entropy h(X|Y) is defined analogously (where the probability mass functions are replaced by the corresponding probability densities with respect to Lebesgue measure).

Note that

    \begin{equation*} \begin{split}  H(X|Y)  &=  - \sum_{x,y} p_{(X,Y)}(x,y)\log{p_{X|Y}(x|y)} \\          &= - \sum_y p_Y(y) \sum_x  p_{X|Y}(x|y)\log{p_{X|Y}(x|y)} \\          & = \sum_y p_Y(y) H(X|Y=y). \end{split} \end{equation*}

That is, the conditional entropy H(X|Y) is precisely the expectation (with respect to the law of Y) of the entropy of the conditional distribution of X given Y.

We now turn to the second definition, the mutual information. It describes the degree of dependence between two random variables.

Definition. The mutual information between two random variables X and Y is defined as

    \[I(X,Y) := D( \mathcal{L}(X,Y) || \mathcal{L}(X) \otimes \mathcal{L}(Y)),\]

where \mathcal{L}(X,Y), \mathcal{L}(X) and \mathcal{L}(Y) denote the distributions of (X,Y), X and Y.

Conditional entropy and mutual information are closely related. For example, suppose that (X,Y) has density f_{(X,Y)} with respect to the Lebesgue measure, then

    \begin{equation*} \begin{split} I(X,Y) & = \int f_{(X,Y)}(x,y) \log{\frac{f_{(X,Y)}(x,y)}{f_X(x)f_Y(y)}} \,dx \,dy \\        & = \mathbb{E} \log{\frac{f_{(X,Y)}(X,Y)}{f_X(X)f_Y(Y)}} \\        & = \mathbb{E} \log{\frac{f_{X|Y}(X|Y)}{f_X(X)}}  \\        & = h(X)-h(X|Y). \end{split} \end{equation*}

In particular, since I(X,Y) is always positive (because it is a relative entropy), we have just shown that h(X|Y) \leq h(X), that is, conditioning reduces entropy. The same result holds for discrete random variables when we replace h by H.

Chain rules

Chain rules are formulas that relate the entropy of multiple random variables to the conditional entropies of these random variables. The most basic version is the following.

Chain rule for entropy. H(X_1, X_2, ..., X_n) = \sum_{i=1}^n H(X_i|X_1,...,X_{i-1}). In particular, H(X_2|X_1)=H(X_1, X_2)-H(X_1).

Proof. Note that

    \[p_{(X_1,...,X_n)}(x_1,...,x_n) = \prod_{i=1}^n p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1}).\]

Thus,

    \[\log{ p_{(X_1,...,X_n)}(x_1,...,x_n)} = \sum_{i=1}^n \log{p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1})}.\]

Taking the expectation on both sides under the distribution (x_1,...,x_n) \sim (X_1,...,X_n) gives the desired result. \qquad\square

Corollary. Entropy is sub-additive, that is, H(X_1, X_2, ..., X_n) \leq \sum_{i=1}^n H(X_i).

Proof. Combine the chain rule with H(X_i|X_1,...,X_{i-1}) \leq H(X_i). \qquad\square

There is also a chain rule for relative entropy.

Chain rule for relative entropy.

    \[D(\mathcal{L}(X,Y) || \mathcal{L}(X^{'},Y^{'})) =  D(\mathcal{L}(X)||\mathcal{L}(X^{'})) + \mathbb{E}_{x \sim X} [ D(\mathcal{L}(Y|X=x) || \mathcal{L}(Y^{'}|X^{'}=x))].\]

The following identity will be useful later.

Lemma.

    \begin{multline*}D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) = \\  \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)).\end{multline*}

Proof. Note that

    \begin{equation*} \begin{split} & D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{Y_1}(X_1)\cdots p_{Y_n}(X_n)}} \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{X_1}(X_1)\cdots p_{X_n}(X_n)}}        + \sum_{i=1}^n \mathbb{E}\log{\frac{p_{X_i}(X_i)}{p_{Y_i}(X_i)}} \\  & = D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)) + \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) . \qquad\square \end{split} \end{equation*}

Data processing and convexity

Two important properties of the relative entropy can be obtained as consequences of the chain rule.

Data processing inequality. Let P and Q be two probability measures on \mathcal{A} and suppose T:\mathcal{A} \rightarrow \mathcal{A}^{'} is measurable. Then D(PT^{-1}||QT^{-1}) \leq D(P||Q), where PT^{-1} is the distribution of T(X) when X \sim P.

The data processing inequality tells us that if we process the data X (which might come from one of the two distributions P and Q), then the relative entropy decreases. In other words, it becomes harder to identify the source distribution after processing the data. The same result (with the same proof) holds also if P and Q are transformed by a transition kernel, rather than by a function.

Proof. Denote by \mathsf{P} and \mathsf{Q} the joint laws of (X,T(X)) and (Y,T(Y)) when X\sim P and Y\sim Q. By the chain rule and nonnegativity of relative entropy

    \[D(PT^{-1}|QT^{-1}) = D(\mathsf{P}||\mathsf{Q}) -    \mathbb{E}_{t \sim PT^{-1}} [ D(\mathcal{L}(X|T(X)=t) || \mathcal{L}(Y|T(Y)=t))] \le D(\mathsf{P}||\mathsf{Q}).\]

On the other hand, using again the chain rule,

    \[D(\mathsf{P}||\mathsf{Q}) = D(P||Q) + \mathbb{E}_{x\sim P} [ D(\mathcal{L}(T(X)|X=x) || \mathcal{L}(T(Y)|Y=x))] =  D(P||Q),\]

where we used \mathcal{L}(T(X)|X=x) = \mathcal{L}(T(Y)|Y=x). Putting these together completes the proof. \qquad\square

Convexity of relative entropy. D(\cdot || \cdot) is jointly convex in its arguments, that is, if P_1, P_2, Q_1, Q_2 are probability measures and 0\leq \lambda \leq 1, then

    \[D(\lambda P_1 + (1-\lambda)P_2 || \lambda Q_1 + (1-\lambda)Q_2 ) \leq \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2).\]

Proof. Let T be a random variable that takes value 1 with probability \lambda and 2 with probability 1-\lambda. Conditionally on T=i, draw X\sim P_i and Y\sim Q_i. Then \mathcal{L}(X)=\lambda P_1+(1-\lambda)P_2 and \mathcal{L}(Y)=\lambda Q_1+(1-\lambda)Q_2. Using the chain rule twice, we obtain

    \[D(\mathcal{L}(X)||\mathcal{L}(Y)) \le    D(\mathcal{L}(X,T)||\mathcal{L}(Y,T)) =    \mathbb{E}_{t\sim \mathcal{L}(T)}[D(\mathcal{L}(X|T=t)||\mathcal{L}(Y|T=t))],\]

and the right hand side is precisely \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2). \qquad\square

Corollary. The entropy function H is concave.

Proof for a finite alphabet. When the alphabet \mathcal{A} is finite, the corollary can be proven by noting that H(P)=\log{|\mathcal{A}|} - D(P||\mathrm{Unif}(\mathcal{A})). \qquad\square

Relative entropy and total variation distance

Consider the hypothesis testing problem of testing the null hypothesis H_0: X \sim P against the alternative hypothesis H_1: X \sim Q. A test is a measurable function T:\mathcal{A} \rightarrow \{0,1\}. Under the constraint P(T(X)=1) \leq \alpha, it can be shown that the optimal rate of decay of Q(T(X)=0) as a function of the sample size n is of the order of \exp{(-n\cdot D(P||Q))}. This means that D(P||Q) is the measure of how well one can distinguish between Q and P on the basis of data.

We will not prove this fact, but only introduce it to motivate that the relative entropy D is, in some sense, like a measure of distance between probability measures. However, it is not a metric since D(P||Q) \neq D(Q||P) and the triangle inequality does not hold. So in what sense does the relative entropy represent a distance? In fact, it controls several bona fide metrics on the space of probability measures. One example of such metric is the total variation distance.

Definition. Let P and Q be probability measures on \mathcal{A}. The total variation distance is defined as d_{\text{TV}}(P,Q)=\sup_{A \in \mathcal{B}(\mathcal{A})} |P(A)-Q(A)|.

The following are some simple facts about the total variation distance.

  1. 0 \leq d_{\text{TV}}(P,Q) \leq 1.
  2. If P and Q have probability density functions p and q with respect to some common probability measure \lambda, then d_{\text{TV}}(P,Q)= \frac{1}{2}||p-q||_{L^{1}(\lambda)}. To see this, define A=\{x\in \mathcal{A}, p(x)>q(x) \}. Then

        \begin{equation*} \begin{split} ||p-q||_{L^{1}(\lambda)} & =  \int_{A}(p(x)-q(x))\lambda(dx) + \int_{A^c}(q(x)-p(x))\lambda(dx)  \\  & = P(A) - Q(A) + (1-Q(A)-1+P(A)) \\  & = 2(P(A)-Q(A)) = 2 d_{\text{TV}}(P,Q)  \\ \end{split} \end{equation*}

  3. d_{\text{TV}}(P,Q)= \inf_{X \sim P, Y \sim Q}  \mathbb{P}(X\neq Y).

The following inequality shows that total variance distance is controlled by the relative entropy. This shows that the relative entropy is a strong notion of distance.

Pinsker’s inequality. d_{\text{TV}}(P,Q)^2 \leq \frac{1}{2} D(P||Q).

Proof. Without loss of generality, we can assume that P and Q have probability density functions p and q with respect to some common probability measure \lambda on \mathcal{A}. Let A=\{x\in \mathcal{A}, p(x)>q(x) \} and T=1_{A}(x).

Step 1: Prove this inequality by simple calculation in the case when \mathcal{A} contains at most 2 elements.

Step 2: Note that PT^{-1} and QT^{-1} are defined on the space \{0,1\}. So Pinsker’s inequality applies to PT^{-1} and QT^{-1}. Thus,

    \begin{equation*} \begin{split} D(P||Q) & \geq  D(PT^{-1}||QT^{-1}) \geq 2 d_{\text{TV}}(PT^{-1},QT^{-1})^2   \\  & = 2(P(A)-Q(A))^2 = 2 d_{\text{TV}}(P,Q)^2. \qquad\square \end{split} \end{equation*}

Law of small numbers

As a first illustration of an application of entropy to probability, let us prove a simple quantitative law of small numbers. An example of the law of small numbers is the well known fact that Bin(n,\frac{\lambda}{n})  \rightarrow Po(\lambda) in distribution as n goes to infinity. More generally, if X_1,...,X_n are Bernoulli random variables with X_i \sim Bern(p_i), if X_1,...,X_n are weakly dependent, and if none of the p_i dominates the rest, then \mathcal{L}(\sum_{i=1}^n X_i) \approx Po(\lambda) where \lambda = \sum_{i=1}^{n} p_i. This idea can be quantified easily using relative entropy.

Theorem. If X_i \sim Bern(p_i) and X_1,...,X_n may be dependent, then

    \[D(\mathcal{L}(\bar X) || Po(\lambda)) \leq \sum_{i=1}^n p_i^2 + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) )\]

where \bar X = \sum_{i=1}^nX_i and \lambda = \sum_{i=1}^n p_i.

Proof. Let Z_1,...,Z_n be independent random variables with Z_i \sim Po(p_i). Then \bar Z = \sum_{i=1}^n Z_i \sim Po(\lambda). We have

    \begin{equation*} \begin{split} D(\mathcal{L}(\bar X) || Po(\lambda) )  & =  D(\mathcal{L}(\bar X) || \mathcal{L}(\bar Z)) \\  & \leq   D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Z_1,...,Z_n) )\\  & = \sum_{i=1}^{n} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) ) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) ). \\ \end{split} \end{equation*}

To conclude, it is enough to note that

    \begin{equation*} \begin{split} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) )  & =  (1-p_i)\log{\frac{1-p_i}{e^{-p_i}}} + p_i\log{\frac{p_i}{p_i e^{-p_i}}}  \\  & =  p_i^2 + (1-p_i)(p_i+\log{(1-p_i)})  \\  & \leq p_i^2 .  \qquad\square \end{split} \end{equation*}

Remark. If p_1= \cdots = p_n = \frac{\lambda}{n} and X_1,...,X_n are independent, then the inequality in the theorem becomes D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) ) \leq \frac{\lambda^2}{n}. However, this rate of convergence is not optimal. One can show that under the same condition, D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) )= o(\frac{1}{n}), using tools similar to those that will be used later to prove the entropic central limit theorem. Note that it is much harder to prove D(\mathcal{L}(S_n)|| \mathcal{N}(0,1)) \rightarrow 0 in the entropic central limit theorem, even without rate of convergence!

Lecture by Mokshay Madiman | Scribed by Che-yu Liu

03. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

Lecture 1. Introduction

What is information theory?

The first question that we want to address is: “What is information?” Although there are several ways in which we might think of answering this question, the main rationale behind our approach is to distinguish information from data. We think of information as something abstract that we want to convey, while we think of data as a representation of information, something that is storable/communicable. This is best understood by some examples.

Example. Information is a certain idea, while data is the words we use to describe this idea.

Example. Information is 10^{1000,000,000}. Possible data describing this information are: 10^{1000,000,000}, 10\cdots 0, 10^{10^9}, 1 followed by a billion zeros.

As we are in a mathematical setting we want to rely on a quantitative approach. The main question that arises naturally is: “How can we measure information?” Making sense of this question requires us to have a model for how data is produced. Throughout this seminar we will consider the probabilistic model which we now introduce.

Definition (Probabilistic model). Data is a random variable X taking values on the space A (alphabet) having distribution P (source distribution). We write X\sim P.

To be precise, with the above we mean that there exists a probability space (\Omega,\mathcal{F},\mathbb{P}) and a measurable space (A,\mathcal{G}) with some measurable function X:\Omega\rightarrow A such that we have \mathbb{P}\circ X^{-1} = P.

Remarks.

  1. While this set-up is very similar to what is done in statistics, the focus in information theory is different. In statistics it is assumed that the data X comes from one of a family of distributions (statistical model), and the goal is to infer something about the particular distribution generating the data. On the other hand, in information theory the distribution of the data X might be known or not, and the goal is to compress or communicate X.
  2. In the probabilistic model we assume that the data is generated by a certain random source. This is a particular modeling assumption and it is not necessarily an expression of belief in how data are actually produced. This is a reasonable modeling assumption to make and it allows us to draw reasonable conclusions (for example, text data is clearly not randomly produced, but you can still do useful things by making the modeling assumption that the data was produced by a stochastic source).
  3. The original motivation behind the development of information theory as based on the probabilistic model came from a practical engineering problem (how to compress/communicate data), and not from the idea of how we measure information (although this aspect was also part of the motivation). The whole field of study was created by the 1948 paper of Claude Shannon.
  4. We are going to use the probabilistic model throughout this seminar, but other models exist as well. A popular model used in theoretical computer science is the algorithmic model (which defines the field of algorithmic information theory, as opposed to probabilistic information theory). In this model it is assumed that data is the output of some computer program (running on a Turing machine). While this approach could be thought of as a generalization of the probabilistic model (in fact, one way in which computers can work is to simulate from some probability distribution), many of the basic quantities in algorithmic information theory (like Kolmogorov complexity) are not computable. This is the reason why this field is suitable for theoretical insights, but it is not necessarily suitable for practical purposes.

How do we measure information in the probabilistic model?

In what follows we assume that the information to be conveyed coincides with the data X itself (and we now assume that X takes values in some countable set A), meaning that there is no universal hidden meaning that we are trying to convey apart from the data itself. For example, assume that the information we want to convey (a particular realization of it) is the text “This is the SAS”. A natural way of measuring the amount of information contained in this data is to look for other representations of this information, and to look for the smallest (in some sense that needs to be specified) representation. As we are in the probabilistic framework, we do not know in advance which data is going to be produced by the random source, so we look for a procedure that takes the random outcome and gives us on average the smallest or most compact representation of the information in that data.

Since the data is random, also the size of a particular realization (encoded data) is random. One way to take into account the randomness is to consider a representation (encoding scheme) that minimizes the expected length/size of the encoded data (and that is uniquely decodable). That is, we measure the amount of information in a given data X as

    \[\min_{\text{valid encoding schemes}} \mathbb{E}_P [\text{length of encoded data}].\]

If we set up things in this way, the measure of information is some functional H of the source distribution P, since P is the only quantity governing the data. This functional H is called the entropy and it is defined as follows.

Definition (Entropy). If P is a discrete distribution, then the entropy of P is defined as

    \[H(P) := \sum_{x\in A} p(x) \log\frac{1}{p(x)} ,\]

where we write p(x)=P(\{x\}) for the probability mass function of P.

While it can be shown that H(P) is the minimal expected length of validly encoded data, we do not proceed this way (the ideas behind this fact are covered in the first couple of lectures of an information theory class). Instead, we will give some intuition on why H(P) is a good measure of information.

We first provide some intuition on why the information in the statement X=a should be decreasing as a function of P(a). In fact, recall that presently we assume to know the source distribution P. If we know P, how informative is a particular outcome x form the source distribution? If P(X=x)=\delta_a(x) (i.e., P is a point mass at a), being informed that the random outcome is a is not informative. On the other hand, if P(X=a)=10^{-10}, being informed that the outcome is a is extremely informative/significant (something very rare has happened).

The relevant decreasing function turns out to be the following:

    \[p(x) \longrightarrow \log\frac{1}{p(x)}.\]

In this respect, \log\frac{1}{p(x)} corresponds to the information that we get from the statement X=x. So the average amount of information in the random outcome X is given by

    \[\mathbb{E}_P \left[ \log\frac{1}{p(X)} \right] = \sum_{x\in A} p(x) \log\frac{1}{p(x)} 	= H(P).\]

Connection between information theory and statistics

While the connection between information theory and statistics is not surprising as both fields rely on the probabilistic model, this correspondence is very strong and natural. We give some examples.

  1. Maximum likelihood estimators (MLE) can be seen as minimal codelength estimators. In a statistical model we assume that X\sim P_{\theta}, with \theta\in \Theta for some parameter space \Theta, and the goal is to find the parameter \theta that generated the data. A popular estimator is the MLE since it is plausible to assume that the parameter that generated the data X is the parameter \tilde\theta whose corresponding distribution would have given maximal probability to X, that is,

        \[\hat\theta := \mathop{\mathrm{argmax}}_{\tilde\theta\in\Theta} P_{\tilde\theta}(X).\]

    Note that we can rewrite the above as

        \[\hat\theta = \mathop{\mathrm{argmin}}_{\tilde\theta\in\Theta} \log \frac{1}{P_{\tilde\theta}(X)},\]

    which can be seen to correspond to the minimal number of bits required to represent X assuming that it was generated by P_{\tilde\theta} (codelength). Hence the connection between MLE in statistics and the minimal codelength estimator in information theory. In this setting we assume that we do not know the distribution generating the data and we try to find a good code to encode the data. The problem of finding a good code is in some sense equivalent to the problem of finding the distribution itself, since once you know the distribution you know the best code (in some sense). Also, we mention that many penalized-MLE estimators (where we take into account the complexity of the model by adding a penalty term to the MLE estimator) can be motivated from an information-theoretic point of view in terms of analogue of coding problems; this is the idea behind the “Minimum Description Length” principle.

  2. In Hypothesis testing, the optimal error exponents are information-theoretic quantities.

These are not just coincidental connections, but examples of basic relationships between fundamental limits of statistical inference on the one hand, and fundamental limits of communication and compression of data on the other hand.

We now turn to the main topic of this seminar, that is, the connection between information theory and probability theory.

Connection between information theory and probability theory

Since we are using a probabilistic model it is clear that probability theory is the language of information theory. However, it is not so obvious that information theory can say something fundamental about probability theory. In fact, in the past half century or so, it has been realized that information theory captures many fundamental concepts in probability theory. Before turning to one key example of such connection (the entropic central limit theorem) which will serve as motivation for the initial few lectures of the seminar, we introduce some relevant quantities.

Definition (Differential or Boltzmann-Shannon entropy). If X\in\mathbb{R}^n, X\sim P and \frac{dP}{d \text{Leb}}=f (i.e., X has a density f with respect to the Lesbegue measure), then the differential entropy of P (equivalently, differential entropy of f) is defined as

    \[h(P) := h(f) := - \int_{R^n} f(x) \log f(x) dx,\]

with the conventions 0\log 0 = 0 and dx = \text{Leb}(dx).

While we can think of h as a measure of disorder (particularly motivated by the setting introduced by Boltzmann in physics), h is not a measure of information in the same sense as H is. The reason is that in the present context of “continuous” data (recall that we are in \mathbb{R}^n and a possible outcome of X is a real number) we need infinitely many bits to encode each outcome of X, so it is not meaningful to talk of the amount of information in an outcome as this is generally infinity. Nonetheless, the differential entropy represents a crucial quantity in information theory, and shows up for example both when considering communication over channels with continuous noise distributions, and when considering lossy data compression (the only kind of data compression possible with sources like \mathbb{R}^n, where one accepts some slight distortion of the data in order to be able to encode it with finitely many bits).

The notion that unifies the continuous entropy h with the discrete entropy H previously introduced is the relative entropy which we now define.

Definition (Relative entropy). If P is a probability measure and Q is a \sigma-finite measure on A, then the relative entropy between P and Q is defined as

    \[D(P || Q) :=  	\begin{cases} 		\int f \log f \, dQ	&\text{if $P\ll Q$ with $\frac{dP}{dQ}=f$},\\ 		\infty		&\text{otherwise}. 	\end{cases}\]

Typically P and Q have respective densities p and q with respect to a given reference measure \lambda. Then the relative entropy reads

    \[D(P || Q) = \int p(x) \log \frac{p(x)}{q(x)} \,\lambda(dx).\]

The following examples show how the relative entropy relates h and H.

  1. If A is a countable set and \lambda is the counting measure, then

        \[D(P || \lambda) = \sum_{x\in A} p(x) \log p(x) = - H(P).\]

  2. If A=\mathbb{R}^n and \lambda is the Lesbegue measure, then

        \[D(P || \lambda) = \int p(x) \log p(x) \, dx = - h(P).\]

The following property of relative entropy is the most important inequality in information theory.

Lemma. Let P be a probability measure on A, and Q be a sub-probability measure on A (i.e., Q is a nonnegative, countably additive measure with 0<Q(A)\le 1). Then D(P || Q) \ge 0.

Proof. We only need to consider the case where P\ll Q. Let f=\frac{dP}{dQ} and R=Q/Q(A). Then we have

    \begin{align*} 	D(P || Q)  	&= \int f(x) \log f(x) \,Q(dx) 	= Q(A)\,\mathbb{E}_R\left[f(X)\log f(X)\right]\\ 	&\ge Q(A)\, \mathbb{E}_R[f(X)]\log\mathbb{E}_R[f(X)] = \log\frac{1}{Q(A)} \ge 0, \end{align*}

where we have applied Jensen’s inequality (which holds as R is a probability measure) using that x\mapsto x\log x is convex, and used that \mathbb{E}_R[f(X)]=P(A)/Q(A)=1/Q(A) and that Q(A)\le 1. \square

As a consequence of this result we can now show that the Gaussian distribution maximizes the entropy under constraints on the first two moments.

Lemma. Let \mathcal{P}_{\mu,\sigma^2} be the class of all probability densities on \mathbb{R} (with respect to Lebesgue measure) with mean \mu and variance \sigma^2 and define

    \[g_{\mu,\sigma^2}(x) := \frac{1}{\sqrt{2\pi\sigma^2}} \, e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]

Then h(g_{\mu,\sigma^2}) \ge h(f) for any f\in\mathcal{P}.

Proof. First of all note that

    \begin{align*} 	\int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx  	= \int f(x) \log g_{\mu,\sigma^2}(x) \, dx  \end{align*}

as \log g_{\mu,\sigma^2} is quadratic function and, consequently, only the first two moments are involved in computing its expectation. Hence, we have

    \begin{align*} 	h(g_{\mu,\sigma^2}) - h(f)  	&= - \int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx + \int f(x) \log f(x) \, dx\\ 	&= \int f(x) \log \frac{f(x)}{g_{\mu,\sigma^2}(x)} \,dx = D(f||g_{\mu,\sigma^2}) \ge 0. \qquad\square \end{align*}

We are now ready to present the first example of cross-fertilization where information-theoretic concepts can be used to capture fundamental properties in probability theory. Let us first recall the classical central limit theorem (CLT).

Theorem (CLT). If X_1,X_2,\ldots are i.i.d. real-valued random variables with mean 0 and variance 1, then

    \[S_n=\frac{1}{\sqrt{N}} \sum_{i=1}^N X_i \stackrel{\mathcal{D}}{\longrightarrow} N(0,1),\]

that is,

    \[\mathbb{P}\{S_n\in A\} \longrightarrow \frac{1}{\sqrt{2 \pi}} \int_A e^{-\frac{x^2}{2}} \,dx\]

for nice enough sets A.

If we denote by f_{S_n} the density of the normalized partial sum S_n introduced in the statement of the theorem above, we note the following.

  1. For each n\ge 1 we have S_n\in\mathcal{P}_{0,1}. This follows immediately from basic properties of expected values.
  2. From the previous lemma it follows immediately that

        \[\mathop{\mathrm{argmax}}_{f\in \mathcal{P}_{0,1}} h(f) = g_{0,1}.\]

So, the CLT tells us that the sequence f_{S_1}, f_{S_2}, \ldots\in\mathcal{P}_{0,1} converges to the maximizer g_{0,1} of the entropy in \mathcal{P}_{0,1}. In fact, it turns out that the convergence in the central limit theorem can be studied in terms of the entropy and that the CLT is an expression of increasing entropy, as the following entropic central limit theorem describes.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean 0 and variance 1, and assume the distribution of X_1 has a density (with respect to Lebesgue measure). Under minimal assumptions (specifically, that h(f_{S_n})>-\infty for some n), we have

    \[h(f_{S_n}) \uparrow h(g_{0,1}),\]

or, equivalently,

    \[D(f_{S_n}||g_{0,1}) \downarrow 0.\]

The entropic central limit theorem is remarkable as usually limit theorems do not come with an associated monotonicity statement. This suggests that the relative entropy is a natural tool to analyze the CLT.

Of course, a natural question that presents itself is whether other limit theorems in probability can be understood from a similar information-theoretic point of view.

Plan for future lectures

In the next lecture or two we will present a full proof of the entropic central limit theorem, and also discuss briefly how other limit theorems can be analogously understood from this information-theoretic point of view. Later, we will look at finer behavior than limit theorems, for instance we may look at how information theory can provide insights into large deviations and concentration inequalities.

Lecture by Mokshay Madiman | Scribed by Patrick Rebeschini

25. September 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off

← Older posts

Newer posts →

css.php