Lecture 8. Entropic cone and matroids

This lecture introduces the notion of the entropic cone and its connection with entropy inequalities.

Entropic cone

Recall that if X is a discrete random variable with distribution P_X, the entropy of X is defined as

    \[H(X) = H(P_X) = -\sum_{x\in\mathrm{supp}(P_X)}P_X(x)\log_2 P_X(x).\]

Now let (X_1,\ldots,X_m)\sim \mu be (not necessarily independent) discrete random variables. It will be convenient to define the entropy function

    \[H(S) := H(X_S) = H(\mu|_{S}) \quad\mbox{where}\quad S\subseteq[m]:=     \{1,\ldots,m\},\quad X_S:=\{X_i:i\in S\}.\]

As the entropy depends only on the probabilities of the outcomes of a random variable and not on its values, we will assume without loss of generality in the sequel that X_1,\ldots,X_n take values in \mathbb{Z}_+.

For any probability \mu on \mathbb{Z}_+^m, let \vec{H_\mu} be the (2^m-1)-dimensional vector

    \[\vec{H_\mu} := \{H(S):S\subseteq[m],~S\ne\varnothing\}.\]

We can now define the entropic cone.

Definition. The entropic cone is the set

    \[\Gamma_m^* := \{\vec{H_\mu}:\mu\mbox{ a probability on }\mathbb{Z}_+^m\}   \subset\mathbb{R}^{2^m-1},\]

i.e., the set of all vectors that can be obtained as the entropy function of m variables.

Question. Can we characterize \Gamma_m^*?


  1. If m=1, then \Gamma_1^*=\mathbb{R}_+, as there exist random variables of arbitrary (nonnegative) entropy.
  2. If m=2, the vector \vec{H_\mu} for (X_1,X_2)\sim\mu is

        \[\vec{H_\mu} = \left( \begin{array}{c} H(X_1) \\ H(X_2) \\ H(X_1,X_2) \end{array} \right).\]

    What constraints must this vector satisfy? We must certainly have

        \[H(X_1) \vee H(X_2) \leq H(X_1,X_2) \leq H(X_1)+H(X_2),\]

    which follows from the chain rule together with positivity for the first inequality, and that conditioning reduces entropy for the second. Are these the only constraints? In fact, we can create many vectors given these constraints. For example:

    • The vector (\alpha,\alpha,\alpha)^T is obtained by taking Z such that H(Z)=\alpha, and letting X_1=X_2=Z.
    • The vector (\alpha,\alpha,2\alpha)^T is obtained by taking X_1,X_2 to be i.i.d. copies of Z.
    • Convex combinations of these vectors can be obtained by taking mixtures of their distributions.

    In fact, a careful analysis shows that the above constraints completely characterize the entropic cone for m=2: that is,

        \[\Gamma_2^* = \{(x,y,z)\in\mathbb{R}_+^3: x\vee y\le z\le x+y\}.\]

  3. If m=3, what happens? Here we are are still subject to the same constraints as in the case m=2, but we pick up some new constraints as well such as

        \[H(X_1,X_2) + H(X_2,X_3) \ge H(X_1,X_2,X_3) + H(X_2)\]

    which is equivalent to the inequality H(X_1|X_2) \ge H(X_1|X_2,X_3).

Evidently, the constraints on the entropic cone correspond to entropy inequalities. What type of inequalities must hold for any given m? Let us think about this question a bit more systematically.

Definition. A set function f:2^{[m]}\to\mathbb{R} is called a polymatroid if

  1. f(\varnothing)=0, f(S)\ge 0 for all S.
  2. f(S)\le f(T) when S\subseteq T (monotonicity).
  3. f(S\cup T)+f(S\cap T)\le f(S)+f(T) (submodularity).

It is not difficult to check that

Lemma. The entropy function H is a polymatroid.

Other examples of polymatroids: (1) entropy, mutual information; (2) max-weight f(S)=\max_{x\in S}w(x) for given weights w; (3) flows; (4) cuts (not monotone, but submodular); (5) Von Neumann entropy.

Let us define

    \[\Gamma_m = \{ \mbox{polymatroid vectors on }[m]\}\subset\mathbb{R}^{2^m-1}.\]

Evidently \Gamma_m is a polyhedral cone (the intersection of a finite number of halfplanes).


  1. \Gamma_3^*\ne \Gamma_3.
  2. \mathrm{cl}\,\Gamma_3^*=\Gamma_3.
  3. \mathrm{cl}\,\Gamma_m^*\ne\Gamma_m for m\ge 4.
  4. \mathrm{cl}\,\Gamma_m^* is a convex cone for all m\ge 1 (but not polyhedral for m\ge 4).

For the proof, we refer to the book for R. Yeung.

As an example of \Gamma_3^*\ne \Gamma_3, we note that the vector (\alpha,\alpha,\alpha,2\alpha,2\alpha,2\alpha,2\alpha)^T is only achievable when \alpha=\log_2M for some M\in\mathbb{Z}_+. Let us also note that convexity of \Gamma_m^* holds in its interior but not on the boundary. However, issues only arise on the boundary of \Gamma_3^*, that is, no new entropy inequalities appear beyond the polymatroid inequalities when m=3.

On the other hand, when m\ge 4, many new inequalities are introduced that actually cause holes to appear within \Gamma_m^*. One such inequality (expressed in terms of mutual information) is

    \[2I(X_3;X_4)\leq I(X_1;X_2) + I(X_1;X_3 \mid X_4) + 3I(X_3;X_4 \mid X_1) + I(X_3;X_4 \mid X_2).\]

Using the entropic cone

The entropic cone can be used to obtain new information theoretic inequalities. For example, the following question arose in the study of mechanism design in computer science.

Problem. Let X_1,\ldots,X_m be discrete random variables. Define the m\times m matrix A as A_{ij}=I(X_i;X_j)=H(X_i)+H(X_j)-H(X_i,X_j). Is A positive definite?

When m=2, we have \det(A)=H(X_1)H(X_2)-I(X_1 ; X_2)^2 \geq 0 since H(X_1) \geq I(X_1;X_2). Thus all the principal minors of A have positive determinant, so A is positive definite.

How about m=3? Note that A depends linearly on the entries of the vector \vec{H_\mu} (where \mu is the joint distribution of X_1,\ldots,X_3). Thus if A is positive definite for distributions on the extreme rays (ER) of the entropic cone, then A must be positive definite for any distribution. More generally:

Proposition. If F : \mathbb{R}_+^7 \rightarrow \mathbb{R} is convex, then F(\vec{H}) \geq 0 holds for all \vec H \in \Gamma_3^* if and only if F(\vec H_i) \geq 0 for all \vec H_i \in ER(\Gamma_3).

Proof. It suffices to note that

    \[F(\lambda H_i + (1-\lambda) H_j) \geq \lambda F(H_i) + (1-\lambda)F(H_j) \geq 0\]

for every 0\le\lambda\le 1, and to use that \Gamma_3=\mathrm{cl}\,\Gamma_3^*. \square

This necessary and sufficient condition for m=3 generalizes to a sufficient condition for m\ge 4.

Proposition. If F : \mathbb{R}_+^{2^m-1} \rightarrow \mathbb{R} is convex, then F(\vec{H}) \geq 0 holds for all \vec H \in \Gamma_m^* if F(\vec H_i) \geq 0 for all \vec H_i \in ER(\Gamma_m}).

As \Gamma_m is polyhedral, this simplifies solving problems such as checking positive definiteness of A significantly: it suffices to check a finite number of cases, which can essentially be done by hand. It can be shown this way that A is always positive definite for m=3, but this can fail for m\ge 4.


It is sometimes of interest to investigate discrete variants of the above questions.

Definition. A matroid f : 2^{[m]} \rightarrow \mathbb{Z}_+ is defined by the conditions

  1. f(S)\le |S| for all S;
  2. f is monotone;
  3. f is submodular.


  1. Vector matroids (also called \mathbb{F}-representable matroids, where \mathbb{F} is a field). Given a matrix A with values in \mathbb{F}, define f(S) = rank(A(S)) where A(S) denotes the columns induced by S.
  2. Graphical matroids. Let G=(V,E) be a graph, and choose m=|E|. Then define f(S) = \max|H|, where the maximum is taking over acyclic subgraphs H \subset S.
  3. Entropic matroids. Let f(S)=H(X_S). For what distributions of X_1,\ldots,X_m is this a matroid?

Denote by \tilde\Gamma^*_{m} the entropic cone induced by those distributions where X_1,\ldots,X_m take values in \{0,1\}. In order for such a distribution \mu to define an entropic matroid, the vector \vec{H_\mu} must take values in \mathbb{Z}^{2^m-1}. Thus we are led to consider the set

    \[\Pi_{m} := \tilde\Gamma^*_{m} \cap \mathbb{Z}^{2^m-1}.\]

Can one characterize what type of matroids can arise as entropic matroids?

Theorem. \Pi_m coincides with the set of all GF(2)-representable matroids with m elements.

For the proof and further results around this theme, see the paper of E. Abbe.

Lecture by Emmanuel Abbe | Scribed by Danny Gitelman

06. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 8. Entropic cone and matroids

Lecture 7. Entropic CLT (4)

This lecture completes the proof of the entropic central limit theorem.

From Fisher information to entropy (continued)

In the previous lecture, we proved the following:

Theorem. If X_1, ..., X_n are independent random variables with I(X_i)<+\infty, then

(1)   \begin{equation*} I\left(\sum_{i\in[n]}X_{i}\right)\leq\sum_{s\in\mathcal{G}}\frac{\omega_{s}^{2}}{\beta_{s}}\, I\left(\sum_{i\in s}X_{i}\right), \end{equation*}

where \{\omega_{s}: s\in\mathcal{G}\} are weights satisfying \sum_{s\in\mathcal{G}}\omega_{s}=1 and \{\beta_{s}\colon s\in\mathcal{G}\} is any fractional packing of the hypergraph G in the sense that \sum_{s\in\mathcal{G}, s\ni i}\beta_{s}\leq 1 for every i\in[n]=\{1,...,n\}.

By optimizing the choice of the weights, it is easy to check that (1) is equivalent to

    \begin{equation*} \frac{1}{I\left(\sum_{i\in[n]}X_{i}\right)}\geq \sum_{s\in\mathcal{G}}\frac{\beta_{s}}{I\left(\sum_{i\in s}X_{i}\right)}. \end{equation*}

In fact, an analogous statement holds for entropy:

(2)   \begin{equation*} e^{2h\left(\sum_{i\in[n]}X_{i}\right)}\geq \sum_{s\in\mathcal{G}}\beta_{s}\, e^{2h\left(\sum_{i\in s}X_{i}\right)}. \end{equation*}

Proving (2) for general fractional packings and general hypergraphs requires some additional steps that we will avoid here. Instead, we will prove the a special case, due to Artstein, Ball, Barthe and Naor (2004), that suffices to resolve the question of monotonicity of entropy in the CLT.

In the following, let \mathcal{G} be the set of all subsets of [n] of size n-1 and take \beta_{s}=\frac{1}{n-1} for every s\in\mathcal{G}. In this case, (2) takes the following form.

Theorem EPI. If X_1, ..., X_n are independent random variables, then

    \begin{equation*} e^{2h\left(\sum_{i\in[n]}X_{i}\right)}\geq \frac{1}{n-1}\sum_{i=1}^{n}e^{2h\left(\sum_{j\neq i}X_{j}\right)}, \end{equation*}

provided that all the entropies exist.

To prove Theorem EPI, we will use the de Bruijn identity which was discussed in the previous lecture. Let us rewrite it in a useful form for the coming proof and give a proof.

Theorem (de Bruijn identity). Let X be a random variable with density f on \mathbb{R}. Let X^{t}=e^{-t}X+\sqrt{1-e^{-2t}}Z, t\geq 0, where Z\sim N(0,1) is independent of X. We have

    \begin{equation*} \frac{d}{dt}h(X^{t})=I(X^{t})-1. \end{equation*}

In particular,

(3)   \begin{equation*} h(N(0,1))-h(X)=\int_{0}^{\infty}[I(X^{t})-1]dt. \end{equation*}

Proof. Let f_{t} be the density of X^{t}. Then

    \begin{equation*} \frac{\partial f_{t}(x)}{\partial t}=(Lf_{t})(x), \end{equation*}

where (L\psi)(x)=\psi^{\prime\prime}(x)+\frac{d}{dx}[x\psi(x)]. It is easy to check that \int (Lf_{t})(x)dx=0. Hence,

    \begin{equation*} \begin{split} \frac{d}{dt}\left[-\int f_{t}(x)\log f_{t}(x)dx\right]&=-\int \frac{\partial f_{t}(x)}{\partial t}\log f_{t}(x)dx-\int f_{t}(x)\frac{1}{f_{t}(x)}\frac{\partial f_{t}(x)}{\partial x}dx\\ &=-\int (Lf_{t})(x)\log f_{t}(x)dx-\int (Lf_{t})(x)dx\\ &=-\int f_{t}^{\prime\prime}(x)dx\log f_{t}(x)dx-\int \frac{\partial}{\partial x}[xf_{t}(x)]\log f_{t}(x)dx\\ &=\int \frac{(f_{t}^{\prime}(x))^{2}}{f_{t}(x)}dx+\int xf_{t}^{\prime}(x)dx\\ &=I(X^{t})-\int f_{t}(x)dx\\ &=I(X^{t})-1. \end{split} \end{equation*}

Integrating from 0 to \infty gives (3). \square

Proof of Theorem EPI. In (1), let \mathcal{G} be the set of all subsets of [n] of size n-1 and \beta_{s}=\frac{1}{n-1}. We have

(4)   \begin{equation*} I\left(\sum_{i\in[n]}X_{i}\right)\leq(n-1)\sum_{i\in[n]}w_{i}^{2}\, I\left(\sum_{j\neq i}X_{j}\right), \end{equation*}

where, w_{i}=\omega_{[n]\setminus\{i\}} is the weight corresponding to the set [n]\setminus\{i\}\in\mathcal{G}. It is easy to check that (4) is equivalent, for every (a_{1},...,a_{n})\in\mathbb{R}^{n} with \sum_{i=1}^{n}a_{i}^{2}=1, to

    \begin{equation*} I\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq(n-1)\sum_{i=1}^{n}w_{i}^{2}\, I\left(\sum_{j\neq i}a_{j}X_{j}\right)=(n-1)\sum_{i=1}^{n}\frac{w_{i}^{2}}{\sum_{j\neq i}a_{j}^{2}}\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right). \end{equation*}

Note the last equality is due to the scaling property of Fisher information. Recalling that w_{1},...,w_{n} are arbitrarily chosen weights with \sum_{i=1}^{n}w_{i}=1, we conclude that

(5)   \begin{equation*} I\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq(n-1)\sum_{i=1}^{n}b_{i}^{2}\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right), \end{equation*}

for any choice of (a_1,...,a_n)\in\mathbb{R}^n with \sum_{i=1}^{n}a_{i}^{2}=1 and (b_1,...,b_n)\in \mathbb{R}^{n}_{+} with \sum_{i=1}^{n}b_{i}\sqrt{\sum_{j\neq i}a_{j}^{2}}=1.

Let us keep the choice of such (a_1,...,a_n) arbitrary for the moment and let b_{i}=\frac{1}{n-1}\sqrt{\sum_{j\neq i}a_{j}^{2}}. Then

(6)   \begin{equation*}\sum_{i=1}^{n}b_{i}\sqrt{\sum_{j\neq i}a_{j}^{2}}=\sum_{i=1}^{n}\frac{1}{n-1}\sum_{j\neq i}a_{j}^{2}=\frac{1}{n-1}\,(n-1)\sum_{i=1}^{n}a_{i}^{2}=1. \end{equation*}

Let X_{i}^{t}=e^{-t}X_{i}+\sqrt{1-e^{-2t}}Z_{i}, where Z_{i} is independent of all else. Then

    \begin{equation*}I\left(\left(\sum_{i=1}^{n}a_{i}X_{i}\right)^{t}\right)=I\left(\sum_{i=1}^{n}a_{i}X_{i}^{t}\right)\leq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\, I\left(\frac{\sum_{j\neq i}a_{j}X_{j}^{t}}{\sqrt{\sum_{j\neq i}a_{j} ^{2}}}\right). \end{equation*}

Note that

    \begin{equation*} \begin{split}\frac{\sum_{j\neq i}a_{j}X_{j}^{t}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}&=\frac{\sum_{j\neq i}a_{j}\left(e^{-t}X_{j}+\sqrt{1-e^{-2t}}Z_{j}\right)}{\sqrt{\sum_{j\neq  i}a_{j}^{2}}}\\&=e^{-t}\cdot\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}+\sqrt{1-e^{-2t}}\cdot\frac{\sum_{j\neq i}a_{j}Z_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}} }\\&\stackrel{(D)}{=}\left(\frac{\sum_{j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j }^{2}}}\right)^{t}, \end{split} \end{equation*}

where the equality in distribution is due to the fact that \frac{\sum_{j\neq i} a_{j}Z_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\sim N(0,1). Hence, by (6),

    \begin{equation*}I\left(\left(\sum_{i=1}^{n}a_{i}X_{i}\right)^{t}\right)-1\leq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\left[I\left(\left(\frac{\sum_ {j\neq i}a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right)^{t}\right)-1\right]. \end{equation*}

We use de Bruijn identity to integrate this from 0 to \infty over time and get

    \begin{equation*}h(N(0,1))-h\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\leq\frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)\left[h(N(0,1))-h\left(\frac{\sum_{j\neq  i} a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right)\right]. \end{equation*}

By (6) again,

(7)   \begin{equation*}h\left(\sum_{i=1}^{n}a_{i}X_{i}\right)\geq \frac{1}{n-1}\sum_{i=1}^{n}\left(\sum_{j\neq i}a_{j}^{2}\right)h\left(\frac{\sum_{j\neq i} a_{j}X_{j}}{\sqrt{\sum_{j\neq i}a_{j}^{2}}}\right) \end{equation*}

This is the entropy analog of the Fisher information inequality obtained above.

As a final step, let N_{i}=e^{2h\left(\sum_{j\neq i}X_{j}\right)}. We are to show

    \[e^{2h\left(\sum_{i=1}^{n}X_{i}\right)}\geq \frac{1}{n-1}\sum_{i=1}^{n}N_{i}.\]

If N_{k}\geq \frac{1}{n-1}\sum_{i=1}^{n}N_{i} for some k, the result is immediate due to the general fact h(Y_1+Y_2)\geq h(Y_1) (convolution increases entropy). Hence, we assume that N_{k}\leq \frac{1}{n-1}\sum_{i=1}^{n}N_{i} for every k, so

    \begin{equation*} \lambda_{k}:=\frac{N_{k}}{\sum_{i=1}^{n}N_{i}}\leq \frac{1}{n-1}. \end{equation*}

Set a_{i}=\sqrt{1-(n-1)\lambda_{i}}, i\in[n]. Note that

    \begin{equation*} \sum_{i=1}^{n}a_{i}^{2}=\sum_{i=1}^{n}1-(n-1)\sum_{i=1}^{n}\lambda_{i}=n-(n-1)=1. \end{equation*}

With this choice of a_1,...,a_n, we apply (7). After some computations, we get

    \begin{equation*} h\left(\sum_{i=1}^{n}X_{i}\right)\geq \sum_{i=1}^{n}\lambda_{i}\,h\left(\frac{1}{\sqrt{(n-1)\lambda_{i}}}\sum_{j\neq i}X_{j}\right). \end{equation*}

Using the definition of the \lambda_{i}, the final result follows from here immediately. \square

Proof of the Entropic Central Limit Theorem

Entropic CLT. Suppose that X_1, ..., X_n are i.i.d. random variables with \mathbb{E} X_{i}=0 and \text{Var}X_{i}=1. Let S_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}X_{i}. If h(S_{n})>-\infty for some n, then h(S_{n})\uparrow h(N(0,1)) as n\rightarrow\infty. Equivalently, if D(\mathcal{L}(S_{n})||N(0,1))<+\infty for some n, then D(\mathcal{L}(S_{n})||N(0,1))\downarrow 0 as n\rightarrow\infty.

In the case of i.i.d. random variables, Theorem EPI gives

    \begin{equation*} e^{2h\left(\sum_{i=1}^{n}X_{i}\right)}\geq \frac{n}{n-1}e^{2h\left(\sum_{j=1}^{n-1}X_{i}\right)}, \end{equation*}

which is equivalent to h(S_n)\geq h(S_{n-1}) because of the identity h(aX)=h(X)+\log\abs{a}. Therefore, it remains to show h(S_{n})\rightarrow h(N(0,1)) as n\rightarrow \infty. For this, we need some analytical properties of the relative entropy. Henceforth, we use \mathcal{P}(\mathcal{X}) to denote the set of all probability measures on some Polish space \mathcal{X} (take \mathcal{X}=\mathbb{R}^d for instance).

Proposition (Variational characterization of D). Let \gamma,\theta\in\mathcal{P}(\mathcal{X}). Then

    \begin{equation*} D(\gamma||\theta)=\sup_{g\in\text{BM}(\mathcal{X})}\left(\int gd\gamma-\log\int e^{g}d\theta\right), \end{equation*}

where \text{BM}(\mathcal{X}) denotes the set of all bounded measurable functions g:\mathcal{X}\rightarrow\mathbb{R}.


  1. In the variational characterization of D above, it is enough to take the supremum over the set \text{BC}(\mathcal{X}) of all bounded continuous functions g:\mathcal{X}\rightarrow\mathbb{R}.
  2. For fixed g\in\text{BM}(\mathcal{X}), the mapping (\gamma,\theta)\mapsto\int gd\gamma-\log\int e^{g}d\theta is convex and continuous. Since D(\gamma||\theta) is the supremum over a class of convex, continuous functions, it is convex and lower semicontinuous. These properties of relative entropy, made transparent by the variational characterization, are very useful in many different contexts.

Corollary. Sublevel sets of D(\cdot||\theta) are compact, that is, the set \{\mu\in\mathcal{P}(\mathcal{X})\colon D(\mu||\theta)\leq M\} is compact (with respect to the topology of weak convergence) for every M<+\infty and \theta\in\mathcal{P}(\mathcal{X}).

Before we prove the corollary, let us recall the definition of tightness and Prohorov Theorem.

Definition (Tightness). A set A\subseteq\mathcal{P}(\mathcal{X}) is called tight if for every \varepsilon>0 there exists a compact set K\subseteq\mathcal{X} such that \mu(K^{c})<\varepsilon for every \mu\in A.

Prohorov Theorem. A set A\subseteq\mathcal{P}(\mathcal{X}) is weakly precompact if and only if it is tight.

Proof of Corollary. Let (\mu_{n})_{n\in\mathbb{N}} be a sequence in \mathcal{P}(\mathcal{X}) with \sup_{n\in\mathbb{N}}D(\mu_{n}||\theta)\leq M. By the variational characterization of D, for every g\in\text{BM}(\mathcal{X}), we have

(8)   \begin{equation*} \int gd\mu_{n}-\log \int e^{g}d\theta\leq D(\mu_{n}||\theta)\leq M. \end{equation*}

Note that \{\theta\} is a tight set as a singleton. We claim that the sequence (\mu_{n})_{n\in\mathbb{N}} is also tight. Indeed, let \varepsilon>0 and let K\subseteq\mathcal{X} be a compact set with \theta(K^{c})<\varepsilon. We take g=\log\left(1+\frac{1}{\varepsilon}\right)1_{K^{c}} and apply (8) to get

    \begin{equation*} \log\left(1+\frac{1}{\varepsilon}\right)\mu_{n}(K^{c})-\log\left[1+\frac{1}{\varepsilon}\,\theta(K^{c})\right]\leq M. \end{equation*}


    \begin{equation*} \mu_{n}(K^{c})\leq\frac{1}{\log\left(1+\frac{1}{\varepsilon}\right)}\left(M+\log\left[1+\frac{1}{\varepsilon}\,\theta(K^{c})\right]\right)\leq \frac{M+\log 2}{\log\left(1+\frac{1}{\varepsilon}\right)}, \end{equation*}

where the rightmost term can be made arbitrarily small. Hence, (\mu_{n})_{n\in\mathbb{N}} is tight. By Prohorov Theorem, there exists \mu\in\mathcal{P}(\mathcal{X}) such that \mu_{n}\Rightarrow\mu as n\rightarrow\infty. By lower semicontinuity of D, we have

    \begin{equation*} D(\mu||\theta)\leq \liminf_{n\rightarrow\infty}D(\mu_{n}||\theta)\leq M. \end{equation*}

This finishes the proof of the corollary. \square

Proof of Variational Characterization of D. As a first case, suppose \gamma\not\ll\theta. So D(\gamma||\theta)=+\infty and there exists a Borel set A\subseteq \mathcal{X} with \theta(A)=0 and \gamma(A)>0. Let g_{n}=n\cdot 1_{A}\in\text{BM}(\mathcal{X}). We have

    \begin{equation*} \int g_{n}d\gamma-\log\int e^{g_{n}}d\theta=n\gamma(A)\rightarrow+\infty \end{equation*}

as n\rightarrow\infty. Hence, both sides of the variational characterization are equal to +\infty.

For the rest, we assume \gamma\ll\theta. First, we show the \geq part. If D(\gamma||\theta)=+\infty, the inequality is obvious. Suppose D(\gamma||\theta)<+\infty. Given g\in\text{BM}(\mathcal{X}), define a probability measure \gamma_{0}\sim\theta by

    \begin{equation*} \frac{d\gamma_{0}}{d\theta}=\frac{e^{g}}{\int e^{g}d\theta}. \end{equation*}


    \begin{equation*} \begin{split} D(\gamma||\theta)&=\int \log\left(\frac{d\gamma}{d\theta}\right)d\gamma\\ &=\int \log\left(\frac{d\gamma}{d\gamma_{0}}\right)d\gamma+\log\left(\frac{d\gamma_{0}}{d\theta}\right)d\gamma\\ &=D(\gamma||\gamma_{0})+\int\left(g-\log\int e^{g}d\theta\right)d\gamma\\ &\geq \int gd\gamma-\log\int e^{g}d\theta. \end{split} \end{equation*}

Since g\in\text{BM}(\mathcal{X}) is chosen arbitrarily, taking supremum on the right hand side gives the \geq part.

Next, we prove the \leq part. Note that if g=\log\frac{d\gamma}{d\theta}, then \int gd\gamma-\log\int e^{g}d\theta=\int \log\frac{d\gamma}{d\theta}d\gamma=D(\gamma||\theta). However, this choice of g may not be in \text{BM}(\mathcal{X}), that is, \frac{d\gamma}{d\theta} may fail to be bounded or bounded away from zero. So we employ the following truncation argument. Let

    \begin{equation*} g_n=\min\{\max\{g,-n\},n\}\in\text{BM}(\mathcal{X}), \end{equation*}

so that g_n\rightarrow g as n\rightarrow\infty. Note that g_{n}1_{\{g\geq 0\}}\uparrow g1_{\{g\geq 0\}} and g_{n}1_{\{g< 0\}}\downarrow g1_{\{g< 0\}}. Thus we have

    \begin{equation*} \begin{split} \lim_{n\rightarrow\infty}\int e^{g_{n}}d\theta&=\lim_{n\rightarrow\infty}\left(\int e^{g_{n}}1_{\{g\geq 0\}}d\theta+\int e^{g_{n}}1_{\{g< 0\}}d\theta\right)\\ &=\lim_{n\rightarrow\infty}\left(\int e^{\min\{g,n\}}1_{\{g\geq 0\}}d\theta+\int e^{\max\{g,-n\}}1_{\{g< 0\}}d\theta\right)\\ &=\int e^{g}1_{\{g\geq 0\}}d\theta+\int e^{g}1_{\{g< 0\}}d\theta\\ &=\int e^{g}d\theta\\ &=\int \frac{d\gamma}{d\theta}d\theta\\ &=1 \end{split} \end{equation*}

by monotone convergence. On the other hand, by Fatou’s Lemma, we have

    \begin{equation*} \liminf_{n\rightarrow\infty}\int g_{n}d\gamma\geq \int gd\gamma=D(\gamma||\theta). \end{equation*}


    \begin{equation*} \liminf_{n\rightarrow\infty}\left(\int g_{n}d\gamma-\log\int e^{g_{n}}d\theta\right)\geq D(\gamma||\theta), \end{equation*}

from which the \leq part of the result follows. \square

Building on these now standard facts (whose exposition above follows that in the book of Dupuis and Ellis), Harremoes and Vignat (2005) gave a short proof of the desired convergence, which we will follow below. It relies on the fact that for uniformly bounded densities within the appropriate moment class, pointwise convergence implies convergence of entropies.

Lemma. If Y_1, Y_2, ... are random variables with \E Y_n =0, \E Y_{n}^{2}=1, and the corresponding densities f_1, f_2, ... are uniformly bounded with f_n\rightarrow f as n\rightarrow \infty(pointwise) for some density f, then h(f_n)\rightarrow h(f) and D(f_n||N(0,1))\rightarrow D(f||N(0,1)) as n\rightarrow\infty.

Proof. Recall D(f||N(0,1))=h(N(0,1))-h(f) for f with mean 0 and variance 1. By lower semicontinuity of D, we have

    \begin{equation*} \limsup_{n\rightarrow\infty}h(f_n)\leq h(f). \end{equation*}

On the other hand, letting c=\sup_{n,x}f_{n}(x), we have

    \begin{equation*} h(f_n)=c\int \left(-\frac{f_n(x)}{c}\log\frac{f_n(x)}{c}\right)dx-\log c, \end{equation*}

and using Fatou’s Lemma,

    \begin{equation*}\liminf_{n\rightarrow\infty}h(f_n)\geq c\int \left(-\frac{f(x)}{c}\log\frac{f(x) }{c}\right)dx-\log c=-\int f(x)\log f(x)dx=h(f). \end{equation*}

Hence, h(f_n)\rightarrow h(f) as n\rightarrow\infty. \square

End of proof of Entropic CLT. Assume D(\mathcal{L}(S_N)||N(0,1))<+\infty. We will use J(X)=I(X)-1 to denote normalized Fisher information. For any t\in (0,\infty), we have that J(S^{t}_{n-1})\geq J(S^{t}_{n}) for n>N. So J(S^{t}_{n})\rightarrow g(t) as n\rightarrow \infty for every t\in(0,\infty). We want to show that g\equiv 0, since then we will get by Lebesgue’s dominated convergence theorem that

    \[D(\mathcal{L}(S_n)||N(0,1))= \int_0^\infty J(S^{t}_{n}) dt \rightarrow \int_0^\infty g(t) dt= 0\]

as n\rightarrow\infty. But since

    \[D(\mathcal{L}(S_n^u)||N(0,1))= \int_u^\infty J(S^{t}_{n}) dt \rightarrow \int_u^\infty g(t) dt ,\]

it is enough to show that D(\mathcal{L}(S_n^u)||N(0,1))\rightarrow 0 for each u>0.

By the monotonicity property we have proved, we know that

    \[D(\mathcal{L}(S_n)||N(0,1)) \leq D(\mathcal{L}(S_N)||N(0,1)) <\infty\]

for any n>N. By compactness of sublevel sets of D, the sequence S_n must therefore have a subsequence S_{n_k} whose distribution converges to a probability measure (let us call Z a random variable with this limiting measure as its distribution). For u>0, the smoothing caused by Gaussian convolution implies that the density of S_{n_k}^u converges pointwise to that of Z^u, and also that the density of S_{2n_k}^u converges pointwise to that of \frac{Z^u + \tilde{Z}^u}{\sqrt{2}}, where \tilde{Z} is an independent copy of Z. By the previous lemma

    \[D(\mathcal{L}(S_{n_k}^u)||N(0,1)) \rightarrow D(\mathcal{L}(Z^u)\|N(0,1))\]

as k\rightarrow \infty, and

    \[D(\mathcal{L}(S_{2n_k}^u)||N(0,1)) \rightarrow  D(\mathcal{L}(\tfrac{Z^u + \tilde{Z}^u}{\sqrt{2}})\|N(0,1)) ,\]

so that necessarily

    \[D(\mathcal{L}(\tfrac{Z^u + \tilde{Z}^u}{\sqrt{2}})\|N(0,1)) =D(\mathcal{L}(Z^u)\|N(0,1)) .\]

By the equality condition in the entropy power inequality, this can only happen if Z^u is Gaussian, which in turn implies that g(u)=J(Z^u)=0. \square

Lecture by Mokshay Madiman | Scribed by Cagin Ararat

05. December 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 7. Entropic CLT (4)

Schedule for rest of the semester

Here is the schedule for the remaining three lectures of the semester:

November 21: Emmanuel Abbe on information theoretic inequalities.

November 28: No lecture (Thanksgiving).

December 5 and 12: Ramon van Handel on transportation-information inequalities and concentration.

The remaining notes on the entropic CLT (Lecture 7) will be posted soon.

20. November 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Schedule for rest of the semester

Lecture 6. Entropic CLT (3)

In this lecture, we complete the proof of monotonicity of the Fisher information in the CLT, and begin developing the connection with entropy. The entropic CLT will be completed in the next lecture.

Variance drop inequality

In the previous lecture, we proved the following decomposition result for functions of independent random variables due to Hoeffding.

ANOVA decomposition. Let X_1, \ldots, X_n be independent random variables taking values in \mathbb{R}, and let \Psi: \mathbb{R}^n \rightarrow \mathbb{R} be such that \mathbb{E}[\Psi(X_1,\ldots,X_n)^2]< \infty. Then \Psi satisfies

    \begin{align*} \Psi(X_1,\ldots,X_n)&= \sum_{t \subseteq \{1,\ldots,n\}} \bar{E_t}[\Psi] \\ \text{where } \bar{E_t}[\Psi]&:=\bigg(\prod_{i \notin t} E_i \prod_{j \in t} (I-E_j)\bigg)\Psi \\ \text{and } E_{i} [\Psi] &:= \mathbb{E}[\Psi | X_1,\ldots,X_{i-1},X_{i+1},\ldots,X_n]. \end{align*}

Furthermore, \mathbb{E}[\bar{E_t}\Psi \cdot \bar{E_s}\Psi]=0 \text{ if } t\neq s.

Note that \bar{E_t}\Psi is a function only of X_t=(X_{i_1}, \ldots, X_{i_m}) (i_1<\ldots < i_m are the ordered elements of t).

In the previous lecture, we proved subadditivity of the inverse Fisher information I^{-1}. The key part of the proof was the observation that the score function of the sum could be written as the conditional expectation of a sum of independent random variables, whose variance is trivially computed. This does not suffice, however, to prove monotonicity in the CLT. To do the latter, we need a more refined bound on the Fisher information in terms of overlapping subsets of indices. Following the same proof, the score function of the sum can be written as the conditional expectation of a sum of terms that are now no longer independent. To estimate the variance of this sum, we will use the following “variance drop lemma” whose proof relies on the ANOVA decomposition.

Lemma. Let U(X_1,\ldots,X_n)=\sum_{s \in \mathcal{G}} \Psi_s(x_s) where \Psi_s : \mathbb{R}^{|s|} \rightarrow \mathbb{R} and \mathcal{G} is some collection of subsets of \{1,\ldots,n\}. If X_1,\ldots, X_n are independent random variables with \mathbb{E} \Psi_s(X_s)^2 < \infty \text{ } \forall s \in \mathcal{G}, then

    \[\mathrm{Var}(U(X_1,\ldots, X_n)) \leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s}\,\mathrm{Var}(\Psi_s(X_s)) ,\]

where \{\beta_s : s \in \mathcal{G} \} is a fractional packing with respect to \mathcal{G}.


  1. Recall that a fractional packing is a function \beta:\mathcal{G} \rightarrow [0;1] such that \sum_{s \in \mathcal{G}, s  \ni i} \beta_s\leq 1 \text{ } \forall i \in [n].

    Example 1. Let d(i)=\#\{s \in \mathcal{G} : s \ni i \}, and define d_+= \max d(i). Taking \beta_s=\frac{1}{d_+} always defines a fractional packing as \sum_{s \in \mathcal{G}, i  \ni s} \frac{1}{d_+}=\frac{1}{d_+} \cdot d(i) \leq 1 by definition of d_+.

    Example 2. If \mathcal{G}=\{s\subseteq\{1,\ldots,n\}:|s|=m\}, then d_+=\binom{n-1}{m-1}.

  2. The original paper of Hoeffding (1948) proves the following special case where each \Psi_s is symmetric in its arguments and \mathcal{G} is as in Example 2 above: U=\frac{1}{\binom{n}{m}}\sum_{|s|=m}\Psi(X_s) (the U-statistic) satisfies

        \[\mathrm{Var}(U) \leq \frac{m}{n} \mathrm{Var}(\Psi(X_s)).\]

    Of course, if m=1 then \mathrm{Var}(U) = \frac{1}{n} \mathrm{Var}(\Psi(X_1)). Thus Hoeffding’s inequality for the variance of U-statistics above and the more general variance drop lemma should be viewed as capturing how much of a drop we get in variance of an additive-type function, when the terms are not independent but have only limited dependencies (overlaps) in their structure.

Proof. We may assume without loss of generality that each \Psi_s(X_s) has mean zero.

    \begin{align*}  U(X_1,\ldots,X_n)&=\sum_{s \in \mathcal{G}} \Psi_s(X_s)= \sum_{s \in \mathcal{G}} \sum_{t \subseteq s} \bar{E_t}[\Psi_s(X_s)]\qquad\text{ by ANOVA}\\ &= \sum_{t \subseteq [n]} \sum_{s \in \mathcal{G}, s \supseteq t} \bar{E_t} \Psi_s(X_s)\\  &= \sum_{t \subseteq [n]} \bar{E_t} \sum_{s \in \mathcal{G}, s \supseteq t} \Psi_{s}. \end{align*}

We then have, using orthogonality of the terms in the ANOVA decomposition:

    \begin{align*} \mathbb{E}U^2 &= \sum_{t \subseteq [n]} \mathbb{E} \bigg[\sum_{s \in \mathcal{G} , s \supseteq t} \bar{E_t}[\Psi_s(X_s)] \bigg]^2. \end{align*}

For each term, we have

    \begin{align*} \Big[\sum_{s \in \mathcal{G} , s \supseteq t} \frac{\sqrt{\beta_s}\bar{E_t}[\Psi_s(X_s)]}{\sqrt{\beta_s}}\Big]^2 &\leq \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \beta_s\Big] \Big[\sum_{s \in \mathcal{G}, s \supseteq t} \frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s}\Big] \qquad\text{ by Cauchy-Schwarz}\\ &\leq \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} ,  \end{align*}

where the second inequality follows from the definition of fractional packing if t is non-empty, and the fact that \bar{E}_{\varnothing} takes any \Psi_s to its mean. Hence

    \begin{align*} \mathbb{E}[U^2] &\leq \sum_{t \subseteq [n]} \mathbb{E} \Big[ \sum_{s \in \mathcal{G}, s \supseteq t}\frac{[\bar{E_t}[\Psi_s(X_s)]]^2}{\beta_s} \Big]\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \sum_{t \subseteq s} \mathbb{E} [\bar{E_t} \Psi_s]^2\\ &= \sum_{s \subseteq \mathcal{G}} \frac{1}{\beta_s} \Psi_s^2 , \end{align*}

again using orthogonality of the \bar{E_t}\Psi in the last step. \qquad\square

Monotonicity of Fisher information

We can now finally prove monotonicity of the Fisher information.

Corollary. Let X_i be independent random variables with I(X_i) <\infty. Then

    \[I(X_1+\ldots + X_n) \leq \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I\bigg(\sum_{i \in s} X_i\bigg)\]

for any hypergraph \mathcal{G} on [n], fractional packing \beta, and positive weights \{\omega_s :s \in \mathcal{G}\} summing to 1.

Proof. Recall that I(X)=\mathrm{Var}(\rho_X(X)) and \rho_X(x)=\frac{f_X'(x)}{f_X(x)}. The identity proved in the last lecture states

    \begin{align*} \rho_{X+Y}(x)&= \mathbb{E}[\rho_X(X)|X+Y=x]\\ \rho_{X+Y}(X+Y)&=\mathbb{E}[\rho_X(X)|X+Y] \end{align*}

With T_s=\sum_{i \in S} X_i, we can write

    \[\rho_{T_{[n]}}(T_{[n]})=\mathbb{E}[\rho_{T_s}(T_s)|T_{[n]}] \text{ }\qquad\forall s \in \mathcal{G}\]

since T_{[n]}=T_s+T_s^C. By taking a convex combination of these identities,

    \begin{align*} \rho_{T_{[n]}}(T_{[n]}) &= \sum_{s \in \mathcal{G}} \omega_s \mathbb{E}[\rho_{T_{s}}(T_s) | T_{[n]}] \\ &= \mathbb{E}\bigg[\sum_{s \in \mathcal{G}} \omega_s \rho_{T_{s}}(T_s) | T_{[n]}\bigg] . \end{align*}

Now by using the Pythagorean inequality (or Jensen’s inequality) and the variance drop lemma, we have

    \begin{align*} I(T_{[n]}) &= \mathbb{E} [\rho^2_{T_{[n]}}(T_{[n]})]\\ &\leq \mathbb{E}\bigg[\bigg(\sum_{s \in \mathcal{G}} \omega_s \rho_{T_s}(T_s)\bigg)^2\bigg]\\ &\leq \sum_{s \in \mathcal{G}} \frac{1}{\beta_s} \omega_s^2 \mathbb{E}[\rho^2_{T_s}(T_s)]\\ &= \sum_{s \in \mathcal{G}} \frac{\omega_s^2}{\beta_s} I(T_s) \end{align*}

as desired. \qquad\square

Remark. The \omega_s being arbitrary weights, we can optimize over them. This gives

    \[\frac{1}{I(T_{[n]})} \geq \sum_{s \in \mathcal{G}} \frac{\beta_s}{I(T_s)}.\]

With \mathcal{G} being all singletons and \beta_s=1 we recover the superadditivity property of I^{-1}. With \mathcal{G} being all sets of size n-1 and \beta_s=\frac{1}{\binom{n-1}{n-2}}=\frac{1}{n-1}, we get

    \[\frac{1}{I(T_{[n]})} \geq \frac{n}{n-1}\frac{1}{I(T_{n-1})} \Rightarrow I\Big(\frac{T_{[n]}}{\sqrt{n}}\Big) \leq I\Big(\frac{T_{[n-1]}}{\sqrt{n-1}}\Big) \Leftrightarrow I(S_n) \leq I(S_{n-1}).\]

Thus we have proved the monotonicity of Fisher information in the central limit theorem.

From Fisher information to entropy

Having proved monotonicity for the CLT written in terms of Fisher information, we now want to show the analogous statement for entropy. The key tool here is the de Bruijn identity.

To formulate this identity, let us introduce some basic quantities. Let X\sim f on \mathbb{R}, and define

    \[X^t = e^{-t}X+\sqrt{1-e^{-2t}}Z\]

where Z \sim \mathcal{N}(0,1). Denote by f_t the density of X^t. The following facts are readily verified for t>0:

  1. f_t>0.
  2. f_t(\cdot) is smooth.
  3. I[f_t] < \infty.
  4. \frac{\partial f_t(x)}{\partial t}= f_t^{\prime \prime}(x)+ \frac{d}{dx} [xf_t(x)] =: (Lf_t)(x).

Observe that X^0=X has density f, and that as t \rightarrow \infty , X^t converges to Z, which has a standard Gaussian distribution. Thus X^t provides an interpolation between the density f and the normal density.

Remark. Let us recall some standard facts from the theory of diffusions. The Ornstein-Uhlenbeck process X(t) is defined by the stochastic differential equation

    \[dX(t)= -X(t) dt + \sqrt{2} dB(t),\]

where B(t) is Brownian motion. This is, like Brownian motion, a Markov process, but the drift term (which always pushes trajectories towards 0) ensures that it has a stationary distribution, unlike Brownian motion. The Markov semigroup associated to this Markov process, namely the semigroup of operators defined on an appropriate domain by

    \[P_t \Psi(x)=\mathbb{E}[\Psi(X(t))|X_0=x] ,\]

has a generator A (defined via A= \lim_{t\rightarrow \infty} \frac{P_t - I}{t}) given by A\Psi(x)=\Psi^{\prime \prime}(x)-x\Psi'(x). The semigroup P_t generated by A governs the evolution of conditional expectations of functions of the process X(t), while the adjoint semigroup generated by L=A^* governs the evolution of the marginal density of X(t). The above expression for \partial f_t/\partial t follows from this remark by noting that X(t) and X^t are the same in distribution; however, it can also be deduced more simply just by writing down the density of X^t explicitly, and using the smoothness of the Gaussian density to verify each part of the claim.

We can now formulate the key identity.

de Bruijn identity. Let g be the density of the standard normal N(0,1).

  1. Differential form:

        \[\frac{d}{dt} D(f_t \| g)=\frac{1}{2} J(f_t),\]

    where J(f)=Var(f) \cdot I(f)-1 is the normalized Fisher information.

  2. Integral form:

        \[D(f \| g)=\frac{1}{2} \int_0^\infty J(f_t) dt .\]

The differential form follows by using the last part of the claim together with integration by parts. The integral form follows from the differential form by the fundamental theorem of calculus, since

    \[D(g \| g)-D(f \| g)=- \frac{1}{2}\int_0^\infty J(f_t) dt ,\]

which yields the desired identity since D(g\|g)=0.

This gives us the desired link between Fisher information and entropy. In the next lecture, we will use this to complete the proof of the entropic central limit theorem.

Lecture by Mokshay Madiman | Scribed by Georgina Hall

13. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 6. Entropic CLT (3)

Lecture 5. Entropic CLT (2)

The goal of this lecture is to prove monotonicity of Fisher information in the central limit theorem. Next lecture we will connect Fisher information to entropy, completing the proof of the entropic CLT.

Two lemmas about the score function

Recall that for a random variable X with absolutely continuous density f, the score function is defined as \rho(x)=f'(x)/f(x) and the Fisher information is I(X)=\mathbb{E}[\rho^2(X)].

The following lemma was proved in the previous lecture.

Lemma 1. Let X be a random variable with finite Fisher information. Let w:\mathbb{R} \rightarrow \mathbb{R} be measurable and let \eta(u) = \mathbb{E}[w^2(u + X)]. If \eta is bounded on the interval [a,b], then the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on [a,b] with g'(u) = -\mathbb{E}[w(u+X)\rho(X)] a.e.

There is a converse to the above lemma, which gives a useful characterization of the score function.

Lemma 2. Let X be a random variable with density f and let m: \mathbb{R} \rightarrow \mathbb{R} be a measurable function with \mathbb{E}|m(X)|<\infty. Suppose that for every bounded measurable function w:\mathbb{R} \rightarrow \mathbb{R}, the function g(u) = \mathbb{E} w(u+X) is absolutely continuous on \mathbb{R} with g'(u) = -\mathbb{E}[w(u+X)m(X)] a.e. Then there must exist an absolutely continuous version of the density f, and moreover m(X)=\rho(X) a.s.

Proof. Take w(x)=\mathbb{I}_{\{x\leq t\}}. Then


Hence g(u) is in fact continuously differentiable with derivative g'(u)=-f(t-u). On the other hand,

    \[\begin{split} g'(u)&=-\mathbb{E}[w(u+X)m(X)]\\ &=-\int_{-\infty}^{t-u}f(x)m(x)\qquad\mbox{a.e.} \end{split}\]

by our assumption. By continuity of f and the assumption \mathbb{E}|m(X)|<\infty, we must have


for every y\in\mathbb{R}. Hence f(x)m(x)=f'(x) a.e. Since \mathbb{P}[f(X)=0]=0, the proof is complete. \square

Score function of the sum of independent variables

We now show a key property of the score function: the score function of the sum of two independent random variables is a projection.

Proposition. Let X and Y be two independent random variables. Suppose X has finite Fisher information. Then \rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y] a.s.

Proof. Let m(u)=\mathbb{E}[\rho_X(X)|X+Y=u]. By Lemma 2, we only need to show that the function g(u) = \mathbb{E} w(u+X+Y) is locally absolutely continuous with g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. for every bounded measurable function w.

Fix a\in\mathbb{R}. By independence of X and Y, we can apply Lemma 1 to X (conditioned on Y) to obtain


Taking expectation of both sides and applying Fubini, we get



    \[\begin{split}\mathbb{E}[w(t+X+Y)\rho_X(X)]&=\mathbb{E}[\mathbb{E}[w(t+X+Y)\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)\mathbb{E}[\rho_X(X)|X+Y]]\\ &=\mathbb{E}[w(t+X+Y)m(X+Y)],\end{split}\]

we arrive at


which implies g'(u) = -\mathbb{E}[w(u+X+Y)m(X+Y)] a.e. \qquad\square

Remark. The well-known interpretation of conditional expectation as a projection means that under the assumption of finite Fisher information (i.e., score functions in L^2(\Omega,\mathcal{F},\mathbb{P})), the score function of the sum is just the projection of the score of a summand onto the closed subspace L^2(\Omega, \sigma\{X+Y\}, \mathbb{P}). This implies directly, by the Pythagorean inequality, that convolution decreases Fisher information: I(X+Y)=\mathbb{E}[\rho_{X+Y}^2(X+Y)]\leq\mathbb{E}[\rho_{X}^2(X)]=I(X). In fact, we can do better, as we will see forthwith.

Monotonicity of FI in the CLT along a subsequence

We now make a first step towards proving monotonicity of the Fisher information in the CLT.

Theorem. Let X and Y be independent random variables both with finite Fisher information. Then


for any \lambda\in\mathbb{R}.

Before going into the proof, we make some remarks.

Remark. Taking \lambda=1, we get I(X+Y)\leq I(X). Hence the above theorem is a significant strengthening of the simple fact that convolution decreases Fisher information.

Remark. By taking \lambda=\frac{I(Y)}{I(X)+I(Y)}, we get the following optimized version of the theorem:


Remark. Let X_1,X_2,\ldots be an i.i.d. sequence of random variables with mean zero and unit variance. Let S_n=\frac{1}{\sqrt{n}}\sum_{i=1}^nX_i. By taking \lambda=\frac{1}{2} in the theorem and using scaling property of Fisher information, it is easy to obtain I(S_{2n})\leq I(S_n). Hence, the above theorem already implies monotonicity of Fisher information along the subsequence of times 1,2,4,8,16,\ldots: that is, I(S_{2^n}) is monotone in n.
However, the theorem is not strong enough to give monotonicity I(S_{n+1})\leq I(S_n) without passing to a subsequence. For example, if we apply the previous remark repeatedly we only get I(S_n)\leq I(S_1), which is not very interesting. To prove full monotonicity of the Fisher information, we will need a strengthening of the above Theorem. But it is instructive to first consider the proof of the simpler case.

Proof. By the projection property of the score function,

    \[\begin{split} &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_X(X)|X+Y],\\ &\rho_{X+Y}(X+Y)=\mathbb{E}[\rho_Y(Y)|X+Y].\\ \end{split}\]



Applying the conditional Jensen inequality, we obtain,


Taking expectations of both sides and using independence, we obtain

    \[I(X+Y)\leq\mathbb{E}[(\lambda\rho_X(X)+(1-\lambda)\rho_Y(Y))^2] =\lambda^2I(X)+(1-\lambda)^2I(Y)+2\lambda(1-\lambda)\mathbb{E}[\rho_X(X)]\mathbb{E}[\rho_Y(Y)].\]

By Lemma 1, \mathbb{E}[\rho_X(X)]=0. This finishes the proof. \qquad\square

Monotonicity of Fisher information in the CLT

To prove monotonicity of the Fisher information in the CLT (without passing to a subsequence) we need a strengthening of the property of Fisher information given in the previous section.

In the following, we will use the common notation [n]:=\{1,\ldots,n\}.

Definition. Let \mathcal{G} be a collection of non-empty subsets of [n]. Then a collection of non-negative numbers \{\beta_s:s\in\mathcal{G}\} is called a fractional packing for \mathcal{G} if \sum_{s:i\in s}\beta_s\leq 1 for all 1\leq i\leq n.

We can now state the desired strengthening of our earlier theorem.

Theorem. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G} and let X_1,\ldots,X_n be independent random variables. Let \{w_s:s\in\mathcal{G}\} be a collection of real numbers such that \sum_{s\in\mathcal{G}}w_s=1. Then

    \[I(X_1+\ldots+X_n)\leq\sum_{s\in\mathcal{G}}\frac{w_s^2}{\beta_s}I\bigg(\sum_{j\in s}X_j\bigg)\]


    \[\frac{1}{I(X_1+\ldots+X_n)}\geq\sum_{s\in\mathcal{G}}\beta_s\frac{1}{I(\sum_{j\in s}X_j)}.\]

Remark. Suppose that the X_i are identically distributed. Take \mathcal{G}=\{[n]\backslash 1,[n]\backslash 2,\ldots,[n]\backslash n\}. For each s\in\mathcal{G}, define \beta_s=\frac{1}{n-1}. It is easy to check that \{\beta_s:s\in\mathcal{G}\} is a fractional packing of \mathcal{G}. Then


by the above theorem. By the scaling property of Fisher information, this is equivalent to I(S_n)\leq I(S_{n-1}), i.e., monotonicity of the Fisher information. This special case was first proved by Artstein, Ball, Barthe and Naor (2004) with a more complicated proof. The proof we will follow of the more general theorem above is due to Barron and Madiman (2007).

The proof of the above theorem is based on an analysis of variance (ANOVA) type decomposition, which dates back at least to the classic paper of Hoeffding (1948) on U-statistics. To state this decomposition, let X_1,\ldots,X_n be independent random variables, and define the Hilbert space

    \[H:=\{\phi:\mathbb{R}^n \rightarrow \mathbb{R}~~:~~\mathbb{E}[\phi^2(X_1,\ldots,X_n)]<\infty\}\]

with inner product

    \[\langle\phi,\psi\rangle:=\mathbb{E}[\phi(X_1,\ldots,X_n)\psi(X_1,\ldots,X _n)].\]

For every j\in[n], define an operator E_j on H as

    \[(E_j\phi)(x_1,\ldots,x_n)=\mathbb{E}[\phi(x_1,\ldots,x_{j+1},X_j,x_{j-1} ,\ldots,x_n)].\]

Proposition. Each \phi\in H can be decomposed as

    \[\phi=\sum_{T\subseteq [n]}\widetilde{E}_T\phi,\]


    \[\widetilde{E}_T:=\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\]


  1. \widetilde{E}_T\widetilde{E}_S=0 if T\neq S;
  2. \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 if T\neq S;
  3. \widetilde{E}_T\phi does not depend on x_i if i\notin T.

Proof. It is easy to verify that E_j is a projection operator, that is, E_j^2=E_j and \langle E_j\phi,\psi\rangle=\langle\phi,E_j\psi\rangle (self-adjointness). It is also easy to see that E_jE_i=E_iE_j for i\neq j. Hence we have

    \[\phi=\prod_{j=1}^n(E_j+I-E_j)\phi=\sum_{T\subset [n]}\prod_{j\notin T}E_j\prod_{j\in T}(I-E_j)\phi=\sum_{T\subset [n]}\widetilde{E}_T\phi.\]

If T\neq S, choose j_0\in T such that j_0\notin S. Then

    \[\widetilde{E}_T\widetilde{E}_S&=(I-E_{j_0})E_{j_0}\prod_{j\notin T}E_j\prod_{j\in T\backslash\{j_0\}}(I-E_j)\prod_{j\notin S\cup\{j_0\}}E_j\prod_{j\in S}(I-E_j)=0.\]

It is easily verified that \widetilde{E}_T is itself self-adjoint, so that \langle\widetilde{E}_T\phi,\widetilde{E}_S\psi\rangle=0 follows directly. Finally, as by definition E_j\phi does not depend on x_j, it is clear that \widetilde{E}_T\phi does not depend on x_j if j\notin T. \qquad\square

The decomposition will be used in the form of the following variance drop lemma whose proof we postpone to next lecture. Here X_S:=(X_{i_1},\ldots,X_{i_{|S|}}) for S=\{i_1,\ldots,i_{|S|}\}\subseteq [n], i_1<\cdots<i_{|S|}.

Lemma. Let \{\beta_s:s\in\mathcal{G}\} be a fractional packing for \mathcal{G}. Let U=\sum_{s\in\mathcal{G}}\phi_s(X_s). Suppose that \mathbb{E}[\phi_s^2(X_s)]<\infty for each s\in\mathcal{G}. Then \text{Var}[U]\leq\sum_{s\in\mathcal{G}}\frac{1}{\beta_s}\text{Var}[\phi_s(X_s)].

To be continued…

Lecture by Mokshay Madiman | Scribed by Liyao Wang

06. November 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 5. Entropic CLT (2)

Next lecture: November 7

A quick reminder that next week is Fall break (no lecture). The next lecture will take place on Thursday, November 7: Mokshay will resume the proof of the entropic central limit theorem.

24. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Next lecture: November 7

Lecture 4. Entropic CLT (1)

The subject of the next lectures will be the entropic central limit theorem (entropic CLT) and its proof.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean zero and unit variance. Let

    \[S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^nX_i.\]

If h(S_n) > -\infty for some n, then h(S_n) \uparrow h(N(0,1)), or equivalently D( \mathcal{L}(S_n)\| N(0,1)) \downarrow 0. That is, the entropy of S_n increases monotonically to that of the standard Gaussian.

Recall that D( \mathcal{L}(X) \| \mathcal{L}(Z)) = h(Z) - h(X) when Z is Gaussian with the same mean and variance as X, which explains the equivalence stated in the theorem.

Let us note that the assumption h(S_n) > -\infty for some n represents a genuine (but not unexpected) restriction: in particular, it implies that the entropic CLT does not apply if X_i are discrete.

Entropy power inequality

Historically, the first result on monotonicity of entropy in the CLT was that h(S_{2n}) \ge h(S_n) for all n. This follows directly from an important inequality for entropy, the entropy power inequality (EPI). The rest of this lecture and part of the next lecture will be devoted to proving the EPI. While the EPI does not suffice to establish the full entropic CLT, the same tools will prove to be crucial later on.

Entropy power inequality. Let X_1 and X_2 be independent real-valued random variables such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)} \ge e^{2h(X_1)} + e^{2h(X_2)},\]

with equality if and only if X_1 and X_2 are Gaussian.

Before we embark on the proof, let us make some remarks.

Remark. None of the assumptions about existence of entropies is redundant: it can happen that h(X_1) and h(X_2) exist but h(X_1+X_2) does not.

Remark. If X_1 and X_2 are i.i.d., S_1=X_1, and S_2=(X_1+X_2)/\sqrt{2}, then the EPI implies

    \[2e^{2h(X_1)} \le e^{2h(X_1+X_2)} = e^{2h(\sqrt{2}S_2)} = 2e^{2h(S_2)},\]

which implies h(S_1) \le h(S_2). Here we have used the easy-to-check equality h(aX) = h(X) + \log|a|, which of course implies e^{2h(aX)}=a^2e^{2h(X)}. From this observation, the proof of the claim that h(S_{2n}) \ge h(S_n) is immediate: simply note that S_{2n} is the sum of two independent copies of S_n.

Remark. It is easy to check that h(X_1+X_2) \ge h(X_1+X_2 | X_2) = h(X_1 | X_2) = h(X_1). In fact, this is true in much more general settings (e.g. on locally compact groups, with entropy defined relative to Haar measure). The EPI is a much stronger statement particular to real-valued random variables.

Remark. The EPI admits the following multidimensional extension.

Multidimensional EPI. Let X_1 and X_2 be independent \mathbb{R}^n-valued random vectors such that h(X_1), h(X_2), and h(X_1 + X_2) all exist. Then

    \[e^{2h(X_1+X_2)/n} \ge e^{2h(X_1)/n} + e^{2h(X_2)/n},\]

with equality if and only if X_1 and X_2 are Gaussian with proportional covariance matrices.

Define the entropy power N(X) for an \mathbb{R}^n-valued random vector X by

    \[N(X) := \exp\left(\frac{2h(X)}{n}\right).\]

The EPI says that N is superadditive under convolution.

Digression: EPI and Brunn-Minkowski

A good way to develop an appreciation for what the EPI is saying is in analogy with the Brunn-Minkowski inequality. If A,B \subset \mathbb{R}^n are Borel sets and |\cdot| denotes n-dimensional Lebesgue measure, then

    \[|A + B|^{1/n} \ge |A|^{1/n} + |B|^{1/n},\]

where A+B := \{a + b : a \in A, \ b \in B\} is the Minkowski sum. In particular, note that |A|^{1/n} is proportional up to an absolute constant to the radius of the n-dimensional Euclidean ball whose volume matches that of A. The Brunn-Minkowski inequality expresses superadditivity of this functional (and we clearly have equality for balls). The Brunn-Minkowski inequality is of fundamental importance in various areas of mathematics: for example, it implies the isoperimetric inequality in \mathbb{R}^n, which states that Euclidean balls with volume V have the minimal surface area among all subsets of \mathbb{R}^n with volume V.

In a sense, the EPI is to random variables as the Brunn-Minkowski inequality is to sets. The Gaussians play the role of the balls, and variance corresponds to radius. In one dimension, for example, since

    \[h(N(0,\sigma^2)) = \frac{1}{2}\log(2\pi e\sigma^2) \quad \Rightarrow \quad e^{2h(N(0,\sigma^2))} = 2\pi e\sigma^2,\]

we see that e^{2h(X)} is proportional to the variance of the Gaussian whose entropy matches that of X. The entropy power inequality expresses superadditivity of this functional, with equality for Gaussians.

Proposition. The EPI is equivalent to the following statement: if X_1 and X_2 are independent and Z_1 and Z_2 are independent Gaussians with h(Z_1)=h(X_1) and h(Z_2)=h(X_2), then h(X_1+X_2) \ge h(Z_1 +Z_2) provided that all of the entropies exist.

Proof. Both implications follow from

    \[\exp\left(\frac{2h(X_1)}{n}\right)+ \exp\left(\frac{2h(X_2)}{n}\right) = \exp\left(\frac{2h(Z_1)}{n}\right) + \exp\left(\frac{2h(Z_2)}{n}\right) = \exp\left(\frac{2h(Z_1+Z_2)}{n}\right). \qquad\square\]

Proof of the entropy power inequality

There are many proofs of the EPI. It was stated by Shannon (1948) but first fully proven by Stam (1959); different proofs were later provided by Blachman (1969), Lieb (1978), and many others. We will follow a simplified version of Stam’s proof. We work from now on in the one-dimensional case for simplicity.

Definition (Fisher information). Let X be a \mathbb{R}-valued random variable whose density f is an absolutely continuous function. The score function of X is defined as

    \[\rho(x) = \rho_X(x) := \begin{cases} \frac{f'(x)}{f(x)} & \text{for } x \text{ at which } f(x)>0 \text{ and } f'(x) \text{ exists,} \\ 0 & \text{otherwise.} \end{cases}\]

The Fisher information (FI) of X is defined as

    \[I(X) := \mathbb{E}\left[\rho_X^2(X)\right]. \]\]

Remark. Let \mathcal{P} = \{p_\theta : \theta \in \Theta\} be a parametric statistical model. In statistics, the score function is usually defined by \rho(\theta,x) := \frac{\partial}{\partial \theta}\log p_\theta(x), and the Fisher information by I(\theta) = \mathbb{E}_\theta\left[\rho^2(\theta,X)\right]. This reduces to our definition in the special case of location families, where p_\theta(x) = p(x-\theta) for some probability density p(x): in this case, \rho(\theta,x) = -\frac{\partial}{\partial x}\log p(x-\theta) and we have

    \[I(\theta) = \int p_\theta(x)\left[\frac{\partial}{\partial x}\log p(x-\theta)\right]^2dx = \int p(x)\left[\frac{\partial}{\partial x}\log p(x)\right]^2dx.\]

Thus for location families I(\theta) does not depend on \theta and coincides precisely with the Fisher information I(Z) as we defined it above for a random variable Z with density p(x). The statistical interpretation allows us do derive a useful inequality. Suppose for simplicity that \mathbb{E}Z=0. Then \mathbb{E}_\theta X = \theta for every \theta, so X is an unbiased estimator of \theta. The Cramér-Rao bound therefore implies the inequality

    \[I(Z) \ge \frac{1}{\mathrm{Var}(Z)},\]

with equality if and only if Z is Gaussian. The same conclusion holds when \mathbf{E}Z\in\mathbb{R} is arbitrary, as both FI and variance are invariant under translation.

Remark. There is a Fisher information analogue of the entropic CLT: in the setup of the entropic CLT, subject to an additional variance constraint, we have I(S_n) \downarrow I(N(0,1)) = 1. Moreover, Fisher information is minimized by Gaussians. This is often stated in terms of normalized Fisher information, defined as J(X) := \mathrm{Var}(X)I(X)-1. Note that J is both translation and scale invariant: J(aX)=J(X) and J(X+a)=J(X). We have J(X) \ge 0, with equality if and only if X is Gaussian, by the previous remark. The Fisher information analogue of the entropic CLT can now be restated as J(S_n) \downarrow 0.

The strategy of the proof of the EPI is as follows:

  1. We first prove an inequality for Fisher information:

        \[\frac{1}{I(X_1+X_2)} \ge \frac{1}{I(X_1)} + \frac{1}{I(X_2)}.\]

  2. Develop an integral identity relating I and h.
  3. Combine (1) and (2) to get the EPI.

The reason to concentrate on Fisher information is that I is an L^2-type quantity, as opposed to the entropy which is an L\log L-type quantity. This makes I easier to work with.

We begin with some technical results about Fisher information.

Lemma. If X has absolutely continuous density f and I(X) < \infty, then f has bounded variation. In particular, f is bounded.

Proof. Let D_f denote the set of points at which f is differentiable, so that |D_f^c|=0. Define f'(x) := 0 for x \notin D_f. Let S = \{x \in \mathbb{R} : f(x) > 0\}. Then

    \[\int_\mathbb{R}|f'(x)|dx = \int_S|f'(x)|dx = \int_S|\rho(x)|f(x)dx = \mathbb{E}|\rho(X)| \le \sqrt{I(X)} < \infty. \qquad\square\]

Lemma. Let w : \mathbb{R} \rightarrow \mathbb{R} be measurable. Let \eta : [a,b] \rightarrow [0,\infty] be defined as \eta(u) = \mathbb{E} w^2(u + X). If \eta is bounded, then g : [a,b] \rightarrow \mathbb{R} defined by g(u) = \mathbb{E} w(u+X) is absolutely continuous on (a,b) with g'(u) = -\mathbb{E}\left[w(u+X)\rho(X)\right] a.e.

Proof. Let \tilde{g}(u)=-\mathbb{E}\left[w(u+X)\rho(X)\right]; we want to show that g is absolutely continuous with derivative \tilde{g}. First observe that \tilde{g} is integrable on [a,b] from the Cauchy-Schwarz inequality and the boundedness condition. Thus we can compute

    \begin{eqnarray*}\begin{split} \int_a^b \tilde{g}(u) du  &= - \int_a^b \bigg[ \int w(s) f'(s-u) ds \bigg] du \\ &= -  \int w(s)\bigg[ \int_a^b f'(s-u) du \bigg]  ds \\ &= -  \int w(s)\bigg[ f(s-a) - f(s-b) \bigg]  ds \\ &= \mathbb{E} w(b+X)- \mathbb{E} w(a+X) \\ &= g(b) - g(a) , \end{split}\end{eqnarray*}

which proves the desired result. \qquad\square

Corollary. \mathbb{E}[\rho(X)] = 0 and, if \mathbb{E} X^2 < \infty, then \mathbb{E}[X\rho(X)] = -1.

Proof. Take w \equiv 1 in the lemma above for the first claim. Take w(u)=u for the second. \qquad\square

To be continued…

Lecture by Mokshay Madiman | Scribed by Dan Lacker

23. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 4. Entropic CLT (1)

Next lecture: one-time room change

The next lecture is on October 17. Please note that we have a one time room change to Sherrerd Hall Room 101 on October 17 as the usual Bendheim room is in use for another event. We will return to the Bendheim center the following week for the rest of the semester.

10. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Next lecture: one-time room change

Lecture 3. Sanov’s theorem

The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:

  1. It is an example of a probabilistic question where entropy naturally appears.
  2. The proof we give uses ideas typical in information theory.
  3. We will need it later to discuss the transportation-information inequalities (if we get there).

What is large deviations?

The best way to start thinking about large deviations is to consider a basic example. Let X_1,X_2,\ldots be i.i.d. random variables with \mathbf{E}[X_1] =0 and \mathbf{E}[X_1^2] < \infty. The law of large numbers states that

    \[\frac{1}{n} \sum_{k=1}^n X_k \xrightarrow{n \rightarrow \infty} 0   \quad \mbox{a.s.}\]

To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that

    \[\sqrt{n} \, \bigg\{ \frac{1}{n} \sum_{k=1}^n X_k\bigg\}     \xrightarrow{n \rightarrow \infty} N(0,\sigma^2)    \quad\mbox{in distribution}.\]

Therefore, for any fixed t, we have

    \[\mathbf{P}\bigg[ \frac{1}{n} \sum_{k=1}^n X_k \ge \frac{t}{\sqrt{n}}\bigg]       = O(1)\]

(the probability converges to a value strictly between zero and one). Informally, this implies that the typical size of \frac 1n \sum_{k=1}^n X_k is of order \frac{1}{\sqrt{n}}.

Rather than considering the probability of typical events, large deviations theory allows us to understand the probability of rare events, that is, events whose probabilities are exponentially small. For example, if X_k \sim N(0,1), then the probability that \frac 1n \sum_{k=1}^n X_k is at least of order unity (a rare event, as we have just shown that the typical size is of order \frac{1}{\sqrt{n}}) can be computed as

    \[\mathbf{P}\bigg[ \frac 1n \sum_{k=1}^n X_k > t\bigg] =       \int_{t\sqrt{n}}^{\infty} \frac{e^{-x^2/2}}{\sqrt{2 \pi}} \approx    e^{-nt^2/2}.\]

The probability of this rare event decays exponentially at rate t^2/2. If the random variables X_i have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of large deviations theory is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.

Sanov’s theorem

Let X_1,X_2,\ldots be i.i.d. random variables with values in a finite set \{1,\ldots,r\} and with distribution P (random variables in a continuous space will be considered at the end of this lecture). Denote by \mathcal{P} the set of probabilities on \{1,\ldots,r \}. Let \hat P_n be the empirical distribution of X:

    \[\hat P_n = \frac{1}{n}\sum_{k=1}^n \delta_{X_k}.\]

The law of large numbers states that \hat P_n \rightarrow P a.s. To define a rare event, we fix \Gamma \subseteq \mathcal{P} that does not contain P. We are interested in behavior of probabilities of the form \mathbf{P}[\hat P_n \in \Gamma], as n \rightarrow \infty.

Example. Let f:\{1,\ldots,r\} \rightarrow \mathbb{R} be such that \int f dP=0. Define \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} for some t>0. Then \mathbf{P}[\hat P_n \in \Gamma]=\mathbf{P}[\frac{1}{n}\sum_{k=1}^n f(X_k)\ge t]. Thus the rare events of the type described in the previous section form a special case of the present setting.

We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities \mathbf{P}[\hat P_n \in \Gamma] decay.

Theorem (Sanov). With the above notations, it holds that

    \begin{eqnarray*} -\inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P) &\le& \liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le& \limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le&-\inf_{Q \in \mathop{\mathrm{cl}} \Gamma} D(Q || P). \end{eqnarray*}

In particular, for “nice” \Gamma such that the left- and right-hand sides coincide we have the exact rate

    \[\lim_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]    = - \inf_{Q \in \Gamma} D(Q || P).\]

In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event \Gamma is controlled by the element of \Gamma that is closest to the true distribution P in the sense of relative entropy.

There are many proofs of Sanov’s theorem (see, for example, the excellent text by Dembo and Zeitouni). Here we will utilize an elegant approach that uses a common device in information theory.

Method of types

It is a trivial observation that each possible value \{1,\ldots,r \} must appear an integer number of times among the samples \{X_1,\ldots,X_n\}. This implies, however, that the empirical measure \hat P_n cannot take arbitrary values: evidently it is always the case that \hat P_n \in \mathcal{P}_n, where we define

    \[\mathcal{P}_n = \bigg \{Q \in \mathcal{P} :     Q(\{i\}) = \frac{k_i}{n}, ~ k_i \in \{0,\ldots,n\},~\sum_{i=1}^rk_i=1 \bigg \}.\]

Each element of \mathcal{P}_n is called a type: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value \hat P_n=Q for each type Q\in\mathcal{P}_n.

Type theorem. For every Q \in \mathcal{P}_n, we have

    \[\frac{1}{(n+1)^r} e^{- n D(Q || P)} \le \mathbf{P}[\hat P_n = Q]     \le e^{- n D(Q || P)}.\]

That is, up to a polynomial factor, the probability of each type Q\in \mathcal{P}_n behaves like e^{- n D(Q || P)}.

In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types Q such that D(Q||P)\ge \inf_{Q\in\Gamma}D(Q||P)+\varepsilon have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in \Gamma. The probability of the rare event \Gamma is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: the probability of a rare event is dominated by the most likely of the unlikely outcomes. The proof of Sanov’s theorem makes this intuition precise.

Proof of Sanov’s theorem.

Upper bound. Note that

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &=& \sum_{Q\in \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \le \sum_{Q\in \Gamma \cap \mathcal{P}_n} e^{- n D(Q || P)}\\ &\le& |\mathcal{P}_n| \sup_{Q \in \Gamma} e^{- n D(Q || P)}  \le (n+1)^r e^{- n \inf _{Q \in \Gamma} D(Q || P)}. \end{eqnarray*}

This yields

    \[\limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \le -\inf_{Q \in \Gamma} D(Q || P).\]

[Note that in the finite case, by continuity, the infimum over \Gamma equals the infimum over \mathop{\mathrm{cl}}\Gamma as stated in the theorem. The closure becomes more important in the continuous setting.]

Lower bound. Note that \bigcup_{n\ge 1}\mathcal{P}_n is dense in \mathcal{P}. As \mathop{\mathrm{int}} \Gamma is open, we can choose Q_n \in  \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n for each n, such that D( Q_n || P) \rightarrow \inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P). Therefore,

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &\ge& \mathbf{P}[\hat P_n \in \mathop{\mathrm{int}} \Gamma] = \sum_{Q \in \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \ge \mathbf{P}[\hat P_n = Q_n]\\ &\ge& \frac{1}{(n+1)^r} e^{-nD(Q_n || P)}. \end{eqnarray*}

It follows that

    \[\liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \mathop{\mathrm{int}}\Gamma} D(Q || P).\]

[Note that despite that we are in the finite case, it is essential to consider the interior of \Gamma.] \qquad\square

Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.

Proof of the type theorem. Let us define

    \[T(Q) = \bigg\{(x_1,\ldots,x_n) \in \{1,\ldots,r\}^n : \frac 1n \sum_{k=1}^n\delta_{x_k}=Q \bigg\}.\]

Then we can write

    \begin{eqnarray*} \mathbf{P}[\hat P_n = Q] &=& \mathbf{P}[(X_1,\ldots,X_n) \in T(Q)] = \sum_{x \in T(Q)} P(\{x_1\}) \ldots P(\{x_n\})\\ &=& \sum_{x \in T(Q)} \prod_{i=1}^r P(i)^{n Q(i)} = e^{n \sum_{i=1}^r Q(i) \log  P(i)} \, |T(Q)| \\ &=& e^{-n \{D(Q || P) + H(Q)\}} |T(Q)|. \end{eqnarray*}

It is therefore sufficient to prove that for every Q \in \mathcal{P}_n

    \[\frac{1}{(n+1)^r} e^{n H(Q)} \le |T(Q)| \le e^{n H(Q)}.\]

To show this, the key idea is to utilize precisely the same expression for \mathbf{P}[\hat P_n = Q] given above, for the case that the distribution P that defined the empirical measure \hat P_n is replaced by Q (which is a type). To this end, let us denote by \hat Q_n the empirical measure of n i.i.d. random variables with distribution Q.

Upper bound. We simply estimate using the above expression

    \[1 \ge \mathbf{P}[\hat Q_n = Q] = e^{-n H(Q)} |T(Q)|.\]

Lower bound. It seems intuitively plausible that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n, that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate

    \[1 = \sum_{R \in \mathcal{P}_n} \mathbf{P}[\hat Q_n = R] \le (n+1)^r \mathbf{P}[\hat Q_n = Q] = (n+1)^r e^{-n H(Q)} |T(Q)|.\]

Proof of the claim. It remains to prove the above claim that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n. To this end, note that T(Q) consists of all vectors x such that nQ(1) of the entries take the value 1, nQ(2) of the entries take the value 2, etc. The number of such vectors is

    \[T(Q) = \frac{n!}{(nQ(1))! \ldots (nQ(r))! }.\]

It is now straightforward to estimate

    \begin{eqnarray*}\frac{\mathbf{P}[\hat Q_n = Q]}{\mathbf{P}[\hat Q_n = R]} &=& \prod_{i=1}^r \frac{(nR(i))!}{(nQ(i))!} Q(i)^{n(Q(i)-R(i))}\\ &\ge& \prod_{i=1}^r (n Q(i))^{n(R(i)-Q(i))} Q(i)^{n(Q(i)-R(i))} \quad \text{using} \quad \frac{n!}{m!} \ge m^{n-m}\\ &=&\prod_{i=1}^r n^{n(R(i)-Q(i))} = n^{n \sum_{i=1}^r (R(i)-Q(i))} =1. \end{eqnarray*}

Thus the claim is established. \qquad\square

Remark. It is a nice exercise to work out the explicit form of the rate function \inf_{Q \in \Gamma} D(Q || P) in the example \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as Cramèr’s theorem.

General form of Sanov’s theorem

The drawback to the method of types is that it relies heavily on the assumption that X_i take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.

Let \mathbb{X} be a Polish space (think \mathbb{R}^n), and let X_1,X_2,\ldots be i.i.d. random variables taking values in \mathbb{X} with distribution P. Denote by \mathcal{P} the space of probability measures on \mathbb{X} endowed with the topology of weak convergence: that is, P_n\to P iff \int f dP_n\to \int f dP for every bounded continuous function f. Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of \mathcal{P}.

Theorem. In the present setting, Sanov’s theorem holds verbatim as stated above.

It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets \Gamma that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]

We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.

Proof of the lower bound. It evidently suffices to consider the case that \Gamma is an open set. We use the following topological fact whose proof will be given below: if \Gamma \subseteq \mathcal{P} is open and R\in\Gamma, then there is a finite (measurable) partition \{A_1,\ldots,A_r\} of \mathbb{X} and \varepsilon>0 such that

    \[\tilde \Gamma = \{Q\in\mathcal{P} : |Q(A_i)-R(A_i)|<\varepsilon~ \forall\, i=1,\ldots,r \} \subseteq \Gamma.\]

Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.

Define the function T:\mathbb{X}\to\{1,\ldots,r\} such that T(x)=i for x\in A_i. Then \hat P_n \in \tilde \Gamma if and only if the empirical measure \hat P_n^\circ of T(X_1),\ldots,T(X_n) lies in \Gamma^\circ=\{Q:|Q(\{i\})-R(A_i)|<\varepsilon~\forall\,i=1,\ldots,r\}. Thus

    \[\mathbf{P}[\hat P_n \in \Gamma] \ge \mathbf{P}[\hat P_n \in \tilde \Gamma] = \mathbf{P}[\hat P_n^\circ \in \Gamma^\circ].\]

As T(X_i) take values in a finite set, and as \Gamma^\circ is open, we obtain from the finite Sanov theorem

    \[\liminf_{n \rightarrow \infty} \frac{1}{n} \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \Gamma^\circ} D(Q || PT^{-1}) = -\inf_{Q \in \tilde \Gamma} D(QT^{-1} || PT^{-1}) \ge - D(R||P),\]

where we have used the data processing inequality and R\in\tilde\Gamma in the last inequality. As R\in\Gamma was arbitrary, taking the supremum over R\in\Gamma completes the proof. \qquad\square

Proof of the topological fact. Sets of the form

    \[\bigg\{Q\in\mathcal{P} : \bigg|\int f_id Q-\int f_i dR\bigg|   <\alpha~ \forall\, i=1,\ldots,k \bigg\}\]

for R\in\mathcal{P}, k<\infty, f_1,\ldots,f_k bounded continuous functions, and \alpha>0 form a base for the weak convergence topology on \mathcal{P}. Thus any open subset \Gamma\subseteq\mathcal{P} must contain a set of this form for every R\in\Gamma (think of the analogous statement in \mathbb{R}^n: any open set B\subseteq\mathbb{R}^n must contain a ball around any x\in B).

It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as f_i is a bounded function, we can find for each i a simple function \tilde f_i such that \|\tilde f_i-f_i\|_\infty\le\alpha/3. Clearly |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3 implies |\int f_i dQ-\int f_i dR|<\alpha, so we can replace the functions f_i by simple functions. But then forming the partition \{A_1,\ldots,A_r\} generated by the sets that define these simple functions, it is evident that if \varepsilon>0 is chosen sufficiently small, then |Q(A_j)-R(A_j)|<\varepsilon for all j implies |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3. The proof is complete. \square

Remark. It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.

Lecture by Ramon van Handel | Scribed by Quentin Berthet

10. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 3. Sanov’s theorem

Next lecture: October 17

Next week (October 10) there will be no stochastic analysis seminar due to conflicting events:

03. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Next lecture: October 17

← Older posts

Newer posts →