Lecture 3. Sanov’s theorem

The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:

  1. It is an example of a probabilistic question where entropy naturally appears.
  2. The proof we give uses ideas typical in information theory.
  3. We will need it later to discuss the transportation-information inequalities (if we get there).

What is large deviations?

The best way to start thinking about large deviations is to consider a basic example. Let X_1,X_2,\ldots be i.i.d. random variables with \mathbf{E}[X_1] =0 and \mathbf{E}[X_1^2] < \infty. The law of large numbers states that

    \[\frac{1}{n} \sum_{k=1}^n X_k \xrightarrow{n \rightarrow \infty} 0   \quad \mbox{a.s.}\]

To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that

    \[\sqrt{n} \, \bigg\{ \frac{1}{n} \sum_{k=1}^n X_k\bigg\}     \xrightarrow{n \rightarrow \infty} N(0,\sigma^2)    \quad\mbox{in distribution}.\]

Therefore, for any fixed t, we have

    \[\mathbf{P}\bigg[ \frac{1}{n} \sum_{k=1}^n X_k \ge \frac{t}{\sqrt{n}}\bigg]       = O(1)\]

(the probability converges to a value strictly between zero and one). Informally, this implies that the typical size of \frac 1n \sum_{k=1}^n X_k is of order \frac{1}{\sqrt{n}}.

Rather than considering the probability of typical events, large deviations theory allows us to understand the probability of rare events, that is, events whose probabilities are exponentially small. For example, if X_k \sim N(0,1), then the probability that \frac 1n \sum_{k=1}^n X_k is at least of order unity (a rare event, as we have just shown that the typical size is of order \frac{1}{\sqrt{n}}) can be computed as

    \[\mathbf{P}\bigg[ \frac 1n \sum_{k=1}^n X_k > t\bigg] =       \int_{t\sqrt{n}}^{\infty} \frac{e^{-x^2/2}}{\sqrt{2 \pi}} \approx    e^{-nt^2/2}.\]

The probability of this rare event decays exponentially at rate t^2/2. If the random variables X_i have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of large deviations theory is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.

Sanov’s theorem

Let X_1,X_2,\ldots be i.i.d. random variables with values in a finite set \{1,\ldots,r\} and with distribution P (random variables in a continuous space will be considered at the end of this lecture). Denote by \mathcal{P} the set of probabilities on \{1,\ldots,r \}. Let \hat P_n be the empirical distribution of X:

    \[\hat P_n = \frac{1}{n}\sum_{k=1}^n \delta_{X_k}.\]

The law of large numbers states that \hat P_n \rightarrow P a.s. To define a rare event, we fix \Gamma \subseteq \mathcal{P} that does not contain P. We are interested in behavior of probabilities of the form \mathbf{P}[\hat P_n \in \Gamma], as n \rightarrow \infty.

Example. Let f:\{1,\ldots,r\} \rightarrow \mathbb{R} be such that \int f dP=0. Define \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} for some t>0. Then \mathbf{P}[\hat P_n \in \Gamma]=\mathbf{P}[\frac{1}{n}\sum_{k=1}^n f(X_k)\ge t]. Thus the rare events of the type described in the previous section form a special case of the present setting.

We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities \mathbf{P}[\hat P_n \in \Gamma] decay.

Theorem (Sanov). With the above notations, it holds that

    \begin{eqnarray*} -\inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P) &\le& \liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le& \limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le&-\inf_{Q \in \mathop{\mathrm{cl}} \Gamma} D(Q || P). \end{eqnarray*}

In particular, for “nice” \Gamma such that the left- and right-hand sides coincide we have the exact rate

    \[\lim_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]    = - \inf_{Q \in \Gamma} D(Q || P).\]

In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event \Gamma is controlled by the element of \Gamma that is closest to the true distribution P in the sense of relative entropy.

There are many proofs of Sanov’s theorem (see, for example, the excellent text by Dembo and Zeitouni). Here we will utilize an elegant approach that uses a common device in information theory.

Method of types

It is a trivial observation that each possible value \{1,\ldots,r \} must appear an integer number of times among the samples \{X_1,\ldots,X_n\}. This implies, however, that the empirical measure \hat P_n cannot take arbitrary values: evidently it is always the case that \hat P_n \in \mathcal{P}_n, where we define

    \[\mathcal{P}_n = \bigg \{Q \in \mathcal{P} :     Q(\{i\}) = \frac{k_i}{n}, ~ k_i \in \{0,\ldots,n\},~\sum_{i=1}^rk_i=1 \bigg \}.\]

Each element of \mathcal{P}_n is called a type: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value \hat P_n=Q for each type Q\in\mathcal{P}_n.

Type theorem. For every Q \in \mathcal{P}_n, we have

    \[\frac{1}{(n+1)^r} e^{- n D(Q || P)} \le \mathbf{P}[\hat P_n = Q]     \le e^{- n D(Q || P)}.\]

That is, up to a polynomial factor, the probability of each type Q\in \mathcal{P}_n behaves like e^{- n D(Q || P)}.

In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types Q such that D(Q||P)\ge \inf_{Q\in\Gamma}D(Q||P)+\varepsilon have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in \Gamma. The probability of the rare event \Gamma is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: the probability of a rare event is dominated by the most likely of the unlikely outcomes. The proof of Sanov’s theorem makes this intuition precise.

Proof of Sanov’s theorem.

Upper bound. Note that

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &=& \sum_{Q\in \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \le \sum_{Q\in \Gamma \cap \mathcal{P}_n} e^{- n D(Q || P)}\\ &\le& |\mathcal{P}_n| \sup_{Q \in \Gamma} e^{- n D(Q || P)}  \le (n+1)^r e^{- n \inf _{Q \in \Gamma} D(Q || P)}. \end{eqnarray*}

This yields

    \[\limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \le -\inf_{Q \in \Gamma} D(Q || P).\]

[Note that in the finite case, by continuity, the infimum over \Gamma equals the infimum over \mathop{\mathrm{cl}}\Gamma as stated in the theorem. The closure becomes more important in the continuous setting.]

Lower bound. Note that \bigcup_{n\ge 1}\mathcal{P}_n is dense in \mathcal{P}. As \mathop{\mathrm{int}} \Gamma is open, we can choose Q_n \in  \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n for each n, such that D( Q_n || P) \rightarrow \inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P). Therefore,

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &\ge& \mathbf{P}[\hat P_n \in \mathop{\mathrm{int}} \Gamma] = \sum_{Q \in \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \ge \mathbf{P}[\hat P_n = Q_n]\\ &\ge& \frac{1}{(n+1)^r} e^{-nD(Q_n || P)}. \end{eqnarray*}

It follows that

    \[\liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \mathop{\mathrm{int}}\Gamma} D(Q || P).\]

[Note that despite that we are in the finite case, it is essential to consider the interior of \Gamma.] \qquad\square

Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.

Proof of the type theorem. Let us define

    \[T(Q) = \bigg\{(x_1,\ldots,x_n) \in \{1,\ldots,r\}^n : \frac 1n \sum_{k=1}^n\delta_{x_k}=Q \bigg\}.\]

Then we can write

    \begin{eqnarray*} \mathbf{P}[\hat P_n = Q] &=& \mathbf{P}[(X_1,\ldots,X_n) \in T(Q)] = \sum_{x \in T(Q)} P(\{x_1\}) \ldots P(\{x_n\})\\ &=& \sum_{x \in T(Q)} \prod_{i=1}^r P(i)^{n Q(i)} = e^{n \sum_{i=1}^r Q(i) \log  P(i)} \, |T(Q)| \\ &=& e^{-n \{D(Q || P) + H(Q)\}} |T(Q)|. \end{eqnarray*}

It is therefore sufficient to prove that for every Q \in \mathcal{P}_n

    \[\frac{1}{(n+1)^r} e^{n H(Q)} \le |T(Q)| \le e^{n H(Q)}.\]

To show this, the key idea is to utilize precisely the same expression for \mathbf{P}[\hat P_n = Q] given above, for the case that the distribution P that defined the empirical measure \hat P_n is replaced by Q (which is a type). To this end, let us denote by \hat Q_n the empirical measure of n i.i.d. random variables with distribution Q.

Upper bound. We simply estimate using the above expression

    \[1 \ge \mathbf{P}[\hat Q_n = Q] = e^{-n H(Q)} |T(Q)|.\]

Lower bound. It seems intuitively plausible that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n, that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate

    \[1 = \sum_{R \in \mathcal{P}_n} \mathbf{P}[\hat Q_n = R] \le (n+1)^r \mathbf{P}[\hat Q_n = Q] = (n+1)^r e^{-n H(Q)} |T(Q)|.\]

Proof of the claim. It remains to prove the above claim that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n. To this end, note that T(Q) consists of all vectors x such that nQ(1) of the entries take the value 1, nQ(2) of the entries take the value 2, etc. The number of such vectors is

    \[T(Q) = \frac{n!}{(nQ(1))! \ldots (nQ(r))! }.\]

It is now straightforward to estimate

    \begin{eqnarray*}\frac{\mathbf{P}[\hat Q_n = Q]}{\mathbf{P}[\hat Q_n = R]} &=& \prod_{i=1}^r \frac{(nR(i))!}{(nQ(i))!} Q(i)^{n(Q(i)-R(i))}\\ &\ge& \prod_{i=1}^r (n Q(i))^{n(R(i)-Q(i))} Q(i)^{n(Q(i)-R(i))} \quad \text{using} \quad \frac{n!}{m!} \ge m^{n-m}\\ &=&\prod_{i=1}^r n^{n(R(i)-Q(i))} = n^{n \sum_{i=1}^r (R(i)-Q(i))} =1. \end{eqnarray*}

Thus the claim is established. \qquad\square

Remark. It is a nice exercise to work out the explicit form of the rate function \inf_{Q \in \Gamma} D(Q || P) in the example \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as Cramèr’s theorem.

General form of Sanov’s theorem

The drawback to the method of types is that it relies heavily on the assumption that X_i take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.

Let \mathbb{X} be a Polish space (think \mathbb{R}^n), and let X_1,X_2,\ldots be i.i.d. random variables taking values in \mathbb{X} with distribution P. Denote by \mathcal{P} the space of probability measures on \mathbb{X} endowed with the topology of weak convergence: that is, P_n\to P iff \int f dP_n\to \int f dP for every bounded continuous function f. Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of \mathcal{P}.

Theorem. In the present setting, Sanov’s theorem holds verbatim as stated above.

It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets \Gamma that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]

We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.

Proof of the lower bound. It evidently suffices to consider the case that \Gamma is an open set. We use the following topological fact whose proof will be given below: if \Gamma \subseteq \mathcal{P} is open and R\in\Gamma, then there is a finite (measurable) partition \{A_1,\ldots,A_r\} of \mathbb{X} and \varepsilon>0 such that

    \[\tilde \Gamma = \{Q\in\mathcal{P} : |Q(A_i)-R(A_i)|<\varepsilon~ \forall\, i=1,\ldots,r \} \subseteq \Gamma.\]

Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.

Define the function T:\mathbb{X}\to\{1,\ldots,r\} such that T(x)=i for x\in A_i. Then \hat P_n \in \tilde \Gamma if and only if the empirical measure \hat P_n^\circ of T(X_1),\ldots,T(X_n) lies in \Gamma^\circ=\{Q:|Q(\{i\})-R(A_i)|<\varepsilon~\forall\,i=1,\ldots,r\}. Thus

    \[\mathbf{P}[\hat P_n \in \Gamma] \ge \mathbf{P}[\hat P_n \in \tilde \Gamma] = \mathbf{P}[\hat P_n^\circ \in \Gamma^\circ].\]

As T(X_i) take values in a finite set, and as \Gamma^\circ is open, we obtain from the finite Sanov theorem

    \[\liminf_{n \rightarrow \infty} \frac{1}{n} \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \Gamma^\circ} D(Q || PT^{-1}) = -\inf_{Q \in \tilde \Gamma} D(QT^{-1} || PT^{-1}) \ge - D(R||P),\]

where we have used the data processing inequality and R\in\tilde\Gamma in the last inequality. As R\in\Gamma was arbitrary, taking the supremum over R\in\Gamma completes the proof. \qquad\square

Proof of the topological fact. Sets of the form

    \[\bigg\{Q\in\mathcal{P} : \bigg|\int f_id Q-\int f_i dR\bigg|   <\alpha~ \forall\, i=1,\ldots,k \bigg\}\]

for R\in\mathcal{P}, k<\infty, f_1,\ldots,f_k bounded continuous functions, and \alpha>0 form a base for the weak convergence topology on \mathcal{P}. Thus any open subset \Gamma\subseteq\mathcal{P} must contain a set of this form for every R\in\Gamma (think of the analogous statement in \mathbb{R}^n: any open set B\subseteq\mathbb{R}^n must contain a ball around any x\in B).

It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as f_i is a bounded function, we can find for each i a simple function \tilde f_i such that \|\tilde f_i-f_i\|_\infty\le\alpha/3. Clearly |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3 implies |\int f_i dQ-\int f_i dR|<\alpha, so we can replace the functions f_i by simple functions. But then forming the partition \{A_1,\ldots,A_r\} generated by the sets that define these simple functions, it is evident that if \varepsilon>0 is chosen sufficiently small, then |Q(A_j)-R(A_j)|<\varepsilon for all j implies |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3. The proof is complete. \square

Remark. It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.

Lecture by Ramon van Handel | Scribed by Quentin Berthet

10. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 3. Sanov’s theorem