## Lecture 3. Sanov’s theorem

The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:

- It is an example of a probabilistic question where entropy naturally appears.
- The proof we give uses ideas typical in information theory.
- We will need it later to discuss the transportation-information inequalities (if we get there).

**What is large deviations?**

The best way to start thinking about large deviations is to consider a basic example. Let be i.i.d. random variables with and . The law of large numbers states that

To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that

Therefore, for any fixed , we have

(the probability converges to a value strictly between zero and one). Informally, this implies that the *typical* size of is of order .

Rather than considering the probability of *typical* events, large deviations theory allows us to understand the probability of *rare* events, that is, events whose probabilities are exponentially small. For example, if , then the probability that is at least of order unity (a rare event, as we have just shown that the typical size is of order ) can be computed as

The probability of this rare event decays exponentially at rate . If the random variables have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of *large deviations theory* is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.

**Sanov’s theorem**

Let be i.i.d. random variables with values in a finite set and with distribution (random variables in a continuous space will be considered at the end of this lecture). Denote by the set of probabilities on . Let be the empirical distribution of :

The law of large numbers states that a.s. To define a rare event, we fix that does not contain . We are interested in behavior of probabilities of the form , as .

**Example.** Let be such that . Define for some . Then . Thus the rare events of the type described in the previous section form a special case of the present setting.

We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities decay.

Theorem(Sanov).With the above notations, it holds that

In particular, for “nice” such that the left- and right-hand sides coincide we have the exact rate

In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event is controlled by the element of that is closest to the true distribution in the sense of relative entropy.

There are many proofs of Sanov’s theorem (see, for example, the excellent text by Dembo and Zeitouni). Here we will utilize an elegant approach that uses a common device in information theory.

**Method of types**

It is a trivial observation that each possible value must appear an integer number of times among the samples . This implies, however, that the empirical measure cannot take arbitrary values: evidently it is always the case that , where we define

Each element of is called a *type*: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value for each type .

Type theorem.For every , we have

That is, up to a polynomial factor, the probability of each type behaves like .

In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types such that have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in . The probability of the rare event is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: *the probability of a rare event is dominated by the most likely of the unlikely outcomes*. The proof of Sanov’s theorem makes this intuition precise.

**Proof of Sanov’s theorem.**

*Upper bound.* Note that

This yields

[Note that in the finite case, by continuity, the infimum over equals the infimum over as stated in the theorem. The closure becomes more important in the continuous setting.]

*Lower bound.* Note that is dense in . As is open, we can choose for each , such that . Therefore,

It follows that

[Note that despite that we are in the finite case, it is essential to consider the interior of .]

Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.

**Proof of the type theorem.** Let us define

Then we can write

It is therefore sufficient to prove that for every

To show this, the key idea is to utilize precisely the same expression for given above, for the case that the distribution that defined the empirical measure is replaced by (which is a type). To this end, let us denote by the empirical measure of i.i.d. random variables with distribution .

*Upper bound.* We simply estimate using the above expression

*Lower bound.* It seems intuitively plausible that for every , that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate

*Proof of the claim.* It remains to prove the above claim that for every . To this end, note that consists of all vectors such that of the entries take the value , of the entries take the value , etc. The number of such vectors is

It is now straightforward to estimate

Thus the claim is established.

**Remark.** It is a nice exercise to work out the explicit form of the rate function in the example considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as *Cramèr’s theorem.*

**General form of Sanov’s theorem**

The drawback to the method of types is that it relies heavily on the assumption that take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.

Let be a Polish space (think ), and let be i.i.d. random variables taking values in with distribution . Denote by the space of probability measures on endowed with the topology of weak convergence: that is, iff for every bounded continuous function . Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of .

Theorem.In the present setting, Sanov’s theorem holds verbatim as stated above.

It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]

We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.

**Proof of the lower bound.** It evidently suffices to consider the case that is an open set. We use the following topological fact whose proof will be given below: if is open and , then there is a finite (measurable) partition of and such that

Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.

Define the function such that for . Then if and only if the empirical measure of lies in . Thus

As take values in a finite set, and as is open, we obtain from the finite Sanov theorem

where we have used the data processing inequality and in the last inequality. As was arbitrary, taking the supremum over completes the proof.

**Proof of the topological fact.** Sets of the form

for , , bounded continuous functions, and form a base for the weak convergence topology on . Thus any open subset must contain a set of this form for every (think of the analogous statement in : any open set must contain a ball around any ).

It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as is a bounded function, we can find for each a simple function such that . Clearly implies , so we can replace the functions by simple functions. But then forming the partition generated by the sets that define these simple functions, it is evident that if is chosen sufficiently small, then for all implies . The proof is complete.

**Remark.** It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.

*Lecture by Ramon van Handel* | *Scribed by Quentin Berthet*