Next lecture: one-time room change

The next lecture is on October 17. Please note that we have a one time room change to Sherrerd Hall Room 101 on October 17 as the usual Bendheim room is in use for another event. We will return to the Bendheim center the following week for the rest of the semester.

10. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Next lecture: one-time room change

Lecture 3. Sanov’s theorem

The goal of this lecture is to prove one of the most basic results in large deviations theory. Our motivations are threefold:

  1. It is an example of a probabilistic question where entropy naturally appears.
  2. The proof we give uses ideas typical in information theory.
  3. We will need it later to discuss the transportation-information inequalities (if we get there).

What is large deviations?

The best way to start thinking about large deviations is to consider a basic example. Let X_1,X_2,\ldots be i.i.d. random variables with \mathbf{E}[X_1] =0 and \mathbf{E}[X_1^2] < \infty. The law of large numbers states that

    \[\frac{1}{n} \sum_{k=1}^n X_k \xrightarrow{n \rightarrow \infty} 0   \quad \mbox{a.s.}\]

To say something quantitative about the rate of convergence, we need finer limit theorems. For example, the central limit theorem states that

    \[\sqrt{n} \, \bigg\{ \frac{1}{n} \sum_{k=1}^n X_k\bigg\}     \xrightarrow{n \rightarrow \infty} N(0,\sigma^2)    \quad\mbox{in distribution}.\]

Therefore, for any fixed t, we have

    \[\mathbf{P}\bigg[ \frac{1}{n} \sum_{k=1}^n X_k \ge \frac{t}{\sqrt{n}}\bigg]       = O(1)\]

(the probability converges to a value strictly between zero and one). Informally, this implies that the typical size of \frac 1n \sum_{k=1}^n X_k is of order \frac{1}{\sqrt{n}}.

Rather than considering the probability of typical events, large deviations theory allows us to understand the probability of rare events, that is, events whose probabilities are exponentially small. For example, if X_k \sim N(0,1), then the probability that \frac 1n \sum_{k=1}^n X_k is at least of order unity (a rare event, as we have just shown that the typical size is of order \frac{1}{\sqrt{n}}) can be computed as

    \[\mathbf{P}\bigg[ \frac 1n \sum_{k=1}^n X_k > t\bigg] =       \int_{t\sqrt{n}}^{\infty} \frac{e^{-x^2/2}}{\sqrt{2 \pi}} \approx    e^{-nt^2/2}.\]

The probability of this rare event decays exponentially at rate t^2/2. If the random variables X_i have a different distribution, then these tail probabilities still decay exponentially but with a different rate function. The goal of large deviations theory is to compute precisely the rate of decay of the probabilities of such rare events. In the sequel, we will consider a more general version of this problem.

Sanov’s theorem

Let X_1,X_2,\ldots be i.i.d. random variables with values in a finite set \{1,\ldots,r\} and with distribution P (random variables in a continuous space will be considered at the end of this lecture). Denote by \mathcal{P} the set of probabilities on \{1,\ldots,r \}. Let \hat P_n be the empirical distribution of X:

    \[\hat P_n = \frac{1}{n}\sum_{k=1}^n \delta_{X_k}.\]

The law of large numbers states that \hat P_n \rightarrow P a.s. To define a rare event, we fix \Gamma \subseteq \mathcal{P} that does not contain P. We are interested in behavior of probabilities of the form \mathbf{P}[\hat P_n \in \Gamma], as n \rightarrow \infty.

Example. Let f:\{1,\ldots,r\} \rightarrow \mathbb{R} be such that \int f dP=0. Define \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} for some t>0. Then \mathbf{P}[\hat P_n \in \Gamma]=\mathbf{P}[\frac{1}{n}\sum_{k=1}^n f(X_k)\ge t]. Thus the rare events of the type described in the previous section form a special case of the present setting.

We are now in a position to state Sanov’s theorem, which explains precisely at what exponential rate the probabilities \mathbf{P}[\hat P_n \in \Gamma] decay.

Theorem (Sanov). With the above notations, it holds that

    \begin{eqnarray*} -\inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P) &\le& \liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le& \limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]\\ &\le&-\inf_{Q \in \mathop{\mathrm{cl}} \Gamma} D(Q || P). \end{eqnarray*}

In particular, for “nice” \Gamma such that the left- and right-hand sides coincide we have the exact rate

    \[\lim_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma]    = - \inf_{Q \in \Gamma} D(Q || P).\]

In words, Sanov’s theorem states that the exponential rate of decay of the probability of a rare event \Gamma is controlled by the element of \Gamma that is closest to the true distribution P in the sense of relative entropy.

There are many proofs of Sanov’s theorem (see, for example, the excellent text by Dembo and Zeitouni). Here we will utilize an elegant approach that uses a common device in information theory.

Method of types

It is a trivial observation that each possible value \{1,\ldots,r \} must appear an integer number of times among the samples \{X_1,\ldots,X_n\}. This implies, however, that the empirical measure \hat P_n cannot take arbitrary values: evidently it is always the case that \hat P_n \in \mathcal{P}_n, where we define

    \[\mathcal{P}_n = \bigg \{Q \in \mathcal{P} :     Q(\{i\}) = \frac{k_i}{n}, ~ k_i \in \{0,\ldots,n\},~\sum_{i=1}^rk_i=1 \bigg \}.\]

Each element of \mathcal{P}_n is called a type: it contains only the information about how often each value shows up in the sample, discarding the order in which they appear. The key idea behind the proof of Sanov’s theorem is that we can obtain a very good bound on the probability that the empirical measure takes the value \hat P_n=Q for each type Q\in\mathcal{P}_n.

Type theorem. For every Q \in \mathcal{P}_n, we have

    \[\frac{1}{(n+1)^r} e^{- n D(Q || P)} \le \mathbf{P}[\hat P_n = Q]     \le e^{- n D(Q || P)}.\]

That is, up to a polynomial factor, the probability of each type Q\in \mathcal{P}_n behaves like e^{- n D(Q || P)}.

In view of the type theorem, the conclusion of Sanov’s theorem is not surprising. The type theorem implies that types Q such that D(Q||P)\ge \inf_{Q\in\Gamma}D(Q||P)+\varepsilon have exponentially smaller probability than the “optimal” distribution that minimizes the relative entropy in \Gamma. The probability of the rare event \Gamma is therefore controlled by the probability of the most likely type. In other words, we have the following intuition, common in large deviation theory: the probability of a rare event is dominated by the most likely of the unlikely outcomes. The proof of Sanov’s theorem makes this intuition precise.

Proof of Sanov’s theorem.

Upper bound. Note that

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &=& \sum_{Q\in \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \le \sum_{Q\in \Gamma \cap \mathcal{P}_n} e^{- n D(Q || P)}\\ &\le& |\mathcal{P}_n| \sup_{Q \in \Gamma} e^{- n D(Q || P)}  \le (n+1)^r e^{- n \inf _{Q \in \Gamma} D(Q || P)}. \end{eqnarray*}

This yields

    \[\limsup_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \le -\inf_{Q \in \Gamma} D(Q || P).\]

[Note that in the finite case, by continuity, the infimum over \Gamma equals the infimum over \mathop{\mathrm{cl}}\Gamma as stated in the theorem. The closure becomes more important in the continuous setting.]

Lower bound. Note that \bigcup_{n\ge 1}\mathcal{P}_n is dense in \mathcal{P}. As \mathop{\mathrm{int}} \Gamma is open, we can choose Q_n \in  \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n for each n, such that D( Q_n || P) \rightarrow \inf_{Q \in \mathop{\mathrm{int}} \Gamma} D(Q || P). Therefore,

    \begin{eqnarray*} \mathbf{P}[\hat P_n \in \Gamma] &\ge& \mathbf{P}[\hat P_n \in \mathop{\mathrm{int}} \Gamma] = \sum_{Q \in \mathop{\mathrm{int}} \Gamma \cap \mathcal{P}_n} \mathbf{P}[\hat P_n = Q] \ge \mathbf{P}[\hat P_n = Q_n]\\ &\ge& \frac{1}{(n+1)^r} e^{-nD(Q_n || P)}. \end{eqnarray*}

It follows that

    \[\liminf_{n \rightarrow \infty} \frac 1n \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \mathop{\mathrm{int}}\Gamma} D(Q || P).\]

[Note that despite that we are in the finite case, it is essential to consider the interior of \Gamma.] \qquad\square

Of course, all the magic has now shifted to the type theorem itself: why are the probabilities of the types controlled by relative entropy? We will presently see that relative entropy arises naturally in the proof.

Proof of the type theorem. Let us define

    \[T(Q) = \bigg\{(x_1,\ldots,x_n) \in \{1,\ldots,r\}^n : \frac 1n \sum_{k=1}^n\delta_{x_k}=Q \bigg\}.\]

Then we can write

    \begin{eqnarray*} \mathbf{P}[\hat P_n = Q] &=& \mathbf{P}[(X_1,\ldots,X_n) \in T(Q)] = \sum_{x \in T(Q)} P(\{x_1\}) \ldots P(\{x_n\})\\ &=& \sum_{x \in T(Q)} \prod_{i=1}^r P(i)^{n Q(i)} = e^{n \sum_{i=1}^r Q(i) \log  P(i)} \, |T(Q)| \\ &=& e^{-n \{D(Q || P) + H(Q)\}} |T(Q)|. \end{eqnarray*}

It is therefore sufficient to prove that for every Q \in \mathcal{P}_n

    \[\frac{1}{(n+1)^r} e^{n H(Q)} \le |T(Q)| \le e^{n H(Q)}.\]

To show this, the key idea is to utilize precisely the same expression for \mathbf{P}[\hat P_n = Q] given above, for the case that the distribution P that defined the empirical measure \hat P_n is replaced by Q (which is a type). To this end, let us denote by \hat Q_n the empirical measure of n i.i.d. random variables with distribution Q.

Upper bound. We simply estimate using the above expression

    \[1 \ge \mathbf{P}[\hat Q_n = Q] = e^{-n H(Q)} |T(Q)|.\]

Lower bound. It seems intuitively plausible that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n, that is, the probability of the empirical distribution is maximized at the true distribution (“what else could it be?” We will prove it below.) Assuming this fact, we simply estimate

    \[1 = \sum_{R \in \mathcal{P}_n} \mathbf{P}[\hat Q_n = R] \le (n+1)^r \mathbf{P}[\hat Q_n = Q] = (n+1)^r e^{-n H(Q)} |T(Q)|.\]

Proof of the claim. It remains to prove the above claim that \mathbf{P}[\hat Q_n = Q] \ge \mathbf{P}[\hat Q_n = R] for every Q,R \in \mathcal{P}_n. To this end, note that T(Q) consists of all vectors x such that nQ(1) of the entries take the value 1, nQ(2) of the entries take the value 2, etc. The number of such vectors is

    \[T(Q) = \frac{n!}{(nQ(1))! \ldots (nQ(r))! }.\]

It is now straightforward to estimate

    \begin{eqnarray*}\frac{\mathbf{P}[\hat Q_n = Q]}{\mathbf{P}[\hat Q_n = R]} &=& \prod_{i=1}^r \frac{(nR(i))!}{(nQ(i))!} Q(i)^{n(Q(i)-R(i))}\\ &\ge& \prod_{i=1}^r (n Q(i))^{n(R(i)-Q(i))} Q(i)^{n(Q(i)-R(i))} \quad \text{using} \quad \frac{n!}{m!} \ge m^{n-m}\\ &=&\prod_{i=1}^r n^{n(R(i)-Q(i))} = n^{n \sum_{i=1}^r (R(i)-Q(i))} =1. \end{eqnarray*}

Thus the claim is established. \qquad\square

Remark. It is a nice exercise to work out the explicit form of the rate function \inf_{Q \in \Gamma} D(Q || P) in the example \Gamma = \{Q\in\mathcal{P} : \int f dQ \ge t \} considered at the beginning of this lecture. The resulting expression yields another basic result in large deviations theory, which is known as Cramèr’s theorem.

General form of Sanov’s theorem

The drawback to the method of types is that it relies heavily on the assumption that X_i take values in a finite state space. In fact, Sanov’s theorem continues to hold in a much more general setting.

Let \mathbb{X} be a Polish space (think \mathbb{R}^n), and let X_1,X_2,\ldots be i.i.d. random variables taking values in \mathbb{X} with distribution P. Denote by \mathcal{P} the space of probability measures on \mathbb{X} endowed with the topology of weak convergence: that is, P_n\to P iff \int f dP_n\to \int f dP for every bounded continuous function f. Now that we have specified the topology, it makes sense to speak of “open” and “closed” subsets of \mathcal{P}.

Theorem. In the present setting, Sanov’s theorem holds verbatim as stated above.

It turns out that the lower bound in the general Sanov theorem can be easily deduced from the finite state space version. The upper bound can also be deduced, but this is much more tricky (see this note) and a direct proof in the continuous setting using entirely different methods is more natural. [There is in fact a simple information-theoretic proof of the upper bound that is however restricted to sets \Gamma that are sufficiently convex, which is an unnecessary restriction; see this classic paper by Csiszar.]

We will need the general form of Sanov’s theorem in the development of transportation-information inequalities. Fortunately, however, we will only need the lower bound. We will therefore be content to deduce the general lower bound from the finite state space version that we proved above.

Proof of the lower bound. It evidently suffices to consider the case that \Gamma is an open set. We use the following topological fact whose proof will be given below: if \Gamma \subseteq \mathcal{P} is open and R\in\Gamma, then there is a finite (measurable) partition \{A_1,\ldots,A_r\} of \mathbb{X} and \varepsilon>0 such that

    \[\tilde \Gamma = \{Q\in\mathcal{P} : |Q(A_i)-R(A_i)|<\varepsilon~ \forall\, i=1,\ldots,r \} \subseteq \Gamma.\]

Given such a set, the idea is now to reduce to the discrete case using the data processing inequality.

Define the function T:\mathbb{X}\to\{1,\ldots,r\} such that T(x)=i for x\in A_i. Then \hat P_n \in \tilde \Gamma if and only if the empirical measure \hat P_n^\circ of T(X_1),\ldots,T(X_n) lies in \Gamma^\circ=\{Q:|Q(\{i\})-R(A_i)|<\varepsilon~\forall\,i=1,\ldots,r\}. Thus

    \[\mathbf{P}[\hat P_n \in \Gamma] \ge \mathbf{P}[\hat P_n \in \tilde \Gamma] = \mathbf{P}[\hat P_n^\circ \in \Gamma^\circ].\]

As T(X_i) take values in a finite set, and as \Gamma^\circ is open, we obtain from the finite Sanov theorem

    \[\liminf_{n \rightarrow \infty} \frac{1}{n} \log \mathbf{P}[\hat P_n \in \Gamma] \ge -\inf_{Q \in \Gamma^\circ} D(Q || PT^{-1}) = -\inf_{Q \in \tilde \Gamma} D(QT^{-1} || PT^{-1}) \ge - D(R||P),\]

where we have used the data processing inequality and R\in\tilde\Gamma in the last inequality. As R\in\Gamma was arbitrary, taking the supremum over R\in\Gamma completes the proof. \qquad\square

Proof of the topological fact. Sets of the form

    \[\bigg\{Q\in\mathcal{P} : \bigg|\int f_id Q-\int f_i dR\bigg|   <\alpha~ \forall\, i=1,\ldots,k \bigg\}\]

for R\in\mathcal{P}, k<\infty, f_1,\ldots,f_k bounded continuous functions, and \alpha>0 form a base for the weak convergence topology on \mathcal{P}. Thus any open subset \Gamma\subseteq\mathcal{P} must contain a set of this form for every R\in\Gamma (think of the analogous statement in \mathbb{R}^n: any open set B\subseteq\mathbb{R}^n must contain a ball around any x\in B).

It is now easy to see that each set of this form must contain a set of the form used in the above proof of the lower bound in Sanov’s theorem. Indeed, as f_i is a bounded function, we can find for each i a simple function \tilde f_i such that \|\tilde f_i-f_i\|_\infty\le\alpha/3. Clearly |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3 implies |\int f_i dQ-\int f_i dR|<\alpha, so we can replace the functions f_i by simple functions. But then forming the partition \{A_1,\ldots,A_r\} generated by the sets that define these simple functions, it is evident that if \varepsilon>0 is chosen sufficiently small, then |Q(A_j)-R(A_j)|<\varepsilon for all j implies |\int \tilde f_i dQ-\int\tilde f_i dR|<\alpha/3. The proof is complete. \square

Remark. It is also possible to work with topologies different than the topology of weak convergence. See, for example, the text by Dembo and Zeitouni for further discussion.

Lecture by Ramon van Handel | Scribed by Quentin Berthet

10. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 3. Sanov’s theorem

Next lecture: October 17

Next week (October 10) there will be no stochastic analysis seminar due to conflicting events:

03. October 2013 by Ramon van Handel
Categories: Announcement | Comments Off on Next lecture: October 17

Lecture 2. Basics / law of small numbers

Due to scheduling considerations, we postpone the proof of the entropic central limit theorem. In this lecture, we discuss basic properties of the entropy and illustrate them by proving a simple version of the law of small numbers (Poisson limit theorem). The next lecture will be devoted to Sanov’s theorem. We will return to the entropic central limit theorem in Lecture 4.

Conditional entropy and mutual information

We begin by introducing two definitions related to entropy. The first definition is a notion of entropy under conditioning.

Definition. If X and Y are two discrete random variables with probability mass functions p_X and p_Y, then the conditional entropy of X given Y is defined as

    \[H(X|Y) := - \mathbb{E} [\log{p_{X|Y}(X|Y)} ]\]

where p_{X|Y}(x|y) = p_{(X,Y)}(x,y)/p_Y(y) is the conditional probability mass function of X given Y.

Remark. If X and Y are absolutely continuous random variables, the conditional differential entropy h(X|Y) is defined analogously (where the probability mass functions are replaced by the corresponding probability densities with respect to Lebesgue measure).

Note that

    \begin{equation*} \begin{split}  H(X|Y)  &=  - \sum_{x,y} p_{(X,Y)}(x,y)\log{p_{X|Y}(x|y)} \\          &= - \sum_y p_Y(y) \sum_x  p_{X|Y}(x|y)\log{p_{X|Y}(x|y)} \\          & = \sum_y p_Y(y) H(X|Y=y). \end{split} \end{equation*}

That is, the conditional entropy H(X|Y) is precisely the expectation (with respect to the law of Y) of the entropy of the conditional distribution of X given Y.

We now turn to the second definition, the mutual information. It describes the degree of dependence between two random variables.

Definition. The mutual information between two random variables X and Y is defined as

    \[I(X,Y) := D( \mathcal{L}(X,Y) || \mathcal{L}(X) \otimes \mathcal{L}(Y)),\]

where \mathcal{L}(X,Y), \mathcal{L}(X) and \mathcal{L}(Y) denote the distributions of (X,Y), X and Y.

Conditional entropy and mutual information are closely related. For example, suppose that (X,Y) has density f_{(X,Y)} with respect to the Lebesgue measure, then

    \begin{equation*} \begin{split} I(X,Y) & = \int f_{(X,Y)}(x,y) \log{\frac{f_{(X,Y)}(x,y)}{f_X(x)f_Y(y)}} \,dx \,dy \\        & = \mathbb{E} \log{\frac{f_{(X,Y)}(X,Y)}{f_X(X)f_Y(Y)}} \\        & = \mathbb{E} \log{\frac{f_{X|Y}(X|Y)}{f_X(X)}}  \\        & = h(X)-h(X|Y). \end{split} \end{equation*}

In particular, since I(X,Y) is always positive (because it is a relative entropy), we have just shown that h(X|Y) \leq h(X), that is, conditioning reduces entropy. The same result holds for discrete random variables when we replace h by H.

Chain rules

Chain rules are formulas that relate the entropy of multiple random variables to the conditional entropies of these random variables. The most basic version is the following.

Chain rule for entropy. H(X_1, X_2, ..., X_n) = \sum_{i=1}^n H(X_i|X_1,...,X_{i-1}). In particular, H(X_2|X_1)=H(X_1, X_2)-H(X_1).

Proof. Note that

    \[p_{(X_1,...,X_n)}(x_1,...,x_n) = \prod_{i=1}^n p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1}).\]


    \[\log{ p_{(X_1,...,X_n)}(x_1,...,x_n)} = \sum_{i=1}^n \log{p_{X_i|X_1,...,X_{i-1}}(x_i|x_1,...,x_{i-1})}.\]

Taking the expectation on both sides under the distribution (x_1,...,x_n) \sim (X_1,...,X_n) gives the desired result. \qquad\square

Corollary. Entropy is sub-additive, that is, H(X_1, X_2, ..., X_n) \leq \sum_{i=1}^n H(X_i).

Proof. Combine the chain rule with H(X_i|X_1,...,X_{i-1}) \leq H(X_i). \qquad\square

There is also a chain rule for relative entropy.

Chain rule for relative entropy.

    \[D(\mathcal{L}(X,Y) || \mathcal{L}(X^{'},Y^{'})) =  D(\mathcal{L}(X)||\mathcal{L}(X^{'})) + \mathbb{E}_{x \sim X} [ D(\mathcal{L}(Y|X=x) || \mathcal{L}(Y^{'}|X^{'}=x))].\]

The following identity will be useful later.


    \begin{multline*}D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) = \\  \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)).\end{multline*}

Proof. Note that

    \begin{equation*} \begin{split} & D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Y_1) \otimes \cdots   \otimes  \mathcal{L}(Y_n)) \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{Y_1}(X_1)\cdots p_{Y_n}(X_n)}} \\ & =  \mathbb{E} \log{\frac{p_{(X_1,...,X_n)}(X_1,...,X_n)}{p_{X_1}(X_1)\cdots p_{X_n}(X_n)}}        + \sum_{i=1}^n \mathbb{E}\log{\frac{p_{X_i}(X_i)}{p_{Y_i}(X_i)}} \\  & = D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots  \otimes  \mathcal{L}(X_n)) + \sum_{i=1}^n  D(\mathcal{L}(X_i) || \mathcal{L}(Y_i)) . \qquad\square \end{split} \end{equation*}

Data processing and convexity

Two important properties of the relative entropy can be obtained as consequences of the chain rule.

Data processing inequality. Let P and Q be two probability measures on \mathcal{A} and suppose T:\mathcal{A} \rightarrow \mathcal{A}^{'} is measurable. Then D(PT^{-1}||QT^{-1}) \leq D(P||Q), where PT^{-1} is the distribution of T(X) when X \sim P.

The data processing inequality tells us that if we process the data X (which might come from one of the two distributions P and Q), then the relative entropy decreases. In other words, it becomes harder to identify the source distribution after processing the data. The same result (with the same proof) holds also if P and Q are transformed by a transition kernel, rather than by a function.

Proof. Denote by \mathsf{P} and \mathsf{Q} the joint laws of (X,T(X)) and (Y,T(Y)) when X\sim P and Y\sim Q. By the chain rule and nonnegativity of relative entropy

    \[D(PT^{-1}|QT^{-1}) = D(\mathsf{P}||\mathsf{Q}) -    \mathbb{E}_{t \sim PT^{-1}} [ D(\mathcal{L}(X|T(X)=t) || \mathcal{L}(Y|T(Y)=t))] \le D(\mathsf{P}||\mathsf{Q}).\]

On the other hand, using again the chain rule,

    \[D(\mathsf{P}||\mathsf{Q}) = D(P||Q) + \mathbb{E}_{x\sim P} [ D(\mathcal{L}(T(X)|X=x) || \mathcal{L}(T(Y)|Y=x))] =  D(P||Q),\]

where we used \mathcal{L}(T(X)|X=x) = \mathcal{L}(T(Y)|Y=x). Putting these together completes the proof. \qquad\square

Convexity of relative entropy. D(\cdot || \cdot) is jointly convex in its arguments, that is, if P_1, P_2, Q_1, Q_2 are probability measures and 0\leq \lambda \leq 1, then

    \[D(\lambda P_1 + (1-\lambda)P_2 || \lambda Q_1 + (1-\lambda)Q_2 ) \leq \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2).\]

Proof. Let T be a random variable that takes value 1 with probability \lambda and 2 with probability 1-\lambda. Conditionally on T=i, draw X\sim P_i and Y\sim Q_i. Then \mathcal{L}(X)=\lambda P_1+(1-\lambda)P_2 and \mathcal{L}(Y)=\lambda Q_1+(1-\lambda)Q_2. Using the chain rule twice, we obtain

    \[D(\mathcal{L}(X)||\mathcal{L}(Y)) \le    D(\mathcal{L}(X,T)||\mathcal{L}(Y,T)) =    \mathbb{E}_{t\sim \mathcal{L}(T)}[D(\mathcal{L}(X|T=t)||\mathcal{L}(Y|T=t))],\]

and the right hand side is precisely \lambda D(P_1 || Q_1) + (1-\lambda)D(P_2||Q_2). \qquad\square

Corollary. The entropy function H is concave.

Proof for a finite alphabet. When the alphabet \mathcal{A} is finite, the corollary can be proven by noting that H(P)=\log{|\mathcal{A}|} - D(P||\mathrm{Unif}(\mathcal{A})). \qquad\square

Relative entropy and total variation distance

Consider the hypothesis testing problem of testing the null hypothesis H_0: X \sim P against the alternative hypothesis H_1: X \sim Q. A test is a measurable function T:\mathcal{A} \rightarrow \{0,1\}. Under the constraint P(T(X)=1) \leq \alpha, it can be shown that the optimal rate of decay of Q(T(X)=0) as a function of the sample size n is of the order of \exp{(-n\cdot D(P||Q))}. This means that D(P||Q) is the measure of how well one can distinguish between Q and P on the basis of data.

We will not prove this fact, but only introduce it to motivate that the relative entropy D is, in some sense, like a measure of distance between probability measures. However, it is not a metric since D(P||Q) \neq D(Q||P) and the triangle inequality does not hold. So in what sense does the relative entropy represent a distance? In fact, it controls several bona fide metrics on the space of probability measures. One example of such metric is the total variation distance.

Definition. Let P and Q be probability measures on \mathcal{A}. The total variation distance is defined as d_{\text{TV}}(P,Q)=\sup_{A \in \mathcal{B}(\mathcal{A})} |P(A)-Q(A)|.

The following are some simple facts about the total variation distance.

  1. 0 \leq d_{\text{TV}}(P,Q) \leq 1.
  2. If P and Q have probability density functions p and q with respect to some common probability measure \lambda, then d_{\text{TV}}(P,Q)= \frac{1}{2}||p-q||_{L^{1}(\lambda)}. To see this, define A=\{x\in \mathcal{A}, p(x)>q(x) \}. Then

        \begin{equation*} \begin{split} ||p-q||_{L^{1}(\lambda)} & =  \int_{A}(p(x)-q(x))\lambda(dx) + \int_{A^c}(q(x)-p(x))\lambda(dx)  \\  & = P(A) - Q(A) + (1-Q(A)-1+P(A)) \\  & = 2(P(A)-Q(A)) = 2 d_{\text{TV}}(P,Q)  \\ \end{split} \end{equation*}

  3. d_{\text{TV}}(P,Q)= \inf_{X \sim P, Y \sim Q}  \mathbb{P}(X\neq Y).

The following inequality shows that total variance distance is controlled by the relative entropy. This shows that the relative entropy is a strong notion of distance.

Pinsker’s inequality. d_{\text{TV}}(P,Q)^2 \leq \frac{1}{2} D(P||Q).

Proof. Without loss of generality, we can assume that P and Q have probability density functions p and q with respect to some common probability measure \lambda on \mathcal{A}. Let A=\{x\in \mathcal{A}, p(x)>q(x) \} and T=1_{A}(x).

Step 1: Prove this inequality by simple calculation in the case when \mathcal{A} contains at most 2 elements.

Step 2: Note that PT^{-1} and QT^{-1} are defined on the space \{0,1\}. So Pinsker’s inequality applies to PT^{-1} and QT^{-1}. Thus,

    \begin{equation*} \begin{split} D(P||Q) & \geq  D(PT^{-1}||QT^{-1}) \geq 2 d_{\text{TV}}(PT^{-1},QT^{-1})^2   \\  & = 2(P(A)-Q(A))^2 = 2 d_{\text{TV}}(P,Q)^2. \qquad\square \end{split} \end{equation*}

Law of small numbers

As a first illustration of an application of entropy to probability, let us prove a simple quantitative law of small numbers. An example of the law of small numbers is the well known fact that Bin(n,\frac{\lambda}{n})  \rightarrow Po(\lambda) in distribution as n goes to infinity. More generally, if X_1,...,X_n are Bernoulli random variables with X_i \sim Bern(p_i), if X_1,...,X_n are weakly dependent, and if none of the p_i dominates the rest, then \mathcal{L}(\sum_{i=1}^n X_i) \approx Po(\lambda) where \lambda = \sum_{i=1}^{n} p_i. This idea can be quantified easily using relative entropy.

Theorem. If X_i \sim Bern(p_i) and X_1,...,X_n may be dependent, then

    \[D(\mathcal{L}(\bar X) || Po(\lambda)) \leq \sum_{i=1}^n p_i^2 + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) )\]

where \bar X = \sum_{i=1}^nX_i and \lambda = \sum_{i=1}^n p_i.

Proof. Let Z_1,...,Z_n be independent random variables with Z_i \sim Po(p_i). Then \bar Z = \sum_{i=1}^n Z_i \sim Po(\lambda). We have

    \begin{equation*} \begin{split} D(\mathcal{L}(\bar X) || Po(\lambda) )  & =  D(\mathcal{L}(\bar X) || \mathcal{L}(\bar Z)) \\  & \leq   D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(Z_1,...,Z_n) )\\  & = \sum_{i=1}^{n} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) ) + D(\mathcal{L}(X_1,...,X_n) || \mathcal{L}(X_1) \otimes \cdots \otimes \mathcal{L}(X_n) ). \\ \end{split} \end{equation*}

To conclude, it is enough to note that

    \begin{equation*} \begin{split} D(\mathcal{L}(X_i) || \mathcal{L}(Z_i) )  & =  (1-p_i)\log{\frac{1-p_i}{e^{-p_i}}} + p_i\log{\frac{p_i}{p_i e^{-p_i}}}  \\  & =  p_i^2 + (1-p_i)(p_i+\log{(1-p_i)})  \\  & \leq p_i^2 .  \qquad\square \end{split} \end{equation*}

Remark. If p_1= \cdots = p_n = \frac{\lambda}{n} and X_1,...,X_n are independent, then the inequality in the theorem becomes D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) ) \leq \frac{\lambda^2}{n}. However, this rate of convergence is not optimal. One can show that under the same condition, D(Bin(n,\frac{\lambda}{n}) || Po(\lambda) )= o(\frac{1}{n}), using tools similar to those that will be used later to prove the entropic central limit theorem. Note that it is much harder to prove D(\mathcal{L}(S_n)|| \mathcal{N}(0,1)) \rightarrow 0 in the entropic central limit theorem, even without rate of convergence!

Lecture by Mokshay Madiman | Scribed by Che-yu Liu

03. October 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 2. Basics / law of small numbers

Lecture 1. Introduction

What is information theory?

The first question that we want to address is: “What is information?” Although there are several ways in which we might think of answering this question, the main rationale behind our approach is to distinguish information from data. We think of information as something abstract that we want to convey, while we think of data as a representation of information, something that is storable/communicable. This is best understood by some examples.

Example. Information is a certain idea, while data is the words we use to describe this idea.

Example. Information is 10^{1000,000,000}. Possible data describing this information are: 10^{1000,000,000}, 10\cdots 0, 10^{10^9}, 1 followed by a billion zeros.

As we are in a mathematical setting we want to rely on a quantitative approach. The main question that arises naturally is: “How can we measure information?” Making sense of this question requires us to have a model for how data is produced. Throughout this seminar we will consider the probabilistic model which we now introduce.

Definition (Probabilistic model). Data is a random variable X taking values on the space A (alphabet) having distribution P (source distribution). We write X\sim P.

To be precise, with the above we mean that there exists a probability space (\Omega,\mathcal{F},\mathbb{P}) and a measurable space (A,\mathcal{G}) with some measurable function X:\Omega\rightarrow A such that we have \mathbb{P}\circ X^{-1} = P.


  1. While this set-up is very similar to what is done in statistics, the focus in information theory is different. In statistics it is assumed that the data X comes from one of a family of distributions (statistical model), and the goal is to infer something about the particular distribution generating the data. On the other hand, in information theory the distribution of the data X might be known or not, and the goal is to compress or communicate X.
  2. In the probabilistic model we assume that the data is generated by a certain random source. This is a particular modeling assumption and it is not necessarily an expression of belief in how data are actually produced. This is a reasonable modeling assumption to make and it allows us to draw reasonable conclusions (for example, text data is clearly not randomly produced, but you can still do useful things by making the modeling assumption that the data was produced by a stochastic source).
  3. The original motivation behind the development of information theory as based on the probabilistic model came from a practical engineering problem (how to compress/communicate data), and not from the idea of how we measure information (although this aspect was also part of the motivation). The whole field of study was created by the 1948 paper of Claude Shannon.
  4. We are going to use the probabilistic model throughout this seminar, but other models exist as well. A popular model used in theoretical computer science is the algorithmic model (which defines the field of algorithmic information theory, as opposed to probabilistic information theory). In this model it is assumed that data is the output of some computer program (running on a Turing machine). While this approach could be thought of as a generalization of the probabilistic model (in fact, one way in which computers can work is to simulate from some probability distribution), many of the basic quantities in algorithmic information theory (like Kolmogorov complexity) are not computable. This is the reason why this field is suitable for theoretical insights, but it is not necessarily suitable for practical purposes.

How do we measure information in the probabilistic model?

In what follows we assume that the information to be conveyed coincides with the data X itself (and we now assume that X takes values in some countable set A), meaning that there is no universal hidden meaning that we are trying to convey apart from the data itself. For example, assume that the information we want to convey (a particular realization of it) is the text “This is the SAS”. A natural way of measuring the amount of information contained in this data is to look for other representations of this information, and to look for the smallest (in some sense that needs to be specified) representation. As we are in the probabilistic framework, we do not know in advance which data is going to be produced by the random source, so we look for a procedure that takes the random outcome and gives us on average the smallest or most compact representation of the information in that data.

Since the data is random, also the size of a particular realization (encoded data) is random. One way to take into account the randomness is to consider a representation (encoding scheme) that minimizes the expected length/size of the encoded data (and that is uniquely decodable). That is, we measure the amount of information in a given data X as

    \[\min_{\text{valid encoding schemes}} \mathbb{E}_P [\text{length of encoded data}].\]

If we set up things in this way, the measure of information is some functional H of the source distribution P, since P is the only quantity governing the data. This functional H is called the entropy and it is defined as follows.

Definition (Entropy). If P is a discrete distribution, then the entropy of P is defined as

    \[H(P) := \sum_{x\in A} p(x) \log\frac{1}{p(x)} ,\]

where we write p(x)=P(\{x\}) for the probability mass function of P.

While it can be shown that H(P) is the minimal expected length of validly encoded data, we do not proceed this way (the ideas behind this fact are covered in the first couple of lectures of an information theory class). Instead, we will give some intuition on why H(P) is a good measure of information.

We first provide some intuition on why the information in the statement X=a should be decreasing as a function of P(a). In fact, recall that presently we assume to know the source distribution P. If we know P, how informative is a particular outcome x form the source distribution? If P(X=x)=\delta_a(x) (i.e., P is a point mass at a), being informed that the random outcome is a is not informative. On the other hand, if P(X=a)=10^{-10}, being informed that the outcome is a is extremely informative/significant (something very rare has happened).

The relevant decreasing function turns out to be the following:

    \[p(x) \longrightarrow \log\frac{1}{p(x)}.\]

In this respect, \log\frac{1}{p(x)} corresponds to the information that we get from the statement X=x. So the average amount of information in the random outcome X is given by

    \[\mathbb{E}_P \left[ \log\frac{1}{p(X)} \right] = \sum_{x\in A} p(x) \log\frac{1}{p(x)} 	= H(P).\]

Connection between information theory and statistics

While the connection between information theory and statistics is not surprising as both fields rely on the probabilistic model, this correspondence is very strong and natural. We give some examples.

  1. Maximum likelihood estimators (MLE) can be seen as minimal codelength estimators. In a statistical model we assume that X\sim P_{\theta}, with \theta\in \Theta for some parameter space \Theta, and the goal is to find the parameter \theta that generated the data. A popular estimator is the MLE since it is plausible to assume that the parameter that generated the data X is the parameter \tilde\theta whose corresponding distribution would have given maximal probability to X, that is,

        \[\hat\theta := \mathop{\mathrm{argmax}}_{\tilde\theta\in\Theta} P_{\tilde\theta}(X).\]

    Note that we can rewrite the above as

        \[\hat\theta = \mathop{\mathrm{argmin}}_{\tilde\theta\in\Theta} \log \frac{1}{P_{\tilde\theta}(X)},\]

    which can be seen to correspond to the minimal number of bits required to represent X assuming that it was generated by P_{\tilde\theta} (codelength). Hence the connection between MLE in statistics and the minimal codelength estimator in information theory. In this setting we assume that we do not know the distribution generating the data and we try to find a good code to encode the data. The problem of finding a good code is in some sense equivalent to the problem of finding the distribution itself, since once you know the distribution you know the best code (in some sense). Also, we mention that many penalized-MLE estimators (where we take into account the complexity of the model by adding a penalty term to the MLE estimator) can be motivated from an information-theoretic point of view in terms of analogue of coding problems; this is the idea behind the “Minimum Description Length” principle.

  2. In Hypothesis testing, the optimal error exponents are information-theoretic quantities.

These are not just coincidental connections, but examples of basic relationships between fundamental limits of statistical inference on the one hand, and fundamental limits of communication and compression of data on the other hand.

We now turn to the main topic of this seminar, that is, the connection between information theory and probability theory.

Connection between information theory and probability theory

Since we are using a probabilistic model it is clear that probability theory is the language of information theory. However, it is not so obvious that information theory can say something fundamental about probability theory. In fact, in the past half century or so, it has been realized that information theory captures many fundamental concepts in probability theory. Before turning to one key example of such connection (the entropic central limit theorem) which will serve as motivation for the initial few lectures of the seminar, we introduce some relevant quantities.

Definition (Differential or Boltzmann-Shannon entropy). If X\in\mathbb{R}^n, X\sim P and \frac{dP}{d \text{Leb}}=f (i.e., X has a density f with respect to the Lesbegue measure), then the differential entropy of P (equivalently, differential entropy of f) is defined as

    \[h(P) := h(f) := - \int_{R^n} f(x) \log f(x) dx,\]

with the conventions 0\log 0 = 0 and dx = \text{Leb}(dx).

While we can think of h as a measure of disorder (particularly motivated by the setting introduced by Boltzmann in physics), h is not a measure of information in the same sense as H is. The reason is that in the present context of “continuous” data (recall that we are in \mathbb{R}^n and a possible outcome of X is a real number) we need infinitely many bits to encode each outcome of X, so it is not meaningful to talk of the amount of information in an outcome as this is generally infinity. Nonetheless, the differential entropy represents a crucial quantity in information theory, and shows up for example both when considering communication over channels with continuous noise distributions, and when considering lossy data compression (the only kind of data compression possible with sources like \mathbb{R}^n, where one accepts some slight distortion of the data in order to be able to encode it with finitely many bits).

The notion that unifies the continuous entropy h with the discrete entropy H previously introduced is the relative entropy which we now define.

Definition (Relative entropy). If P is a probability measure and Q is a \sigma-finite measure on A, then the relative entropy between P and Q is defined as

    \[D(P || Q) :=  	\begin{cases} 		\int f \log f \, dQ	&\text{if $P\ll Q$ with $\frac{dP}{dQ}=f$},\\ 		\infty		&\text{otherwise}. 	\end{cases}\]

Typically P and Q have respective densities p and q with respect to a given reference measure \lambda. Then the relative entropy reads

    \[D(P || Q) = \int p(x) \log \frac{p(x)}{q(x)} \,\lambda(dx).\]

The following examples show how the relative entropy relates h and H.

  1. If A is a countable set and \lambda is the counting measure, then

        \[D(P || \lambda) = \sum_{x\in A} p(x) \log p(x) = - H(P).\]

  2. If A=\mathbb{R}^n and \lambda is the Lesbegue measure, then

        \[D(P || \lambda) = \int p(x) \log p(x) \, dx = - h(P).\]

The following property of relative entropy is the most important inequality in information theory.

Lemma. Let P be a probability measure on A, and Q be a sub-probability measure on A (i.e., Q is a nonnegative, countably additive measure with 0<Q(A)\le 1). Then D(P || Q) \ge 0.

Proof. We only need to consider the case where P\ll Q. Let f=\frac{dP}{dQ} and R=Q/Q(A). Then we have

    \begin{align*} 	D(P || Q)  	&= \int f(x) \log f(x) \,Q(dx) 	= Q(A)\,\mathbb{E}_R\left[f(X)\log f(X)\right]\\ 	&\ge Q(A)\, \mathbb{E}_R[f(X)]\log\mathbb{E}_R[f(X)] = \log\frac{1}{Q(A)} \ge 0, \end{align*}

where we have applied Jensen’s inequality (which holds as R is a probability measure) using that x\mapsto x\log x is convex, and used that \mathbb{E}_R[f(X)]=P(A)/Q(A)=1/Q(A) and that Q(A)\le 1. \square

As a consequence of this result we can now show that the Gaussian distribution maximizes the entropy under constraints on the first two moments.

Lemma. Let \mathcal{P}_{\mu,\sigma^2} be the class of all probability densities on \mathbb{R} (with respect to Lebesgue measure) with mean \mu and variance \sigma^2 and define

    \[g_{\mu,\sigma^2}(x) := \frac{1}{\sqrt{2\pi\sigma^2}} \, e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]

Then h(g_{\mu,\sigma^2}) \ge h(f) for any f\in\mathcal{P}.

Proof. First of all note that

    \begin{align*} 	\int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx  	= \int f(x) \log g_{\mu,\sigma^2}(x) \, dx  \end{align*}

as \log g_{\mu,\sigma^2} is quadratic function and, consequently, only the first two moments are involved in computing its expectation. Hence, we have

    \begin{align*} 	h(g_{\mu,\sigma^2}) - h(f)  	&= - \int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx + \int f(x) \log f(x) \, dx\\ 	&= \int f(x) \log \frac{f(x)}{g_{\mu,\sigma^2}(x)} \,dx = D(f||g_{\mu,\sigma^2}) \ge 0. \qquad\square \end{align*}

We are now ready to present the first example of cross-fertilization where information-theoretic concepts can be used to capture fundamental properties in probability theory. Let us first recall the classical central limit theorem (CLT).

Theorem (CLT). If X_1,X_2,\ldots are i.i.d. real-valued random variables with mean 0 and variance 1, then

    \[S_n=\frac{1}{\sqrt{N}} \sum_{i=1}^N X_i \stackrel{\mathcal{D}}{\longrightarrow} N(0,1),\]

that is,

    \[\mathbb{P}\{S_n\in A\} \longrightarrow \frac{1}{\sqrt{2 \pi}} \int_A e^{-\frac{x^2}{2}} \,dx\]

for nice enough sets A.

If we denote by f_{S_n} the density of the normalized partial sum S_n introduced in the statement of the theorem above, we note the following.

  1. For each n\ge 1 we have S_n\in\mathcal{P}_{0,1}. This follows immediately from basic properties of expected values.
  2. From the previous lemma it follows immediately that

        \[\mathop{\mathrm{argmax}}_{f\in \mathcal{P}_{0,1}} h(f) = g_{0,1}.\]

So, the CLT tells us that the sequence f_{S_1}, f_{S_2}, \ldots\in\mathcal{P}_{0,1} converges to the maximizer g_{0,1} of the entropy in \mathcal{P}_{0,1}. In fact, it turns out that the convergence in the central limit theorem can be studied in terms of the entropy and that the CLT is an expression of increasing entropy, as the following entropic central limit theorem describes.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean 0 and variance 1, and assume the distribution of X_1 has a density (with respect to Lebesgue measure). Under minimal assumptions (specifically, that h(f_{S_n})>-\infty for some n), we have

    \[h(f_{S_n}) \uparrow h(g_{0,1}),\]

or, equivalently,

    \[D(f_{S_n}||g_{0,1}) \downarrow 0.\]

The entropic central limit theorem is remarkable as usually limit theorems do not come with an associated monotonicity statement. This suggests that the relative entropy is a natural tool to analyze the CLT.

Of course, a natural question that presents itself is whether other limit theorems in probability can be understood from a similar information-theoretic point of view.

Plan for future lectures

In the next lecture or two we will present a full proof of the entropic central limit theorem, and also discuss briefly how other limit theorems can be analogously understood from this information-theoretic point of view. Later, we will look at finer behavior than limit theorems, for instance we may look at how information theory can provide insights into large deviations and concentration inequalities.

Lecture by Mokshay Madiman | Scribed by Patrick Rebeschini

25. September 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 1. Introduction

Fall 2013: Information theoretic methods

While information theory has traditionally been based on probabilistic methods, ideas and methods from information theory have recently played an increasingly important role in various areas of probability theory itself (as well as in statistics, combinatorics, and other areas of mathematics). The goal of these informal lectures is to introduce some topics and tools at the intersection of probability theory and information theory.  No prior knowledge of information theory will be assumed.  Potential topics include: entropic central limit theorems, entropic Poisson approximation, and related information-theoretic and probabilistic inequalities; connections to logarithmic Sobolev inequalities and Stein’s method; entropic inequalities for additive, matroidal, and tree structures, and their applications; transportation cost-information inequalities and their relation to concentration of measure; basic large deviations theory.

Prerequisites: Probability at the level of ORF 526 is assumed.

Time and location: Thursdays, 4:30-6:00, Bendheim Center classroom 103.
The first lecture will be on September 19.


07. September 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Fall 2013: Information theoretic methods

End of Spring 2013 seminar

That was it for the Spring 2013 stochastic analysis seminar! We will be back in the Fall. The topic and location/time slot will be announced in September.

03. May 2013 by Ramon van Handel
Categories: Announcement | Comments Off on End of Spring 2013 seminar

Lectures 7/8. Games on random graphs

The following two lectures by Rene Carmona are on games on random graphs.
Many thanks to Patrick Rebes­chini for scrib­ing these lectures!

In the following we discuss some results from the paper “Connectivity and equilibrium in random games” by Daskalakis, Dimakis, and Mossel. We define a random game on a random graph and we characterize the graphs that are likely to exhibit Nash equilibria for this game. We show that if the random graph is drawn from the Erdös-Rényi distribution, then in the high connectivity regime the law of the number of pure Nash equilibria converges toward a Poisson distribution, asymptotically, as the size of the graph is increased.

Let G=(V,E) be a simple (that is, undirected and with no self-edges) graph, and for each v\in V denote by N(v) the set of neighbors of v, that is, N(v):=\{v'\in V : (v,v')\in E\}. We think of each vertex in V as a player in the game that we are about to introduce. At the same time, we think of each edge (v,v')\in E as a strategic interaction between players v and v'.

Definition (Game on a graph). For each v\in V let S_v represent the set of strategies for player v, assumed to be a finite set. We naturally extend this definition to include families of players: for each A\subseteq V, let S_A= \times_{v\in A} S_v be the set of strategies for each player in A. For each v\in V, denote by u_v:  S_v\times S_{N(v)}\ni(\sigma_v,\sigma_{N(v)})\rightarrow u_v(\sigma_v,\sigma_{N(v)}) \in \mathbb{R} the reward function for player v. A game is a collection (S_v,u_v)_{v\in V}.

The above definition describes a game that is static, in the sense that the game is played only once, and local, in the sense that the reward function of each player depends only on its own strategy and on the strategy of the players in its neighbors. We now introduce the notion of pure Nash equilibrium.

Definition (Pure Nash equilibrium). We say that \sigma \in S_V is a pure Nash equilibrium (PNE) if for each v\in V we have

    \[u_v(\sigma_v,\sigma_{N(v)}) \ge u_v(\tau,\sigma_{N(v)}) 	\qquad\text{for each $\tau\in S_v$}.\]

A pure Nash equilibrium represents a state where no player can be better off by changing his own strategy if he is the only one who is allowed to do so. In order to investigate the existence of a pure Nash equilibrium it suffices to study the best response function defined below.

Definition (Best response function). Given a reward function u_v for player v\in V, we define the best response function \operatorname{BR}_v for v as

    \[\operatorname{BR}_v(\sigma_v,\sigma_{N(v)}) := 	\begin{cases}  		1	& \text{if } \sigma_v\in\arg\sup_{\tau\in S_v} u_v(\tau,\sigma_{N(v)}),\\ 		0	& \text{otherwise}. 	\end{cases}\]

Clearly, \sigma is a pure Nash equilibrium if and only if \operatorname{BR}_v(\sigma_v,\sigma_{N(v)})=1 for each v\in V. We now define the type of random games that we will be interested in; in order to do so, we need to specify the set of strategies and the reward function for each player.

Definition (Random game on a fixed graph). For a graph G=(V,E) and an atomless probability measure \mu on \mathbb{R}, let \mathcal{D}_{G,\mu} be the associated random game defined as follows:

  1. S_v=\{0,1\} for each v\in V;
  2. \{ u_v(\sigma_v,\sigma_{N(v)}) \}_{v\in V, \sigma_v \in S_v, \sigma_{N(v)} \in S_{N(v)}} is a collection of independent identically distributed random variables with distribution \mu.

Remark. For each game \mathcal{D}_{G,\mu} the family \{\operatorname{BR}_v(0,\sigma_{N(v)})\}_{v\in V, \sigma_{N(v)}\in S_{N(v)}} is a collection of independent random variables that are uniformly distributed in \{0,1\}, and for each v\in V, \sigma_{N(v)}\in S_{N(v)} we have \operatorname{BR}_v(1,\sigma_{N(v)}) = 1-\operatorname{BR}_v(0,\sigma_{N(v)}) almost surely. In fact, note that \operatorname{BR}_v(0,\sigma_{N(v)})=1 if and only if u_v(0,\sigma_{N(v)}) \ge u_v(1,\sigma_{N(v)}) and this event has probability 1/2 since the two random variables appearing on both sides of the inequality sign are independent with the same law \mu and \mu is atomless. As far as the analysis of the existence of pure Nash equilibria is concerned, we could take the present notion of best response functions as the definition of our random game on a fix graph. In fact, note that the choice of \mu in \mathcal{D}_{G,\mu} does not play a role in our analysis, and we would obtain the same results by choosing different (atomless) distributions for sampling (independently) the reward function of each player.

Denote by G(n,p) the distribution of a Erdös-Rényi random graph with n vertices where each edge is present independently with probability p. We now introduce the notion of a random game on a random graph.

Definition (Random game on a ramon graph). For each n\in\mathbb{N}, p\in(0,1) and each probability measure \mu on \mathbb{R}, do the following:

  1. choose a graph G from G(n,p);
  2. choose a random game from \mathcal{D}_{G,\mu} for the graph G.

Henceforth, given a random variable X let \mathcal{L}(X) represent its distribution. Given two measures \mu and \nu on a measurable space, define the total variation distance between \mu and \nu as

    \[\| \mu - \nu \|_{TV} := \sup_{f: \| f\| \le 1} | \mu f - \nu f |,\]

where the supremum is taken over measurable functions such that the supremum norm is less than or equal to 1.

We are now ready to state the main theorem that we will prove in the following (Theorem 1.9 in Daskalakis, Dimakis, and Mossel).

Theorem 1 (High connectivity regime). Let Z^{n,p} be the number of pure Nash equilibria in the random game on random graph defined above. Let define the high-connectivity regime as

    \[p(n)= \frac{(2+\varepsilon(n)) \log n}{n},\]

where \varepsilon:\mathbb{N}\rightarrow\mathbb{R} satisfies the following two properties:

    \begin{align*} 	\varepsilon (n) &> c &\text{for each $n\in\mathbb{N}$, for some $c>0$,}\\ 	\varepsilon (n) &\le \frac{n}{\log n} - 2 &\text{for each $n\in\mathbb{N}$.} \end{align*}

Then, we have

    \[\mathbf{P}_{G(n,p)} \{ \| \mathcal{L}(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV}  	\le O(n^{-\varepsilon/8}) + e^{-\Omega(n)} \}  	\ge 1- \frac{2}{n^{\varepsilon/8}},\]

where \mathbf{P}_{G(n,p)} denotes the conditional probability given the graph G(n,p) and N_1 is a Poisson random variable with mean 1. In particular,

    \[\lim_{n\rightarrow\infty} \mathcal{L}(Z^{n,p(n)}) = \mathcal{L}(N_1),\]

which shows that in this regime a pure Nash equilibrium exists with probability converging to 1-\frac{1}{e} as the size of the network increases.

Remark. Using the terminology of statistical mechanics, the first result in Theorem 1 represents a quenched-type result since it involves the conditional distribution of a system (i.e., the game) given its environment (i.e., the graph). On the other hand, the second result represents an annealed-type result, where the unconditional probability is considered.

In order to prove Theorem 1 we need the following lemma on Poisson approximations. The lemma is adapted from the results of R. Arratia, L. Goldstein, and L. Gordon (“Two moments suffice for Poisson approximations: the Chen-Stein method“, Ann. Probab. 17, 9-25, 1989) and it shows how the total variation distance between the law of a sum of Bernoulli random variables and a Poisson distribution can be bounded by the first and second moments of the Bernoulli random variables. This result is a particular instance of the Stein’s method in probability theory.

Lemma 2 (Arratia, Goldstein, and Gordon, 1989). Let \{X_i\}_{i=0,1,\ldots,N} be a collection of Bernoulli random variables with p_i:=\mathbf{P}\{X_i=1\}. For each i\in\{0,1,\ldots,N\} let B_i\subseteq \{0,1,\ldots,N\} be such that \{X_j\}_{j\in B^c_i} is independent of X_i. Define

    \begin{align*} 	b_1 := \sum_{i=0}^N \sum_{j\in B_i} p_ip_j,\qquad\qquad 	b_2 := \sum_{i=0}^N \sum_{j\in B_i\setminus \{i\}} p_{ij}, \end{align*}

where p_{ij} := \mathbf{P}\{X_i=1,X_j=1\}. Define Z:=\sum_{i=0}^N X_i and \lambda := \mathbf{E} Z = \sum_{i=0}^N p_i. If N_{\lambda} is a Poisson random variable with mean \lambda, then

    \[\| \mathcal{L} (Z) - \mathcal{L} (N_\lambda) \|_{TV} \le 2 (b_1+b_2).\]

Proof. We define the following operators that act on each function f : \mathbb{N}\rightarrow\mathbb{R}:

    \begin{align*} 	[Df](n) &:= f(n+1)-f(n)&\text{for each $n\in\mathbb{N}$},\\ 	[Tf](n) &:= nf(n)-\lambda f(n+1)&\text{for each $n\in\mathbb{N}$},\\ 	[Sf](n+1) &:= - \frac{\mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le n}]} 	{\lambda \mathbf{P}\{N_\lambda = n\}} &\text{for each $n\in\mathbb{N}$}, 	& &[Sf](0):=0. \end{align*}

We point out that T characterizes N_\lambda in the sense that \mathbf{E} [[Tf](M)] = 0 if and only if M is a Poisson random variable with mean \lambda; T is an example of Stein’s operator. First of all, we show that for each f : \mathbb{N}\rightarrow\mathbb{R} we have TSf = f. In fact, if n=0 we have

    \[[TSf](0) = T[Sf](0) = - \lambda [Sf](1)  	= \frac{\mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le 0}]} 	{\mathbf{P}\{N_\lambda = 0\}} = f(0)\]

and if n\ge 1 we have

    \begin{align*} 	[TSf](n) &= T[Sf](n) =  n[Sf](n) - \lambda [Sf](n+1)\\ 	&= -n \frac{\mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le n-1}]} 	{\lambda \mathbf{P}\{N_\lambda = n-1\}}  	+ \frac{\mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le n}]} 	{\mathbf{P}\{N_\lambda = n\}}\\ 	&= \frac{\mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le n}] 	- \mathbf{E}[f(N_\lambda) \mathbf{1}_{N_\lambda \le n-1}]} 	{\mathbf{P}\{N_\lambda = n\}}\\ 	&= f(n). \end{align*}

For each i\in\{0,1,\ldots,N\} define \tilde Z_i := \sum_{j\in B^c_i}X_j and Y_i := Z-X_i = \sum_{j\in\{0,1,\ldots,N\}\setminus \{i\}} X_j. The following properties hold:

  1. \tilde Z_i \le Y_i \le Z,
  2. X_i f(Z) = X_i f(Y_i+1),
  3. f(Y_i+1)-f(Z+1) = X_i[f(Y_i+1)-f(Y_i+2)].

In what follows, consider any given function h:\mathbb{N}\rightarrow \mathbb{R} such that \| h\| := \sup_{n\in\mathbb{N}} |h(n)| \le 1. Define the function \tilde h as \tilde h(n):=h(n)-\mathbf{E}h(N_\lambda) for each n\in\mathbb{N}, and let f:= S \tilde h. From what was seen above we have Tf=\tilde h and we get

    \begin{align*} 	\mathbf{E} [h(Z)-h(N_\lambda)] =& \ \mathbf{E} \tilde h (Z) = \mathbf{E} [Tf] (Z) 	= \mathbf{E} [Zf(Z)-\lambda f (Z+1)] \\ 	=& \ \sum_{i=0}^N \mathbf{E} [X_i f(Z) - p_i f(Z+1)] 	\stackrel{\text{(ii)}}{=} \sum_{i=0}^N \mathbf{E} [X_i f(Y_i+1) - p_i f(Z+1)]\\ 	=& \ \sum_{i=0}^N \mathbf{E} [p_i f(Y_i+1) - p_i f(Z+1)]  	+ \sum_{i=0}^N \mathbf{E} [(X_i - p_i)f(Y_i+1)]\\ 	\stackrel{\text{(iii)}}{=}& \ \sum_{i=0}^N p_i \mathbf{E} [X_i (f(Y_i+1) - f(Y_i+2))] \\ 	& \ + \sum_{i=0}^N \mathbf{E} [(X_i - p_i)(f(Y_i+1)-f(\tilde Z_i + 1))]  	+ \sum_{i=0}^N \mathbf{E} [(X_i - p_i)f(\tilde Z_i + 1)]. \end{align*}

The first term is bounded above by \| Df \|\sum_{i=0}^N p^2_i while the third term is equal to 0 since \tilde Z_i is independent of X_i. In order to bound the second term we want to rewrite each term f(Y_i+1)-f(\tilde Z_i + 1) as a telescoping sum. In what follow fix i\in\{0,1,\ldots,N\}, label the elements of B_i\setminus\{i\} as \{j_1,\ldots,j_K\} and define

    \begin{align*} 	U_0 &:= \tilde Z_i = \sum_{j\in B^c_i} X_{j},\\ 	U_{k} &:= U_{k-1} + X_{j_k} \qquad \text{for $k\in\{1,\ldots,K\}$}. \end{align*}

Noticing that U_K=Y_i, we have

    \[f(Y_i+1)-f(\tilde Z_i + 1) = \sum_{k=1}^K [ f(U_{k-1}+X_{j_k}+1)  	- f(U_{k-1}+1)]\]

and we get

    \begin{align*} 	\lefteqn{\mathbf{E} [(X_i - p_i)(f(Y_i+1)-f(\tilde Z_i + 1))]=}\\ 	&\qquad\qquad= 	\sum_{k=1}^K \mathbf{E} [(X_i - p_i)(f(U_{k-1}+X_{j_k}+1) - f(U_{k-1}+1))]\\ 	&\qquad\qquad=  	\sum_{k=1}^K \mathbf{E} [(X_i - p_i) X_{j_k} (f(U_{k-1}+2) - f(U_{k-1}+1))]\\ 	&\qquad\qquad\le \| Df \| \sum_{k=1}^K \mathbf{E} [(X_i + p_i)X_{j_k}] 	= \| Df \| \sum_{k=1}^K (p_{ij_k} + p_ip_{j_k})\\ 	&\qquad\qquad= \| Df \| \sum_{j\in B_i\setminus \{i\}} ( p_{ij} + p_i p_j). \end{align*}

Therefore, combining all together we get

    \begin{align*} 	\mathbf{E} [h(Z)-h(N_\lambda)] &\le \| Df \|\sum_{i=0}^N p^2_i  	+ \| Df \| \sum_{i=0}^N \sum_{j\in B_i\setminus \{i\}} ( p_{ij} + p_i p_j)\\ 	&= \| Df \| \bigg( \sum_{i=0}^N \sum_{j\in B_i \setminus \{i\}} p_{ij}  	+ \sum_{i=0}^N\sum_{j\in B_i} p_i p_j \bigg)\\ 	&= \| Df \| (b_1+b_2). \end{align*}

Since the total variation distance can be characterized in terms of sets as

    \[\| \mathcal{L} (Z) - \mathcal{L} (N_\lambda) \|_{TV}  	= 2 \sup_{A\subset \mathbb{N}} | \mathbf{P}(Z\in A) - \mathbf{P}(N_\lambda\in A) |,\]

from this point on we restrict our analysis to indicator functions, which are easier to deal with than generic functions. For each A\subset\mathbb{N} define h_A:=\mathbf{1}_A, \tilde h_A := h_A-\mathbf{E}h_A(N_\lambda) = \mathbf{1}_A-\mathbf{P}\{N_\lambda \in A\} and f_A := S \tilde h_A. The previous result yields

    \[\| \mathcal{L} (Z) - \mathcal{L} (N_\lambda) \|_{TV}  	= 2 \sup_{A\subset \mathbb{N}} |\mathbf{E} [h_A(Z)-h_A(N_\lambda)]| 	\le 2 (b_1+b_2) \sup_{A\subset \mathbb{N}} \| Df_A \|\]

and the proof of the Lemma is concluded if we show that \| Df_A \| \le 1 for each A\subset \mathbb{N}. In fact, in what follows we will show that

    \[\sup_{A\subset \mathbb{N}} \| Df_A \| \le \frac{1-e^{-\lambda}}{\lambda},\]

where the right hand side is clearly upper bounded by 1. The proof that we are going to present is contained in the Appendix of “Poisson Approximation for Some Statistics Based on Exchangeable Trials” by Barbour and Eagleson.

First of all, note that for each A\subset \mathbb{N} we have f_A(0) = 0 and

    \begin{align*} 	f_A(n+1) &= - \frac{\mathbf{E}[\tilde h_A(N_\lambda) \mathbf{1}_{N_\lambda \le n}]} 	{\lambda \mathbf{P}\{N_\lambda = n\}} = \frac{e^\lambda n!}{\lambda^{n+1}}  	(\mathbf{P}\{N_\lambda \in A\} \mathbf{P}\{N_\lambda \le n\}  	- \mathbf{P}\{N_\lambda \in A \cap \{0,1,\ldots,n\}\}), \end{align*}

for each n\in\mathbb{N}. From this expression it is clear that f_A = \sum_{j\in A} f_{\{j\}} for each A\subset\mathbb{N}, which suggests that we can restrict our analysis to singletons. For each j\in\mathbb{N} we have

    \begin{align*} 	-f_{\{j\}}(n+1) = 	\begin{cases} 		-\frac{\lambda^j n!}{\lambda^{n+1}j!} \mathbf{P}\{N_\lambda \le n\}  		& \text{if }n < j\\ 		\frac{\lambda^j n!}{\lambda^{n+1}j!} \mathbf{P}\{N_\lambda > n\}  		& \text{if }n \ge j\\ 	\end{cases} \end{align*}

and from the series expansion of the Poisson probabilities it is easy seen that the function n\in\mathbb{N} \longrightarrow -f_{\{j\}}(n) is negative and decreasing if n \le j and is positive and decreasing if n \ge j+1. Hence, the only positive value taken by the difference function n\in\mathbb{N} \longrightarrow -f_{\{j\}}(n+1)+f_{\{j\}}(n) corresponds to the case n=j that can be bounded as follows:

    \begin{align*} 	-f_{\{n\}}(n+1)+f_{\{n\}}(n) &= \frac{1}{\lambda} \mathbb{P}\{N_\lambda > n\} 	+ \frac{1}{n} \mathbb{P}\{N_\lambda \le n-1\}\\ 	&= \frac{e^{-\lambda}}{\lambda} \left( \sum_{k=n+1}^\infty \frac{\lambda^k}{k!} 	+ \sum_{k=1}^n \frac{\lambda^k}{k!} \frac{k}{n} \right) \\ 	&\le \frac{e^{-\lambda}}{\lambda} (e^\lambda - 1) = \frac{1-e^{-\lambda}}{\lambda}. \end{align*}

Therefore, for each A\subset\mathbb{N}, n\in\mathbb{N} we have

    \begin{align*} 	- Df_A(n) &= -f_A(n+1) + f_A(n) = \sum_{j\in A} (-f_{\{j\}}(n+1) + f_{\{j\}}(n))\\ 	&= \mathbf{1}_A (n) (-f_{\{n\}}(n+1) + f_{\{n\}}(n)) 	+ \sum_{j \in A : j\neq n} (-f_{\{j\}}(n+1) + f_{\{j\}}(n))\\ 	&\le \frac{1-e^{-\lambda}}{\lambda}. \end{align*}

Noticing that f_{A^c} = - f_A, then for each A\subset\mathbb{N}, n\in\mathbb{N} we also get

    \begin{align*} 	Df_A(n) \le - Df_{A^c}(n) \le \frac{1-e^{-\lambda}}{\lambda}, \end{align*}

and we proved that \sup_{A\subset \mathbb{N}} |Df_A| \le \frac{1-e^{-\lambda}}{\lambda}. \square

We now introduce the notation that will naturally allow us to use Lemma 2 to prove Theorem 1. We label the pure strategy profiles in S_V=\times_{v\in V} S_v as \{\sigma^0,\sigma^1,\ldots,\sigma^N\}, where N:=2^n-1 and as always n=|V|:=\mathop{\mathrm{card}} V. Often times, it will be convenient to use the labels 1,2,\ldots,n to enumerate the vertices of the graph G. Accordingly, one can think of the strategy profiles \sigma^i defined in a specific way, for example by positing that \sigma^i(1)\sigma^i(2)\ldots\sigma^i(n) is the binary decomposition of i. In particular \sigma^0 becomes the strategy where each player plays zero, that is, \sigma^0_v=0 for each v\in V, and \sigma^N the strategy where each player plays one, that is, \sigma^N_v=1 for all v\in V. For each i\in\{0,1,\ldots,N\} define

    \[X_i := 	\begin{cases}  		1	& \text{if $\sigma^i$ is a pure Nash equilibrium},\\ 		0	& \text{otherwise}. 	\end{cases}\]

Clearly the quantity Z:=\sum_{i=0}^N X_i identifies the number of pure Nash equilibria and Z>0 corresponds to the existence of a pure Nash equilibrium. We recall that both the randomness in the choice of the graph and the randomness in the choice of the game are embedded in the random variables \{X_i\}_{i\in\{0,1,\ldots,N\}}. Note that conditionally on a given graph G sampled form G(n,p) we have, for each i\in\{0,1,\ldots,N\},

    \begin{align*} 	\mathbf{E}_{G} X_i  	&= \mathbf{P}_{G} \{\text{$\sigma^i$ is a pure Nash equlibrium}\}\\ 	&= \mathbf{P}_{G} \{\operatorname{BR}_v(\sigma^i_v,\sigma^i_{N(v)})  	= 1 \  v\in V \}\\ 	&= 2^{-n}, \end{align*}

from which it follows that

    \begin{align*} 	\mathbf{E}_{G} Z  	= \sum_{i=0}^N \mathbf{E}_{G} X_i = (1+N) 2^{-n} = 1. \end{align*}

That is, the current definition of a game on a fixed graph implies that the expected number of pure Nash equilibria is 1 for any given graph. It follows that also \mathbf{E} Z = 1. Notice that Theorem 1 adds more information on the table since it describes the asymptotic distribution of Z\equiv Z^{n,p(n)} on a particular regime of the Erdös-Rényi random graph.

In the way we set up the stage it seems tempting to apply Lemma 2 to the random variables \{X_i\}_{i\in\{0,1,\ldots,N\}} just defined. However, this approach is not fruitful since, apart from trivial cases, any given random variable X_i has a neighborhood of dependence B_i that coincides with the entire set \{0,1,\ldots,N\}. To see this, consider any two strategy profiles \sigma^i and \sigma^j. As long as there exists v\in V such that \sigma^i_v=\sigma^j_v, then we can always find a realization of the graph G such that v does not have any edges attached to it, that is, N(v)=\varnothing; this implies that \operatorname{BR}_v(0,\sigma^i_{N(v)}) = \operatorname{BR}_v(0,\sigma^j_{N(v)}) and, consequently, it implies that X_i and X_j are not independent. Therefore, only X_0 and X_N are independent, where \sigma^0_v:=0 and \sigma^N_v:=1 for each v\in V. However, Lemma 2 can be fruitfully applied to the random variables \{X_i\}_{i\in\{0,1,\ldots,N\}} when we look at them conditionally on a given graph realization, as the following Lemma demonstrates (Lemma 2.2 of Daskalakis, Dimakis, Mossel).

Lemma 3. Let G=(V,E) be a graph. Define

    \[B_0 := \{ j\in\{0,1,\ldots,N\}  	: \exists v \in V \text{ such that } \sigma^j_{v'}=0  	\text{ for each $v'$ satisfying } (v,v') \in E \}\]

and for each i\in\{1,2,\ldots,N\} define

    \[B_i := \{ j \in \{0,1,\ldots,N\} : \sigma^j = \sigma^i \oplus \sigma^k  	\text{ for some } k\in B_0\},\]


    \[\sigma^i \oplus \sigma^j := (\sigma^i(1) \oplus \sigma^j(1),\ldots,  	\sigma^i(|V|) \oplus \sigma^j(|V|))\]

and \oplus is the exclusive-or operation. Then, for each i\in\{0,1,\ldots,N\} we have that X_i is independent of \{X_j\}_{j\in B^c_i}.

Proof.We first show that X_0 is independent of \{X_j\}_{j\in B^c_0}. In order to make the independency structure manifest, we characterize B^c_0. Recall that B_0 is the index set of the pure strategy profiles for which there exists a player having all his neighbors playing 0. Therefore, each strategy profile corresponding to an index set in B^c_0 is characterized by the fact that each player has at least one neighbor who is playing 1. Hence, for each j\in B^c_0 we have \sigma^0_{N(v)} \neq \sigma^j_{N(v)} for all v\in V and, consequently, the events

    \[\{X_0=\alpha\} = \{\operatorname{BR}_v(\sigma^0_v,\sigma^0_{N(v)})  	= \alpha \ \forall v\in V\} \qquad \alpha\in \{0,1\}\]

are independent of the events

    \[\{X_j=\alpha_j\} = \{\operatorname{BR}_v(\sigma^j_v,\sigma^j_{N(v)}) = \alpha_j 	 \ \forall v\in V\}  	 \qquad \alpha_j\in \{0,1\}, j\in B^c_0,\]

which proves our claim.

We now generalize this result to show that X_i is independent of \{X_j\}_{j\in B^c_i}, for each i\in \{1,2,\ldots,N\}. For any i\in\{0,1,\ldots,N\}, note that the exclusive-or map with respect to \sigma^i preserves the differences in the strategy profiles (and, of course, it also preserves the equalities). That is, if \sigma^j and \sigma^k are such that \sigma^j_v\neq \sigma^k_v for some v\in V, then also \sigma^i\oplus\sigma^j and \sigma^i\oplus\sigma^k are such that (\sigma^i\oplus\sigma^j)_v \neq (\sigma^i\oplus\sigma^k)_v. Therefore,

    \[\sigma^0_{N(v)} \neq \sigma^j_{N(v)} \qquad \text{for each $v\in V$}\]

holds true if and only if

    \[\sigma^i_{N(v)}  	= (\sigma^i\oplus\sigma^0)_{N(v)} \neq (\sigma^i\oplus\sigma^j)_{N(v)} 	\qquad\text{for each $v\in V$}\]

holds true. Equivalently stated, j\in B^c_0 if and only if k\in B^c_i, where k is the index set such that \sigma^k = \sigma^i\oplus\sigma^j. Hence, the proof is concluded once we notice that \sigma^i=\sigma^i\oplus\sigma^0 and that B_i is defined as the index set of the pure strategy profiles that are obtained by an exclusive-or map with respect to \sigma^i of a strategy profile in B_0. \square

For a given graph G=(V,E) with |V|=n, define p_i(G):=\mathbf{P}_G\{X_i=1\}=2^{-n}=(N+1)^{-1}, where N:=2^{n}-1 and \mathbf{P}_G represents the conditional probability conditionally on the graph G. Define B_i(G)\subseteq \{0,1,\ldots,N\} as in Lemma 3; then, conditionally on G we have that \{X_j\}_{j\in B^c_i(G)} is independent of X_i. Define

    \begin{align*} 	b_1(G) &:= \sum_{i=0}^N \sum_{j\in B_i(G)} p_i(G)p_j(G)  	= \frac{1}{(N+1)^2}\sum_{i=0}^N |B_i(G)|,\\ 	b_2(G) &:= \sum_{i=0}^N \sum_{j\in B_i(G)\setminus \{i\}} p_{ij}(G), \end{align*}

where p_{ij}(G) := \mathbf{P}_G\{X_i=1,X_j=1\}. Define Z:=\sum_{i=0}^N X_i and recall that \mathbf{E} Z = \mathbf{E} \mathbf{E}_G Z = 1. If N_1 is a Poisson random variable with mean 1, then Lemma 2 yields

    \[\| \mathcal{L}_G (Z) - \mathcal{L} (N_1) \|_{TV} \le 2 (b_1(G)+b_2(G)).\]

At this point, let us introduce the following two lemmas (Lemmas 2.4 and 2.5 of Daskalakis, Dimakis, Mossel).

Lemma 4. If G is sampled from the Erdös-Rényi distribution G(n,p), we have

    \begin{align*} 	\mathbf{E}[b_1(G)] &\le  	R(n,p) := \sum_{s=0}^n \binom{n}{s} 2^{-n} \min\{1,n(1-p)^{s-1}\},\\ 	\mathbf{E}[b_2(G)] &\le  	S(n,p) := \sum_{s=1}^n \binom{n}{s} 2^{-n} [(1+(1-p)^s)^{n-s} - (1-(1-p)^s)^{n-s}]. \end{align*}

Lemma 5. Under the assumptions of Theorem 1 there exists \alpha',\alpha'',\beta',\beta''\in \mathbb{R}_+ and n_0',n_0''\in\mathbb{N}_+ such that

    \begin{align*} 	R(n,p) &\le \alpha' n^{-\varepsilon/4} + e^{-\beta' n} &\text{for each $n>n_0'$},\\ 	S(n,p) &\le \alpha'' n^{-\varepsilon/4} + e^{-\beta'' n} &\text{for each $n>n_0''$}. \end{align*}

We now show how the proof of Theorem 1 follows easily from the two lemmas above.

Proof of Theorem 1. Let \alpha',\alpha'',\beta',\beta''\in \mathbb{R}_+ and n_0',n_0''\in\mathbb{N}_+ as in Lemma 5. Define \alpha^\star:=\max\{\alpha',\alpha''\}, \beta^\star:=\min\{\beta',\beta''\} and n_0^\star:=\max\{n_0',n_0''\}. Clearly, by Lemma 5 we have

    \begin{align*} 	R(n,p) &\le \alpha^\star n^{-\varepsilon/4}  	+ e^{-\beta^\star n} &\text{for each $n>n_0^\star$},\\ 	S(n,p) &\le \alpha^\star n^{-\varepsilon/4}  	+ e^{-\beta^\star n} &\text{for each $n>n_0^\star$}. \end{align*}

Define the event

    \[A_n := \{ \max\{b_1(G),b_2(G)\} 	\le 2\alpha^\star n^{-\varepsilon/8} + e^{- \beta^\star n} \}.\]

By the Markov inequality and the previous asymptotic bounds, for each n>n_0^\star we have

    \begin{align*} 	\mathbf{P}\{A^c_n\} &\le \mathbf{P}\{ b_1(G) > 	2\alpha^\star n^{-\varepsilon/8} + e^{- \beta^\star n} \} 	+ \mathbf{P}\{ b_2(G) > 	2\alpha^\star n^{-\varepsilon/8} + e^{- \beta^\star n} \}\\ 	&\le \mathbf{P}\{ b_1(G) > 	2\alpha^\star n^{-\varepsilon/8} \} 	+ \mathbf{P}\{ b_2(G) > 	2\alpha^\star n^{-\varepsilon/8} \}\\ 	&\le \frac{\mathbf{E}[b_1(G)]}{2\alpha^\star n^{-\varepsilon/8}} 	+ \frac{\mathbf{E}[b_2(G)]}{2\alpha^\star n^{-\varepsilon/8}}\\ 	&\le n^{-\varepsilon/8} + \frac{e^{-\beta^\star n}}{\alpha^\star  	n^{-\frac{\varepsilon}{8}}}, \end{align*}

where we used the result of Lemma 4 and the above estimates for R(n,p) and S(n,p). Since \varepsilon> c for some c\in\mathbb{R}_+, then there clearly exists n_0^{\star'} such that

    \[\mathbf{P}\{A^c_n\} \le 2 n^{-\varepsilon/8}\qquad\text{for each $n>n_0^{\star'}$}.\]

Hence, we have that \mathbf{P}\{A_n\} \ge 1-2n^{-\frac{\varepsilon}{8}} for n>n_0^{\star'}. Let us now define \alpha,\beta\in\mathbb{R}_+ and n_0\in\mathbb{N}_+ such that n_0>n_0^{\star'} and

    \[4\alpha^\star n^{-\varepsilon/8} + 2e^{- \beta^\star n} 	\le \alpha n^{-\varepsilon/8} + e^{- \beta n} 	\qquad\text{for each $n>n_0$}.\]


    \[\mathbf{P} \{ \| \mathcal{L}_G(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV}  	\le \alpha n^{-\varepsilon/8} + e^{-\beta n} \}  	\ge 1- \frac{2}{n^{\varepsilon/8}},\]

which proves the first statement in Theorem 1. In fact, we have

    \begin{align*} 	&\mathbf{P} \{ \| \mathcal{L}_{G}(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV}  	\le \alpha n^{-\varepsilon/8} + e^{-\beta n} \} \\ 	&\qquad\ge \mathbf{P} \{ \| \mathcal{L}_{G}(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV}  	\le \alpha n^{-\varepsilon/8} + e^{-\beta n} | A_n \} \mathbf{P} \{A_n\}\\ 	&\qquad\ge 1- 2n^{-\frac{\varepsilon}{8}}, \end{align*}

since by definition of A_n, on the event A_n we have

    \begin{align*} 	\| \mathcal{L}_G(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV} &\le 2 (b_1(G)+b_2(G))  	\le 4\alpha^\star n^{-\varepsilon/8} + 2e^{- \beta^\star n} 	\le \alpha n^{-\varepsilon/8} + e^{- \beta n}. \end{align*}

By the properties of conditional expectations we can now prove the convergence in total variation of the unconditional law of Z^{n,p(n)} to the law of N_1. In fact, for n>n_0 we have

    \begin{align*} 	&\| \mathcal{L}(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV} \\ 	&\qquad= \sup_{h: \| h\| \le 1} | \mathbf{E} [h(Z^{n,p(n)}) - h(N_1)] |\\ 	&\qquad\le \sup_{h: \| h\| \le 1} | \mathbf{E} [(h(Z^{n,p(n)}) - h(N_1)) \mathbf{1}_{A_n}] | 	+ \sup_{h: \| h\| \le 1} \mathbf{E} [|h(Z^{n,p(n)}) - h(N_1)| \mathbf{1}_{A_n^c}] \\ 	&\qquad\le \sup_{h: \| h\| \le 1} \mathbf{E} [ 	\mathbf{1}_{A_n} | \mathbf{E}_G [h(Z^{n,p(n)}) - h(N_1)] |] 	+ 2 \mathbf{P}\{A^c_n\}\\ 	&\qquad\le \mathbf{E} [ 	\mathbf{1}_{A_n} \| \mathcal{L}_G(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV} ] 	+ 2 \mathbf{P}\{A^c_n\}\\ 	&\qquad\le \alpha n^{-\varepsilon/8} + e^{-\beta n} + 4 n^{-\varepsilon/8}, \end{align*}

from which it follows that

    \[\lim_{n\rightarrow\infty} \| \mathcal{L}(Z^{n,p(n)}) - \mathcal{L}(N_1) \|_{TV} = 0.\]

Since convergence in total variation implies convergence in distribution, the previous result implies that Z^{n,p(n)} converges in distribution to N_1, which concludes the proof of Theorem 1. \square

We now provide the proof of Lemma 4, while we refer the reader to Daskalakis, Dimakis, Mossel for the proof of Lemma 5.

Proof of Lemma 4. We begin with the study of \mathbf{E} [b_1(G)]. By the symmetry of the model we have

    \begin{align*} 	\mathbf{E} [b_1(G)] &= \frac{1}{(N+1)^2}\sum_{i=0}^N \mathbf{E} [|B_i(G)|]  	= \frac{1}{N+1} \mathbf{E} [|B_0(G)|]. \end{align*}

Since |B_0(G)| = \sum_{i=0}^N \mathbf{1}_{i\in B_0(G)}, we have \mathbf{E} [|B_0(G)|] = \sum_{i=0}^N \mathbf{P}\{i\in B_0(G)\}. By the symmetry of the Erdös-Rényi distribution we also have that \mathbf{P}\{i\in B_0(G)\} = \mathbf{P}\{j\in B_0(G)\} if \sigma^i and \sigma^j have the same number of players playing 1 (equivalently, the same number of players playing 0).
Therefore, if we label the vertices of the graph as \{1,2,\ldots,n\}, we have

    \[\mathbf{E} [|B_0(G)|] = \sum_{s=0}^n \binom{n}{s}  	\mathbf{E} [\mathbf{1}_{i_s\in B_0(G)}]\]

where for each s\in\{0,1,\ldots,n\} the index set i_s\in\{0,1,\ldots,N\} is such that the strategy \sigma^{i_s} satisfies \sigma^{i_s}_{k}=1 if k\le s and \sigma^{i_s}_{k}=0 if k> s. Hence, the bound for \mathbf{E} [b_1(G)] in the statement of the Lemma is proved if we show that \mathbf{P} \{i_s\in B_0(G)\} \le n (1-p)^{s-1}. In fact, by definition of B_0(G) we have

    \begin{align*} 	\mathbf{P} \{i_s\in B_0(G)\}  	&= \mathbf{P} \{\text{$G : $ $\exists$ player $k\in\{1,\ldots,n\}$ such that  	$N(k)\cap\{1,\ldots,s\}=\varnothing$}\}\\ 	&\le \sum_{k=1}^n  	\mathbf{P} \{\text{$G : N(k)\cap\{1,\ldots,s\}=\varnothing$}\}\\ 	&\le n (1-p)^{s-1}. \end{align*}

We now study the term \mathbf{E} [b_2(G)]. Proceeding as above, by symmetry we have

    \begin{align*} 	\mathbf{E} [b_2(G)] &= (N+1) \sum_{j=1}^N  	\mathbf{E}[ \mathbf{P}_G\{X_0=1,X_j=1\} \mathbf{1}_{j\in B_0(G)} ]\\ 	&= 2^n \sum_{s=1}^n \binom{n}{s} 	\mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)} ] \end{align*}

We now analyze the term \mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)} ]. As noticed above, i_s\in B_0(G) if and only if the graph G is such that there exists a player k\in\{1,\ldots,n\} such that N(k)\cap \{1, \dots,s\}=\varnothing. In the case in which such k also satisfies the property k\in\{1,\ldots,s\}, then \sigma^{i_s}_k=1 and \sigma^{i_s}_{k'}=0 for each k'\in N(k), and it follows that \mathbf{P}_G\{X_0=1,X_{i_s}=1\} = 0. In fact, the event \{X_0=1,X_{i_s}=1\} corresponds to the realizations where both strategy \sigma^0 and strategy \sigma^{i_s} are pure Nash equilibria, that is, \operatorname{BR}_v(\sigma^0_v,\sigma^0_{N(v)})=\operatorname{BR}_v(\sigma^{i_s}_v,\sigma^{i_s}_{N(v)})=1 for each v\in V. But it can not be that both 0 and 1 are best responses for the player k when all player in the neighbor play 0. Hence, \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)} is different from 0 on the event

    \begin{align*} 	A:=& \ \{G: k\in\{1,\ldots,s\} \text{ implies } N(k)\cap \{1, 	\dots,s\} \neq \varnothing \}\\ 	=& \ \{ \text{$\not\exists$ isolated node in the subgraph induced by $\{1,\ldots,s\}$} \} \end{align*}

and on the event

    \[B:=\{G: \text{$\exists k\in\{s+1,\ldots,n\}$ such that $N(k)\cap \{1, 	\dots,s\} = \varnothing$} \}.\]

Define p_s:=\mathbf{P} \{A\} and m_s := |\{k\in \{s+1,\ldots,n\} : N(k)\cap \{1,\dots,s\} = \varnothing\}|. Note that we have B=\bigcup_{t=1}^{n-s} \{G: m_s=t\}. On the events A and B we have \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)}=\mathbf{P}_G\{X_0=1,X_{i_s}=1\} and we get

    \begin{align*} 	&\mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)} ]\\ 	&\qquad= \mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\}  	\mathbf{1}_{i_s\in B_0(G)} | A] \, p_s\\ 	&\qquad= \sum_{t=1}^{n-s} \mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\}  	| A, m_s = t] \, \mathbf{P}\{m_s=t | A\} \, p_s. \end{align*}

Because of the independency structure in the Erdös-Rényi random graph we have

    \[\mathbf{P}\{m_s=t | A\} = \mathbf{P}\{m_s=t \} = \binom{n-s}{t}  	[(1-p)^s]^t [1-(1-p)^s]^{n-s-t}.\]

Furthermore, we have

    \[\mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\}  	| A, m_s = t] = \frac{1}{2^t} \frac{1}{2^{n-t}} \frac{1}{2^{n-t}}.\]

This follows immediately from the definition of pure Nash equilibrium in terms of the best response functions once noticed that on the event \{A, m_s = t\} there are exactly t players (each k\in\{s+1,\ldots,n\} such that N(k)\cap \{1,\dots,s\} = \varnothing\}) such that (\sigma^0_k,\sigma^0_{N(k)})=(\sigma^{i_s}_k,\sigma^{i_s}_{N(k)}), while for the remaining n-t players we have (\sigma^0_k,\sigma^0_{N(k)})\neq(\sigma^{i_s}_k,\sigma^{i_s}_{N(k)}). Putting everything together we get

    \begin{align*} 	&\mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\} \mathbf{1}_{i_s\in B_0(G)} ]\\ 	&\qquad= \sum_{t=1}^{n-s} \mathbf{E}[ \mathbf{P}_G\{X_0=1,X_{i_s}=1\}  	\mathbf{1}_{i_s\in B_0(G)} | A, m_s = t] \, \mathbf{P}\{m_s=t | A\} \, p_s\\ 	&\qquad= p_s \sum_{t=1}^{n-s} \binom{n-s}{t} [(1-p)^s]^t [1-(1-p)^s]^{n-s-t} 	\frac{1}{4^{n-t}}\\ 	&\qquad= \frac{p_s}{4^n} [(1+(1-p)^s)^{n-s} - (1-(1-p)^s)^{n-s}]. \end{align*}

Using the fact that p_s\le 1, it clearly follows that \mathbf{E} [b_2(G)] \le S(n,p). \square

01. May 2013 by Ramon van Handel
Categories: Random graphs | Comments Off on Lectures 7/8. Games on random graphs

Giant component: final remarks

The past three lectures were devoted to the giant component theorem:

Theorem Let C_v be the connected component of G(n, \frac{c}{n}) that contains v\in [n].

  1. If c<1, then \max_v|C_v|= O(\log n) in probability.
  2. If c>1, then \max_v|C_v|\sim (1-\rho) n in probability, for some 0<\rho<1.
  3. If c=1, then \max_v|C_v|\sim n^{2/3} in distribution.

We proved only the first (subcritical) and second (supercritical) cases: our presentation was largely inspired by the treatment of Janson, Luczak, and Rucinski and Durrett. We have omitted the critical case, however, as the last two lectures of the semester will be on another topic. The goal of this post is to provide some final remarks and references on the giant component theorem.


At first sight, the “double jump” in the giant component theorem looks quite shocking. In hindsight, however, this does not seem quite so miraculous, as it mirrors an elementary phenomenon that is covered in many introductory probability courses: given a (nice) random walk S_t with initial condition S_0=a, define the hitting time \tau:=\inf\{t:S_t=b\} for some b<a. Then there are three cases:

  1. If S_t has negative drift, then \mathbb{E}[\tau]<\infty. In fact, the random variable \tau has a light (exponential) tail.
  2. If S_t has positive drift, then \mathbb{P}[\tau=\infty]>0.
  3. If S_t has zero drift, then \tau<\infty a.s. but \mathbb{E}[\tau]=\infty. That is, the random variable \tau has a heavy tail.

This “double jump” in the behavior of the hitting probabilities of a random walk is directly analogous to the behavior of the connected components of an Erdös-Rényi graph, and this was indeed the basic idea behind the proofs given in the previous lectures. Of course, it remains a bit of a miracle that the random walk approximation of the exploration process, which only holds for small times, is sufficiently powerful that it describes so completely the behavior of the random graph.

The critical case

In the subcritical case, the size of the largest component is of order \log n because the hitting time of a random walk with negative drift has an exponential tail: that is, we proved

    \[\mathbb{P}[\tau\ge k] \le e^{-\alpha k}    \quad\Longrightarrow\quad    \mathbb{P}\Big[\max_v|C_v| \ge \gamma\log n\Big] \le    n\,\mathbb{P}[\tau\ge\gamma\log n] \le n^{1-\alpha\gamma}\]

which goes to zero as n\to\infty for \gamma>1/\alpha.

Similarly, we would expect that we can obtain the size of the largest component in the critical case if we understand the heavy tail behavior of the hitting time of a random walk with zero drift. This is in fact the case. Indeed, when there is zero drift, one can show that

    \[P[\tau\ge k] \lesssim \frac{1}{\sqrt{k}}.\]

The crude union bound argument used above now does not give the correct answer, but an only slightly better argument is needed. Indeed, note that

    \begin{eqnarray*}     \mathbb{P}\Big[\max_v|C_v| \ge \gamma n^{2/3}\Big] &\le&     \mathbb{P}[|\{v:|C_v|\ge \gamma n^{2/3}\}|\ge \gamma n^{2/3}] \\     &\le&     \frac{\mathbb{E}[|\{v:|C_v|\ge \gamma n^{2/3}\}|]}{\gamma n^{2/3}} \\     &=&     \frac{n\,\mathbb{P}[\tau\ge \gamma n^{2/3}]}{\gamma n^{2/3}}     \lesssim \frac{1}{\gamma^{3/2}}. \end{eqnarray*}

Therefore, n^{-2/3}\max_v|C_v|=O(1). With some further work, a corresponding lower bound can also be proved. See the paper by Nachmias and Peres or the notes by van der Hofstad for the details.

It turns out that in the critical case n^{-2/3}\max_v|C_v| is not only bounded in probability, but in fact converges weakly to some limiting distribution. This distribution, and much more, is beautifully described by Aldous in terms of Brownian excursions. This is an interesting example of the application of stochastic analysis to discrete probability; unfortunately, we do not have the time to cover it.

In a different direction, it turns out that various additional phase transitions appear when we consider a finer scaling, for example, in the “critical window” c=1\pm c' n^{-1/3}. For an overview of the various transitions, see, for example, section 11.1 in Alon and Spencer.

Connectivity threshold

Rather than considering the size of the largest component, one could ask when the entire Erdös-Rényi graph is connected. Note that when 1<c<\infty, the constant \rho in the size (1-\rho)n of the giant component is always strictly positive, so the graph is not connected. Therefore, in order for the entire graph to be connected, we must let c\to\infty (that is, the edge probability p must be superlinear). It turns out that the appropriate scaling for this question is c\sim \log n, and another phase transition arises here.

Theorem. Let p=\frac{a\log n}{n}. If a>1, then the Erdös-Rényi graph G(n,p) is connected with probability tending to 1 as n\to\infty. If a<1, the graph is connected with probability tending to 0 as n\to\infty.

To get some intuition, consider the probability that a vertex v is isolated (that is, disconnected from every other vertex):

    \[\mathbb{P}[v\text{ is isolated}] =    \mathbb{P}[\eta_{vw}=0\text{ for all }w\ne v] =    \bigg(1-\frac{c}{n}\bigg)^{n-1}\sim e^{-c}.\]

Thus for c=a\log n, we have

    \[\mathbb{E}[|\{v:v\text{ is isolated}\}|] \sim    n \,e^{-a\log n} = n^{1-a}.\]

In particular, if a<1, then there must exist an isolated vertex with positive probability as n\to\infty, in which case the graph is not connected (in fact, it is not hard to show that the variance of the number of isolated components is of the same order as its mean, so that the probability that the graph is connected tends to zero). Somewhat miraculously, it turns out that when there are no isolated vertices, then the graph must already be connected, so that we do indeed obtain the sharp transition described in the Theorem above. For a proof of this fact (by a clever combinatorial argument) see, for example, the lecture notes by van der Hofstad. Alternatively, one can use a random walk argument entirely in the spirit of the proofs in the previous lectures to prove that the random graph is connected for a>1: by running simultaneous exploration processes from different vertices as we did in the proof of the supercritical case, one can show that all connected components must intersect when a>1 and thus the entire graph must be connected. See section 2.8 in Durrett for such an argument.

28. April 2013 by Ramon van Handel
Categories: Random graphs | Comments Off on Giant component: final remarks

Lecture 6. Giant component (3)

Let us begin with a brief recap from the previous lecture. We consider the Erdös-Rényi random graph G(n,\frac{c}{n}) in the supercritical case c>1. Recall that C_v denotes the connected component of the graph that contains the vertex v. Our goal is to prove the existence of the giant component with size \max_v|C_v|\sim (1-\rho)n, while the remaining components have size \lesssim\log n.

Fix \beta>0 sufficiently large (to be chosen in the proof), and define the set

    \[K=\{v :  |C_v| > \beta \log n\}\]

of vertices contained in “large” components. The proof consists of two parts:

  • Part 1: \mathbb{P}[C_v = C_{v'}~\forall v,v' \in K] \to 1.
  • Part 2: \frac{|K|}{n} \to 1-\rho in probability.

Part 1 states that all the sufficiently large components must intersect, forming the giant component. Part 2 counts the number of vertices in the giant component. Part 2 was proved in the previous lecture. The goal of this lecture is to prove Part 1, which completes the proof of the giant component.


As in the previous lectures, the central idea in the study of the giant component is the exploration process (U_t,A_t,R_t), where

    \[|C_v| = \tau := \inf\{ t  :  |A_t|=0\}.\]

We have seen that |A_t| \approx S_t, where S_t is a random walk with increments

    \[X_t = S_t -S_{t-1} \sim \text{Binomial}(n,\tfrac{c}{n})-1, 	\qquad |A_0| = S_0 =1.\]

When c>1, we have \mathbb{E}[X_t] = c-1 > 0. Thus |A_t| is approximately a random walk with positive drift. The intuitive idea behind the proof of Part 1 is as follows. Initially, the random walk can hit 0 rapidly, in which case the component is small. However, if the random walk drifts away from zero, then with high probability it will never hit zero, in which case the component must keep growing until the random walk approximation is no longer accurate. Thus there do not exist any components of intermediate size: each component is either very small (|C_v|\le \beta\log n) or very large (we will show |C_v|\ge n^{2/3}, but the precise exponent is not important).

We now want to argue that any pair of large components must necessarily intersect. Consider two disjoint sets I and J of vertices of size |I|,|J|\ge n^{2/3}. As each edge is present in the graph with probability c/n, the probability that there is no edge between I and J is

    \[\mathbb{P}[\eta_{ij}=0\mbox{ for all }i\in I,~j\in J]= 	\bigg(1-\frac{c}{n}\bigg)^{|I|\,|J|} \le 	\bigg(1-\frac{c}{n}\bigg)^{n^{4/3}} \le 	e^{-cn^{1/3}}.\]

We therefore expect that any pair of large components must intersect with high probability. The problem with this argument is that we assumed that the sets I and J are nonrandom, while the random sets C_v themselves depend on the edge structure of the random graph (so the events \{C_v=I,C_{v'}=J\} and \{\mbox{no edges between }I,J\} are highly correlated). To actually implement this idea, we therefore need a little bit more sophisticated approach.

To make the proof work, we revisit more carefully our earlier random walk argument. The process |A_t|\approx S_t has positive drift as \mathbb{E}[S_t] = S_0 + (c-1)t. Thus the process |A_t|-(c-1)t/2 is still approximately a random walk with positive drift! Applying the above intuition, either |A_t| dies rapidly (the component is small), or |A_t| grows linearly in t as is illustrated in the following figure:

Rendered by

This means that the exploration process for a component of size >\beta\log n will not only grow large (|A_{n^{2/3}}|>0) with high probability, but that the exploration process will also possess a large number of active vertices (|A_{n^{2/3}}|\gtrsim n^{2/3}). To prove that all large components intersect, we will run different exploration processes simultaneously starting from different vertices. We will show that if two of these processes reach a large number of active vertices then there must be an edge between them with high probability, and thus the corresponding components must coincide. This resolves the dependence problem in our naive argument, as the edges between the sets of active vertices have not yet been explored and are therefore independent of the history of the exploration process.

The component size dichotomy

We now begin the proof in earnest. We will first show the dichotomy between large and small components: either the component size is \le\beta\log n, or the number of active vertices |A_t| grows linearly up to time n^{2/3}. To be precise, we consider the following event:

    \[B := \Big\{\text{either} ~  	|C_v| \le \beta \log n , 	~ \text{or} ~  	|A_t| \ge \Big(\frac{c-1}{2}\Big) t  	~ \text{for all} ~ \beta \log n \le t  	\le n^{2/3}\Big\}.\]

Our goal is to show that \mathbb{P}[B] is large.

Define the stopping time

    \[\sigma := \inf\Big\{t \ge \beta \log n  : 	|A_t| < \Big(\frac{c-1}{2}\Big)t \Big\}.\]

We can write

    \[B = \{ \tau \le \beta \log n ~ \text{or} ~ \sigma > n^{2/3} \}.\]

Now suppose \tau >\beta \log n and \sigma =t. Then \tau \ge t, as exploration process is alive at time \beta \log n and stays alive until time t. We can therefore write

    \[\mathbb{P}[B^c] = \sum_{s=\beta \log n}^{n^{2/3}}  	\mathbb{P}[\tau>\beta \log n ~ \text{and} ~ \sigma = s] \le 	\sum_{s=\beta \log n}^{n^{2/3}}  	\mathbb{P}\Big[|A_s|  	< \Big(\frac{c-1}{2}\Big)s 	~ \text{and} ~ 	s\le\tau\Big].\]

To bound the probabilities inside the sum, we compare |A_s| to a suitable random walk.

The random walk argument

To bound the probability that |A_s| < (c-1)s/2, we must introduce a comparison random walk that lies beneath |A_t|. We use the same construction as was used in the previous lecture. Let

    \begin{equation*} 	\bar S_{t+1}=\left\{ 	\begin{array}{ll} 	\bar S_t -1 +  	\sum_{w \in U_t} \eta_{wi_t} +  	\sum_{w=1}^{n-(\frac{c+1}{2})n^{2/3} - |U_t|}  	\tilde \eta _w^t &\quad\text{if }  	|U_t|<n-(\frac{c+1}{2}) n^{2/3},\\ 	\bar S_t -1 +  	\sum_{w \in \bar U_t} \eta_{wi_t} & \quad   	\text{if } 	|U_t|\ge n-(\frac{c+1}{2}) n^{2/3} . 	\end{array}  	\right. \end{equation*}

where \bar S_0=1, \tilde \eta_w^t are i.i.d. \text{Bernoulli}(\frac{c}{n}) random variables independent of \eta_{ij}, i_t=\min A_t (the same i_t used in the exploration process), and \bar U_t is the set of the first n-(\frac{c+1}{2})n^{2/3} components of U_t (if t \ge \tau, then A_t=\varnothing and thus i_t is undefined; then we simply add n-(\frac{c+1}{2})n^{2/3} variables \tilde\eta_w^t).

As in the previous lecture, we have:

  • \bar S_t is a random walk with \text{Binomial}(n-(\frac{c+1}{2})n^{2/3},\frac{c}{n})-1 increments.
  • \bar S_t \le |A_t| whenever t \le \tau and |U_t| \ge n-(\frac{c+1}{2})n^{2/3}.

Now suppose that s\le n^{2/3} and |A_s|<(\frac{c-1}{2})s. Then

    \[|U_s| = n - |A_s| - |R_s| \ge 	n - (\tfrac{c+1}{2})s \ge n - (\tfrac{c+1}{2})n^{2/3}.\]

We therefore obtain for s\le n^{2/3}

    \[\mathbb{P}\Big[|A_s|  	< \Big(\frac{c-1}{2}\Big)s 	~ \text{and} ~ 	s\le\tau\Big] \le 	\mathbb{P}\Big[\bar S_s < \Big(\frac{c-1}{2}\Big)s\Big].\]

Thus computing \mathbb{P}[B^c] reduces to compute the tail probability of a random walk (or, in less fancy terms, a sum of i.i.d. random variables). That is something we know how to do.

Lemma (Chernoff bound). Let X\sim \text{Binomial}(n,p). Then

    \[\mathbb{P}[X\le np-t] \le e^{-t^2/2np}.\]

Proof. Let \gamma>0. Then

    \begin{eqnarray*} 	\mathbb{P}[X\le np-t] &=& 	\mathbb{P}[e^{-\gamma X} \ge e^{-\gamma np + \gamma t}] \\ 	&\le& e^{\gamma np - \gamma t} \, \mathbb{E}[e^{-\gamma X}]\\ 	&=& e^{\gamma np - \gamma t} (1- (1-e^{-\gamma})p)^n \\ 	&\le& e^{\{\gamma-(1-e^{-\gamma})\}np - \gamma t}\\ 	&\le& e^{\gamma^2 np/2 - \gamma t}. \end{eqnarray*}

The result follows by optimizing over \gamma>0. \quad\square

Note that \bar S_s \sim 1 - s + \text{Binomial}(\{n-(\frac{c+1}{2})n^{2/3}\}s,\frac{c}{n}). We therefore have by the Chernoff bound

    \[\mathbb{P}\Big[\bar S_s < \Big(\frac{c-1}{2}\Big)s\Big] \le 	\mathbb{P}\Big[ 	\text{Binomial}(\{n-(\tfrac{c+1}{2})n^{2/3}\}s,\tfrac{c}{n}) 	\le \Big(\frac{c+1}{2}\Big)s\Big] \le 	e^{-(c-1-o(1))^2 s/8c}\]

for all s (here o(1) depends only on n and c). In particular, we have

    \[\mathbb{P}\Big[\bar S_s < \Big(\frac{c-1}{2}\Big)s\Big] \le 	n^{-\beta(c-1)^2/9c} 	\quad\text{for all }s\ge\beta\log n\]

provided n is sufficiently large. Thus we can estimate

    \[\mathbb{P}[B^c] \le 	\sum_{s=\beta \log n}^{n^{2/3}}  	\mathbb{P}\Big[\bar S_s < \Big(\frac{c-1}{2}\Big)s\Big]  	\le 	n^{2/3-\beta(c-1)^2/9c},\]

which goes to zero as n\to\infty provided that \beta is chosen sufficiently large. In particular, the component size dichotomy follows: choosing any \beta>15c/(c-1)^2, we obtain

    \[\mathbb{P}[|C_v|\le\beta\log n 	\text{ or }|C_v|\ge n^{2/3}\text{ for all }v] \ge 	1-n\mathbb{P}[B^c] \ge  	1-n^{5/3-\beta(c-1)^2/9c}\xrightarrow{n\to\infty}0.\]

Remark: Unlike in the proof of Part 2 in the previous lecture, here we do need to choose \beta sufficiently large for the proof to work. If \beta is too small, then the random walk \bar S_t cannot move sufficiently far away from zero to ensure that it will never return. In particular, even in the supercritical case, the second largest component has size of order \log n.

Large components must intersect

To complete the proof, it remains to show that all large components must intersect. To do this, we will run several exploration processes at once starting from different vertices. If the sets of active vertices of two of these processes grow large, then there must be an edge between them with high probability, and thus the corresponding components intersect. Let us make this argument precise.

In the following, we denote by (U_t^v,A_t^v,R_t^v) the exploration process started at A_0=\{v\}. For each such process, we define the corresponding set B^v that we have investigated above:

    \[B_v := \Big\{\text{either} ~  	|C_v| \le \beta \log n , 	~ \text{or} ~  	|A_t^v| \ge \Big(\frac{c-1}{2}\Big) t  	~ \text{for all} ~ \beta \log n \le t  	\le n^{2/3}\Big\}.\]

We have shown above that, provided \beta>15c/(c-1)^2, we have

    \[\mathbb{P}\bigg[\bigcap_v B_v\bigg] 	\ge 1 - \sum_v \mathbb{P}[B_v^c] \ge 	1-o(1).\]

We can therefore estimate

    \begin{eqnarray*} 	&& 	\mathbb{P}[\exists\, v,v' \in K \text{ such that } C_v \neq C_{v'}] \\ 	&& \mbox{}= 	\mathbb{P}\big[\exists\, v,v' \in K \text{ such that } C_v \neq C_{v'}, 	\text{ and } 	|A^v_{n^{2/3}}| \ge (\tfrac{c-1}{2})n^{2/3} 	~\forall\, v \in K\big] + o(1) \phantom{\sum}\\ 	&& \mbox{}\le 	\sum_{v,v'} 	\mathbb{P}\big[C_v \neq C_{v'} \text{ and } 	 |A^v_{n^{2/3}}| \wedge |A_{n^{2/3}}^{v'}| \ge  	(\tfrac{c-1}{2})n^{2/3}\big]+o(1). \end{eqnarray*}

Now note that by time t, the exploration process (U_t^v,A_t^v,R_t^v) has only explored edges \eta_{ij} where i\in R_t^v (or j\in R_t^v), and similarly for (U_t^{v'},A_t^{v'},R_t^{v'}). It follows that

    \[\text{The conditional law of } 	\{\eta_{ij} : i,j \not\in R_t^v\cup R_t^{v'}\} 	\text{ given } 	(A_t^v,R_t^v,A_t^{v'},R_t^{v'}) 	\text{ is i.i.d.\ Binomial}(\tfrac{c}{n}).\]

In particular, if I^v,J^v,I^{v'},J^{v'} are disjoint subsets of vertices, then

    \[\mathbb{P}[\text{no edge between } I^v, I^{v'}  	| A_t^v =I^v, R_t^v=J^v,A_t^{v'}=I^{v'},R_t^{v'}=J^{v'}]  	= \bigg(1-\frac cn \bigg)^{|I^v|\, |I^{v'}|}.\]

On the other hand, C_v \ne C_{v'} implies that R_t^v,A_t^v,R_t^{v'},A_t^{v'} must be disjoint at every time t. Thus if C_v \ne C_{v'}, there can be no edges between vertices in A_t^v and A_t^{v'} at any time t (if such an edge exists, then the vertices connected by this edge will eventually be explored by both exploration processes, and then the sets of removed vertices will no longer be disjoint). Therefore,

    \begin{eqnarray*} 	&&  	\mathbb{P}\big[C_v \neq C_{v'} \text{ and } 	 |A^v_{n^{2/3}}| \wedge |A_{n^{2/3}}^{v'}| \ge  	(\tfrac{c-1}{2})n^{2/3}\big] \\ 	&& \mbox{} \le 	\mathbb{P}\big[ 	\text{no edge between }A^v_{n^{2/3}},A^{v'}_{n^{2/3}},\\ 	&& \phantom{\mbox{} \le \mathbb{P}\big[} 	A^v_{n^{2/3}},R^v_{n^{2/3}}, 	A^{v'}_{n^{2/3}},R^{v'}_{n^{2/3}}\text{ are disjoint}, ~ 	|A^v_{n^{2/3}}| \wedge |A_{n^{2/3}}^{v'}| \ge  	(\tfrac{c-1}{2})n^{2/3}\big] \\ 	&& \mbox{}\le 	\bigg(1-\frac{c}{n} \bigg)^{(c-1)^2n^{4/3}/4} 	\le e^{-c(c-1)^2n^{1/3}/4}. \end{eqnarray*}

Thus we finally obtain

    \[\mathbb{P}[C_v = C_{v'}~\forall v,v' \in K] \ge 	1- n^2e^{-c(c-1)^2n^{1/3}/4} - o(1) 	\xrightarrow{n\to\infty}1,\]

and the proof of the giant component theorem is complete.

Many thanks to Quentin Berthet for scribing this lecture!

27. April 2013 by Ramon van Handel
Categories: Random graphs | Comments Off on Lecture 6. Giant component (3)

← Older posts

Newer posts →