Next lecture: October 17
Next week (October 10) there will be no stochastic analysis seminar due to conflicting events:
Lecture 2. Basics / law of small numbers
Due to scheduling considerations, we postpone the proof of the entropic central limit theorem. In this lecture, we discuss basic properties of the entropy and illustrate them by proving a simple version of the law of small numbers (Poisson limit theorem). The next lecture will be devoted to Sanov’s theorem. We will return to the entropic central limit theorem in Lecture 4.
Conditional entropy and mutual information
We begin by introducing two definitions related to entropy. The first definition is a notion of entropy under conditioning.
Definition. If and are two discrete random variables with probability mass functions and , then the conditional entropy of given is defined as
where is the conditional probability mass function of given .
Remark. If and are absolutely continuous random variables, the conditional differential entropy is defined analogously (where the probability mass functions are replaced by the corresponding probability densities with respect to Lebesgue measure).
Note that
That is, the conditional entropy is precisely the expectation (with respect to the law of ) of the entropy of the conditional distribution of given .
We now turn to the second definition, the mutual information. It describes the degree of dependence between two random variables.
Definition. The mutual information between two random variables and is defined as
where , and denote the distributions of , and .
Conditional entropy and mutual information are closely related. For example, suppose that has density with respect to the Lebesgue measure, then
In particular, since is always positive (because it is a relative entropy), we have just shown that , that is, conditioning reduces entropy. The same result holds for discrete random variables when we replace by .
Chain rules
Chain rules are formulas that relate the entropy of multiple random variables to the conditional entropies of these random variables. The most basic version is the following.
Chain rule for entropy. . In particular, .
Proof. Note that
Thus,
Taking the expectation on both sides under the distribution gives the desired result.
Corollary. Entropy is subadditive, that is, .
Proof. Combine the chain rule with .
There is also a chain rule for relative entropy.
Chain rule for relative entropy.
The following identity will be useful later.
Lemma.
Proof. Note that
Data processing and convexity
Two important properties of the relative entropy can be obtained as consequences of the chain rule.
Data processing inequality. Let and be two probability measures on and suppose is measurable. Then , where is the distribution of when .
The data processing inequality tells us that if we process the data (which might come from one of the two distributions and ), then the relative entropy decreases. In other words, it becomes harder to identify the source distribution after processing the data. The same result (with the same proof) holds also if and are transformed by a transition kernel, rather than by a function.
Proof. Denote by and the joint laws of and when and . By the chain rule and nonnegativity of relative entropy
On the other hand, using again the chain rule,
where we used . Putting these together completes the proof.
Convexity of relative entropy. is jointly convex in its arguments, that is, if , , , are probability measures and , then
Proof. Let be a random variable that takes value with probability and with probability . Conditionally on , draw and . Then and . Using the chain rule twice, we obtain
and the right hand side is precisely .
Corollary. The entropy function is concave.
Proof for a finite alphabet. When the alphabet is finite, the corollary can be proven by noting that .
Relative entropy and total variation distance
Consider the hypothesis testing problem of testing the null hypothesis against the alternative hypothesis . A test is a measurable function . Under the constraint , it can be shown that the optimal rate of decay of as a function of the sample size is of the order of . This means that is the measure of how well one can distinguish between and on the basis of data.
We will not prove this fact, but only introduce it to motivate that the relative entropy is, in some sense, like a measure of distance between probability measures. However, it is not a metric since and the triangle inequality does not hold. So in what sense does the relative entropy represent a distance? In fact, it controls several bona fide metrics on the space of probability measures. One example of such metric is the total variation distance.
Definition. Let and be probability measures on . The total variation distance is defined as .
The following are some simple facts about the total variation distance.
 .

If and have probability density functions and with respect to some common probability measure , then . To see this, define . Then
 .
The following inequality shows that total variance distance is controlled by the relative entropy. This shows that the relative entropy is a strong notion of distance.
Pinsker’s inequality. .
Proof. Without loss of generality, we can assume that and have probability density functions and with respect to some common probability measure on . Let and .
Step 1: Prove this inequality by simple calculation in the case when contains at most elements.
Step 2: Note that and are defined on the space . So Pinsker’s inequality applies to and . Thus,
Law of small numbers
As a first illustration of an application of entropy to probability, let us prove a simple quantitative law of small numbers. An example of the law of small numbers is the well known fact that in distribution as goes to infinity. More generally, if are Bernoulli random variables with , if are weakly dependent, and if none of the dominates the rest, then where . This idea can be quantified easily using relative entropy.
Theorem. If and may be dependent, then
where and .
Proof. Let be independent random variables with . Then . We have
To conclude, it is enough to note that
Remark. If and are independent, then the inequality in the theorem becomes . However, this rate of convergence is not optimal. One can show that under the same condition, , using tools similar to those that will be used later to prove the entropic central limit theorem. Note that it is much harder to prove in the entropic central limit theorem, even without rate of convergence!
Lecture by Mokshay Madiman  Scribed by Cheyu Liu
Lecture 1. Introduction
What is information theory?
The first question that we want to address is: “What is information?” Although there are several ways in which we might think of answering this question, the main rationale behind our approach is to distinguish information from data. We think of information as something abstract that we want to convey, while we think of data as a representation of information, something that is storable/communicable. This is best understood by some examples.
Example. Information is a certain idea, while data is the words we use to describe this idea.
Example. Information is . Possible data describing this information are: , , , followed by a billion zeros.
As we are in a mathematical setting we want to rely on a quantitative approach. The main question that arises naturally is: “How can we measure information?” Making sense of this question requires us to have a model for how data is produced. Throughout this seminar we will consider the probabilistic model which we now introduce.
Definition (Probabilistic model). Data is a random variable taking values on the space (alphabet) having distribution (source distribution). We write .
To be precise, with the above we mean that there exists a probability space and a measurable space with some measurable function such that we have .
Remarks.
 While this setup is very similar to what is done in statistics, the focus in information theory is different. In statistics it is assumed that the data comes from one of a family of distributions (statistical model), and the goal is to infer something about the particular distribution generating the data. On the other hand, in information theory the distribution of the data might be known or not, and the goal is to compress or communicate .
 In the probabilistic model we assume that the data is generated by a certain random source. This is a particular modeling assumption and it is not necessarily an expression of belief in how data are actually produced. This is a reasonable modeling assumption to make and it allows us to draw reasonable conclusions (for example, text data is clearly not randomly produced, but you can still do useful things by making the modeling assumption that the data was produced by a stochastic source).
 The original motivation behind the development of information theory as based on the probabilistic model came from a practical engineering problem (how to compress/communicate data), and not from the idea of how we measure information (although this aspect was also part of the motivation). The whole field of study was created by the 1948 paper of Claude Shannon.
 We are going to use the probabilistic model throughout this seminar, but other models exist as well. A popular model used in theoretical computer science is the algorithmic model (which defines the field of algorithmic information theory, as opposed to probabilistic information theory). In this model it is assumed that data is the output of some computer program (running on a Turing machine). While this approach could be thought of as a generalization of the probabilistic model (in fact, one way in which computers can work is to simulate from some probability distribution), many of the basic quantities in algorithmic information theory (like Kolmogorov complexity) are not computable. This is the reason why this field is suitable for theoretical insights, but it is not necessarily suitable for practical purposes.
How do we measure information in the probabilistic model?
In what follows we assume that the information to be conveyed coincides with the data itself (and we now assume that takes values in some countable set ), meaning that there is no universal hidden meaning that we are trying to convey apart from the data itself. For example, assume that the information we want to convey (a particular realization of it) is the text “This is the SAS”. A natural way of measuring the amount of information contained in this data is to look for other representations of this information, and to look for the smallest (in some sense that needs to be specified) representation. As we are in the probabilistic framework, we do not know in advance which data is going to be produced by the random source, so we look for a procedure that takes the random outcome and gives us on average the smallest or most compact representation of the information in that data.
Since the data is random, also the size of a particular realization (encoded data) is random. One way to take into account the randomness is to consider a representation (encoding scheme) that minimizes the expected length/size of the encoded data (and that is uniquely decodable). That is, we measure the amount of information in a given data as
If we set up things in this way, the measure of information is some functional of the source distribution , since is the only quantity governing the data. This functional is called the entropy and it is defined as follows.
Definition (Entropy). If is a discrete distribution, then the entropy of is defined as
where we write for the probability mass function of .
While it can be shown that is the minimal expected length of validly encoded data, we do not proceed this way (the ideas behind this fact are covered in the first couple of lectures of an information theory class). Instead, we will give some intuition on why is a good measure of information.
We first provide some intuition on why the information in the statement should be decreasing as a function of . In fact, recall that presently we assume to know the source distribution . If we know , how informative is a particular outcome form the source distribution? If (i.e., is a point mass at ), being informed that the random outcome is is not informative. On the other hand, if , being informed that the outcome is is extremely informative/significant (something very rare has happened).
The relevant decreasing function turns out to be the following:
In this respect, corresponds to the information that we get from the statement . So the average amount of information in the random outcome is given by
Connection between information theory and statistics
While the connection between information theory and statistics is not surprising as both fields rely on the probabilistic model, this correspondence is very strong and natural. We give some examples.
 Maximum likelihood estimators (MLE) can be seen as minimal codelength estimators. In a statistical model we assume that , with for some parameter space , and the goal is to find the parameter that generated the data. A popular estimator is the MLE since it is plausible to assume that the parameter that generated the data is the parameter whose corresponding distribution would have given maximal probability to , that is,
Note that we can rewrite the above as
which can be seen to correspond to the minimal number of bits required to represent assuming that it was generated by (codelength). Hence the connection between MLE in statistics and the minimal codelength estimator in information theory. In this setting we assume that we do not know the distribution generating the data and we try to find a good code to encode the data. The problem of finding a good code is in some sense equivalent to the problem of finding the distribution itself, since once you know the distribution you know the best code (in some sense). Also, we mention that many penalizedMLE estimators (where we take into account the complexity of the model by adding a penalty term to the MLE estimator) can be motivated from an informationtheoretic point of view in terms of analogue of coding problems; this is the idea behind the “Minimum Description Length” principle.
 In Hypothesis testing, the optimal error exponents are informationtheoretic quantities.
These are not just coincidental connections, but examples of basic relationships between fundamental limits of statistical inference on the one hand, and fundamental limits of communication and compression of data on the other hand.
We now turn to the main topic of this seminar, that is, the connection between information theory and probability theory.
Connection between information theory and probability theory
Since we are using a probabilistic model it is clear that probability theory is the language of information theory. However, it is not so obvious that information theory can say something fundamental about probability theory. In fact, in the past half century or so, it has been realized that information theory captures many fundamental concepts in probability theory. Before turning to one key example of such connection (the entropic central limit theorem) which will serve as motivation for the initial few lectures of the seminar, we introduce some relevant quantities.
Definition (Differential or BoltzmannShannon entropy). If , and (i.e., has a density with respect to the Lesbegue measure), then the differential entropy of (equivalently, differential entropy of ) is defined as
with the conventions and .
While we can think of as a measure of disorder (particularly motivated by the setting introduced by Boltzmann in physics), is not a measure of information in the same sense as is. The reason is that in the present context of “continuous” data (recall that we are in and a possible outcome of is a real number) we need infinitely many bits to encode each outcome of , so it is not meaningful to talk of the amount of information in an outcome as this is generally infinity. Nonetheless, the differential entropy represents a crucial quantity in information theory, and shows up for example both when considering communication over channels with continuous noise distributions, and when considering lossy data compression (the only kind of data compression possible with sources like , where one accepts some slight distortion of the data in order to be able to encode it with finitely many bits).
The notion that unifies the continuous entropy with the discrete entropy previously introduced is the relative entropy which we now define.
Definition (Relative entropy). If is a probability measure and is a finite measure on , then the relative entropy between and is defined as
Typically and have respective densities and with respect to a given reference measure . Then the relative entropy reads
The following examples show how the relative entropy relates and .
 If is a countable set and is the counting measure, then
 If and is the Lesbegue measure, then
The following property of relative entropy is the most important inequality in information theory.
Lemma. Let be a probability measure on , and be a subprobability measure on (i.e., is a nonnegative, countably additive measure with ). Then .
Proof. We only need to consider the case where . Let and . Then we have
where we have applied Jensen’s inequality (which holds as is a probability measure) using that is convex, and used that and that .
As a consequence of this result we can now show that the Gaussian distribution maximizes the entropy under constraints on the first two moments.
Lemma. Let be the class of all probability densities on (with respect to Lebesgue measure) with mean and variance and define
Then for any .
Proof. First of all note that
as is quadratic function and, consequently, only the first two moments are involved in computing its expectation. Hence, we have
We are now ready to present the first example of crossfertilization where informationtheoretic concepts can be used to capture fundamental properties in probability theory. Let us first recall the classical central limit theorem (CLT).
Theorem (CLT). If are i.i.d. realvalued random variables with mean and variance , then
that is,
for nice enough sets .
If we denote by the density of the normalized partial sum introduced in the statement of the theorem above, we note the following.
 For each we have . This follows immediately from basic properties of expected values.
 From the previous lemma it follows immediately that
So, the CLT tells us that the sequence converges to the maximizer of the entropy in . In fact, it turns out that the convergence in the central limit theorem can be studied in terms of the entropy and that the CLT is an expression of increasing entropy, as the following entropic central limit theorem describes.
Theorem (Entropic CLT). Let be i.i.d. realvalued random variables with mean and variance , and assume the distribution of has a density (with respect to Lebesgue measure). Under minimal assumptions (specifically, that for some ), we have
or, equivalently,
The entropic central limit theorem is remarkable as usually limit theorems do not come with an associated monotonicity statement. This suggests that the relative entropy is a natural tool to analyze the CLT.
Of course, a natural question that presents itself is whether other limit theorems in probability can be understood from a similar informationtheoretic point of view.
Plan for future lectures
In the next lecture or two we will present a full proof of the entropic central limit theorem, and also discuss briefly how other limit theorems can be analogously understood from this informationtheoretic point of view. Later, we will look at finer behavior than limit theorems, for instance we may look at how information theory can provide insights into large deviations and concentration inequalities.
Lecture by Mokshay Madiman  Scribed by Patrick Rebeschini
Fall 2013: Information theoretic methods
While information theory has traditionally been based on probabilistic methods, ideas and methods from information theory have recently played an increasingly important role in various areas of probability theory itself (as well as in statistics, combinatorics, and other areas of mathematics). The goal of these informal lectures is to introduce some topics and tools at the intersection of probability theory and information theory. No prior knowledge of information theory will be assumed. Potential topics include: entropic central limit theorems, entropic Poisson approximation, and related informationtheoretic and probabilistic inequalities; connections to logarithmic Sobolev inequalities and Stein’s method; entropic inequalities for additive, matroidal, and tree structures, and their applications; transportation costinformation inequalities and their relation to concentration of measure; basic large deviations theory.
Prerequisites: Probability at the level of ORF 526 is assumed.
Time and location: Thursdays, 4:306:00, Bendheim Center classroom 103.
The first lecture will be on September 19.
References:
 N. Gozlan and C. Leonard, “Transport Inequalities. A Survey”
 O. Johnson, “Information Theory and the Central Limit Theorem”
 M. Ledoux, “The Concentration of Measure Phenomenon”
 C. Villani, “Topics in Optimal Transportation”
End of Spring 2013 seminar
That was it for the Spring 2013 stochastic analysis seminar! We will be back in the Fall. The topic and location/time slot will be announced in September.
Lectures 7/8. Games on random graphs
The following two lectures by Rene Carmona are on games on random graphs.
Many thanks to Patrick Rebeschini for scribing these lectures!
In the following we discuss some results from the paper “Connectivity and equilibrium in random games” by Daskalakis, Dimakis, and Mossel. We define a random game on a random graph and we characterize the graphs that are likely to exhibit Nash equilibria for this game. We show that if the random graph is drawn from the ErdösRényi distribution, then in the high connectivity regime the law of the number of pure Nash equilibria converges toward a Poisson distribution, asymptotically, as the size of the graph is increased.
Let be a simple (that is, undirected and with no selfedges) graph, and for each denote by the set of neighbors of , that is, . We think of each vertex in as a player in the game that we are about to introduce. At the same time, we think of each edge as a strategic interaction between players and .
Definition (Game on a graph). For each let represent the set of strategies for player , assumed to be a finite set. We naturally extend this definition to include families of players: for each , let be the set of strategies for each player in . For each , denote by the reward function for player . A game is a collection .
The above definition describes a game that is static, in the sense that the game is played only once, and local, in the sense that the reward function of each player depends only on its own strategy and on the strategy of the players in its neighbors. We now introduce the notion of pure Nash equilibrium.
Definition (Pure Nash equilibrium). We say that is a pure Nash equilibrium (PNE) if for each we have
A pure Nash equilibrium represents a state where no player can be better off by changing his own strategy if he is the only one who is allowed to do so. In order to investigate the existence of a pure Nash equilibrium it suffices to study the best response function defined below.
Definition (Best response function). Given a reward function for player , we define the best response function for as
Clearly, is a pure Nash equilibrium if and only if for each . We now define the type of random games that we will be interested in; in order to do so, we need to specify the set of strategies and the reward function for each player.
Definition (Random game on a fixed graph). For a graph and an atomless probability measure on , let be the associated random game defined as follows:
 for each ;
 is a collection of independent identically distributed random variables with distribution .
Remark. For each game the family is a collection of independent random variables that are uniformly distributed in , and for each , we have almost surely. In fact, note that if and only if and this event has probability since the two random variables appearing on both sides of the inequality sign are independent with the same law and is atomless. As far as the analysis of the existence of pure Nash equilibria is concerned, we could take the present notion of best response functions as the definition of our random game on a fix graph. In fact, note that the choice of in does not play a role in our analysis, and we would obtain the same results by choosing different (atomless) distributions for sampling (independently) the reward function of each player.
Denote by the distribution of a ErdösRényi random graph with vertices where each edge is present independently with probability . We now introduce the notion of a random game on a random graph.
Definition (Random game on a ramon graph). For each , and each probability measure on , do the following:
 choose a graph from ;
 choose a random game from for the graph .
Henceforth, given a random variable let represent its distribution. Given two measures and on a measurable space, define the total variation distance between and as
where the supremum is taken over measurable functions such that the supremum norm is less than or equal to .
We are now ready to state the main theorem that we will prove in the following (Theorem 1.9 in Daskalakis, Dimakis, and Mossel).
Theorem 1 (High connectivity regime). Let be the number of pure Nash equilibria in the random game on random graph defined above. Let define the highconnectivity regime as
where satisfies the following two properties:
Then, we have
where denotes the conditional probability given the graph and is a Poisson random variable with mean . In particular,
which shows that in this regime a pure Nash equilibrium exists with probability converging to as the size of the network increases.
Remark. Using the terminology of statistical mechanics, the first result in Theorem 1 represents a quenchedtype result since it involves the conditional distribution of a system (i.e., the game) given its environment (i.e., the graph). On the other hand, the second result represents an annealedtype result, where the unconditional probability is considered.
In order to prove Theorem 1 we need the following lemma on Poisson approximations. The lemma is adapted from the results of R. Arratia, L. Goldstein, and L. Gordon (“Two moments suffice for Poisson approximations: the ChenStein method“, Ann. Probab. 17, 925, 1989) and it shows how the total variation distance between the law of a sum of Bernoulli random variables and a Poisson distribution can be bounded by the first and second moments of the Bernoulli random variables. This result is a particular instance of the Stein’s method in probability theory.
Lemma 2 (Arratia, Goldstein, and Gordon, 1989). Let be a collection of Bernoulli random variables with . For each let be such that is independent of . Define
where . Define and . If is a Poisson random variable with mean , then
Proof. We define the following operators that act on each function :
We point out that characterizes in the sense that if and only if is a Poisson random variable with mean ; is an example of Stein’s operator. First of all, we show that for each we have . In fact, if we have
and if we have
For each define and . The following properties hold:
 ,
 ,
 .
In what follows, consider any given function such that . Define the function as for each , and let . From what was seen above we have and we get
The first term is bounded above by while the third term is equal to since is independent of . In order to bound the second term we want to rewrite each term as a telescoping sum. In what follow fix , label the elements of as and define
Noticing that , we have
and we get
Therefore, combining all together we get
Since the total variation distance can be characterized in terms of sets as
from this point on we restrict our analysis to indicator functions, which are easier to deal with than generic functions. For each define , and . The previous result yields
and the proof of the Lemma is concluded if we show that for each . In fact, in what follows we will show that
where the right hand side is clearly upper bounded by . The proof that we are going to present is contained in the Appendix of “Poisson Approximation for Some Statistics Based on Exchangeable Trials” by Barbour and Eagleson.
First of all, note that for each we have and
for each . From this expression it is clear that for each , which suggests that we can restrict our analysis to singletons. For each we have
and from the series expansion of the Poisson probabilities it is easy seen that the function is negative and decreasing if and is positive and decreasing if . Hence, the only positive value taken by the difference function corresponds to the case that can be bounded as follows:
Therefore, for each , we have
Noticing that , then for each , we also get
and we proved that .
We now introduce the notation that will naturally allow us to use Lemma 2 to prove Theorem 1. We label the pure strategy profiles in as , where and as always . Often times, it will be convenient to use the labels to enumerate the vertices of the graph . Accordingly, one can think of the strategy profiles defined in a specific way, for example by positing that is the binary decomposition of . In particular becomes the strategy where each player plays zero, that is, for each , and the strategy where each player plays one, that is, for all . For each define
Clearly the quantity identifies the number of pure Nash equilibria and corresponds to the existence of a pure Nash equilibrium. We recall that both the randomness in the choice of the graph and the randomness in the choice of the game are embedded in the random variables . Note that conditionally on a given graph sampled form we have, for each ,
from which it follows that
That is, the current definition of a game on a fixed graph implies that the expected number of pure Nash equilibria is for any given graph. It follows that also . Notice that Theorem 1 adds more information on the table since it describes the asymptotic distribution of on a particular regime of the ErdösRényi random graph.
In the way we set up the stage it seems tempting to apply Lemma 2 to the random variables just defined. However, this approach is not fruitful since, apart from trivial cases, any given random variable has a neighborhood of dependence that coincides with the entire set . To see this, consider any two strategy profiles and . As long as there exists such that , then we can always find a realization of the graph such that does not have any edges attached to it, that is, ; this implies that and, consequently, it implies that and are not independent. Therefore, only and are independent, where and for each . However, Lemma 2 can be fruitfully applied to the random variables when we look at them conditionally on a given graph realization, as the following Lemma demonstrates (Lemma 2.2 of Daskalakis, Dimakis, Mossel).
Lemma 3. Let be a graph. Define
and for each define
where
and is the exclusiveor operation. Then, for each we have that is independent of .
Proof.We first show that is independent of . In order to make the independency structure manifest, we characterize . Recall that is the index set of the pure strategy profiles for which there exists a player having all his neighbors playing . Therefore, each strategy profile corresponding to an index set in is characterized by the fact that each player has at least one neighbor who is playing . Hence, for each we have for all and, consequently, the events
are independent of the events
which proves our claim.
We now generalize this result to show that is independent of , for each . For any , note that the exclusiveor map with respect to preserves the differences in the strategy profiles (and, of course, it also preserves the equalities). That is, if and are such that for some , then also and are such that . Therefore,
holds true if and only if
holds true. Equivalently stated, if and only if , where is the index set such that . Hence, the proof is concluded once we notice that and that is defined as the index set of the pure strategy profiles that are obtained by an exclusiveor map with respect to of a strategy profile in .
For a given graph with , define , where and represents the conditional probability conditionally on the graph . Define as in Lemma 3; then, conditionally on we have that is independent of . Define
where . Define and recall that . If is a Poisson random variable with mean , then Lemma 2 yields
At this point, let us introduce the following two lemmas (Lemmas 2.4 and 2.5 of Daskalakis, Dimakis, Mossel).
Lemma 4. If is sampled from the ErdösRényi distribution , we have
Lemma 5. Under the assumptions of Theorem 1 there exists and such that
We now show how the proof of Theorem 1 follows easily from the two lemmas above.
Proof of Theorem 1. Let and as in Lemma 5. Define , and . Clearly, by Lemma 5 we have
Define the event
By the Markov inequality and the previous asymptotic bounds, for each we have
where we used the result of Lemma 4 and the above estimates for and . Since for some , then there clearly exists such that
Hence, we have that for . Let us now define and such that and
Then
which proves the first statement in Theorem 1. In fact, we have
since by definition of , on the event we have
By the properties of conditional expectations we can now prove the convergence in total variation of the unconditional law of to the law of . In fact, for we have
from which it follows that
Since convergence in total variation implies convergence in distribution, the previous result implies that converges in distribution to , which concludes the proof of Theorem 1.
We now provide the proof of Lemma 4, while we refer the reader to Daskalakis, Dimakis, Mossel for the proof of Lemma 5.
Proof of Lemma 4. We begin with the study of . By the symmetry of the model we have
Since , we have . By the symmetry of the ErdösRényi distribution we also have that if and have the same number of players playing (equivalently, the same number of players playing ).
Therefore, if we label the vertices of the graph as , we have
where for each the index set is such that the strategy satisfies if and if . Hence, the bound for in the statement of the Lemma is proved if we show that . In fact, by definition of we have
We now study the term . Proceeding as above, by symmetry we have
We now analyze the term . As noticed above, if and only if the graph is such that there exists a player such that . In the case in which such also satisfies the property , then and for each , and it follows that . In fact, the event corresponds to the realizations where both strategy and strategy are pure Nash equilibria, that is, for each . But it can not be that both and are best responses for the player when all player in the neighbor play . Hence, is different from on the event
and on the event
Define and . Note that we have . On the events and we have and we get
Because of the independency structure in the ErdösRényi random graph we have
Furthermore, we have
This follows immediately from the definition of pure Nash equilibrium in terms of the best response functions once noticed that on the event there are exactly players (each such that ) such that , while for the remaining players we have . Putting everything together we get
Using the fact that , it clearly follows that .
Giant component: final remarks
The past three lectures were devoted to the giant component theorem:
Theorem Let be the connected component of that contains .
 If , then in probability.
 If , then in probability, for some .
 If , then in distribution.
We proved only the first (subcritical) and second (supercritical) cases: our presentation was largely inspired by the treatment of Janson, Luczak, and Rucinski and Durrett. We have omitted the critical case, however, as the last two lectures of the semester will be on another topic. The goal of this post is to provide some final remarks and references on the giant component theorem.
Retrospective
At first sight, the “double jump” in the giant component theorem looks quite shocking. In hindsight, however, this does not seem quite so miraculous, as it mirrors an elementary phenomenon that is covered in many introductory probability courses: given a (nice) random walk with initial condition , define the hitting time for some . Then there are three cases:
 If has negative drift, then . In fact, the random variable has a light (exponential) tail.
 If has positive drift, then .
 If has zero drift, then a.s. but . That is, the random variable has a heavy tail.
This “double jump” in the behavior of the hitting probabilities of a random walk is directly analogous to the behavior of the connected components of an ErdösRényi graph, and this was indeed the basic idea behind the proofs given in the previous lectures. Of course, it remains a bit of a miracle that the random walk approximation of the exploration process, which only holds for small times, is sufficiently powerful that it describes so completely the behavior of the random graph.
The critical case
In the subcritical case, the size of the largest component is of order because the hitting time of a random walk with negative drift has an exponential tail: that is, we proved
which goes to zero as for .
Similarly, we would expect that we can obtain the size of the largest component in the critical case if we understand the heavy tail behavior of the hitting time of a random walk with zero drift. This is in fact the case. Indeed, when there is zero drift, one can show that
The crude union bound argument used above now does not give the correct answer, but an only slightly better argument is needed. Indeed, note that
Therefore, . With some further work, a corresponding lower bound can also be proved. See the paper by Nachmias and Peres or the notes by van der Hofstad for the details.
It turns out that in the critical case is not only bounded in probability, but in fact converges weakly to some limiting distribution. This distribution, and much more, is beautifully described by Aldous in terms of Brownian excursions. This is an interesting example of the application of stochastic analysis to discrete probability; unfortunately, we do not have the time to cover it.
In a different direction, it turns out that various additional phase transitions appear when we consider a finer scaling, for example, in the “critical window” . For an overview of the various transitions, see, for example, section 11.1 in Alon and Spencer.
Connectivity threshold
Rather than considering the size of the largest component, one could ask when the entire ErdösRényi graph is connected. Note that when , the constant in the size of the giant component is always strictly positive, so the graph is not connected. Therefore, in order for the entire graph to be connected, we must let (that is, the edge probability must be superlinear). It turns out that the appropriate scaling for this question is , and another phase transition arises here.
Theorem. Let . If , then the ErdösRényi graph is connected with probability tending to as . If , the graph is connected with probability tending to as .
To get some intuition, consider the probability that a vertex is isolated (that is, disconnected from every other vertex):
Thus for , we have
In particular, if , then there must exist an isolated vertex with positive probability as , in which case the graph is not connected (in fact, it is not hard to show that the variance of the number of isolated components is of the same order as its mean, so that the probability that the graph is connected tends to zero). Somewhat miraculously, it turns out that when there are no isolated vertices, then the graph must already be connected, so that we do indeed obtain the sharp transition described in the Theorem above. For a proof of this fact (by a clever combinatorial argument) see, for example, the lecture notes by van der Hofstad. Alternatively, one can use a random walk argument entirely in the spirit of the proofs in the previous lectures to prove that the random graph is connected for : by running simultaneous exploration processes from different vertices as we did in the proof of the supercritical case, one can show that all connected components must intersect when and thus the entire graph must be connected. See section 2.8 in Durrett for such an argument.
Lecture 6. Giant component (3)
Let us begin with a brief recap from the previous lecture. We consider the ErdösRényi random graph in the supercritical case . Recall that denotes the connected component of the graph that contains the vertex . Our goal is to prove the existence of the giant component with size , while the remaining components have size .
Fix sufficiently large (to be chosen in the proof), and define the set
of vertices contained in “large” components. The proof consists of two parts:
 Part 1:
 Part 2: in probability.
Part 1 states that all the sufficiently large components must intersect, forming the giant component. Part 2 counts the number of vertices in the giant component. Part 2 was proved in the previous lecture. The goal of this lecture is to prove Part 1, which completes the proof of the giant component.
Overview
As in the previous lectures, the central idea in the study of the giant component is the exploration process , where
We have seen that , where is a random walk with increments
When , we have . Thus is approximately a random walk with positive drift. The intuitive idea behind the proof of Part 1 is as follows. Initially, the random walk can hit rapidly, in which case the component is small. However, if the random walk drifts away from zero, then with high probability it will never hit zero, in which case the component must keep growing until the random walk approximation is no longer accurate. Thus there do not exist any components of intermediate size: each component is either very small () or very large (we will show , but the precise exponent is not important).
We now want to argue that any pair of large components must necessarily intersect. Consider two disjoint sets and of vertices of size . As each edge is present in the graph with probability , the probability that there is no edge between and is
We therefore expect that any pair of large components must intersect with high probability. The problem with this argument is that we assumed that the sets and are nonrandom, while the random sets themselves depend on the edge structure of the random graph (so the events and are highly correlated). To actually implement this idea, we therefore need a little bit more sophisticated approach.
To make the proof work, we revisit more carefully our earlier random walk argument. The process has positive drift as . Thus the process is still approximately a random walk with positive drift! Applying the above intuition, either dies rapidly (the component is small), or grows linearly in as is illustrated in the following figure:
This means that the exploration process for a component of size will not only grow large () with high probability, but that the exploration process will also possess a large number of active vertices (. To prove that all large components intersect, we will run different exploration processes simultaneously starting from different vertices. We will show that if two of these processes reach a large number of active vertices then there must be an edge between them with high probability, and thus the corresponding components must coincide. This resolves the dependence problem in our naive argument, as the edges between the sets of active vertices have not yet been explored and are therefore independent of the history of the exploration process.
The component size dichotomy
We now begin the proof in earnest. We will first show the dichotomy between large and small components: either the component size is , or the number of active vertices grows linearly up to time . To be precise, we consider the following event:
Our goal is to show that is large.
Define the stopping time
We can write
Now suppose and . Then , as exploration process is alive at time and stays alive until time . We can therefore write
To bound the probabilities inside the sum, we compare to a suitable random walk.
The random walk argument
To bound the probability that , we must introduce a comparison random walk that lies beneath . We use the same construction as was used in the previous lecture. Let
where , are i.i.d. random variables independent of , (the same used in the exploration process), and is the set of the first components of (if , then and thus is undefined; then we simply add variables ).
As in the previous lecture, we have:
 is a random walk with increments.
 whenever and .
Now suppose that and . Then
We therefore obtain for
Thus computing reduces to compute the tail probability of a random walk (or, in less fancy terms, a sum of i.i.d. random variables). That is something we know how to do.
Lemma (Chernoff bound). Let . Then
Proof. Let . Then
The result follows by optimizing over .
Note that . We therefore have by the Chernoff bound
for all (here depends only on and ). In particular, we have
provided is sufficiently large. Thus we can estimate
which goes to zero as provided that is chosen sufficiently large. In particular, the component size dichotomy follows: choosing any , we obtain
Remark: Unlike in the proof of Part 2 in the previous lecture, here we do need to choose sufficiently large for the proof to work. If is too small, then the random walk cannot move sufficiently far away from zero to ensure that it will never return. In particular, even in the supercritical case, the second largest component has size of order .
Large components must intersect
To complete the proof, it remains to show that all large components must intersect. To do this, we will run several exploration processes at once starting from different vertices. If the sets of active vertices of two of these processes grow large, then there must be an edge between them with high probability, and thus the corresponding components intersect. Let us make this argument precise.
In the following, we denote by the exploration process started at . For each such process, we define the corresponding set that we have investigated above:
We have shown above that, provided , we have
We can therefore estimate
Now note that by time , the exploration process has only explored edges where (or ), and similarly for . It follows that
In particular, if are disjoint subsets of vertices, then
On the other hand, implies that must be disjoint at every time . Thus if , there can be no edges between vertices in and at any time (if such an edge exists, then the vertices connected by this edge will eventually be explored by both exploration processes, and then the sets of removed vertices will no longer be disjoint). Therefore,
Thus we finally obtain
and the proof of the giant component theorem is complete.
Many thanks to Quentin Berthet for scribing this lecture!
Lecture 5. Giant component (2)
Consider the ErdösRényi graph model , and denote as usual by the connected component of the graph that contains the vertex . In the last lecture, we focused mostly on the subcritical case , where we showed that . Today we will begin developing the supercritical case , where for a suitable constant . In particular, our aim for this and next lecture is to prove the following theorem.
Theorem. Let . Then
where is the smallest positive solution of the equation . Moreover, there is a such that all but one of the components have size with probability tending to as .
Beginning of the proof. Define the set
The proof of the Theorem consists of two main parts:
 Part 1: .
 Part 2: in probability.
Part 1 states that all “large” components of the graph must intersect, forming one giant component. Some intuition for why this is the case was given at the end of the previous lecture. Part 2 computes the size of this giant component. In this lecture, we will concentrate on proving Part 2, and we will find out where the mysterious constant comes from. In the next lecture, we will prove Part 1, and we will develop a detailed understanding of why all large components must intersect.
Before we proceed, let us complete the proof of the Theorem assuming Parts 1 and 2 have been proved. First, note that with probability tending to one, the set is itself a connected component. Indeed, if and then must lie in disjoint connected components by the definition of . On the other hand, with probability tending to one, all must lie in the same connected component by Part 1. Therefore, with probability tending to one, the set forms a single connected component of the graph. By Part 2, the size of this component is , while by the definition of , all other components have size . This completes the proof.
The remainder of this lecture is devoted to proving Part 2 above. We will first prove that the claim holds on average, and then prove concentration around the average. More precisely, we will show:
 .
 .
Together, these two claims evidently prove Part 2.
Mean size of the giant component
We begin by writing out the mean size of the giant component:
where we note that does not depend on the vertex by the symmetry of the ErdösRényi model. Therefore, to prove convergence of the mean size of the giant component, it suffices to prove that
This is what we will now set out to accomplish.
In the previous lecture we defined exploration process . We showed that
and that for
where is an arbitrary nonanticipating choice, say, (recall that denotes the adjacency matrix of the random graph). As are i.i.d. and as edges emanating from the set of unexplored vertices have not yet appeared in previous steps, the process is “almost'' a random walk: it fails to be a random walk as we only add Bernoulli variables in each iteration, rather than a constant number. In the last lecture, we noted that we can estimate from above by a genuine random walk by adding some fictitious vertices. To be precise, we define
where are i.i.d. independent of the (if , then and thus is undefined; in this case, we simply add all variables ). In the present lecture, we also need to bound from below. To this end, we introduce another process as follows:
where is the set consisting of the first elements of in increasing order of the vertices (if , we add variables ). The idea behind these processes is that is engineered, by including “fictitious'' vertices, to always add i.i.d. Bernoulli variables in every iteration, while is engineered, by including “fictitious'' vertices when is small and omitting vertices when is large, to always add i.i.d. Bernoulli variables in every iteration. The following facts are immediate:
 is a random walk with i.i.d. increments.
 is a random walk with i.i.d. increments.
 for all .
 for all on the event .
To see the last property, note that the exploration process can only explore as many vertices as are present in the connected component , so that for all ; therefore, in this situation only the second possibility in the definition of occurs, and it is obvious by construction that then (nonetheless, the first possibility in the definition must be included to ensure that is a random walk).
We now define the hitting times
Then we evidently have
(Note how we cleverly chose the random walk precisely so that whenever ). We have therefore reduced the problem of computing to computing the hitting probabilities of random walks. Now we are in business, as this is something we know how to do using martingales!
The hitting time computation
Let us take a moment to gather some intuition. The random walks and have increments distributed as and , respectively. As , both increment distributions converge to a distribution, so we expect that where is the first hitting time of the Poisson random walk. On the other hand, as , we expect that . The problem then reduces to computing the probability that a Poisson random walk ever hits the origin. This computation can be done explicitly, and this is precisely where the mysterious constant comes from!
We now proceed to make this intuition precise. First, we show that the probability can indeed be replaced by , as one might expect.
Lemma. as .
Proof. We need to show that
Note that as when ,
We can evidently write
where and
Choosing , we obtain for . Therefore,
This completes the proof.
By the above Lemma, and a trivial upper bound, we obtain
To complete computation of the mean size of the giant component, it therefore remains to show that and converge to . In fact, we can compute these quantities exactly.
Lemma. Let . Then
where is the smallest positive solution of .
Proof. Recall the martingale used in last lecture:
Suppose that and . Then
The first equality holds since if then and , while if then and . The second equality holds by dominated convergence since , and the third equality is by the optional stopping theorem.
Now suppose we can find such that as . Then we have
by dominated convergence. Thus, evidently, it suffices to find with the requisite properties. Now note that as , , and for , we evidently must have
We can find such by inspecting the following illustration:
Evidently the requisite assumptions are satisfied when is the smallest root of the equation (but not for the larger root at !)
Remark. Note that the supercritical case is essential here. If then the equation for has no solutions , and the argument in the proof does not work. In fact, when , we have .
By an immediate adaptation of the proof of the previous lemma, we obtain
where is the smallest positive solution of . Letting , we see that
where is the smallest solution of the equation (which is precisely the probability that the Poisson random walk hits zero, by the identical proof to the lemma above). We have therefore proved
Variance of the giant component size
To complete the proof of Part 2 of the giant component theorem, it remains to show that
To this end, let us consider
To estimate the terms in this sum, we condition on one of the components:
To proceed, note that the event can be written as
In particular, the event is independent of the edges for . Therefore, for , the conditional law of given coincides with the (unconditional) law of , the conncted component containing in the induced subgraph on the vertices :
As this quantity only depends on by the symmetry of the ErdösRényi model, we can evidently write
for , . In particular, we obtain
Now note that, by its definition, is distributed precisely as the component containing vertex in the random graph model. We can therefore show, repeating exactly the proof of the mean size of the giant component above, that
We have therefore shown that
which evidently implies
This is what we set out to prove.
Remark. It should be noted that the proof of Part 2 did not depend on the value of , or even on the rate, in the definition of the set : any sequence that grows sublinearly to infinity would have given the same result. This suggests that all but a vanishing fraction of vertices are contained in connected components of order or . We find out only in the next lecture why the rate (for sufficiently large!) is important: only sufficiently large connected components are guaranteed to intersect, while there might (and do) exist components of order that are disjoint from the giant component. If we do not exclude the latter, we will not be able to prove Part 1.
Many thanks to Weichen Wang for scribing this lecture!
Lecture 4. Giant component (1)
Consider the ErdösRényi graph model . In previous lectures, we focused on the “high complexity regime”, i.e., as goes to infinity, is fixed. We discussed topics such as clique numbers and chromatic numbers. From now on, we shall consider the “low complexity regime”, where as goes to infinity, for a fixed constant . As before, let be the adjacency matrix of . Then are i.i.d. Bernoulli random variables with success probability .
Theorem 1 Let be the connected component of that contains .
 If , then in probability.
 If , then in probability, for some .
 If , then in distribution.
In the following lectures, we will aim to prove at least parts 1 and 2.
The exploration process
How to study ? We will explore by starting an “exploration process'' at that moves around until all its sites have been visited. This walk will be constructed so that it hits each site once. So, the time it takes to explore all of is exactly . As a consequence, studying reduces to studying a hitting time of a certain random walk, and to the latter we can apply martingale theory.
At each time , we maintain three sets of vertices:
Below is an illustration of how these sets are updated on a simple example.
 At , initialize , and . Namely, only is active, all the vertices other than are unexplored, and no vertices have been removed.
 At , update , and . Namely, all neighbors of are moved from the unexplored set to the active set, and itself is removed.
 At , pick some and update , and . Namely, all unexplored neighbors of are moved into the active set, and itself is removed.
 At time , we pick some vertex from the current active set , activate all unexplored neighbors of and remove itself.
This is a sort of local search along the connected component : much like playing a game of Minesweeper! At each , the choice of can be made arbitrarily (e.g., selecting the vertex with the smallest index or randomly selecting a vertex in ). The only requirement is that it is nonanticipating (only depending on the edges visited up to time ). For example, we cannot pick the vertex in which has the largest number of unexplored neighbors, as this choice relies on unexplored edges.
A formal description of the “exploration process'':
 Initialize , and .
 For , we set
where is a nonanticipating but otherwise arbitrary choice.
This process stops when there are no more active vertices. It hits each vertex in once and only once. At each time , we remove one vertex in . So the stopping time is exactly equal to the size of :
So, we only need to study the stopping time .
Recall that indicates whether there is an edge between vertices and , and . By construction,
Now, let's do a thought experiment (wrong, but intuitive). Let's forget for the moment that some sites were previously visited, and assume that in each step all neighbors of are unvisited still (note that when is really large and is relatively small, this assumption makes sense). Then is the sum of independent ) variables, which has a distribution. This binomial variable is independent of the past because it only depends on unexplored edges; in addition, its distribution does not depend on . Therefore, would be a random walk with increment distribution . Then, studying boils down to studying first time a Poisson random walk hits zero! Of course, we cannot really ignore previously visited sites, but this rough intuition nonetheless captures the right idea as and will serve as a guiding principle for the proof.
A comparison random walk
The reason that is not a random walk is that there are only edges (not ) to explore at time . We can artificially create a random walk by adding “fictitious'' points at each time as follows.
Let be i.i.d. for , , which are independent of . Define
(When , then and thus is undefined; in this case, we simply add all variables .)
Note that is the sum of edges from to . Since we have not explored yet, those edges are independent of all edges explored up to time (here we use that is nonanticipating). We therefore see that is indeed a random walk with increment
Moreover, since all are nonnegative,
as long as . It follows that is dominated by the genuine random walk , that is,
We can now obtain bounds on by analyzing hitting times of the random walk .
The subcritical case
Define the first time the comparison walk hits zero as
Since for , it is obvious that
Now we study . The intuition is that as , is a random walk with negative drift in the subcritical case . Thus , and in fact the hitting time has nice tails!
Lemma 2 Let and . Then for any positive integer ,
We will prove this lemma below. Using the lemma, we immediately obtain:
Corollary 3 If , then for any
Proof. Applying the Lemma 2 and the union bound,
This corollary proves part 1 of Theorem 1. In fact, it turns out that the constant is tight: by using the second moment method, one can prove a matching lower bound on (see, for example, the lecture notes by van der Hofstad), which implies that in fact in probability. The proof is not much more difficult, but we prefer to move on to the supercritical case.
Remark. It might seem somewhat surprising that the result we obtain is so sharp, considering that we have blindly replaced by the larger quantity . However, in going from to we do not lose as much as one might think at first sight. When is large and is relatively small, the excess term in the definition of is zero with high probability, as most vertices are unexplored and the Bernoulli success probability of the is very small. With a bit of work, one can show that and will actually stick together for times with probability going to one as . Thus, in the subcritical case where the random walk only lives for time steps, nothing is lost in going from to , and our rough intuition that should behave like a random walk as is vindicated.
To wrap up the subcritical case, it remains to prove the lemma.
Proof of Lemma 2. By the Markov inequality,
It remains to bound , which is a standard exercise in martingale theory.
Recall that
where are i.i.d. Define the moment generating function , and let
Since and is independent of ,
where we have used . So is a martingale.
In the case and ,
The inequality is by Fatou's lemma and the second equality is by the optional stopping theorem. To see the first equality, note that if , then and as , while if , then for all and . Thus .
Next, we compute . Recall that . It has the same distribution as the sum of i.i.d. variables. For , we have . Therefore,
where the last inequality is because for any . Now, by setting , we obtain that . Thus we have shown .
The supercritical case
The goal of the following lectures is to prove part 2 of Theorem 1. More precisely, we will prove:
Theorem 4 Let . Then
in probability, where is the smallest positive solution of the equation . Moreover, there is such that all but one of the components have size , with probability tending to as .
This theorem says that with probability tending to , there is a unique giant component whose size is , and all other components are small with size .
Here we provide some vague intuition for this theorem. When , the random walk satisfies , i.e., has positive drift. Then ! In fact, the further away it starts from , the smaller the probability it will ever hit . Consider the two situations:
 dies quickly: this implies that the component is small.
 lives long: then it must live very long, as once it gets far away from , the probability of returning is very small. This implies that the component must be very large (if we pretend that ).
Of course, is not (obviously eventually hits ). But the intuition explains that there cannot be components of intermediate size: given any vertex , either is small (), or it must get very large (, say). In fact, we will find that all components of size must grow all the way to . However, any pair of large components must intersect with high probability, as there are many potential edges between them! Therefore, all vertices with should be in the same giant component. We then show that the number of such vertices is with high probability.
Many thanks to Tracy Ke for scribing this lecture!