# Guest post by Miklos Racz: A primer on exact recovery in the general stochastic block model

Foreword: this will be a series of blog posts written by our Theory Group postdoc Miklos Racz. It is essentially the transcript of a mini-course that Miki gave at the University of Washington and at the XX Brazilian School of Probability, and that I gave at the University of Nice. These notes are also posted on arxiv. I now leave the floor to Miki:

Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.

This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).

The stochastic block model and exact recovery

The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:

, the number of vertices;

a probability distribution that describes the relative sizes of the communities;

and , a symmetric matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

The number of communities, , is implicit in this notation; in these notes we assume that is a fixed constant.

A random graph from is defined as follows:

• The vertex set of the graph is .
• Every vertex is independently assigned a (hidden) label from the probability distribution on . That is, for every .
• Given the labels of the vertices, each (unordered) pair of vertices is connected independently with probability .

Example [Symmetric communities]

A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.

This is modeled by the SBM with for all and if and otherwise, with .

We write for a graph generated according to the SBM without the hidden vertex labels revealed.

The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

There are various notions of success that are worth studying. In this post we focus on exact recovery: we aim to recover the labels of all vertices exactly with high probability (whp).

More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.

For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be .

It is thus natural to scale the edge probability matrix accordingly, i.e., to consider , where .

We also assume that the communities have linear size, i.e., that is independent of , and for all .

From exact recovery to testing multivariate Poisson distributions

As a thought experiment, imagine that not only is the graph given, but also all vertex labels are revealed, except for that of a given vertex . Is it possible to determine the label of ?

Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.

Given the labels of all vertices except , the information we have about is the number of nodes in each community it is connected to. In other words, we know the degree profile of , where, for a given labelling of the graph’s vertices, the -th component is the number of edges between and the vertices in community .

The distribution of the degree profile depends on the community that belongs to. Recall that the community sizes are given by a multinomial distribution with parameters and , and hence the relative size of community concentrates on . Thus if , the degree profile can be approximated by independent binomials, with approximately distributed as

where denotes the binomial distribution with trials and success probability . In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community is approximately Poisson distributed with mean

where is the -th unit vector. Defining , this can be abbreviated as , where denotes the -th column of the matrix .

We call the quantity the community profile of community ; this is the quantity that determines the distribution of the degree profile of vertices from a given community.

Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between multivariate Poisson distributions.

The prior on the label of is given by , and we get to observe the degree profile , which comes from one of multivariate Poisson distributions, which have mean times the community profiles , .

Testing multivariate Poisson distributions

We now turn to understanding the testing problem described above; the setup is as follows.

We consider a Bayesian hypothesis testing problem with hypotheses.

The random variable takes values in with prior given by , i.e., .

We do not observe , but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of :

given , the mean is .

In short:

In more detail:

where

and

Our goal is to infer the value of from a realization of .

The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing , selects

as an estimate for the value of , with ties broken arbitrarily.

Let denote the error of the MAP estimator.

One can think of the MAP estimator as a tournament of pairwise comparisons of the hypotheses:

if

then the MAP estimate is not .

The probability that one makes an error during such a comparison is exactly

(1)

For finite , the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., .

In particular, we have that

(2)

Thus we desire to understand the magnitude of the error probability in (1) in the particular case when the conditional distribution of given is a multivariate Poisson distribution with mean vector on the order of . The following result determines this error up to first order in the exponent.

Lemma [Abbe and Sandon 2015]

For any with and , we have

(3)

where

(4)

Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed the Chernoff-Hellinger divergence.

We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.

Observe that

decays rapidly away from

so we can obtain a good estimate of the sum

by simply estimating the term

Now observe that must satisfy

after some algebra this is equivalent to

Let denote the maximizer in the expression of in~\eqref{eq:D_+}.

By differentiating in , we obtain that satisfies

and so

Thus we see that

from which, after some algebra, we get that

The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to

gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.

Characterizing exact recoverability using CH-divergence

Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity in (4).

The discussion above then implies that plays an important role in the threshold for exact recovery.

In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that

Suppose on the contrary that

for some and .

This implies that the error probability in the testing problem is for some for all vertices in communities and . Since the number of vertices in these communities is linear in , and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most

.

It turns out that this indeed gives the recoverability threshold.

Theorem [Abbe and Sandon 2015]

Let denote the number of communities,

let with denote the community prior,

let ,

and let be a symmetric matrix with no two rows equal.

Exact recovery is solvable in if and only if

(5)

This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example [Symmetric communities]

Consider again symmetric communities, that is,

for all ,

if ,

and otherwise, with .

Then exact recovery is solvable in if and only if

(6)

We note that in this case is the same as the Hellinger divergence.

Above we have heuristically described why the condition (5) is necessary.

To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is .

Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is by a union bound.

However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their size neighborhoods interact.

One can then apply two rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.

This entry was posted in Probability theory, Random graphs. Bookmark the permalink.