Foreword: this will be a series of blog posts written by our Theory Group postdoc Miklos Racz. It is essentially the transcript of a mini-course that Miki gave at the University of Washington and at the XX Brazilian School of Probability, and that I gave at the University of Nice. These notes are also posted on arxiv. I now leave the floor to Miki:
Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.
This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).
The stochastic block model and exact recovery
The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:
, the number of vertices;
a probability distribution that describes the relative sizes of the communities;
and , a symmetric
matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.
The number of communities, , is implicit in this notation; in these notes we assume that
is a fixed constant.
A random graph from is defined as follows:
- The vertex set of the graph is
.
- Every vertex
is independently assigned a (hidden) label
from the probability distribution
on
. That is,
for every
.
- Given the labels of the vertices, each (unordered) pair of vertices
is connected independently with probability
.
Example [Symmetric communities]
A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.
This is modeled by the SBM with
for all
and
if
and
otherwise, with
.
We write for a graph generated according to the SBM without the hidden vertex labels revealed.
The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.
There are various notions of success that are worth studying. In this post we focus on exact recovery: we aim to recover the labels of all vertices exactly with high probability (whp).
More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.
For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be .
It is thus natural to scale the edge probability matrix accordingly, i.e., to consider , where
.
We also assume that the communities have linear size, i.e., that is independent of
, and
for all
.
From exact recovery to testing multivariate Poisson distributions
As a thought experiment, imagine that not only is the graph given, but also all vertex labels are revealed, except for that of a given vertex
. Is it possible to determine the label of
?
Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of
given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.
Given the labels of all vertices except , the information we have about
is the number of nodes in each community it is connected to. In other words, we know the degree profile
of
, where, for a given labelling of the graph’s vertices, the
-th component
is the number of edges between
and the vertices in community
.
The distribution of the degree profile depends on the community that
belongs to. Recall that the community sizes are given by a multinomial distribution with parameters
and
, and hence the relative size of community
concentrates on
. Thus if
, the degree profile
can be approximated by independent binomials, with
approximately distributed as
where denotes the binomial distribution with
trials and success probability
. In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community
is approximately Poisson distributed with mean
where is the
-th unit vector. Defining
, this can be abbreviated as
, where
denotes the
-th column of the matrix
.
We call the quantity the community profile of community
; this is the quantity that determines the distribution of the degree profile of vertices from a given community.
Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between multivariate Poisson distributions.
The prior on the label of is given by
, and we get to observe the degree profile
, which comes from one of
multivariate Poisson distributions, which have mean
times the community profiles
,
.
Testing multivariate Poisson distributions
We now turn to understanding the testing problem described above; the setup is as follows.
We consider a Bayesian hypothesis testing problem with hypotheses.
The random variable takes values in
with prior given by
, i.e.,
.
We do not observe , but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of
:
given , the mean is
.
In short:
In more detail:
where
and
Our goal is to infer the value of from a realization of
.
The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing , selects
as an estimate for the value of , with ties broken arbitrarily.
Let denote the error of the MAP estimator.
One can think of the MAP estimator as a tournament of pairwise comparisons of the hypotheses:
if
then the MAP estimate is not .
The probability that one makes an error during such a comparison is exactly
(1)
For finite , the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e.,
.
In particular, we have that
(2)
Thus we desire to understand the magnitude of the error probability in (1) in the particular case when the conditional distribution of
given
is a multivariate Poisson distribution with mean vector on the order of
. The following result determines this error up to first order in the exponent.
Lemma [Abbe and Sandon 2015]
For any
with
and
, we have
(3)
where
(4)
Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed the Chernoff-Hellinger divergence.
We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.
Observe that
decays rapidly away from
so we can obtain a good estimate of the sum
by simply estimating the term
Now observe that must satisfy
after some algebra this is equivalent to
Let denote the maximizer in the expression of
in~\eqref{eq:D_+}.
By differentiating in , we obtain that
satisfies
and so
Thus we see that
from which, after some algebra, we get that
The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to
gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.
Characterizing exact recoverability using CH-divergence
Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity in (4).
The discussion above then implies that plays an important role in the threshold for exact recovery.
In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that
Suppose on the contrary that
for some and
.
This implies that the error probability in the testing problem is for some
for all vertices in communities
and
. Since the number of vertices in these communities is linear in
, and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most
.
It turns out that this indeed gives the recoverability threshold.
Theorem [Abbe and Sandon 2015]
Let
denote the number of communities,
let
with
denote the community prior,
let
,
and let
be a symmetric
matrix with no two rows equal.
Exact recovery is solvable in
if and only if
(5)
This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.
Example [Symmetric communities]
Consider again
symmetric communities, that is,
for all
,
if
,
and
otherwise, with
.
Then exact recovery is solvable in
if and only if
(6)
We note that in this case
is the same as the Hellinger divergence.
Above we have heuristically described why the condition (5) is necessary.
To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is .
Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is by a union bound.
However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their
size neighborhoods interact.
One can then apply two rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.