Following the previous post in which we studied community detection, in this post we study the fundamental limits of inferring geometric structure in networks. Many networks coming from physical considerations naturally have an underlying geometry, such as the network of major roads in a country. In other networks this stems from a latent feature space of the nodes. For instance, in social networks a person might be represented by a feature vector of their interests, and two people are connected if their interests are close enough; this latent metric space is referred to as the social space. We are particularly interested in the high-dimensional regime, which brings about a host of new questions, such as estimating the dimension.
A simple random geometric graph model and basic questions
We study perhaps the simplest model of a random geometric graph, where the underlying metric space is the -dimensional unit sphere
, and where the latent labels of the nodes are i.i.d. uniform random vectors in
. More precisely, the random geometric graph
is defined as follows. Let
be independent random vectors, uniformly distributed on
. In
, distinct nodes
and
are connected by an edge if and only if
, where the threshold value
is such that
For example, when
we have
.
The most natural random graph model without any structure is the standard Erdos-Renyi random graph , where any two of the
vertices are independently connected with probability
.
We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdos-Renyi model, while the alternative is that it is drawn from . In brief:
(1)
To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, and
, denoted by
; recall that the total variation distance between two probability measures
and
is defined as
. We are interested in particular in the case when the dimension
is large, growing with
.
It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with while still being able to detect it? Most of this post will focus on this question.
If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a true representation, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the post.
The dimension threshold for detecting underlying geometry
The high-dimensional setting of the random geometric graph was first studied by Devroye, Gyorgy, Lugosi, and Udina, who showed that geometry is indeed lost in high dimensions: if
is fixed and
, then
. More precisely, they show that this convergence happens when
, but this is not tight. The dimension threshold for dense graphs was recently found by Bubeck, Ding, Eldan, and Racz, and it turns out that it is
, in the following sense.
Theorem [Bubeck, Ding, Eldan, and Racz 2014]
Let
be fixed. Then
Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time
).
(2)
(3)
Most of this post is devoted to understanding (3), that is, how the two models can be distinguished; the impossibility result of (2) will be discussed in a future post. At the end we will also consider this same question for sparse graphs (where ), where determining the dimension threshold is an intriguing open problem.
The triangle test
A natural test to uncover geometric structure is to count the number of triangles in . Indeed, in a purely random scenario, vertex
being connected to both
and
says nothing about whether
and
are connected. On the other hand, in a geometric setting this implies that
and
are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us.
Given that is connected to both
and
,
and
are more likely to be connected under
than under
.
For a graph , let
denote its adjacency matrix. Then
is the indicator variable that three vertices ,
, and
form a triangle, and so the number of triangles in
is
By linearity of expectation, for both models the expected number of triangles is times the probability of a triangle between three specific vertices. For the Erd\H{o}s-R\’enyi random graph the edges are independent, so the probability of a triangle is
, and thus we have
For it turns out that for any fixed
we have
(4)
for some constant , which gives that
Showing (4) is somewhat involved, but in essence it follows from the concentration of measure phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of around the equator. We sketch here the main intuition for
, which is illustrated in the figure below.
Let ,
, and
be independent uniformly distributed points in
. Then
where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately . To compute this conditional probability what we really need to know is the typical angle is between
and
. By rotational invariance we may assume that
, and hence
, the first coordinate of
. One way to generate
is to sample a
-dimensional standard Gaussian and then normalize it by its length. Since the norm of a
-dimensional standard Gaussian is very well concentrated around
, it follows that
is on the order of
. Conditioned on
, this typical angle gives the boost in the conditional probability that we see.
If and
are two independent uniform points on
, then their inner product
is on the order of
due to the concentration of measure phenomenon on the sphere.
Thus we see that the boost in the number of triangles in the geometric setting is in expectation:
To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation. A simple calculation shows that
and also
Thus we see that if
, which is equivalent to
.
Signed triangles are more powerful
While triangles detect geometry up until , are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when
, as do several other natural statistics. Yet the underlying geometry can be detected even when
.
The simple idea that leads to this improvement is to consider signed triangles. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. The following figure summarizes these observations.
The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define
The key insight motivating this definition is that the variance of signed triangles is much smaller than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the term vanishes, leaving only the
term. It is a simple exercise to show that
and
On the other hand it can be shown that
(5)
so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for and we have
Putting everything together we get that if
, which is equivalent to
. This concludes the proof of (3) from Theorem 1.
Estimating the dimension
Until now we discussed detecting geometry. However, the insights gained above allow us to also touch upon the more subtle problem of estimating the underlying dimension .
Dimension estimation can also be done by counting the “number” of signed triangles as above. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound on in~(5) is not enough. Still, we believe that the lower bound should give the true value of the expected value for an appropriate constant
, and hence we expect to have that
(6)
Thus, using the variance bound from above, we get that dimension estimation should be possible using signed triangles whenever , which is equivalent to
. Showing (6) for general
seems involved; Bubeck et al. showed that it holds for
, which can be considered as a proof of concept.
Theorem [Bubeck, Ding, Eldan, and Racz 2014]
There exists a universal constant
such that for all integers
and
, one has
This result is tight, as demonstrated by a result of Eldan, which implies that and
are indistinguishable when
.
The mysterious sparse regime
We conclude this post by discussing an intriguing conjecture for sparse graphs. It is again natural to consider the number of triangles as a way to distinguish between and
. Bubeck et al. show that this statistic works whenever
, and conjecture that this is tight.
Conjecture [Bubeck, Ding, Eldan, and Racz 2014]
Let
be fixed and assume
. Then
The main reason for this conjecture is that, when ,
and
seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is
from (2) (which holds uniformly over
), which is very far from
!