Following the previous post in which we studied community detection, in this post we study the fundamental limits of inferring *geometric structure* in networks. Many networks coming from physical considerations naturally have an underlying geometry, such as the network of major roads in a country. In other networks this stems from a latent feature space of the nodes. For instance, in social networks a person might be represented by a feature vector of their interests, and two people are connected if their interests are close enough; this latent metric space is referred to as the *social space*. We are particularly interested in the *high-dimensional regime*, which brings about a host of new questions, such as estimating the dimension.

**A simple random geometric graph model and basic questions**

We study perhaps the simplest model of a random geometric graph, where the underlying metric space is the -dimensional unit sphere , and where the latent labels of the nodes are i.i.d. uniform random vectors in . More precisely, the random geometric graph is defined as follows. Let be independent random vectors, uniformly distributed on . In , distinct nodes and are connected by an edge if and only if , where the threshold value is such that For example, when we have .

The most natural random graph model without any structure is the standard Erdos-Renyi random graph , where any two of the vertices are independently connected with probability .

We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdos-Renyi model, while the alternative is that it is drawn from . In brief:

(1)

To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, and , denoted by ; recall that the total variation distance between two probability measures and is defined as . We are interested in particular in the case when the dimension is *large*, growing with .

It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with while still being able to detect it? Most of this post will focus on this question.

If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a *true representation*, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the post.

**The dimension threshold for detecting underlying geometry**

The high-dimensional setting of the random geometric graph was first studied by Devroye, Gyorgy, Lugosi, and Udina, who showed that geometry is indeed lost in high dimensions: if is fixed and , then . More precisely, they show that this convergence happens when , but this is not tight. The dimension threshold for dense graphs was recently found by Bubeck, Ding, Eldan, and Racz, and it turns out that it is , in the following sense.

Theorem[Bubeck, Ding, Eldan, and Racz 2014]Let be fixed. Then

Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time ).

(2)

(3)

Most of this post is devoted to understanding (3), that is, how the two models can be distinguished; the impossibility result of (2) will be discussed in a future post. At the end we will also consider this same question for *sparse graphs* (where ), where determining the dimension threshold is an intriguing open problem.

**The triangle test**

A natural test to uncover geometric structure is to count the number of triangles in . Indeed, in a purely random scenario, vertex being connected to both and says nothing about whether and are connected. On the other hand, in a geometric setting this implies that and are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us.

Given that is connected to both and , and are more likely to be connected under than under .

For a graph , let denote its adjacency matrix. Then

is the indicator variable that three vertices , , and form a triangle, and so the number of triangles in is

By linearity of expectation, for both models the expected number of triangles is times the probability of a triangle between three specific vertices. For the Erd\H{o}s-R\’enyi random graph the edges are independent, so the probability of a triangle is , and thus we have

For it turns out that for any fixed we have

(4)

for some constant , which gives that

Showing (4) is somewhat involved, but in essence it follows from the *concentration of measure* phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of around the equator. We sketch here the main intuition for , which is illustrated in the figure below.

Let , , and be independent uniformly distributed points in . Then

where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately . To compute this conditional probability what we really need to know is the typical angle is between and . By rotational invariance we may assume that , and hence , the first coordinate of . One way to generate is to sample a -dimensional standard Gaussian and then normalize it by its length. Since the norm of a -dimensional standard Gaussian is very well concentrated around , it follows that is on the order of . Conditioned on , this typical angle gives the boost in the conditional probability that we see.

If and are two independent uniform points on , then their inner product is on the order of due to the concentration of measure phenomenon on the sphere.

Thus we see that the boost in the number of triangles in the geometric setting is in expectation:

To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation. A simple calculation shows that

and also

Thus we see that if , which is equivalent to .

**Signed triangles are more powerful**

While triangles detect geometry up until , are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when , as do several other natural statistics. Yet the underlying geometry can be detected even when .

The simple idea that leads to this improvement is to consider *signed triangles*. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. The following figure summarizes these observations.

The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define

The key insight motivating this definition is that the variance of signed triangles is *much smaller* than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the term vanishes, leaving only the term. It is a simple exercise to show that

and

On the other hand it can be shown that

(5)

so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for and we have

Putting everything together we get that if , which is equivalent to . This concludes the proof of (3) from Theorem 1.

**Estimating the dimension**

Until now we discussed *detecting* geometry. However, the insights gained above allow us to also touch upon the more subtle problem of *estimating* the underlying dimension .

Dimension estimation can also be done by counting the “number” of signed triangles as above. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound on in~(5) is not enough. Still, we believe that the lower bound should give the true value of the expected value for an appropriate constant , and hence we expect to have that

(6)

Thus, using the variance bound from above, we get that dimension estimation should be possible using signed triangles whenever , which is equivalent to . Showing (6) for general seems involved; Bubeck et al. showed that it holds for , which can be considered as a proof of concept.

Theorem[Bubeck, Ding, Eldan, and Racz 2014]There exists a universal constant such that for all integers and , one has

This result is tight, as demonstrated by a result of Eldan, which implies that and are indistinguishable when .

**The mysterious sparse regime**

We conclude this post by discussing an intriguing conjecture for *sparse graphs*. It is again natural to consider the number of triangles as a way to distinguish between and . Bubeck et al. show that this statistic works whenever , and conjecture that this is tight.

Conjecture[Bubeck, Ding, Eldan, and Racz 2014]Let be fixed and assume . Then

The main reason for this conjecture is that, when , and seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is from (2) (which holds uniformly over ), which is very far from !