Guest post by Miklos Racz: Estimating the dimension of a random geometric graph on a high-dimensional sphere

Following the previous post in which we studied community detection, in this post we study the fundamental limits of inferring geometric structure in networks. Many networks coming from physical considerations naturally have an underlying geometry, such as the network of major roads in a country. In other networks this stems from a latent feature space of the nodes. For instance, in social networks a person might be represented by a feature vector of their interests, and two people are connected if their interests are close enough; this latent metric space is referred to as the social space. We are particularly interested in the high-dimensional regime, which brings about a host of new questions, such as estimating the dimension.

A simple random geometric graph model and basic questions

We study perhaps the simplest model of a random geometric graph, where the underlying metric space is the d-dimensional unit sphere \mathbb{S}^{d-1} = \left\{ x \in \mathbb{R}^d : \left\| x \right\|_2 = 1 \right\}, and where the latent labels of the nodes are i.i.d. uniform random vectors in \mathbb{S}^{d-1}. More precisely, the random geometric graph G \left( n, p, d \right) is defined as follows. Let X_1, \dots, X_n be independent random vectors, uniformly distributed on \mathbb{S}^{d-1}. In G\left( n, p, d \right), distinct nodes i \in \left[n\right] and j \in \left[n \right] are connected by an edge if and only if \left\langle X_i, X_j \right\rangle \geq t_{p,d}, where the threshold value t_{p,d} \in \left[-1,1\right] is such that \mathbb{P} \left( \left\langle X_1, X_2 \right\rangle \geq t_{p,d} \right) = p. For example, when p = 1/2 we have t_{p,d} = 0.

The most natural random graph model without any structure is the standard Erdos-Renyi random graph G(n,p), where any two of the n vertices are independently connected with probability p.

We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdos-Renyi model, while the alternative is that it is drawn from G(n,p,d). In brief:

(1)   \begin{equation*} H_{0} : G \sim G(n,p), \qquad \qquad H_{1} : G \sim G(n,p,d). \end{equation*}

To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, G(n,p) and G(n,p,d), denoted by \mathrm{TV} \left( G(n,p), G(n,p,d) \right); recall that the total variation distance between two probability measures P and Q is defined as \mathrm{TV} \left( P, Q \right) = \tfrac{1}{2} \left\| P - Q \right\|_{1} = \sup_{A} \left| P(A) - Q(A) \right|. We are interested in particular in the case when the dimension d is large, growing with n.

It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with n while still being able to detect it? Most of this post will focus on this question.

If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a true representation, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the post.

The dimension threshold for detecting underlying geometry

The high-dimensional setting of the random geometric graph G(n,p,d) was first studied by Devroye, Gyorgy, Lugosi, and Udina, who showed that geometry is indeed lost in high dimensions: if n is fixed and d \to \infty, then \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 0. More precisely, they show that this convergence happens when d \gg n^{7} 2^{n^2 / 2}, but this is not tight. The dimension threshold for dense graphs was recently found by Bubeck, Ding, Eldan, and Racz, and it turns out that it is d \approx n^3, in the following sense.

Theorem [Bubeck, Ding, Eldan, and Racz 2014]

Let p \in (0,1) be fixed. Then

    \[\mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to \left\{ \begin{array}  0 \;\; \text{ if } \;\; d \gg n^3  \qquad \qquad (2) \\ 1 \;\; \text{ if } \;\; d \ll n^3 \qquad \qquad (3) \end{array} \right.\]

Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time O\left( n^{3} \right)).

(2)   \begin{equation*} \end{equation*}

(3)   \begin{equation*} \end{equation*}

Most of this post is devoted to understanding (3), that is, how the two models can be distinguished; the impossibility result of (2) will be discussed in a future post. At the end we will also consider this same question for sparse graphs (where p = c/n), where determining the dimension threshold is an intriguing open problem.

The triangle test

A natural test to uncover geometric structure is to count the number of triangles in G. Indeed, in a purely random scenario, vertex u being connected to both v and w says nothing about whether v and w are connected. On the other hand, in a geometric setting this implies that v and w are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us.

triangle_question

Given that u is connected to both v and w, v and w are more likely to be connected under G(n,p,d) than under G(n,p).

 

For a graph G, let A denote its adjacency matrix. Then

    \[T_{G} \left( i,j,k \right) := A_{i,j} A_{i,k} A_{j,k}\]

is the indicator variable that three vertices i, j, and k form a triangle, and so the number of triangles in G is

    \[T(G) := \sum_{\{i,j,k\} \in \binom{[n]}{3}} T_{G} \left( i,j,k \right).\]

By linearity of expectation, for both models the expected number of triangles is \binom{n}{3} times the probability of a triangle between three specific vertices. For the Erd\H{o}s-R\’enyi random graph the edges are independent, so the probability of a triangle is p^3, and thus we have

    \[ \mathbb{E} \left[ T \left( G(n,p) \right) \right] = \binom{n}{3} p^3. \]

For G(n,p,d) it turns out that for any fixed p \in \left( 0, 1 \right) we have

(4)   \begin{equation*} \mathbb{P} \left( T_{G(n,p,d)} \left( 1, 2, 3 \right) = 1 \right) \approx p^{3} \left( 1 + \frac{C_{p}}{\sqrt{d}} \right) \end{equation*}

for some constant C_{p} > 0, which gives that

    \[ \mathbb{E} \left[ T \left( G(n,p,d) \right) \right] \geq \binom{n}{3} p^3 \left( 1 + \frac{C_{p}}{\sqrt{d}} \right). \]

Showing (4) is somewhat involved, but in essence it follows from the concentration of measure phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of O \left( 1 / \sqrt{d} \right) around the equator. We sketch here the main intuition for p=1/2, which is illustrated in the figure below.

Let X_1, X_2, and X_3 be independent uniformly distributed points in \mathbb{S}^{d-1}. Then

    \begin{multline*} \mathbb{P} \left( T_{G(n,1/2,d)} \left( 1, 2, 3 \right) = 1 \right) \\ \begin{aligned} &= \mathbb{P} \left( \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0, \langle X_2, X_3 \rangle \geq 0 \right) \\ &= \mathbb{P} \left( \langle X_2, X_3 \rangle \geq 0 \, \middle| \, \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right) \mathbb{P} \left( \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right) \\ &= \frac{1}{4} \times \mathbb{P} \left( \langle X_2, X_3 \rangle \geq 0 \, \middle| \, \langle X_1, X_2 \rangle \geq 0, \langle X_1, X_3 \rangle \geq 0 \right), \end{aligned} \end{multline*}

where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately 1/2 + c / \sqrt{d}. To compute this conditional probability what we really need to know is the typical angle is between X_1 and X_2. By rotational invariance we may assume that X_1 = (1,0,0, \dots, 0), and hence \langle X_1, X_2 \rangle = X_{2} (1), the first coordinate of X_{2}. One way to generate X_2 is to sample a d-dimensional standard Gaussian and then normalize it by its length. Since the norm of a d-dimensional standard Gaussian is very well concentrated around \sqrt{d}, it follows that X_{2}(1) is on the order of 1/\sqrt{d}. Conditioned on X_{2}(1) \geq 0, this typical angle gives the boost in the conditional probability that we see.

triangle_gnpd

 

 

If X_{1} and X_{2} are two independent uniform points on \mathbb{S}^{d-1}, then their inner product \left\langle X_1, X_2 \right\rangle is on the order of 1/\sqrt{d} due to the concentration of measure phenomenon on the sphere.

 

 

Thus we see that the boost in the number of triangles in the geometric setting is \Theta \left( n^{3} / \sqrt{d} \right) in expectation:

    \[\mathbb{E} \left[ T \left( G(n,p,d) \right) \right] - \E \left[ T \left( G(n,p) \right) \right] \geq \binom{n}{3} \frac{C_p}{\sqrt{d}}.\]

To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation. A simple calculation shows that

    \[\mathrm{Var} \left( T \left( G \left( n, p \right) \right) \right) = \binom{n}{3} \left( p^{3} - p^{6} \right) + \binom{n}{4} \binom{4}{2} \left( p^{5} - p^{6} \right)\]

and also

    \[\mathrm{Var} \left( T \left( G \left( n, p, d \right) \right) \right) \leq n^4.\]

Thus we see that \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 1 if n^{3} / \sqrt{d} \gg \sqrt{n^4}, which is equivalent to d \ll n^2.

Signed triangles are more powerful

While triangles detect geometry up until d \ll n^2, are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when d \ll n^2, as do several other natural statistics. Yet the underlying geometry can be detected even when d \ll n^{3}.

The simple idea that leads to this improvement is to consider signed triangles. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. The following figure summarizes these observations.

triangles_signed

The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define

    \[ \tau \left( G \right) := \sum_{\{i,j,k\} \in \binom{[n]}{3}} \left( A_{i,j} - p \right) \left( A_{i,k} - p \right) \left( A_{j,k} - p \right). \]

The key insight motivating this definition is that the variance of signed triangles is much smaller than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the \Theta \left( n^{4} \right) term vanishes, leaving only the \Theta \left( n^{3} \right) term. It is a simple exercise to show that

    \[\mathbb{E} \left[ \tau \left( G(n,p) \right) \right] = 0\]

and

    \[\mathrm{Var} \left( \tau \left( G(n,p) \right) \right) = \binom{n}{3} p^{3} \left( 1 - p \right)^{3}.\]

On the other hand it can be shown that

(5)   \begin{equation*} \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] \geq c_{p} n^{3} / \sqrt{d}, \end{equation*}

so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for G(n,p,d) and we have

    \[\mathrm{Var} \left( \tau \left( G(n,p,d) \right) \right) \leq n^{3} + \frac{3 n^{4}}{d}.\]

Putting everything together we get that \mathrm{TV} \left( G(n,p), G(n,p,d) \right) \to 1 if n^{3} / \sqrt{d} \gg \sqrt{n^3 + n^{4} / d}, which is equivalent to d \ll n^3. This concludes the proof of (3) from Theorem 1.

Estimating the dimension

Until now we discussed detecting geometry. However, the insights gained above allow us to also touch upon the more subtle problem of estimating the underlying dimension d.

Dimension estimation can also be done by counting the “number” of signed triangles as above. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound on \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] in~(5) is not enough. Still, we believe that the lower bound should give the true value of the expected value for an appropriate constant c_{p}, and hence we expect to have that

(6)   \begin{equation*} \mathbb{E} \left[ \tau \left( G(n,p,d) \right) \right] - \E \left[ \tau \left( G(n,p,d+1) \right) \right] = \Theta \left( \frac{n^{3}}{d^{3/2}} \right). \end{equation*}

Thus, using the variance bound from above, we get that dimension estimation should be possible using signed triangles whenever n^{3} / d^{3/2} \gg \sqrt{n^3 + n^{4} / d}, which is equivalent to d \ll n. Showing (6) for general p seems involved; Bubeck et al. showed that it holds for p = 1/2, which can be considered as a proof of concept.

Theorem [Bubeck, Ding, Eldan, and Racz 2014]

There exists a universal constant C > 0 such that for all integers n and d_1 < d_2, one has

    \[ \mathrm{TV} \left( G (n, 1/2, d_1), G(n, 1/2, d_2) \right) \geq 1 - C \left( \frac{d_1}{n} \right)^{2}. \]

This result is tight, as demonstrated by a result of Eldan, which implies that G(n,1/2,d) and G(n,1/2,d+1) are indistinguishable when d \gg n.

The mysterious sparse regime

We conclude this post by discussing an intriguing conjecture for sparse graphs. It is again natural to consider the number of triangles as a way to distinguish between G(n,c/n) and G(n, c/n,d). Bubeck et al. show that this statistic works whenever d \ll \log^{3} \left( n \right), and conjecture that this is tight.

Conjecture [Bubeck, Ding, Eldan, and Racz 2014]

Let c > 0 be fixed and assume d / \log^{3} \left( n \right) \to \infty. Then

    \[ \mathrm{TV} \left( G \left( n, \frac{c}{n} \right), G \left( n, \frac{c}{n}, d \right) \right) \to 0. \]

The main reason for this conjecture is that, when d \gg \log^{3} \left( n \right), G(n, c/n) and G(n, c/n, d) seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is n^{3} from (2) (which holds uniformly over p), which is very far from \log^{3} \left( n \right)!

This entry was posted in Probability theory, Random graphs. Bookmark the permalink.

Leave a reply