On the influence of the seed graph in the preferential attachment model

    \[\]

The preferential attachment model, introduced in 1992 by Mahmoud and popularized in 1999 by Barabási and Albert, has attracted a lot of attention in the last decade. In its simplest form it describes the evolution of a random tree. Formally we denote by \mathrm{PA}(n) the preferential attachment tree on n vertices which is defined by induction as follows. First \mathrm{PA}(2) is the unique tree on two vertices. Then, given \mathrm{PA}(n), \mathrm{PA}(n+1) is formed from \mathrm{PA}(n) by adding a new vertex u and a new edge uv where v is selected at random among vertices in \mathrm{PA}(n) according to the following probability distribution:

    \[\mathbb{P}\left(v = i \ \middle| \, \mathrm{PA}(n) \right) = \frac{d_{\mathrm{PA}(n)}(i)}{2 \left( n - 1 \right)} ,\]

where d_T(u) denotes the degree of vertex u in a tree T. In other words vertices of large degrees are more likely to attract the new nodes. This model of evolution is argued to be a good approximation for things such a network of citations, or the internet network.

One of the main reason for the success of the preferential attachment model is the following theorem, which shows that the degree distribution in \mathrm{PA}(n) follows a power-law, a feature that many real-world networks (such as the internet) exhibit but which is not reproduced by standard random graph models such as the Erdös-Rényi model.

Theorem [Bollobás, Riordan, Spencer and Tusnády (2001)Let d be fixed. Then as n \to +\infty, the proportion of vertices with degree d tends in probability to

    \[\frac{4}{(d+1)(d+2)(d+3)} \sim \frac{4}{d^3} .\]

While the above theorem is a fine and interesting mathematical result, I do not view it as the critical aspect of the preferential attachment model (note that Wikipedia disagrees). In my opinion \mathrm{PA}(n) is simply interesting merely because of its natural rule of evolution.

Now think about the application of the PA model to the internet. Of course there is a few obvious objections, such as the fact that in \mathrm{PA}(n) a website can only link to one other website. While this is clearly ridiculous I think that \mathrm{PA}(n) still contains the essence of what one would like to capture to model the evolution of the internet. However there is one potentially important aspect which is overlooked: in the early days of the internet the PA model was probably very far from being a good approximation to the evolution of the network. It is perhaps reasonable to assume that after 1995 the network was evolving according to PA, but certainly from 1970 to 1995 the evolution followed fundamentally different rules. This observation suggests to study the preferential attachment model with a seed.

Thus we are now interested in \mathrm{PA}(n, T), where T is a finite seed tree. Formally \mathrm{PA}(n, T) is also defined by induction, where \mathrm{PA}(|T|, T) = T and \mathrm{PA}(n+1,T) is formed from \mathrm{PA}(n, T) as before. A very basic question which seems to have been overlooked in the literature is the following: what is the influence of the seed T on \mathrm{PA}(n, T) as n goes to infinity?

In our recent joint work with Elchanan Mossel and Miklos Racz we looked exactly at this question. More precisely we ask the following: given two seed trees T and S, do the distributions \mathrm{PA}(n, S) and \mathrm{PA}(n, T) remain separated (say in total variation distance) as n goes to infinity? In other words we are interested in the following quantity:

    \[\delta(S, T) = \lim_{n \to \infty} \mathrm{TV}(\mathrm{PA}(n, S), \mathrm{PA}(n, T)) .\]

A priori it could be that \delta(S, T) = 0 for any S and T, which would mean that the seed has no influence and that the preferential attachment “forgets” its initial conditions. We prove that this is far from true:

Theorem [Bubeck, Mossel and Racz (2014)] Let S and T be two finite trees on at least 3 vertices. If the degree distributions in S and T are different, then \delta \left( S, T \right) > 0.

If I wanted to make a bold statement I could say that this theorem implies the following: by looking at the internet network today, one can still “see” the influence of the topological structure of the internet back in the 90’s. In other words to a certain extent one can go back in time and potentially infer some properties that people may have believed to be lost (perhaps some army’s secrets hidden in the structure of the ARPANET?). Of course at this stage this is pure science fiction, but the theorem certainly leaves that possibility open. Note that we believe that the theorem can even be strengthen to the following statement:

Conjecture Let S and T be two finite trees on at least 3 vertices. If S and T are non-isomorphic, then \delta \left( S, T \right) > 0.

These statements show that even when n is large one can still “see” in \mathrm{PA}(n, T) the influence of the original seed T. However by considering the total variation distance we allow global statistics that depend on entire tree \mathrm{PA}(n, T). What about local statistics that could be computed by an agent looking only at a finite neighborhood around her? Mathematically this question can be interpreted in the framework of the Benjamini-Schramm limit. Recall that a sequence of random graphs (G_n) tends to a random infinite rooted tree (\mathcal{T}, Z) (Z is the random root) if for any r \in \mathbb{N}, the random ball of radius r around a random vertex k_n in G_n tends in distribution to the random ball of radius r around Z in \mathcal{T}. In other words when n is large enough a random agent cannot tell if she is in G_n or in \mathcal{T} by looking at a finite neighborhood around her. One has the following theorem for the weak limit of the PA model:

Theorem [Berger, Borgs, Chayes and Saberi (2014)The Benjamini-Schramm limit of \mathrm{PA}(n) is the Pólya-point graph with m=1.

We extend this result to an arbitrary seed and show that locally the seed has no influence:

Theorem [Bubeck, Mossel and Racz (2014)For any tree T the Benjamini-Schramm limit of \mathrm{PA}(n, T) is the Pólya-point graph with m=1.

Thus, while the army’s secret of the 90’s might be at risk if one looks at the overall topology of the current internet network, these secrets are safe from any local agent who would only access a (random) finite part of the network.

These new results on the PA model naturally lead to a host of new problems. We end the paper with a list of 7 open problems, I recommend to take a look at them (and try to solve them)!

This entry was posted in Random graphs. Bookmark the permalink.

9 Responses to "On the influence of the seed graph in the preferential attachment model"

    • Sebastien Bubeck
    • Sebastien Bubeck

Leave a replyLeave a Reply to Jeremy Kun