In this post we give a proof of the fundamental limits of dimension estimation in random geometric graphs, based on the recent work of Bubeck and Ganguly. We refer the reader to the previous post for a detailed introduction; here we just recall the main theorem we will prove.
Theorem [Bubeck and Ganguly]
If the distribution is log-concave, i.e., if it has density for some convex function , and if , then
(1)
where is an appropriately scaled Wishart matrix coming from vectors having i.i.d. entries and is a GOE matrix, both with the diagonal removed.
The proof hinges on a high-dimensional entropic central limit theorem, so a large part of the post is devoted to entropic central limit theorems and ways of proving them. Without further ado let us jump right in.
Pinsker’s inequality: from total variation to relative entropy
Our goal is now to bound from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so cannot be computed directly. Coupling these two random matrices also seems challenging.
In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. Here we use Pinsker’s inequality to switch to relative entropy:
(2)
where denotes the relative entropy of with respect to . We next take a detour to entropic central limit theorems and techniques involved in their proof, before coming back to bounding the right hand side in (2).
An introduction to entropic CLTs
Let denote the density of , the -dimensional standard Gaussian distribution, and let be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity . Then
where the second equality follows from the fact that is quadratic in , and the first two moments of and are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities. It turns out that much more is true.
The central limit theorem states that if are i.i.d. real-valued random variables with zero mean and unit variance, then converges in distribution to a standard Gaussian random variable as . There are many other senses in which converges to a standard Gaussian, the entropic CLT being one of them.
Theorem [Entropic CLT]
Let be i.i.d. real-valued random variables with zero mean and unit variance, and let . If , then
as . Moreover, the entropy of increases monotonically, i.e., for every .
The condition is necessary for an entropic CLT to hold; for instance, if the are discrete, then for all .
The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor, and Johnson and Barron in the early 2000s, using two different techniques.
The fact that follows from the entropy power inequality, which goes back to Shannon in 1948. This implies that for all , and so it was naturally conjectured that increases monotonically. However, proving this turned out to be challenging. Even the inequality was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor proved in general that for all .
In the following we sketch some of the main ideas that go into the proof of these results, in particular following the technique introduced by Ball, Barthe, and Naor.
From relative entropy to Fisher information
Our goal is to show that some random variable , which is a convolution of many i.i.d. random variables, is close to a Gaussian . One way to approach this is to interpolate between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define
for , and let denote the density of . We have and . This semigroup has several desirable properties. For instance, if the density of is isotropic, then so is . Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.
For a density function , let
be the Fisher information matrix. The Cramer-Rao bound states that
More generally this holds for the covariance of any unbiased estimator of the mean. The Fisher information is defined as
It is sometimes more convenient to work with the Fisher information distance, defined as . Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative.
Now we are ready to state the De Bruijn identity, which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance:
This implies that the relative entropy between and —which is our quantity of interest—can be expressed as follows:
(3)
Thus our goal is to bound the Fisher information distance .
Bounding the Fisher information distance
We first recall a classical result by Blachman and Stam that shows that Fisher information decreases under convolution.
Theorem [Blachman; Stam]
Let be independent random variables taking values in , and let be such that . Then
In the i.i.d. case, this bound becomes .
Ball, Barthe, and Naor gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3. (See Bubeck and Ganguly for a short proof.)
Theorem [Variational characterization of Fisher information]
Let be a sufficiently smooth density on , let be a unit vector, and let be the marginal of in direction . Then we have
(4)
for any continuously differentiable vector field with the property that for every , . Moreover, if satisfies , then there is equality for some suitable vector field .
The Blachman-Stam theorem follows from this characterization by taking the constant vector field . Then we have , and so the right hand side of (4) becomes , where recall that is the Fisher information matrix. In the setting of Theorem 3 the density of is a product density: , where is the density of . Consequently the Fisher information matrix is a diagonal matrix, , and thus , concluding the proof of Theorem 3 using Theorem 4.
Given the characterization of Theorem 4, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions.
Theorem [Artstein, Ball, Barthe, and Naor]
Let be i.i.d. random variables with a density having a positive spectral gap . (We say that a random variable has spectral gap if for every sufficiently smooth , we have . In particular, log-concave random variables have a positive spectral gap, see Bobkov (1999).) Then for any with we have that
When , then , and thus using (3) we obtain a rate of convergence of in the entropic CLT.
A result similar to Theorem 5 was proven independently and roughly at the same time by Johnson and Barron using a different approach involving score functions.
A high-dimensional entropic CLT
The techniques of Artstein, Ball, Barthe, and Naor generalize to higher dimensions, as was recently shown by Bubeck and Ganguly.
A result similar to Theorem 5 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3) again.
Theorem [Bubeck and Ganguly]
Let be a random vector with i.i.d. entries from a distribution with zero mean, unit variance, and spectral gap . Let be a matrix such that , the identity matrix. Let
and
Then we have that
where denotes the standard Gaussian measure in .
To interpret this result, consider the case where the matrix is built by picking rows one after the other uniformly at random on the Euclidean sphere in , conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition ). We then expect to have and (we leave the details as an exercise for the reader), and so Theorem 7 tells us that .
Back to Wishart and GOE
We now turn our attention back to bounding the relative entropy ; recall (2). Since the Wishart matrix contains the (scaled) inner products of vectors in , it is natural to relate and , since the former comes from the latter by adding an additional -dimensional vector to the vectors already present. Specifically, we have the following:
where is a -dimensional random vector with i.i.d. entries from , which are also independent from . Similarly we can write the matrix using :
This naturally suggests to use the chain rule for relative entropy and bound
by induction on . We get that
By convexity of the relative entropy we also have that
Thus our goal is to understand and bound for , and then apply the bound to (followed by taking expectation over ). This is precisely what was done in Theorem 6, the high-dimensional entropic CLT, for satisfying . Since does not necessarily satisfy , we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader.
Lemma
Let and be such that . Then for any isotropic random variable taking values in we have that
(5)
We then apply this lemma with and . Observe that
and hence in expectation the middle two terms of the right hand side of (5) cancel each other out.
The last term in (5),
should be understood as the relative entropy between a centered Gaussian with covariance given by and a standard Gaussian in . Controlling the expectation of this term requires studying the probability that is close to being non-invertible, which requires bounds on the left tail of the smallest singular of . Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to Bubeck and Ganguly for more details on this point.
Finally, the high-dimensional entropic CLT can now be applied to see that
From the induction on we get another factor of , arriving at
We conclude that the dimension threshold is , and the information-theoretic proof that we have outlined sheds light on why this threshold is .