In this post we give a proof of the fundamental limits of dimension estimation in random geometric graphs, based on the recent work of Bubeck and Ganguly. We refer the reader to the previous post for a detailed introduction; here we just recall the main theorem we will prove.
Theorem [Bubeck and Ganguly]
If the distribution
is log-concave, i.e., if it has density
for some convex function
, and if
, then
(1)
where
is an appropriately scaled Wishart matrix coming from vectors having i.i.d.
entries and
is a GOE matrix, both with the diagonal removed.
The proof hinges on a high-dimensional entropic central limit theorem, so a large part of the post is devoted to entropic central limit theorems and ways of proving them. Without further ado let us jump right in.
Pinsker’s inequality: from total variation to relative entropy
Our goal is now to bound from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so
cannot be computed directly. Coupling these two random matrices also seems challenging.
In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. Here we use Pinsker’s inequality to switch to relative entropy:
(2)
where denotes the relative entropy of
with respect to
. We next take a detour to entropic central limit theorems and techniques involved in their proof, before coming back to bounding the right hand side in (2).
An introduction to entropic CLTs
Let denote the density of
, the
-dimensional standard Gaussian distribution, and let
be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity
. Then
where the second equality follows from the fact that is quadratic in
, and the first two moments of
and
are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities. It turns out that much more is true.
The central limit theorem states that if are i.i.d. real-valued random variables with zero mean and unit variance, then
converges in distribution to a standard Gaussian random variable as
. There are many other senses in which
converges to a standard Gaussian, the entropic CLT being one of them.
Theorem [Entropic CLT]
Let
be i.i.d. real-valued random variables with zero mean and unit variance, and let
. If
, then
as
. Moreover, the entropy of
increases monotonically, i.e.,
for every
.
The condition is necessary for an entropic CLT to hold; for instance, if the
are discrete, then
for all
.
The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor, and Johnson and Barron in the early 2000s, using two different techniques.
The fact that follows from the entropy power inequality, which goes back to Shannon in 1948. This implies that
for all
, and so it was naturally conjectured that
increases monotonically. However, proving this turned out to be challenging. Even the inequality
was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor proved in general that
for all
.
In the following we sketch some of the main ideas that go into the proof of these results, in particular following the technique introduced by Ball, Barthe, and Naor.
From relative entropy to Fisher information
Our goal is to show that some random variable , which is a convolution of many i.i.d. random variables, is close to a Gaussian
. One way to approach this is to interpolate between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define
for , and let
denote the density of
. We have
and
. This semigroup has several desirable properties. For instance, if the density of
is isotropic, then so is
. Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.
For a density function , let
be the Fisher information matrix. The Cramer-Rao bound states that
More generally this holds for the covariance of any unbiased estimator of the mean. The Fisher information is defined as
It is sometimes more convenient to work with the Fisher information distance, defined as . Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative.
Now we are ready to state the De Bruijn identity, which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance:
This implies that the relative entropy between and
—which is our quantity of interest—can be expressed as follows:
(3)
Thus our goal is to bound the Fisher information distance .
Bounding the Fisher information distance
We first recall a classical result by Blachman and Stam that shows that Fisher information decreases under convolution.
Theorem [Blachman; Stam]
Let
be independent random variables taking values in
, and let
be such that
. Then
In the i.i.d. case, this bound becomes
.
Ball, Barthe, and Naor gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3. (See Bubeck and Ganguly for a short proof.)
Theorem [Variational characterization of Fisher information]
Let
be a sufficiently smooth density on
, let
be a unit vector, and let
be the marginal of
in direction
. Then we have
(4)
for any continuously differentiable vector field
with the property that for every
,
. Moreover, if
satisfies
, then there is equality for some suitable vector field
.
The Blachman-Stam theorem follows from this characterization by taking the constant vector field . Then we have
, and so the right hand side of (4) becomes
, where recall that
is the Fisher information matrix. In the setting of Theorem 3 the density
of
is a product density:
, where
is the density of
. Consequently the Fisher information matrix is a diagonal matrix,
, and thus
, concluding the proof of Theorem 3 using Theorem 4.
Given the characterization of Theorem 4, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions.
Theorem [Artstein, Ball, Barthe, and Naor]
Let
be i.i.d. random variables with a density having a positive spectral gap
. (We say that a random variable has spectral gap
if for every sufficiently smooth
, we have
. In particular, log-concave random variables have a positive spectral gap, see Bobkov (1999).) Then for any
with
we have that
When , then
, and thus using (3) we obtain a rate of convergence of
in the entropic CLT.
A result similar to Theorem 5 was proven independently and roughly at the same time by Johnson and Barron using a different approach involving score functions.
A high-dimensional entropic CLT
The techniques of Artstein, Ball, Barthe, and Naor generalize to higher dimensions, as was recently shown by Bubeck and Ganguly.
A result similar to Theorem 5 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3) again.
Theorem [Bubeck and Ganguly]
Let
be a random vector with i.i.d. entries from a distribution
with zero mean, unit variance, and spectral gap
. Let
be a matrix such that
, the
identity matrix. Let
and
Then we have that
where
denotes the standard Gaussian measure in
.
To interpret this result, consider the case where the matrix is built by picking rows one after the other uniformly at random on the Euclidean sphere in
, conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition
). We then expect to have
and
(we leave the details as an exercise for the reader), and so Theorem 7 tells us that
.
Back to Wishart and GOE
We now turn our attention back to bounding the relative entropy ; recall (2). Since the Wishart matrix contains the (scaled) inner products of
vectors in
, it is natural to relate
and
, since the former comes from the latter by adding an additional
-dimensional vector to the
vectors already present. Specifically, we have the following:
where is a
-dimensional random vector with i.i.d. entries from
, which are also independent from
. Similarly we can write the matrix
using
:
This naturally suggests to use the chain rule for relative entropy and bound
by induction on . We get that
By convexity of the relative entropy we also have that
Thus our goal is to understand and bound for
, and then apply the bound to
(followed by taking expectation over
). This is precisely what was done in Theorem 6, the high-dimensional entropic CLT, for
satisfying
. Since
does not necessarily satisfy
, we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader.
Lemma
Let
and
be such that
. Then for any isotropic random variable
taking values in
we have that
(5)
We then apply this lemma with and
. Observe that
and hence in expectation the middle two terms of the right hand side of (5) cancel each other out.
The last term in (5),
should be understood as the relative entropy between a centered Gaussian with covariance given by and a standard Gaussian in
. Controlling the expectation of this term requires studying the probability that
is close to being non-invertible, which requires bounds on the left tail of the smallest singular of
. Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to Bubeck and Ganguly for more details on this point.
Finally, the high-dimensional entropic CLT can now be applied to see that
From the induction on we get another factor of
, arriving at
We conclude that the dimension threshold is , and the information-theoretic proof that we have outlined sheds light on why this threshold is
.