The Wayback Machine - https://web.archive.org/web/20210228095049/https://blogs.princeton.edu/imabandit/2016/10/11/guest-post-by-miklos-racz-entropic-central-limit-theorems-and-a-proof-of-the-fundamental-limits-of-dimension-estimation/

Guest post by Miklos Racz: Entropic central limit theorems and a proof of the fundamental limits of dimension estimation

In this post we give a proof of the fundamental limits of dimension estimation in random geometric graphs, based on the recent work of Bubeck and Ganguly. We refer the reader to the previous post for a detailed introduction; here we just recall the main theorem we will prove.

Theorem [Bubeck and Ganguly]

If the distribution \mu is log-concave, i.e., if it has density f(\cdot) = e^{-\varphi(\cdot)} for some convex function \varphi, and if \frac{d}{n^{3} \log^{2} \left( d \right)} \to \infty, then

(1)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) \to 0, \end{equation*}

where \mathcal{W}_{n,d} is an appropriately scaled Wishart matrix coming from vectors having i.i.d. \mu entries and \mathcal{G}_{n} is a GOE matrix, both with the diagonal removed.

The proof hinges on a high-dimensional entropic central limit theorem, so a large part of the post is devoted to entropic central limit theorems and ways of proving them. Without further ado let us jump right in.

 

Pinsker’s inequality: from total variation to relative entropy

Our goal is now to bound \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right) cannot be computed directly. Coupling these two random matrices also seems challenging.

In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. Here we use Pinsker’s inequality to switch to relative entropy:

(2)   \begin{equation*} \mathrm{TV} \left( \mathcal{W}_{n,d}, \mathcal{G}_{n} \right)^{2} \leq \frac{1}{2} \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right), \end{equation*}

where \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) denotes the relative entropy of \mathcal{W}_{n,d} with respect to \mathcal{G}_{n}. We next take a detour to entropic central limit theorems and techniques involved in their proof, before coming back to bounding the right hand side in (2).

 

An introduction to entropic CLTs

Let \phi denote the density of \gamma_{n}, the n-dimensional standard Gaussian distribution, and let f be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity I_{n}. Then

    \begin{align*} 0 \leq \mathrm{Ent} \left( f \, \| \, \phi \right) &= \int f \log f - \int f \log \phi \\ &= \int f \log f - \int \phi \log \phi = \mathrm{Ent} \left( \phi \right) - \mathrm{Ent} \left( f \right), \end{align*}

where the second equality follows from the fact that \log \phi \left( x \right) is quadratic in x, and the first two moments of f and \phi are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities. It turns out that much more is true.

The central limit theorem states that if Z_{1}, Z_{2}, \dots are i.i.d. real-valued random variables with zero mean and unit variance, then S_{m} := \left( Z_{1} + \dots + Z_{m} \right) / \sqrt{m} converges in distribution to a standard Gaussian random variable as m \to \infty. There are many other senses in which S_{m} converges to a standard Gaussian, the entropic CLT being one of them.

Theorem [Entropic CLT]

Let Z_{1}, Z_{2}, \dots be i.i.d. real-valued random variables with zero mean and unit variance, and let S_{m} := \left( Z_{1} + \dots + Z_{m} \right) / \sqrt{m}. If \mathrm{Ent} \left( Z_{1} \, \| \, \phi \right) < \infty, then

    \[ \mathrm{Ent} \left( S_{m} \right) \nearrow \mathrm{Ent} \left( \phi \right) \]

as m \to \infty. Moreover, the entropy of S_{m} increases monotonically, i.e., \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{m+1} \right) for every m \geq 1.

The condition \mathrm{Ent} \left( Z_{1} \, \| \, \phi \right) < \infty is necessary for an entropic CLT to hold; for instance, if the Z_{i} are discrete, then h \left( S_{m} \right) = - \infty for all m.

The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor, and Johnson and Barron in the early 2000s, using two different techniques.

The fact that \mathrm{Ent} \left( S_{1} \right) \leq \mathrm{Ent} \left( S_{2} \right) follows from the entropy power inequality, which goes back to Shannon in 1948. This implies that \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{2m} \right) for all m \geq 0, and so it was naturally conjectured that \mathrm{Ent} \left( S_{m} \right) increases monotonically. However, proving this turned out to be challenging. Even the inequality \mathrm{Ent} \left( S_{2} \right) \leq \mathrm{Ent} \left( S_{3} \right) was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor proved in general that \mathrm{Ent} \left( S_{m} \right) \leq \mathrm{Ent} \left( S_{m+1} \right) for all m \geq 1.

In the following we sketch some of the main ideas that go into the proof of these results, in particular following the technique introduced by Ball, Barthe, and Naor.

 

From relative entropy to Fisher information

Our goal is to show that some random variable Z, which is a convolution of many i.i.d. random variables, is close to a Gaussian G. One way to approach this is to interpolate between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define

    \[ P_{t} Z := e^{-t} Z + \sqrt{1 - e^{-2t}} G \]

for t \in [0,\infty), and let f_{t} denote the density of P_{t} Z. We have P_{0} Z = Z and P_{\infty} Z = G. This semigroup has several desirable properties. For instance, if the density of Z is isotropic, then so is f_{t}. Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.

For a density function f : \mathbb{R}^{n} \to \mathbb{R}_{+}, let

    \[ \mathcal{I} \left( f \right) := \int \frac{\nabla f (\nabla f)^{T}}{f} = \E \left[ \left( \nabla \log f \right) \left( \nabla \log f \right)^{T} \right] \]

be the Fisher information matrix. The Cramer-Rao bound states that

    \[ \mathrm{Cov} \left( f \right) \succeq \mathcal{I} \left( f \right)^{-1}. \]

More generally this holds for the covariance of any unbiased estimator of the mean. The Fisher information is defined as

    \[ I \left( f \right) := \Tr \left( \mathcal{I} \left( f \right) \right). \]

It is sometimes more convenient to work with the Fisher information distance, defined as J(f) := I(f) - I(\phi) = I(f) - n. Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative.

Now we are ready to state the De Bruijn identity, which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance:

    \[ \partial_{t} \mathrm{Ent} \left( f_{t} \right) = J \left( f_{t} \right). \]

This implies that the relative entropy between f and \phi—which is our quantity of interest—can be expressed as follows:

(3)   \begin{equation*} \mathrm{Ent} \left( f \, \| \, \phi \right) = \mathrm{Ent} \left( \phi \right) - \mathrm{Ent} \left( f \right) = \int_{0}^{\infty} J \left( f_{t} \right) dt. \end{equation*}

Thus our goal is to bound the Fisher information distance J(f_{t}).

 

Bounding the Fisher information distance

We first recall a classical result by Blachman and Stam that shows that Fisher information decreases under convolution.

Theorem [Blachman; Stam]

Let Y_{1}, \dots, Y_{d} be independent random variables taking values in \mathbb{R}, and let a \in \mathbb{R}^{d} be such that \left\| a \right\|_{2} = 1. Then

    \[ I \left( \sum_{i=1}^{d} a_{i} Y_{i} \right) \leq \sum_{i=1}^{d} a_{i}^{2} I \left( Y_{i} \right). \]

In the i.i.d. case, this bound becomes \left\| a \right\|_{2}^{2} I \left( Y_{1} \right).

Ball, Barthe, and Naor gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3. (See Bubeck and Ganguly for a short proof.)

Theorem [Variational characterization of Fisher information]

Let w : \mathbb{R}^{d} \to \left( 0, \infty \right) be a sufficiently smooth density on \mathbb{R}^{d}, let a \in \mathbb{R}^{d} be a unit vector, and let h be the marginal of w in direction a. Then we have

(4)   \begin{equation*} I \left( h \right) \leq \int_{\mathbb{R}^{d}} \left( \frac{\mathrm{div} \left( pw \right)}{w} \right)^{2} w \end{equation*}

for any continuously differentiable vector field p : \mathbb{R}^{d} \to \mathbb{R}^{d} with the property that for every x, \left\langle p \left( x \right), a \right\rangle = 1. Moreover, if w satisfies \int \left\| x \right\|^{2} w \left( x \right) < \infty, then there is equality for some suitable vector field p.

The Blachman-Stam theorem follows from this characterization by taking the constant vector field p \equiv a. Then we have \mathrm{div} \left( pw \right) = \left\langle \nabla w, a \right\rangle, and so the right hand side of (4) becomes a^{T} \mathcal{I} \left( w \right) a, where recall that \mathcal{I} is the Fisher information matrix. In the setting of Theorem 3 the density w of \left( Y_{1}, \dots, Y_{d} \right) is a product density: w \left( x_{1}, \dots, x_{d} \right) = f_{1} \left( x_{1} \right) \times \dots \times f_{d} \left( x_{d} \right), where f_{i} is the density of Y_{i}. Consequently the Fisher information matrix is a diagonal matrix, \mathcal{I} \left( w \right) = \mathrm{diag} \left( I \left( f_{1} \right), \dots, I \left( f_{d} \right) \right), and thus a^{T} \mathcal{I} \left( w \right) a = \sum_{i=1}^{d} a_{i}^{2} I \left( f_{i} \right), concluding the proof of Theorem 3 using Theorem 4.

Given the characterization of Theorem 4, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions.

Theorem [Artstein, Ball, Barthe, and Naor]

Let Y_{1}, \dots, Y_{d} be i.i.d. random variables with a density having a positive spectral gap c. (We say that a random variable has spectral gap c if for every sufficiently smooth g, we have \mathrm{Var} \left( g \right) \leq \tfrac{1}{c} \E g'^{2}. In particular, log-concave random variables have a positive spectral gap, see Bobkov (1999).) Then for any a \in \mathbb{R}^{d} with \left\| a \right\|_{2} = 1 we have that

    \[ J \left( \sum_{i=1}^{d} a_{i} Y_{i} \right) \leq \frac{2 \left\| a \right\|_{4}^{4}}{c + (2-c) \left\| a \right\|_{4}^{4}} J \left( Y_{1} \right). \]

When a = \frac{1}{\sqrt{d}} \mathbf{1}, then \frac{2 \left\| a \right\|_{4}^{4}}{c + (2-c) \left\| a \right\|_{4}^{4}} = O \left( 1 / d \right), and thus using (3) we obtain a rate of convergence of O \left( 1 / d \right) in the entropic CLT.

A result similar to Theorem 5 was proven independently and roughly at the same time by Johnson and Barron using a different approach involving score functions.

 

A high-dimensional entropic CLT

The techniques of Artstein, Ball, Barthe, and Naor generalize to higher dimensions, as was recently shown by Bubeck and Ganguly.

A result similar to Theorem 5 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3) again.

Theorem [Bubeck and Ganguly]

Let Y \in \mathbb{R}^{d} be a random vector with i.i.d. entries from a distribution \nu with zero mean, unit variance, and spectral gap c \in (0,1]. Let A \in \mathbb{R}^{n \times d} be a matrix such that AA^{T} = I_{n}, the n \times n identity matrix. Let

    \[\varepsilon = \max_{i \in [d]} \left( A^T A \right)_{i,i}\]

and

    \[\zeta = \max_{i,j \in [d], i \neq j} \left| \left( A^T A \right)_{i,j} \right|.\]

Then we have that

    \[ \mathrm{Ent} \left( A Y \, \| \, \gamma_{n} \right) \leq n \min \left\{ 2 \left( \varepsilon + \zeta^{2} d \right) / c, 1 \right\} \mathrm{Ent} \left( \nu \, \| \, \gamma_{1} \right), \]

where \gamma_{n} denotes the standard Gaussian measure in \mathrm{R}^{n}.

To interpret this result, consider the case where the matrix A is built by picking rows one after the other uniformly at random on the Euclidean sphere in \mathbb{R}^{d}, conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition AA^{T} = I_{n}). We then expect to have \varepsilon \simeq n/d and \zeta \simeq \sqrt{n} / d (we leave the details as an exercise for the reader), and so Theorem 7 tells us that \mathrm{Ent} \left( A Y \, \| \, \gamma_{n} \right) \lesssim n^{2}/d.

 

Back to Wishart and GOE

We now turn our attention back to bounding the relative entropy \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right); recall (2). Since the Wishart matrix contains the (scaled) inner products of n vectors in \mathbb{R}^{d}, it is natural to relate \mathcal{W}_{n+1,d} and \mathcal{W}_{n,d}, since the former comes from the latter by adding an additional d-dimensional vector to the n vectors already present. Specifically, we have the following:

    \[ \mathcal{W}_{n+1,d} = \begin{pmatrix} \mathcal{W}_{n,d} & \frac{1}{\sqrt{d}} \mathbb{X} X \\ \frac{1}{\sqrt{d}} \left( \mathbb{X} X \right)^{T} & 0 \end{pmatrix}, \]

where X is a d-dimensional random vector with i.i.d. entries from \mu, which are also independent from \mathbb{X}. Similarly we can write the matrix \mathcal{G}_{n+1} using \mathcal{G}_{n}:

    \[ \mathcal{G}_{n+1} = \begin{pmatrix} \mathcal{G}_{n} & \gamma_{n} \\ \gamma_{n}^{T} & 0 \end{pmatrix}. \]

This naturally suggests to use the chain rule for relative entropy and bound \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right)

by induction on n. We get that

    \[ \mathrm{Ent} \left( \mathcal{W}_{n+1,d} \, \| \, \mathcal{G}_{n+1} \right) = \mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) + \mathbb{E}_{W_{n,d}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathcal{W}_{n,d} \, \| \, \gamma_{n} \right) \right]. \]

By convexity of the relative entropy we also have that

    \[ \mathbb{E}_{W_{n,d}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathcal{W}_{n,d} \, \| \, \gamma_{n} \right) \right] \leq \mathbb{E}_{\mathbb{X}} \left[ \mathrm{Ent} \left( \tfrac{1}{\sqrt{d}} \mathbb{X} X \, | \, \mathbb{X} \, \| \, \gamma_{n} \right) \right]. \]

Thus our goal is to understand and bound \mathrm{Ent} \left( A X \, \| \, \gamma_{n} \right) for A \in \mathbb{R}^{n \times d}, and then apply the bound to A = \tfrac{1}{\sqrt{d}} \mathbb{X} (followed by taking expectation over \mathbb{X}). This is precisely what was done in Theorem 6, the high-dimensional entropic CLT, for A satisfying AA^T = I_{n}. Since A = \tfrac{1}{\sqrt{d}} \mathbb{X} does not necessarily satisfy AA^T = I_{n}, we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader.

Lemma

Let A \in \mathbb{R}^{n \times d} and Q \in \mathbb{R}^{n \times n} be such that QA \left( QA \right)^{T} = I_{n}. Then for any isotropic random variable X taking values in \mathbb{R}^{d} we have that

(5)   \begin{equation*} \mathrm{Ent} \left( A X \, \| \, \gamma_{n} \right) = \mathrm{Ent} \left( QA X \, \| \, \gamma_{n} \right) + \frac{1}{2} \mathrm{Tr} \left( A A^{T} \right) - \frac{n}{2} + \log \left| \det \left( Q \right) \right|. \end{equation*}

We then apply this lemma with A = \tfrac{1}{\sqrt{d}} \mathbb{X} and Q = \left( \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} \right)^{-1/2}. Observe that

    \[\mathbb{E} \mathrm{Tr} \left( A A^{T} \right) = \tfrac{1}{d} \mathbb{E} \mathrm{Tr} \left( \mathbb{X} \mathbb{X}^{T} \right) = \tfrac{1}{d} \times n \times d = n,\]

and hence in expectation the middle two terms of the right hand side of (5) cancel each other out.

The last term in (5),

    \[- \tfrac{1}{4} \log \det \left( \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} \right),\]

should be understood as the relative entropy between a centered Gaussian with covariance given by \tfrac{1}{d} \mathbb{X} \mathbb{X}^{T} and a standard Gaussian in \mathbb{R}^{n}. Controlling the expectation of this term requires studying the probability that \mathbb{X} \mathbb{X}^{T} is close to being non-invertible, which requires bounds on the left tail of the smallest singular of \mathbb{X}. Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to Bubeck and Ganguly for more details on this point.

Finally, the high-dimensional entropic CLT can now be applied to see that

    \[\mathrm{Ent} \left( QA X \, \| \, \gamma_{n} \right) \lesssim n^{2} / d.\]

From the induction on n we get another factor of n, arriving at

    \[\mathrm{Ent} \left( \mathcal{W}_{n,d} \, \| \, \mathcal{G}_{n} \right) \lesssim n^{3} / d.\]

We conclude that the dimension threshold is d \approx n^{3}, and the information-theoretic proof that we have outlined sheds light on why this threshold is n^{3}.

This entry was posted in Probability theory, Random graphs. Bookmark the permalink.

Leave a reply