Guest post by Philippe Rigollet: Estimating structured graphs

Last wednesday, Sourav Chatterjee was speaking in the Mathematics Colloquium here at Princeton. While he is most famous for his contributions to probability, especially using Stein’s method, this talk was rather statistical. The problem was interesting and its solution simple enough that it can fit in this blog post. Sourav’s full paper can be found on ArXiv.

I present a slightly simpler proof of his main result. It yields a bound that depends on a more natural quantity that may be much smaller than the one involving the nuclear norm and that appears in the original paper Chatterjee (2012).

The problem

Let ${P=\{p_{i,j}\}}$ be an ${n \times n}$ symmetric matrix with elements ${p_{i,j} \in [0,1]}$ . Let ${G}$ be the undirected random graph on ${n}$ vertices obtained by placing edges ${\{i,j\}}$ independently with probability ${p_{i,j}}$ respectively.

The goal is to produce an estimator of ${P}$ , i.e., a ${n \times n}$ matrix ${\hat P}$ , that is measurable with respect to ${G}$ and such that the Frobenius norm ${\|\hat P - P\|_F}$ is small with high probability.

The title of this post suggests that we need some structure on the ${P}$ . Sourav’s article points to a variety of examples where structure appears in the spectrum of ${P}$ . As a result, the method and analysis described below are spectral.

Thresholded eigen-decomposition

Let ${A}$ denote the adjacency matrix of ${G}$ and observe that the matrix ${A-P}$ is symmetric, has zero mean and its elements ${\xi_{i,j}}$ are independent and lie in the interval ${[-1,1]}$ almost surely. This is not quite a Wigner matrix (the variances are not identical) but up to multiplicative constant, its operator norm ${\|A-P\|}$ is of the order ${\sqrt{n}}$ , just like a Wigner matrix. Formally, the following lemma holds. Its proof is postponed to the end of the post.

Lemma With probability at least ${1-\delta}$ , it holds

$\displaystyle \|A-P\|^2 \le 2\log(6)n +\log(4/\delta)=:C_\delta^2(n)\,.$

Since we can decompose our observed matrix ${A=P+(A-P)}$ , the eigenvalues of ${P}$ that are smaller than ${\|A-P\|}$ should disappear in the noise. Moreover, those that are larger than ${\|A-P\|}$ should be visible in the spectrum on ${A}$ .

Consider the eigen-decompositions of ${P}$ and ${A}$ :

$\displaystyle P=\sum_{i=1}^n \mu_i u_iu_i^\top\,,\quad A=\sum_{i=1}^n \lambda_i v_iv_i^\top\,,$

where the eigenvalues are ordered by decreasing order of magnitude and define the following estimator,

$\displaystyle \hat P=\sum_{i=1}^n \lambda_i \mathbf{1}( |\lambda_i|>2C_\delta(n)) v_i v_i^\top\,.$

Note that Sourav Chatterjee uses ${1.001 \sqrt{n}}$ instead of ${2C_\delta(n)}$ but his proof requires much more complicated arguments for essentially the same result (see the original paper). Also, his result holds in expectation and we allow here a tunable level of confidence.

Theorem With probability at least ${1-\delta}$ , estimator ${\hat P}$ satisfies

$\displaystyle \|\hat P -P\|_F \le 12\sqrt{ \sum_{i=1}^n \{\mu_i^2 \wedge C_n(\delta)\}}$

Proof: Assume that we are on the event ${\{\|A-P\|\le C_\delta(n)\}}$ . We begin by a simple consequence of the lemma. Let ${S =\{ i\,:\, |\lambda_i|>2C_\delta(n)\}}$ and define the matrix

$\displaystyle P_{S}=\sum_{i \in {S}}\mu_iu_iu_i^\top\,.$

The lemma together with Weyl’s inequality implies that

$\displaystyle |\lambda_i -\mu_i|\le \|A-P\| \le C_\delta(n)\,, \forall\, i \ \ \ \ \ (1)$

It yields ${|\lambda_i| \le |\mu_i| + C_\delta(n)}$ so that ${|\mu_i|>C_\delta(n)}$ for all ${i \in S}$ (by definition of ${S}$ ). Therefore, ${S \subset \{ i\,:\, |\mu_i|>C_\delta(n)\}}$ . Similarly, we find that the complement of ${S}$ is such that ${S^c\subset \{ i\,:\, |\mu_i|\le 3C_\delta(n)\}}$ .

We now return to the main part of the proof. By the triangle inequality, we have

$\displaystyle \|\hat P-P\|_F\le \|\hat P- P_{S}\|_F+\|P_{S}-P\|_F\,,$

and we treat each term independently.

$\displaystyle \|\hat P-P_{S}\|_F \le \sqrt{\mathrm{rank}(\hat P-P_{S})}\|\hat P-P_{S}\|$

On the one hand, ${\mathrm{rank}(\hat P-P_{S}) \le 2|S|}$ . On the other hand,

$\displaystyle \begin{array}{rcl} \|\hat P-P_{S}\|&\le &\|\hat P-A\|+\|A- P\|+\|P-P_S\| \\ &\le &2C_\delta(n)+C_\delta(n)+\|P-P_S\| \\ &= &3C_\delta(n)+\max_{i \in S^c} |\mu_i|\le 6C_\delta(n)\,. \end{array}$

where we used the definition of ${\hat P}$ and the lemma. Together, the above three displays yield

$\displaystyle \|\hat P-P\|_F\le 6\sqrt{2\sum_{i \in S}C_\delta^2(n)} + \sqrt{\sum_{i\in S^c}\mu_i^2} \le (6\sqrt{2}+3)\sqrt{\sum_{i=1}^n \{\mu_i^2 \wedge C_n(\delta)\}}\,.$

$\Box$

Discussion

This is not exactly the bound from Chatterjee (2012). Indeed, the main Theorem 2.1 in Chatterjee (2012) yields

$\displaystyle \mathbb{E}\|\hat P -P\|_F^2\le C \|P\|_*\sqrt{n}+ Ce^{-Cn}\,,$

where ${\|P\|_*=\sum_{i=1}^n|\mu_i|}$ is the nuclear norm of ${P}$ .

Note first that our theorem yields a bound on the median:

$\displaystyle \mathbb{M}\|\hat P -P\|_F^2 \le C \sum_{i=1}^n \{\mu_i^2 \wedge n\}\,.$

How do the two bounds compare? Note first that our bound is always at least as good as Chatterjee’s bound:

$\displaystyle \sum_{i=1}^n \{\mu_i^2 \wedge n\}\le \sum_{i:|\mu_i|\le \sqrt{n}}\mu_i^2 + \sum_{i:|\mu_i|> \sqrt{n}}n\le \|P\|_*\sqrt{n}$

Moreover, there may be a large gap between the two quantities. For example, if ${P=I_n}$ is the identity matrix, then

$\displaystyle \|P\|_*\sqrt{n} =n^{3/2}\,, \qquad \text{and} \qquad \sum_{i=1}^n \{\mu_i^2 \wedge n\}=\sum_{i=1}^n\mu_i^2 =n\,.$

This can also be the case with low rank matrices. Take the matrix the all-one matrix ${P=\mathbf{1}_n\mathbf{1}_n^\top}$ , where ${\mathbf{1}_n \in \mathbb{R}^n}$ is a vector of ones. In this case ${n}$ is the only nonzero eigenvalue and

$\displaystyle \|P\|_*\sqrt{n} =n^{3/2}\,, \qquad \text{and} \qquad \sum_{i=1}^n \{\mu_i^2 \wedge n\}=n\,.$

Moreover, the ratio between these quantities can be made arbitrary close to 0 by taking ${\|P\|/\sqrt{n}}$ arbitrary close to 0. Indeed,

$\displaystyle \sum_{i=1}^n \{\mu_i^2 \wedge n\}\le \big( \max_i|\mu_i| \big) \sum_{i=1}^n |\mu_i| = \frac{\|P\|}{\sqrt{n}} \|P\|_*\sqrt{n}\,.$

Proof of the lemma

Let ${\mathcal{N}}$ be a ${\frac12}$ -net of the unit Euclidean ball ${\mathcal{B}}$ of ${\mathbb{R}^n}$ . Using a volume packing argument (see Vershynin (2011)), it is not hard to show that we can always take ${\mathcal{N}}$ of cardinality ${|\mathcal{N}| \le 6^n}$ . Moreover, for any vector ${t \in \mathbb{R}^n}$ ,

$\displaystyle \sup_{x \in \mathcal{B}}x^\top t\le 2\sup_{x \in \mathcal{N}}x^\top t\,.$

It implies

$\displaystyle \|A-P\|=\sup_{(x,y) \in \mathcal{B}^2} x^\top(A-P)y\le 4\sup_{(x,y) \in \mathcal{N}^2} x^\top(A-P)y\,.$

Using a union bound, we find for any ${\varepsilon>0}$ ,

$\displaystyle \mathbb{P}(\|A-P\|> \varepsilon)\le 4\sum_{(x,y) \in \mathcal{N}^2}\mathbb{P}(x^\top(A-P)y>\varepsilon)$

We conclude a Chernoff bound to get for any ${s>0}$ ,

$\displaystyle \begin{array}{rcl} \mathbb{P}(x^\top(A-P)y>\varepsilon)&\le &\mathbb{E}(e^{sx^\top(A-P)y}) e^{-s\varepsilon}\\ &= &\prod_{i}\mathbb{E}(e^{sx_iy_i\xi_{i,i}}) \prod_{i<j}\mathbb{E}(e^{2sx_iy_j\xi_{i,j}}) e^{-s\varepsilon}\,. \end{array}$