Some stuff I learned this week

This week I had the pleasure to spend 3 days at a wonderful workshop (disclaimer: I was part of the organization, with Ofer Dekel, Yuval Peres and James Lee, so I might be a bit biased). Below you will find three results that I particularly enjoyed (one from each day). Obviously many more topics were covered at the workshop, from finding correlations in subquadratic time, to the basic problem of learning halfspaces (see for instances these two COLT’15 papers, by Daniely and by Awasthi, Balcan, Haghtalab and Urner), or how stability can explain why deep nets trained with SGD are not overfitting (upcoming work by Hardt-Recht-Singer; also on a related note I learned from Yuval about the Kolmogrov-Arnold representation theorem, which seems very much related to the famous Barron’1990 result on the universal approximation properties of neural networks with 1 hidden layer). Finally two more things: (i) we also have a blog where we collected all the open problems discussed during the workshop, and (ii) many of the talks have been recorded (including the ones I mention below, as well as the nice tutorial by Sergiu Hart on dynamics in games). I will post the link to the videos as soon as we upload them.


Negative association

Robin Pemantle gave an excellent tutorial on negative association, see the slides here. The punchline can be summarized pretty succinctly. Consider a measure \mu on \{0,1\}^n. We say that \mu is positively correlated if for any f,g monotone increasing one has \int f g d\mu \geq \int f d\mu \int g d\mu. On the other hand we say that \mu is negatively correlated if for any f,g monotone increasing and such that f and g depends on disjoint subset of variables, one has \int f g d\mu \leq \int f d\mu \int g d\mu. Positive correlation is particularly easy to check thanks to the famous FKG theorem which states that if \mu is log-supermodular (i.e. - \log(\mu) is a submodular function) then \mu is positively correlated (this is nice because submodularity is a local condition). One may hope that log-submodular distributions are negatively correlated, but unfortunately this is not true in general. The route to an easily checkable condition for negative correlation is a bit more tortuous and it goes as follows. First observe that negative association clearly implies pairwise negative correlation, that is for any i,j and X = (X_1, \hdots, X_n) \sim \mu,

    \[\mathbb{E} X_i X_j \leq \mathbb{E} X_i \mathbb{E} X_j .\]

Writing this inequality in terms of the generating function F(x) = \mathbb{E} \prod_{k=1}^n x_i^{X_i} one has (with \mathds{1}=(1,\hdots,1)):

    \[F(\mathds{1}) \frac{\partial^2 F}{\partial x_i \partial x_j}(\mathds{1}) \leq \frac{\partial F}{\partial x_i}(\mathds{1}) \frac{\partial F}{\partial x_j}(\mathds{1}) .\]

One says that \mu is strongly Rayleigh if the above equation holds true for any x \in \mathbb{R}^n instead of just at x= \mathds{1}. It turns out that this condition is more restrictive than negative correlation, but it is also much more tractable, in particular because it is equivalent to the fact that F is a stable polynomial (that is F does not have any roots in \mathbb{H}^n where \mathbb{H} is the complex upper half plane). Once you know that latter equivalence it becomes trivial to check that determinantal measures are negatively correlated since they are strongly Rayleigh (just use the fact that the characteristic polynomial of a symmetric matrix only has roots on the real line). A particularly useful property of strongly Rayleigh measures is that they have the usual Gaussian tails for any Lipschitz functional, just like product measures (this was conjectured by Elchanan Mossel, and recently proved by Robin Pemantle and Yuval Peres). Recently this set of measures has also been useful in theoretical computer science, see the recent work of Shayan Oveis Gharan and his co-authors (where they use negative correlation properties of random spanning trees). Finally I briefly mention that the sampling problem for negatively associated measures is a quite rich question, see the slides by Robin and also this COLT’15 paper by Rebeschini and Karbasi.


Learning in networks

Elchanan Mossel talked about a beautiful set of open questions. The basic problem goes as follows: let G be a graph, and assume that each vertex is a player that receives an independent random variable distributed according to some measure. The measure used to produce the random variables is the same for all vertices, and it is either \nu_0 or \nu_1 (say the players also know that the choice I \in \{0,1\} of the measure was a coin flip). Question: assuming that at the end of each day the players reveal their conditional expectation for I (given all the information they currently have) to their neighbors in G, how many days do we need to reach an agreement? If convergence happens, is the conditional expectation that is reached the optimal one (that is the conditional expectation given all the random variables)? What if instead of revealing your belief you only reveal what is the most likely to you between I=0 and I=1 (revealed action game)? Mossel-Sly-Tamuz show that both in the revealed belief and the revealed action game there will eventually be an agreement. Furthermore in the revealed beliefs this will be close to the optimal posterior provided that G is large enough. Both of these theorems currently lack a convergence rate.


The complexity of stochastic block models

The stochastic block model is a canonical model of a random graph with communities. The general version of this model is denoted \mathrm{SBM}(n,p,W) where n \in \mathbb{N} is the number of vertices, p \in [0,1]^k represents the proportion of each community, and W \in [0,1]^{k \times k} represents the matrix of connection probabilities. More precisely the random graph \mathrm{SBM}(n,p,W) is defined as follows: First each vertex i \in [n] is assigned a latent label \sigma_i \in [k], where the latent label vector \sigma \in [k]^n is drawn uniformly at random conditionally on the fact that the number of vertices with label \sigma \in [k] is exactly p_{\sigma} n. The edges are then drawn independently (conditionally on \sigma), and each edge \{i,j\} is present with probability W_{\sigma_i, \sigma_j}. This model has received a LOT of attention in the last thirty years. The basic objective is to recover the latent label vector \sigma from the observation of the unlabelled graph \mathrm{SBM}(n,p,W). The following very recent result by Emmanuel Abbe and Colin Sandon answers the most basic question one can ask: when is this objective achievable? First we introduce the Abbe-Sandon divergence for vectors x, y \in \mathbb{R}^k:

    \[D(x,y) = \max_{t \in [0,1]} \sum_{i=1}^k (t x_i + (1-t) y_i - x_i^t y_i^{1-t}) .\]

Theorem (Abbe-Sandon 2015): Asymptotically almost sure exact recovery of the latent feature vector in \mathrm{SBM}(n,p, Q \frac{\log(n)}{n}) is possible if and only if for any i, j \in [k], i \neq j,

    \[D \big( (\mathrm{diag}(p) Q)_i, (\mathrm{diag}(p) Q)_j \big) \geq 1 .\]

In their paper Abbe and Sandon also give an efficient algorithm which achieves the information theoretic limit given above. It remains an outstanding open problem to obtain such a precise picture for the problem of weak recovery where one asks to recover correctly only a fraction 1-\epsilon of the latent label vector. This question is particularly interesting in the sparse regime where the average degree is constant (note that in the above theorem the average degree diverges at the rate \log(n)). Recently a lot of progress has been made in this mathematically challenging regime too, see in particular the papers by Mossel-Neeman-Sly, Massoulie and Bordenave-Lelarge-Massoulie.

This entry was posted in Conference/workshop, Probability theory. Bookmark the permalink.

Leave a reply