Guest post by Miklos Racz: A primer on exact recovery in the general stochastic block model

Foreword: this will be a series of blog posts written by our Theory Group postdoc Miklos Racz. It is essentially the transcript of a mini-course that Miki gave at the University of Washington and at the XX Brazilian School of Probability, and that I gave at the University of Nice. These notes are also posted on arxiv. I now leave the floor to Miki:


Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.

This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).

The stochastic block model and exact recovery

The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:

n, the number of vertices;

a probability distribution p = (p_1, \dots, p_k) that describes the relative sizes of the communities;

and \overline{Q} \in \left[0,1\right]^{k \times k}, a symmetric k \times k matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

The number of communities, k, is implicit in this notation; in these notes we assume that k is a fixed constant.

A random graph from \mathrm{SBM}( n, p, \overline{Q}) is defined as follows:

  • The vertex set of the graph is V = \left\{ 1, \dots, n \right\} \equiv \left[ n \right].
  • Every vertex v \in V is independently assigned a (hidden) label \sigma_v \in \left[ k \right] from the probability distribution p on \left[k \right]. That is, \mathbb{P} \left( \sigma_v = i \right) = p_i for every i \in \left[k \right].
  • Given the labels of the vertices, each (unordered) pair of vertices \left( u, v \right) \in V \times V is connected independently with probability \overline{Q}_{\sigma_{u}, \sigma_{v}}.

Example [Symmetric communities]

A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.

This is modeled by the SBM with p_i = 1/k for all i \in \left[ k \right] and \overline{Q}_{i,j} = a if i = j and \overline{Q}_{i,j} = b otherwise, with a > b > 0.

We write G \sim \mathrm{SBM}(n,p,\overline{Q}) for a graph generated according to the SBM without the hidden vertex labels revealed.

The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

There are various notions of success that are worth studying. In this post we focus on exact recovery: we aim to recover the labels of all vertices exactly with high probability (whp).

More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.

For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be \Omega \left( \ln(n) / n \right).

It is thus natural to scale the edge probability matrix accordingly, i.e., to consider \mathrm{SBM} \left( n, p, \ln \left( n \right) Q / n \right), where Q \in \mathbb{R}_{+}^{k \times k}.

We also assume that the communities have linear size, i.e., that p is independent of n, and p_i \in (0,1) for all i.

From exact recovery to testing multivariate Poisson distributions

As a thought experiment, imagine that not only is the graph G given, but also all vertex labels are revealed, except for that of a given vertex v \in V. Is it possible to determine the label of v?

Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but o(n) labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of v given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.

Given the labels of all vertices except v, the information we have about v is the number of nodes in each community it is connected to. In other words, we know the degree profile d(v) of v, where, for a given labelling of the graph’s vertices, the i-th component d_{i} (v) is the number of edges between v and the vertices in community i.

The distribution of the degree profile d(v) depends on the community that v belongs to. Recall that the community sizes are given by a multinomial distribution with parameters n and p, and hence the relative size of community i \in [k] concentrates on p_i. Thus if \sigma_v = j, the degree profile d(v) = (d_1(v), \dots, d_k(v)) can be approximated by independent binomials, with d_i (v) approximately distributed as

    \[\mathrm{Bin} \left( np_i, \ln(n) Q_{i,j} / n \right) ,\]

where \mathrm{Bin}(m,q) denotes the binomial distribution with m trials and success probability q. In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community j is approximately Poisson distributed with mean

    \[\ln \left( n \right) \sum_{i \in [k]} p_i Q_{i,j} e_i ,\]

where e_i is the i-th unit vector. Defining P = \mathrm{diag}(p), this can be abbreviated as \ln \left( n \right) \left( PQ \right)_{j}, where \left( PQ \right)_{j} denotes the j-th column of the matrix PQ.

We call the quantity \left( PQ \right)_{j} the community profile of community j; this is the quantity that determines the distribution of the degree profile of vertices from a given community.

Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between k multivariate Poisson distributions.

The prior on the label of v is given by p, and we get to observe the degree profile d(v), which comes from one of k multivariate Poisson distributions, which have mean \ln(n) times the community profiles \left( PQ \right)_j, j \in [k].

Testing multivariate Poisson distributions

We now turn to understanding the testing problem described above; the setup is as follows.

We consider a Bayesian hypothesis testing problem with k hypotheses.

The random variable H takes values in [k] with prior given by p, i.e., \mathbb{P} \left( H = j \right) = p_j.

We do not observe H, but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of H:

given H = j, the mean is \lambda(j) \in \mathbb{R}_{+}^{k}.

In short:

    \[ D \, | \, H = j \sim \mathrm{Poi} \left( \lambda(j) \right), \qquad j \in [k]. \]

In more detail:

    \[ \mathbb{P} \left( D = d \, \middle| \, H = j \right) = \mathcal{P}_{\lambda(j)} \left( d \right), \qquad d \in \mathbb{Z}_{+}^{k}, \]


    \[ \mathcal{P}_{\lambda(j)} \left( d \right) = \prod_{i \in [k]} \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) \]


    \[ \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) = \frac{\lambda_{i}(j)^{d_{i}}}{d_{i}!} e^{-\lambda_{i}(j)}. \]

Our goal is to infer the value of H from a realization of D.

The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing D = d, selects

    \[ \mathrm{argmax}_{j \in [k]} \mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j \]

as an estimate for the value of H, with ties broken arbitrarily.

Let P_e denote the error of the MAP estimator.

One can think of the MAP estimator as a tournament of k-1 pairwise comparisons of the hypotheses:


    \[\mathbb{P} \left( D = d \, \middle| \, H = i \right) p_i >\mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j\]

then the MAP estimate is not j.

The probability that one makes an error during such a comparison is exactly

(1)   \begin{equation*} P_{e} \left( i, j \right) := \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathbb{P} \left( D = x \, \middle| \, H = i \right) p_i , \mathbb{P} \left( D = x \, \middle| \, H = j \right) p_j \right\}. \end{equation*}

For finite k, the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., \max_{i,j} P_{e} \left( i, j \right).

In particular, we have that

(2)   \begin{equation*} \frac{1}{k-1} \sum_{i < j} P_{e} \left( i, j \right) \leq P_{e} \leq \sum_{i < j} P_{e} \left( i, j \right). \end{equation*}

Thus we desire to understand the magnitude of the error probability P_{e} \left( i, j \right) in (1) in the particular case when the conditional distribution of D given H is a multivariate Poisson distribution with mean vector on the order of \ln \left( n \right). The following result determines this error up to first order in the exponent.

Lemma [Abbe and Sandon 2015]

For any c_{1}, c_{2} \in \left( 0, \infty \right)^{k} with c_1 \neq c_2 and p_1, p_2 > 0, we have

(3)   \begin{align*} \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= O \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{\ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \\ \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= \Omega \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{k \ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \label{eq:LB} \end{align*}


(4)   \begin{equation*} D_{+} \left( c_{1}, c_{2} \right) = \max_{t \in \left[ 0, 1 \right]} \sum_{i \in \left[ k \right]} \left( t c_{1} \left( i \right) + \left( 1 - t \right) c_{2} \left( i \right) - c_{1} \left( i \right)^{t} c_{2} \left( i \right)^{1-t} \right). \end{equation*}

Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed D_{+} the Chernoff-Hellinger divergence.

We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.

Observe that

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

decays rapidly away from

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\} ,\]

so we can obtain a good estimate of the sum

    \[\sum_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

by simply estimating the term

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right), \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \right\} .\]

Now observe that x_{\mathrm{max}} must satisfy

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) ;\]

after some algebra this is equivalent to

    \[x_{\mathrm{max}} \approx \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Let t^{*} denote the maximizer in the expression of D_{+} \left( \lambda, \mu \right) in~\eqref{eq:D_+}.

By differentiating in t, we obtain that t^{*} satisfies

    \[\lambda - \mu - \log \left( \lambda / \mu \right) \cdot \lambda^{t^{*}} \mu^{1 - t^{*}} = 0 ,\]

and so

    \[\lambda^{t^{*}} \mu^{1 - t^{*}} = \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Thus we see that

    \[x_{\mathrm{max}} \approx \lambda^{t^{*}} \mu^{1 - t^{*}} ,\]

from which, after some algebra, we get that

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \approx \exp \left( - D_{+} \left( \lambda, \mu \right) \right) .\]

The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right), \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) \right\}\]

gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.

Characterizing exact recoverability using CH-divergence

Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity D_{+} in (4).

The discussion above then implies that D_{+} plays an important role in the threshold for exact recovery.

In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that

    \[ \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \]

Suppose on the contrary that

D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) < 1

for some i and j.

This implies that the error probability in the testing problem is \Omega \left( n^{\varepsilon - 1} \right) for some \varepsilon > 0 for all vertices in communities i and j. Since the number of vertices in these communities is linear in n, and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most

\left( 1 - \Omega \left( n^{\varepsilon - 1} \right) \right)^{\Omega \left( n \right)} = \exp\left( - \Omega \left( n^{\varepsilon} \right) \right) = o(1).

It turns out that this indeed gives the recoverability threshold.

Theorem [Abbe and Sandon 2015]

Let k \in \mathbb{Z}_{+} denote the number of communities,

let p \in (0,1)^{k} with \left\| p \right\|_{1} = 1 denote the community prior,

let P = \mathrm{diag}(p),

and let Q \in \left( 0, \infty \right)^{k \times k} be a symmetric k \times k matrix with no two rows equal.

Exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(5)   \begin{equation*} \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \end{equation*}

This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example [Symmetric communities]

Consider again k symmetric communities, that is,

p_i = 1/k for all i \in [k],

Q_{i,j} = a if i = j,

and Q_{i,j} = b otherwise, with a, b > 0.

Then exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(6)   \begin{equation*} \left| \sqrt{a} - \sqrt{b} \right| \geq \sqrt{k}. \end{equation*}

We note that in this case D_{+} is the same as the Hellinger divergence.

Above we have heuristically described why the condition (5) is necessary.

To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is o(1/n).

Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is o(1) by a union bound.

However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only o(n) vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their \log(n) size neighborhoods interact.

One can then apply two rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.

Posted in Probability theory, Random graphs | Leave a comment

Kernel-based methods for convex bandits, part 3

(This post absolutely requires to have read Part 1 and Part 2.) A key assumption at the end of Part 1 was that, after rescaling space so that the current exponential weights distribution p_t is isotropic, one has

(1)   \begin{equation*} \mathrm{supp}(p_t) \subset \{x \in \mathbb{R}^n : |x| \leq R\} =:F_{t-1} , \end{equation*}

for some R = \tilde{O}(\sqrt{n}). Effectively this means that the estimated cumulative loss outside of the ball F_t is infinite (recall that p_t is proportional to \exp(- \eta \sum_{s=1}^{t-1} \tilde{\ell}_s)). Thus to enforce (1) (at time t+1) we will actually set the loss estimate \tilde{\ell}_t to +\infty on F_t^c. The price one pays for this is that the loss estimator is now unbiased only on F_t, which in turn means that we control the cumulative regret with respect to points in F_T only. We believe that in the so-called stochastic setting where the loss sequence \ell_t is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in F_t at all round t (for R=\tilde{O}(\sqrt{n})) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order \tilde{O}(n^{4.5} \sqrt{T}) (note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first T^{1-\epsilon} rounds and then move out the optimum very far from this region for the remaining (1-o(1)) T rounds. To solve this issue we introduce a restart condition: essentially if the algorithm believes that the optimum might be outside the region F_t then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon T-t). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region F_t given in (1) will not quite work, and in fact we will construct a region which verifies

    \[F_{t-1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

Furthermore we will need to take R = \tilde{O}(n^2) which in turn forces us to take \lambda = \tilde{O}(1/n^8) (indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially \exp(nR \times \sqrt{\lambda n}) where nR instead of R comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take \lambda to be 1/n times smaller than the predicted value from Part 1) and hence the \tilde{O}(n^{9.5} \sqrt{T}) final regret bound (indeed recall that we got a bound of \frac{n}{\lambda} \sqrt{n T} where the first n comes from the fact that we need to take a learning rate 1/n times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the \sqrt{n} comes from the relative entropy distance to the optimum at the beginning).

We will use extensively the following result:

Lemma: Let p = \exp(-f) be an isotropic log-concave measure. Then for any x such that |x| = \tilde{\Omega}(n) one has f(x) - \min_y f(y) = \Omega(|x|).

Restart condition

The simplest idea to restart goes as follows. Let us consider some x \in \partial F_t (with F_t defined as in (1)). Why is it there? Well there must be some time s in the past where we estimated the cumulative regret of x to be at least \Omega(R/\eta). In particular if this x now has a small estimated regret, say smaller than O(1/\eta) then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus F_t. Thus our restart condition looks like this: restart if

    \[\exists \ x \in \partial F_t : \eta \left(\sum_{s=1}^t \tilde{\ell}_t(x) - \min_{y \in F_t} \sum_{s=1}^t \tilde{\ell}_t(y) \right) = O(1) .\]

A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the O(1) constant to be large enough, we know that the optimum (over all of \mathcal{K}) of the “true” cumulative loss \sum_{s=1}^t K^*_s \ell_s(x) is still in F_t at the time when we restart.

Hoping for negative regret

If we catch the adversary’s presumptive attempt to move the optimum out of F_t quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region F_t for a good reason, and thus if we compare ourselves to the point x which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since x was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where s is the time at which x enters the boundary of the focus region,

(2)   \begin{equation*} \mathrm{Ent}(\delta_x \Vert p_1) - \mathrm{Ent}(\delta_x \Vert p_{s+1}) = \log(p_{s+1}(x) / p_1(x)) . \end{equation*}

Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as \tilde{O}(n) - R for R = \Omega(n), and thus this term divided by \eta (which is then O(- R \sqrt{T n})) could easily compensate the variance term which is O(\eta T) = O(\sqrt{T / n}). This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step s we had increased the learning rate from \eta to (1+\gamma) \eta then we would have the term - \mathrm{Ent}(\delta_x \Vert p_{s+1}) multiplied by \gamma in the final regret bound and thus we would only need to have \gamma R \geq \mathrm{Ent}(\delta_x \Vert p_1) /\eta = \tilde{O}(n) to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of \eta). Let us make this slightly more formal.

Turning up the learning rate

Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: p_{t+1}(x) \propto p_t(x) \exp(- \eta_t \tilde{\ell}_t(x)). One can see with the same calculations than in Part 1 that:

    \begin{align*} \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \leq \frac{\mathrm{Ent}(\delta_x \Vert p_t) - \mathrm{Ent}(\delta_x \Vert p_{t+1}) }{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle \\ & = \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle . \end{align*}

In particular if we increase the learning rate at some times \tau_1, \hdots, \tau_N by \eta_{\tau_i+1} = (1+\gamma) \eta_{\tau_i} then we have

    \begin{align*} & \sum_{t=1}^T \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \sum_{i=1}^N \log(p_{\tau_i+1}) \left(\frac{1}{\eta}_{\tau_i} - \frac{1}{\eta_{\tau_i+1}} \right) \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \frac{\gamma}{1+\gamma} \sum_{i=1}^N \frac{\log(p_{\tau_i+1})}{\eta_{\tau_i}} . \end{align*}

Using that the normalizing constants are all larger than T^{-O(n)} (this can be proved using the fact that covariance of the exponential weights is at least some T^{ -O(1)}) and assuming that \eta_T / \eta_1 = (1+\gamma)^N = O(1) (i.e. N \approx 1/\gamma) one gets roughly, with Q_t(x) = \sum_{s=1}^t \eta_s \tilde{\ell}_s(x),

    \begin{align*} \sum_{t=1}^T \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \lesssim \eta_1 T + \frac{n \log(T)}{\eta_1} - \frac{\gamma}{\eta_1} \sum_{i=1}^N Q_{\tau_i}(x) \\ & \lesssim \sqrt{T} \bigg( (n \log(T))^{3/2} - \gamma \sqrt{n \log(T)} \max_{i \in [N]} Q_{\tau_i}(x) \bigg), \end{align*}

where the second inequality uses that \eta_1 \approx \frac{1}{\sqrt{n \log(T)}} (see Part 2).

We see that by taking \gamma R = \Omega(n \log(T)) we can guarantee negative regret with respect to any x whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times \tau_1, \hdots, \tau_N). Thus we see that we should take \gamma = \tilde{\Theta}(1/\mathrm{poly}(n)) if we want to have a regret of order \tilde{O}(\mathrm{poly}(n) \sqrt{T}). Recall also that we can afford to update the learning rate only N \approx 1/\gamma = \tilde{\Theta}(\mathrm{poly}(n)) times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take N = O(n \log(T)) (and thus \gamma = \Theta(1/(n \log(T)), R = \Theta( (n \log(T))^2)) while guaranteeing that the focus region satisfies

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : \|x - \mathrm{mean}(p_t)\|_{\mathrm{Cov}(p_t)^{-1}} \leq O(n R) \}\]

As we explained at the beginning of this post this will conclude the proof of the n^{9.5} \sqrt{T} regret bound.

Updating the focus region more carefully

We update the focus region F_t only at times such that, once space is rescaled so that p_{t+1} is isotropic,

    \[\mathrm{Vol} \bigg(F_t \cap\{x \in \mathbb{R}^n : |x| \leq R\}\bigg) \leq \frac{1}{2} \mathrm{Vol}(F_t) ,\]

and in this case we set

    \[F_{t+1} := F_t \cap \{x \in \mathbb{R}^n : |x| \leq R\} .\]

Somewhat clearly we won’t make more than N=O(n \log(T)) updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

This follows from:

Lemma: Let \mathcal{K} be a convex body and \mathcal{E} be the centered unit ball. Suppose that \mathrm{Vol}(\mathcal{K} \cap \mathcal{E}) \geq \tfrac{1}{2} \mathrm{Vol}(\mathcal{K}). Then \mathcal{K} \subset 10 n \mathcal{E}.

Proof: Let us prove the contrapositive and assume that there is a point x \in \mathcal{K} with |x| > 10n. Denote s_i = \frac{2i}{10n}, i=1,..,5n and consider the sets \mathcal{K}_i = (1 - s_i) (\mathcal{E} \cap \mathcal{K}) + s_i x.

Note that those sets are disjoint. Indeed, the intervals (1 - s_i) [-1,1] + |x| s_i are disjoint, which implies that the projections of the ellipsoids (1 - s_i) \mathcal{E} + s_i x onto the span of x are disjoint. So, we have

    \[\mathrm{Vol}(\mathcal{K}) \geq \sum_{i=1}^{5 n} \mathrm{Vol}(\mathcal{K}_i) = \sum_{i=1}^{5 n} (1 - s_i)^{n} \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) \geq 2 \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) ,\]

which concludes the proof.

That’s it folks!

This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability 1/\mathrm{poly}(T) do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.

Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample x_t in time \mathrm{poly}(n) T. Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on \mathrm{poly}(n) convex sets (which are n-1 dimensional). All of this can be found in the paper of course!

Posted in Optimization, Theoretical Computer Science | Leave a comment

Kernel-based methods for convex bandits, part 2

The goal of this second lecture is to explain how to do the variance calculation we talked about at the end of Part 1 for the case where the exponential weights distribution is non-Gaussian. We will lose here a factor n^2 in the regret bound compared to the optimistic calculations from Part 1, and thus by the end of this lecture we could still hope to prove a regret bound of order n^{4.5} \sqrt{T}.

Gaussian core

First let us revisit the introduction of the core. Recall that we are considering a kernel K such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. To control the regret we want to have the following inequality for any convex function f:

    \[K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) ,\]

which means that the distribution q of the random variable Z should satisfy

(1)   \begin{equation*} \langle q, f \rangle \leq \langle Kp, f\rangle . \end{equation*}

The core is defined so that (1) is in fact an equality. As we discussed in the previous post the core is a deep mathematical construction, but unfortunately we could not find a way to generalize the Part 1’s variance calculation when the core is non-Gaussian. In what follows we describe a slightly different construction which will allow us to satisfy (1) with q being Gaussian. A key observation is that if r is convexly dominated by p, i.e. for any convex f one has \langle r, f \rangle \leq \langle p, f\rangle, then (1) is satisfied by taking q to be the core of r:

    \[\langle q, f \rangle = \mathbb{E}_{X \sim q} f(X) = \mathbb{E}_{X \sim q, Y \sim r} f((1-\lambda) X + \lambda Y) \leq \mathbb{E}_{X \sim q, Y \sim p} f((1-\lambda) X + \lambda Y) = \langle Kp, f\rangle .\]

Thus it suffices to show that for any p that we care about one can find a Gaussian r “inside” of p (in the sense that r is convexly dominated by p). Then instead of taking the core of p to define the kernel one can take the core of r (which we will also refer to as the Gaussian core of p).

Next we show that we can essentially restrict our attention to the case where p is log-concave, and then we show how to find a Gaussian inside a log-concave measure.

Approximate log-concavity

Recall that in Part 1 we replaced the variance calculation by simply showing that the loss estimates \tilde{\ell}_t(y) are bounded. Thus we see by induction and using Hoeffding-Azuma that with probability at least 1-\delta one has

    \[\left|\sum_{s=1}^{t-1} (\tilde{\ell}_s (y) - K_s^* \ell_s(y)) \right| \lesssim \sqrt{T \log(1/\delta)} .\]

Using an union bound over an \epsilon-net together with Lipschitzness of \tilde{\ell}_s (which we didn’t prove but it follows the same lines than boundedness) one has in fact that with high probability (say 1-1/T^{10}) there is a convex function f (recall that \mathbb{E}_{x_t \sim K_t p_t} \tilde{\ell}_t = K_t^* \ell_t is a convex function) such that

    \[\|f - \sum_{s=1}^{t-1} \tilde{\ell}_s\|_{\infty} \lesssim \sqrt{T n \log(T)} .\]

In particular provided that \eta \lesssim \frac{1}{\sqrt{T n \log(T)}} the above display shows that p_{t} (whose density is proportional to \exp(-\eta \sum_{s=1}^{t-1} \tilde{\ell}_s)) is O(1)-approximately log-concave (recall that p is said to be \epsilon-approximately log-concave if there is a log-concave function q such that \epsilon \leq p/q \leq 1/\epsilon). To simplify the discussion in these lectures we will in fact assume that p_t is exactly log-concave.

We note that the above concentration argument makes us lose a n factor in the regret. Indeed in Part 1 we used \eta \approx \sqrt{n / T} to (informally) obtain the regret n^{2.5} \sqrt{T} (or n^{1.5} \sqrt{T} with even more optimistic calculations). We see that now we are forced to take a learning rate which is 1/n smaller than this, which in turn multiplies the regret bound by a factor n.

Finding a Gaussian inside a log-concave

Let p be an isotropic log-concave measure. We will show that a centered Gaussian G with covariance \tilde{O}(\frac{1}{n}) I_n is (approximately) convexly dominated by p. We note in passing that by going back to the calculations at the end of Part 1 it is easy to see that this factor \tilde{O}(\frac{1}{n}) in the covariance will in turn force us to take \lambda to be \tilde{O}(\frac{1}{n}) smaller than what we hoped at the end of Part 1, leading to yet again an extra n in the regret bound.

In fact we will show that any centered measure q supported on a ball of radius \epsilon=O(1) is (exactly) convexly dominated by p. This implies the (approximate) Gaussian convex domination we mentioned above since most of the mass of G is in such a ball (it is easy to deal with the remaining \mathrm{poly}(1/T) mass but we won’t worry about it here).

Let us fix some convex function f. We want to show that \langle q, f \rangle \leq \langle p, f \rangle. Since both measures are centered we may add any linear function to f without affecting this inequality, so we may legitimately assume that f(0)=0 and f(x) \geq 0 for all x\in \mathbb{R}^n. By scaling we might also assume that the maximum on the centered ball of radius \epsilon is 1 and thus it is enough to show \langle p, f \rangle \geq 1. By convexity f is above the linear function x \mapsto 1+ \nabla f(y)^{\top} (x - y) where y is the maximizer of f on the centered ball of radius \epsilon. Note also that |\nabla f(y)| \geq 1/\epsilon and thus it is enough to show (recall that f \geq 0):

    \[\langle p, x \mapsto \max(0, \theta^{\top} (x-y)) \rangle \geq \epsilon ,\]

where \theta = \nabla f(y) /|\nabla f(y)|, which is implied by

    \[\langle p, x \mapsto \max(0, \theta^{\top} x) \rangle \geq 2 \epsilon .\]

It only remains to use the fact that an isotropic real log-concave random variable X verifies \mathbb{E} (X \textbf{1}\{X \geq 0\}) \approx 1, and thus the above display holds true with \epsilon = O(1).

Posted in Optimization, Theoretical Computer Science | Leave a comment

Kernel-based methods for bandit convex optimization, part 1

A month ago Ronen Eldan, Yin Tat Lee and myself posted our latest work on bandit convex optimization. I’m quite happy with the result (first polynomial time method with poly(dimension)\times\sqrt{T}-regret) but I’m even more excited by the techniques we developed. Next week I will give three 60 minutes lectures at MSR to explain those techniques. I thought the notes for these lectures could be useful to people who are going through the main paper as the informal style of blog posts/lectures allow to gloss over some of the annoying details we had to deal with in the real proof. Each lecture covers essentially one major idea of the paper. Here we go for the notes of the first lecture!

Bandit convex optimization
Let \mathcal{K} a convex body in \mathbb{R}^n. For t=1,\hdots, T, the player selects (at random) x_t \in \mathcal{K} and simultaneously an adversary selects a convex function \ell_t : \mathcal{K} \rightarrow [0,1]. The player observes/suffers the loss \ell_t(x_t). The player’s performance is measured by the regret:

    \[R_T = \sum_{t=1}^T \ell_t(x_t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T \ell_t(x) .\]

The goal of these notes is to give a “simple” informal proof of a \mathrm{poly}(n) \sqrt{T}-regret (with high probability).

The full information engine
Most previous works on bandit convex optimization were using Online Gradient Descent as a full information primitive (in the full information setting one observes the complete loss function \ell_t). Here we will use continuous exponential weights. Let us recall what this means and the basic regret bound one can obtain. For sake of notation we use the same symbol for a probability measure p and for its density with respect to the Lebesgue measure (when it has one). We denote \langle \cdot, \cdot \rangle for the usual L^2 inner product (i.e. \langle f, g\rangle = \int_{x \in \mathbb{R}^n} f(x) g(x) dx) and \mathrm{Ent}(f \Vert g) = \langle f, \log(f/g) \rangle.

Theorem: Let \ell_1, \hdots, \ell_T be arbitrary functions from \mathcal{K} to \mathbb{R}_+. Let p_1 be some measure on \mathcal{K} and define by induction

    \[p_{t+1}(x) = \frac{1}{Z_{t+1}} p_t(x) \exp(- \eta \ell_t(x)) ,\]

where Z_{t+1} is a normalization constant and \eta>0 is some fixed learning rate. Then

    \[\sum_{t=1}^T \langle p_t - q , \ell_t \rangle \leq \frac{\mathrm{Ent}(q \Vert p_1) - \mathrm{Ent}(q \Vert p_{T+1})}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \ell_t^2 \rangle .\]

Note that the quantity \sum_{t=1}^T \langle p_t - q , \ell_t \rangle should be interpreted as the expected regret of playing x_t at random from p_t compared to playing the fixed distribution q. Just as a teaser for the experts reading this, a key point of the argument in the third lecture will be to use the term - \mathrm{Ent}(q \Vert p_{T+1}) in the regret bound (usually we simply drop this term as it is negative).

Proof: Observe that

    \[{\mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) = \langle q , \log(p_{t+1} / {p_t}) \rangle = - \log Z_{t+1} - \eta \langle q, \ell_t \rangle}.\]

Using the basic inequalities \log(1+s) \leq s and \exp(-s) \leq 1 - s + \frac{s^2}{2} for any s \geq 0, as well as the fact that \ell_t \geq 0 one obtains

    \[\log(Z_{t+1}) = \log(\langle p_t , \exp(- \eta \ell_t) \rangle) \leq - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle .\]

Rearranging the two above displays we proved

    \[\eta \langle p_t - q, \ell_t \rangle \leq \mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle ,\]

which concludes the proof.

The usual bandit argument is to use the bandit feedback \ell_t(x_t) to build some unbiased estimator \tilde{\ell}_t of \ell_t (i.e. such that \mathbb{E}_{x_t \sim p_t} \tilde{\ell}_t = \ell_t) and then use the full information primitive with the estimated losses instead of the real ones. The previous theorem indicates that the only difficulty is then to control the variance of these estimators, i.e. \mathbb{E} \langle p_t, \tilde{\ell_t}^2 \rangle. This is where the shape assumption on the losses has to be used: the observation of the loss at x_t has to be somehow propagated to nearby points so as to have a low variance estimator. The bottleneck of bandit convex optimization is exactly that it is highly unclear how to leverage the convexity assumption to propagate information to nearby points. Our first idea is that one should not estimate \ell_t directly, but rather it is enough to estimate a coarse approximation to \ell_t, where the level of coarseness depends on our level of uncertainty about where lies the optimum of the current cumulative loss. In particular “far” (where distances are measured with respect to the variance of our current exponential weights) from our current believed optimum (which is basically the center of mass of our current exponential weights) a very rough estimation of \ell_t is probably enough. We found that the best way to realize this idea is via kernels as we now explain.

Let K_t : \mathcal{K} \times \mathcal{K} \rightarrow \mathbb{R}_+ be a kernel (which will depend on the current exponential weights distribution p_t), which we think of as a linear operator over probability measures through the formula K_t q(x) = \int K_t(x,y) q(y) dy. The adjoint K_t^* acts on functions: K_t^* f(y) = \int f(x) K_t(x,y) dy (indeed \langle K_t q , f \rangle = \langle q, K_t^* f \rangle). The key point of K_t^* f is that there is a canonical unbiased estimator of it based on the observation of f(x_t) with x_t sampled from some distribution q, namely f(x_t) K_t(x_t, \cdot) / q(x_t). That’s great, and we will see soon how to control the variance of this estimator, but first let us talk about a more pressing issue: with these estimators one would control the regret as if the adversary was playing the losses K_t^* \ell_t, but what we want is to control the regret with respect to real losses \ell_t. In other words we control terms of the form \langle p_t - q, K_t^* \ell_t \rangle = \langle K_t (p_t - q), \ell_t \rangle but what we really care about is something like \langle q_t - q, \ell_t \rangle for some q_t potentially different from p_t (indeed we can play from some q_t which can be different from the exponential weights distribution). A natural guess is to take q_t = K_t p_t, in which case we would be satisfied by an inequality of the form \langle K_t p_t - q, \ell_t \rangle \lesssim \langle K_t (p_t - q), \ell_t \rangle.

Generalized Bernoulli convolutions
(What follows is an almost exact copy of Section 1.2.2. in the main paper.) As we just explained above for a given measure p we want to find a kernel K such that \langle Kp - \delta_x, f \rangle \lesssim \langle K(p - \delta_x), f \rangle for all convex functions f and all points x \in \mathcal{K}. We note that for any \lambda \in (0,1) one has

(1)   \begin{equation*}  \langle Kp - \delta_x, f \rangle \leq \frac{1}{\lambda} \langle K(p - \delta_x), f \rangle \Leftrightarrow K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) . \end{equation*}

Leveraging the fact that f is convex we see that a natural kernel to consider is such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. Indeed in this case one has

    \[K^* f(x) = \mathbb{E} f((1-\lambda) Z + \lambda x) \leq (1-\lambda) \mathbb{E} f(Z) + \lambda f(x) .\]

Thus this kernel satisfies the right hand side of (1) if Z is defined to be equal to K p, that is Z satisfies the following distributional identity, where X \sim p,

(2)   \begin{equation*}  Z \; \overset{D}{=} \; (1-\lambda) Z + \lambda X . \end{equation*}

If (2) holds true we say that Z is the core of p. It is easy to see that the core always exists and is unique by taking Z = \sum_{k=0}^{+\infty} (1-\lambda)^k \lambda X_k where X_0, X_1, \hdots are i.i.d. copies of X. Interestingly such random variables have a long history for the special case of a random sign X where they are called Bernoulli convolutions (they were introduced by Erdos in a 1939 paper). Our notion of core can thus be viewed as a generalized Bernoulli convolution. We refer the reader to the following very nice a survey on Bernoulli convolutions by Peres, Schlag and Solomyak, and we simply mention that the main objective in this literature is to understand for which values of \lambda is the random variable Z “smooth” (say for instance absolutely continuous with respect to the Lebesgue measure).

Summarizing the discussion so far, we see that by playing at round t from K_t p_t, where K_t is the kernel described above and p_t is the continuous exponential weights strategy on the estimated losses \tilde{\ell}_s = \ell_s(x_s) \frac{K_s(x_s, \cdot)}{K_s p_s(x_s)} one has for any q,

    \[\mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \langle q, \ell_t \rangle) \leq \frac{1}{\lambda} \mathbb{E} \left(\frac{\mathrm{Ent}(q \Vert p_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \left(\frac{K_t(x_t, \cdot)}{K_t p_t(x_t)}\right)^2 \rangle\right) .\]

The term in the left hand side is almost exactly the expected regret by taking q to be uniform on a small ball around the minimizer of \sum_{t=1}^T \ell_t, and in this case the term \mathrm{Ent}(q \Vert p_1) is of order n \log(T).

Variance control in the Gaussian case
All that remains to be done is to control the variance term \mathbb{E}_{x \sim K p} \langle p, \tilde{\ell}^2 \rangle where \tilde{\ell}(y) = \frac{K(x, y)}{K p(x)} = \frac{K(x,y)}{\int K(x,y') p(y') dy}. More precisely if this quantity is O(1) then we obtain a regret of \tilde{O}\left( \frac{1}{\lambda} \sqrt{n T}\right). This variance control will turn out to be more challenging than one might expect and our goal for the rest of this first lecture is to derive some intuition from the case where p is a standard Gaussian. In the next lecture we will see how to somehow reduce the general case to the Gaussian case. In the calculations below we will also assume that p is compactly supported on the Euclidean ball of radius R = \mathrm{poly}(n, \log(T)). Of course this is at odds with the Gaussian assumption, but again this issue will be dealt with in the next lecture. The more tricky part will be to ensure that the exponential weights distribution is indeed truncated outside of a ball of such radius (observe that p is assumed to be isotropic, so with respect to the distances in the original space this ball might represent a tiny ball in \mathcal{K}). This truncation (and dealing with the issues that result from it) will be the topic of the third lecture (this is where the negative entropic term in the regret bound of exponential weights will turn out to be useful!).
First observe that to bound the variance term it is in fact sufficient to show that K(x,y) / K(x,y') is bounded for all y,y' in the support of p and all x in the support of Kp. In fact it is fine if we have this control with probability at least 1-1/T^{10} with respect to the draw of x from Kp (recall that eventually x is the played point, so events with small probability on x can intuitively be ignored). Next we need to understand what is K(x,y) when p is a standard Gaussian. Observe that, with c denoting the core of p, one always has K(x,y) = K \delta_y (x) = \mathrm{cst} \times c\left(\frac{x-\lambda y}{1-\lambda}\right). Furthermore it is a straightforward calculation that the core of standard Gaussian is a centered Gaussian with covariance \frac{\lambda}{2-\lambda} I_n. Thus we obtain:

    \begin{align*} \frac{K(x,y)}{K(x,y')} & = \exp \left(\frac{2-\lambda}{2 \lambda} \left(\left\vert \frac{x-\lambda y'}{1-\lambda} \right\vert^2 - \left\vert \frac{x-\lambda y}{1-\lambda} \right\vert^2 \right) \right) \\ & \leq \exp\left( \frac{1}{(1-\lambda)^2} (4 R |x| + 2 \lambda R^2) \right). \end{align*}

Finally note that with high probability one has |x| \lesssim \lambda R + \sqrt{\lambda n \log(T)}. Ignoring logarithmic factors, and provided that somehow one could take R = \tilde{O}(\sqrt{n}) (which is the ball that contains most of the mass of a standard Gaussian so it seems reasonable), we see that for \lambda = \tilde{O}(1/n^2) we have a constant variance, which in turn lead to a regret of order n^{2.5} \sqrt{T}. In fact by being more careful in bounding the variance one could hope that \lambda = \tilde{O}(1/n) is enough (basically in the display above one could do a moment generating function calculation instead of Cauchy-Schwarz to replace the term R|x| by |x|^2) which in turn would give a n^{1.5} \sqrt{T} regret. This latter result is what we conjecture to be optimal, but we are quite far from being able to prove this!

In the next lecture we will do a little bit of geometry of log-concave measures to show how to reduce the general p case to the Gaussian case. In the third and final lecture we will deal with the truncation of the exponential weights distribution (this will force us to significantly modify the algorithm).

Posted in Optimization, Theoretical Computer Science | Leave a comment

COLT 2016 videos

This is a one-sentence post for the set {blog followers}\{Twitter/Facebook followers} to advertise the COLT 2016 videos which have just been posted!

Posted in Uncategorized | Leave a comment

Notes on least-squares, part II

As promised at the end of Part I, I will give some intuition for the Bach and Moulines analysis of constant step size SGD for least-squares. Let us recall the setting. Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables with Y=X^{\top} w^* + \epsilon where \mathbb{E}[\epsilon X] = 0. Let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

Observe that, with the notation \Sigma =\mathbb{E} X X^{\top} and \|x\|_{\Sigma} = \sqrt{x^{\top} \Sigma x} one has

    \[f(w) - f(w^*) = \|w - w^*\|_{\Sigma}^2 .\]

We assume that \mathbb{E}[ |X|^2 X X^{\top} ] \preceq R^2 \Sigma (in particular this implies \mathbb{E} |X|^2 = \mathrm{Tr}(\Sigma) \leq R^2) and \mathbb{E}[ \epsilon^2 X X^{\top} ] \preceq \sigma^2 \Sigma. For instance these assumptions are respectively satisfied with |X| \leq R almost surely, and \epsilon \sim \mathcal{N}(0, \sigma^2) independent of X. Recall that constant step-size SGD on f is defined by the following iteration (where (x_t, y_t) is a sequence of i.i.d. copies of (X,Y)), for t \geq 0,

    \[w_{t+1} = w_{t} - \eta (x_t^{\top} w_{t} - y_t) x_t .\]

Our goal is to show that for \eta = \Theta(1/R^2) one has (for a sequence (u_t)_{t \in [n]} we denote by \bar{u}_n its average \frac{1}{n} \sum_{t=1}^n u_t)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \frac{|w_0 - w^*|^2}{n} .\]

Decomposition into a noise-free SGD and a pure noise SGD

With the notation b_t = y_t x_t, A_t=I_d - \eta x_t x_t^{\top}, and A_{i:j} = A_j \hdots A_i (with the convention A_{i:i-1} = I_d) one can write an iteration of SGD as

    \[w_{t+1} = A_{t} w_{t} + \eta b_{t} = A_{1:t} w_0 + \eta \sum_{k=1}^t A_{k+1:t} b_k .\]

Interestingly, since y_t = w_*^{\top} x_t + \epsilon_t, we have with \xi_t = \epsilon_t x_t,

    \[w_{t+1}-w_* = A_{t} (w_{t}-w_*) + \eta \xi_{t} = A_{1:t} (w_0-w_*) + \eta \sum_{k=1}^t A_{k+1:t} \xi_k.\]

In other words the convergence analysis decomposes into two terms: A_{1:t} (w_0-w_*) which corresponds to a noise-free SGD (it is as if we were running SGD with \epsilon_t = 0) and \sum_{k=1}^t A_{k+1:t} \xi_k which corresponds to a pure noise SGD (it is as if we were running SGD with w_0=w^*=0).

Noise-free SGD

The analysis of the noise-free SGD is the usual drill: let \Delta_t = w_t - w^* and recall that \Delta_{t+1} = \Delta_t - \eta x_t x_t^{\top} \Delta_t (recall that in the noise-free case one has \epsilon_t=0), thus

    \[|\Delta_{t+1}|^2 = |\Delta_t|^2 - 2 \eta \Delta_t^{\top} x_t x_t^{\top} \Delta_t + \eta^2 \Delta_t^{\top} (x_t x_t^{\top})^2 \Delta_t ,\]

which directly yields \mathbb{E} |\Delta_{t+1}|^2 \leq \mathbb{E}[ |\Delta_t|^2 - 2 \eta \|\Delta_t\|_{\Sigma}^2 + \eta^2 R^2 \|\Delta_t\|_{\Sigma}^2 ] and thus for \eta \leq 1/R^2 one gets \eta \mathbb{E} \|\Delta_t\|_{\Sigma}^2 \leq \mathbb{E}[ |\Delta_t|^2 - |\Delta_{t+1}|^2 ] which finally yields (by convexity of x \mapsto \|x\|_{\Sigma}^2)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 = \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq\frac{|\Delta_1|^2}{n \eta} .\]

Thus from now on we can restrict our attention to the pure noise case, and the goal is to show that in this case one has \mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}.

Full gradient and pure noise

The pure noise case is the difficult part in the proof (this was to be expected, indeed in the SGD analysis 101 one usually cancels the effect of the noise by taking a very small step size). For this case Bach and Moulines (following the footsteps of Polyak and Juditsky [1992]) suggest to first study a “full gradient” method, that is to replace the term x_t x_t^{\top} by \Sigma in the SGD iteration. More precisely the “full gradient” analysis would amount to study the sequence \Delta_t^{(1)} defined by \Delta_0^{(1)} = 0 (recall that we are in the pure noise case and thus w_0= w^*) and

    \[\Delta_{t+1}^{(1)} =(I_d - \eta \Sigma) \Delta_t^{(1)} + \eta \xi_t .\]

We expect that \Delta_t^{(1)} will manage to stay pretty close to the optimum since we are doing full gradient steps, and thus it should only suffer the inherent variance of the problem (recall that the statistical lower bound for this problem is \sigma^2 d / n). It is actually easy to prove that this intuition is correct, first note that

(1)   \begin{equation*} \Delta_{t}^{(1)} = \eta \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s , \end{equation*}

and thus for \eta \leq 1/R^2 (which implies \eta \Sigma \preceq I_d)

    \begin{align*} \bar{\Delta}_n^{(1)} & = \frac{\eta}{n} \sum_{t=1}^n \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s\\ & = \frac{\eta}{n} \sum_{s=1}^n \sum_{k=0}^{n-s} (I_d - \eta \Sigma)^{k} \xi_s \\ & = \frac{\eta}{n} \sum_{s=1}^n (\eta \Sigma)^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s , \end{align*}

which implies (the cross terms vanish since \xi_s is centered, and also recall that \mathbb{E} \xi_s \xi_s^{\top} \preceq \sigma^2 \Sigma)

(2)   \begin{equation*} \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 = \frac{1}{n^2} \sum_{s=1}^n \mathbb{E} \| \Sigma^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s \|_{\Sigma}^2 \leq \frac{\sigma^2 d}{n} . \end{equation*}

Difference between full gradient and SGD

Let us now see how different is the sequence \Delta_t^{(1)} from our real target \Delta_t:

(3)   \begin{align*} \Delta_{t+1} - \Delta_{t+1}^{(1)} & = (I_d - \eta x_t x_t^{\top}) \Delta_t - (I_d - \eta \Sigma) \Delta_t^{(1)} \notag \\ & = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} . \end{align*}

In words we see that the difference between the two sequences satisfy exactly the same recursion as SGD, except that the noise term \xi_t is replaced by (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)}. This is pretty good news since we expect \Delta_t^{(1)} to be of order \eta: indeed \Delta_t^{(1)} is a full gradient method starting from 0, so it should stay close to 0 up to the noise term \eta \xi_t; again this intuition is easy to make precise since with (X) one sees easily that \mathbb{E}[ \Delta_t^{(1)} \Delta_t^{(1) \top} ] \preceq \eta \sigma^2 I_d. In other words while \Delta_t corresponds to SGD with noise \xi_t (the covariance of this noise is bounded by \sigma^2 \Sigma), we see that \Delta_t - \Delta_t^{(1)} corresponds to SGD with noise (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} (now the covariance of this noise is bounded by \eta R^2 \sigma^2 \Sigma). Thus we see that the new noise term is a factor \eta R^2 smaller. In particular we expect to have

    \[\mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \eta \times \eta R^2 \times \sigma^2 R^2\]

(this come from the usual decomposition of the “average regret” for SGD as distance to optimum over n \times \eta –which is 0 here– plus \eta times the average of the norm squared of the noise –which is \frac{\eta}{n} \sum_{t=1}^n \mathbb{E} |(\Sigma - x_t x_t^{\top}) \Delta_{t}^{(1)}|^2 \leq \eta R^2 \times \sigma^2 R^2.)

Not quite there but close

So far we have obtained for the pure noise case:

    \[\mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 + \mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \eta \times \eta R^2 \times \sigma^2 R^2 .\]

We fall short of what we were hoping to prove (which is \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}) because \Delta_t^{(1)} is not close enough to \Delta_t. Morally the above shows that \Delta_t^{(1)} can be viewed as a 1^{st} order approximation to \Delta_t, with a remainder term of order \eta R^2. What about continuing this expansion of \Delta_t as a polynomial in \eta R^2? Let us see what the next term \Delta_t^{(2)} should look like. First we hope that this term will also satisfy (2) and thus it should correspond to a full gradient method. On the other hand the noise term in this full gradient method should absorb the noise term in (3) so that we can hope that the noise term in the new (3) will be smaller. This leads to:

(4)   \begin{equation*} \Delta_{t+1}^{(2)} = (I_d - \eta \Sigma) \Delta_t^{(2)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} , \end{equation*}

and thus (1) rewrites as

    \[\Delta_{t+1} - \Delta_{t+1}^{(1)} - \Delta_{t+1}^{(2)} = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)} - \Delta_{t+1}^{(2)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(2)} .\]

In fact there is no reason to stop there, and for any k \geq 1 one can define

    \[\Delta_{t+1}^{(k)} = (I_d - \eta \Sigma) \Delta_t^{(k)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} ,\]

which leads to

    \[\Delta_{t+1} - \sum_{k=1}^r \Delta_{t+1}^{(k)} = (I_d - \eta x_t x_t^{\top}) \left(\Delta_t - \sum_{k=1}^r \Delta_t^{(k)} \right) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(r)} .\]

Denoting \lambda_k for a positive number such that the covariance matrix of (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} (which is the noise term in the “full gradient with noise” corresponding to both \Delta_t^{(k)} and \Delta_{t} - \sum_{s=1}^k \Delta_{t}^{(s)}) is bounded by \lambda_k \sigma^2 \Sigma, one gets with the same arguments as above

    \[\mathbb{E} \left\|\bar{\Delta}_{n} - \sum_{k=1}^r \bar{\Delta}_{n}^{(k)} \right\|_{\Sigma}^2 \lesssim \eta \times \lambda_r \times \sigma^2 R^2\]


    \[\mathbb{E} \|\bar{\Delta}_n^{(k)}\|_{\Sigma}^2 \lesssim \frac{\lambda_k \sigma^2 d}{n} .\]

In particular to conclude it suffices to show that \lambda_k is geometrically decreasing. In fact an easy induction shows that one can take \lambda_k = (\eta R^2)^k (we already did the base case k=1).


That’s it for today. In part III I will discuss how to use random projections to speed up computations for least squares in both the large sample regime n \gg d and in the high-dimensional regime d \gg n.

Posted in Optimization | Leave a comment

Bandit theory, part II

These are the lecture notes for the second part of my minicourse on bandit theory (see here for Part 1).

The linear bandit problem, Auer [2002]

We will mostly study the following far-reaching extension of the n-armed bandit problem.

Known parameters: compact action set \mathcal{A} \subset \mathbb{R}^n, adversary’s action set \mathcal{L} \subset \mathbb{R}^n, number of rounds T.

Protocol: For each round t=1,2,\hdots, T, the adversary chooses a loss vector \ell_t \in \mathcal{L} and simultaneously the player chooses a_t \in \mathcal{A} based on past observations and receives a loss/observation Y_t = \ell_t^{\top} a_t.

    \[R_T = \mathbb{E} \sum_{t=1}^T \ell_t^{\top} a_t - \min_{a \in \mathcal{A}} \mathbb{E} \sum_{t=1}^T \ell_t^{\top} a .\]

Other models: In the i.i.d. model we assume that there is some underlying \theta \in \mathcal{L} such that \mathbb{E}(Y_t | a_t) = \theta^{\top} a_t. In the Bayesian model we assume that we have a prior distribution \nu over the sequence (\ell_1,\hdots, \ell_T) (in this case the expectation in R_T is also over (\ell_1,\hdots,\ell_T)\sim\nu). Alternatively we could assume a prior over \theta.

Example: Part 1 was about \mathcal{A} = \{e_1,\hdots,e_n\} and \mathcal{L}=[0,1]^n. Another simple example is path planning: say you have a graph with n edges, and at each step one has to pick a path from some source node to a target node. The action set \mathcal{A} can be represented as a subset of the hypercube \{0,1\}^n. The adversary chooses delays on the edges, and the delay of the chosen path is the sum of the delays on the edges that the path visits (this is indeed a linear loss).

Assumption: unless specified otherwise we assume \mathcal{L}=\mathcal{A}^{\circ} :=\{\ell : \sup_{a \in \mathcal{A}} |\ell^{\top} a| \leq 1\}.

Other feedback model: in the case where \mathcal{A} \subset \{0,1\}^n one can assume that the loss \ell_t is observed at every coordinate i where a_t(i) = 1. This is the so-called semi-bandit feedback.

Thompson Sampling for linear bandit after RVR14

Assume \mathcal{A}=\{a_1,\hdots,a_{|\mathcal{A}|}\}. Recall from Part 1 that TS satisfies

    \begin{align*} & \sum_i \pi_t(i) (\bar{\ell}_t(i) - \bar{\ell}_t(i,i)) \leq \sqrt{C \ \sum_{i,j} \pi_t(i) \pi_t(j) (\bar{\ell}_t(i,j) - \bar{\ell}_t(i))^2} \\ & \Rightarrow R_T \leq \sqrt{C \ T \ \log(|\mathcal{A}|) / 2} , \end{align*}

where \bar{\ell}_t(i) = \mathbb{E}_t \ell_t(i) and \bar{\ell}_t(i,j) = \mathbb{E}_t (\ell_t(i) | i^*=j).

Writing \bar{\ell}_t(i) = a_i^{\top}\bar{\ell}_t, \bar{\ell}_t(i,j) = a_i^{\top} \bar{\ell}_t^{j}, and (M_{i,j}) = \left(\sqrt{\pi_t(i) \pi_t(j)} a_i^{\top} (\bar{\ell}_t - \bar{\ell}_t^j) \right) we want to show that

    \[\mathrm{Tr}(M) \leq \sqrt{C} \|M\|_F .\]

Using the eigenvalue formula for the trace and the Frobenius norm one can see that \mathrm{Tr}(M)^2 \leq \mathrm{rank}(M) \|M\|_F^2. Moreover the rank of M is at most n since M = U V^{\top} where U, V \in \mathbb{R}^{|\mathcal{A}| \times n} (the i^{th} row of U is \sqrt{\pi_t(i)} a_i and for V it is \sqrt{\pi_t(i)} (\bar{\ell}_t - \bar{\ell}_t^i)).

Let us make some observations.

  1. TS satisfies R_T \leq \sqrt{n T \log(|\mathcal{A}|)}. To appreciate the improvement recall that without the linear structure one would get a regret of order \sqrt{|\mathcal{A}| T} and that \mathcal{A} can be exponential in the dimension n (think of the path planning example).
  2. Provided that one can efficiently sample from the posterior on \ell_t (or on \theta), TS just requires at each step one linear optimization over \mathcal{A}.
  3. TS regret bound is optimal in the following sense. W.l.og. one can assume |\mathcal{A}| \leq (10 T)^n and thus TS satisfies R_T = O(n \sqrt{T \log(T)}) for any action set. Furthermore one can show that there exists an action set and a prior such that for any strategy one has R_T = \Omega(n \sqrt{T}), see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].

Adversarial linear bandit after Dani, Hayes, Kakade [2008]

Recall from Part 1 that exponential weights satisfies for any \tilde{\ell}_t such that \mathbb{E} \tilde{\ell}_t(i) = \ell_t(i) and \tilde{\ell}_t(i) \geq 0,

    \[R_T \leq \frac{\max_i \mathrm{Ent}(\delta_{i} \Vert p_{1})}{\eta} + \frac{\eta}{2} \mathbb{E} \sum_{t} \mathbb{E}_{I \sim p_t} \tilde{\ell}_t(I)^2 .\]

DHK08 proposed the following (beautiful) unbiased estimator for the linear case:

    \[\tilde{\ell}_t = \Sigma_t^{-1} a_t a_t^{\top} \ell_t \; \text{where} \; \Sigma_t = \mathbb{E}_{a \sim p_t} (a a^{\top}) .\]

Again, amazingly, the variance is automatically controlled:

    \[\mathbb{E}(\mathbb{E}_{a \sim p_t} (\tilde{\ell}_t^{\top} a)^2) = \mathbb{E} \tilde{\ell}_t^{\top} \Sigma_t \tilde{\ell}_t \leq \mathbb{E} a_t^{\top} \Sigma_t^{-1} a_t = \mathbb{E} \mathrm{Tr} (\Sigma_t^{-1} a_t a_t) = n .\]

Up to the issue that \tilde{\ell}_t can take negative values this suggests the “optimal” \sqrt{n T \log(|\mathcal{A}|)} regret bound.

Adversarial linear bandit, further development

  1. The non-negativity issue of \tilde{\ell}_t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional \sqrt{n} in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John’s ellipsoid (smallest ellipsoid containing \mathcal{A}). You can check this video for some more details on this.
  2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions.
  3. Abernethy, Hazan and Rakhlin [2008] proposed an alternative (beautiful) strategy based on mirror descent. The key idea is to use a n-self-concordant barrier for \mathrm{conv}(\mathcal{A}) as a mirror map and to sample points uniformly in Dikin ellipses. This method’s regret is suboptimal by a factor \sqrt{n} and the computational efficiency depends on the barrier being used.
  4. Bubeck and Eldan [2014]‘s entropic barrier allows for a much more information-efficient sampling than AHR08. This gives another strategy with optimal regret which is efficient when \mathcal{A} is convex (and one can do linear optimization on \mathcal{A}). You can check this video for some more on the entropic barier.

Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014]

Combinatorial setting: \mathcal{A} \subset \{0,1\}^n, \max_a \|a\|_1 = m, \mathcal{L} = [0,1]^n.

  1. Full information case goes back to the end of the 90’s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets \mathcal{A}).
  2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013].
  3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where \mathcal{L} = \mathcal{A}^{\circ}.
  4. Optimal regret in the semi-bandit case is \sqrt{m n T} and it can be achieved with mirror descent and the natural unbiased estimator for the semi-bandit situation.
  5. For the bandit case the bound for exponential weights from the previous slides gives m \sqrt{m n T} (you should read this as “range of the loss times square root dimension times time times logsize of the set). However the lower bound from ABL14 is m \sqrt{n T}, which is conjectured to be tight.

Preliminaries for the i.i.d. case: a primer on least squares

Assume Y_t = \theta^{\top} a_t + \xi_t where (\xi_t) is an i.i.d. sequence of centered and sub-Gaussian real-valued random variables. The (regularized) least squares estimator for \theta based on \mathbb{Y}_t = (Y_1, \hdots, Y_{t-1})^{\top} is, with \mathbb{A}_t = (a_1 \hdots a_{t-1}) \in \mathbb{R}^{n \times t-1} and \Sigma_t = \lambda I_n + \sum_{s=1}^{t-1} a_s a_s^{\top}:

    \[\hat{\theta}_t = \Sigma_t^{-1} \mathbb{A}_t \mathbb{Y}_t\]

Observe that we can also write \theta = \Sigma_t^{-1} (\mathbb{A}_t (\mathbb{Y}_t+\epsilon_t) + \lambda \theta) where \epsilon_t = (\mathbb{E}(Y_1|a_1)-Y_1, \hdots, \mathbb{E}(Y_{t-1}|a_{t-1})-Y_{t-1})^{\top} so that

    \[\|\theta-\hat{\theta}_t\|_{\Sigma_t} = \|\mathbb{A}_t \epsilon_t + \lambda \theta\|_{\Sigma_t^{-1}} \leq \|\mathbb{A}_t \epsilon_t\|_{\Sigma_t^{-1}} + \sqrt{\lambda} \|\theta\| .\]

A basic martingale argument (see e.g., Abbasi-Yadkori, Pal and Szepesvari [2011]) shows that w.p. \geq 1-\delta, \forall t \geq 1,

    \[\|\mathbb{A}_t \epsilon_t\|_{\Sigma_t^{-1}} \leq \sqrt{\mathrm{logdet}(\Sigma_t) + \log(1/(\delta^2 \lambda^n))} .\]

Note that \mathrm{logdet}(\Sigma_t) \leq n \log(\mathrm{Tr}(\Sigma_t)/n) \leq n \log(\lambda + t /n) (w.l.o.g. we assumed \|a_t\|\leq1).

i.i.d. linear bandit after DHK08, RT10, AYPS11

Let \beta = 2 \sqrt{n \log(T)}, and \mathcal{E}_t = \{\theta' : \|\theta' - \hat{\theta}_t\|_{\Sigma_t} \leq \beta\}. We showed that w.p. \geq 1-1/T^2 one has \theta \in \mathcal{E}_t for all t \in [T].

The appropriate generalization of UCB is to select: (\tilde{\theta}_t, a_t) = \mathrm{argmin}_{(\theta', a) \in \mathcal{E}_t \times \mathcal{A}} \theta'^{\top} a (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event:

    \[\sum_{t=1}^T \theta^{\top} (a_t - a^*) \leq \sum_{t=1}^T (\theta-\tilde{\theta}_t)^{\top} a_t \leq \beta \sum_{t=1}^T \|a_t\|_{\Sigma_t^{-1}} \leq \beta \sqrt{T \sum_{t} \|a_t\|_{\Sigma_t^{-1}}^2}.\]

To control the sum of squares we observe that:

    \[\det(\Sigma_{t+1}) = \det(\Sigma_t) \det(I_n + \Sigma_t^{-1/2} a_t (\Sigma_t^{-1/2} a_t)^{\top}) = \det(\Sigma_t) (1+\|a_t\|^2_{\Sigma_t^{-1}})\]

so that (assuming \lambda \geq 1)

    \[\log\det(\Sigma_{T+1}) - \log\det(\Sigma_1) = \sum_t \log (1+\|a_t\|^2_{\Sigma_t^{-1}}) \geq \frac12 \sum_t \|a_t\|^2_{\Sigma_t^{-1}} .\]

Putting things together we see that the regret is O(n \log(T) \sqrt{T}).

What’s the point of i.i.d. linear bandit?

So far we did not get any real benefit from the i.i.d. assumption (the regret guarantee we obtained is the same as for the adversarial model). To me the key benefit is in the simplicity of the i.i.d. algorithm which makes it easy to incorporate further assumptions.

  1. Sparsity of \theta: instead of regularization with \ell_2-norm to define \hat{\theta} one could regularize with \ell_1-norm, see e.g., Johnson, Sivakumar and Banerjee [2016] (see also Carpentier and Munos [2012] and Abbasi-Yadkori, Pal and Szepesvari [2012].
  2. Computational constraint: instead of optimizing over \mathcal{E}_t to define \tilde{\theta}_t one could optimize over an \ell_1-ball containing \mathcal{E}_t (this would cost an extra \sqrt{n} in the regret bound).
  3. Generalized linear model: \mathbb{E}(Y_t | a_t) = \sigma(\theta^{\top} a_t) for some known increasing \sigma : \mathbb{R} \rightarrow \mathbb{R}, see Filippi, Cappe, Garivier and Szepesvari [2011].
  4. \log(T)-regime: if \mathcal{A} is finite (note that a polytope is effectively finite for us) one can get n^2 \log^2(T) / \Delta regret:

    \[R_T \leq \mathbb{E} \sum_{t=1}^T \frac{(\theta^{\top} (a_t - a^*))^2}{\Delta} \leq \frac{\beta^2}{\Delta} \mathbb{E} \sum_{t=1}^T \|a_t\|^2_{\Sigma_t^{-1}} \lesssim \frac{n^2 \log^2(T)}{\Delta} .\]

Some non-linear bandit problems

  1. Lipschitz bandit: Kleinberg, Slivkins and Upfal [2008, 2016], Bubeck, Munos, Stoltz and Szepesvari [2008, 2011], Magureanu, Combes and Proutiere [2014]
  2. Gaussian process bandit: Srinivas, Krause, Kakade and Seeger [2010]
  3. Convex bandit: see the videos by myself and Ronen Eldan here and our arxiv paper.

Contextual bandit

We now make the game-changing assumption that at the beginning of each round t a {\em context} x_t \in \mathcal{X} is revealed to the player. The ideal notion of regret is now:

    \[R_T^{\mathrm{ctx}} = \sum_{t=1}^T \ell_t(a_t) - \inf_{\Phi : \mathcal{X} \rightarrow \mathcal{A}} \sum_{t=1}^T \ell_t(\Phi(x_t)) .\]

Sometimes it makes sense to restrict the mapping from contexts to actions, so that the infimum is taken over some {\em policy set} \Pi \subset \mathcal{A}^{\mathcal{X}}.

As far as I can tell the contextual bandit problem is an infinite playground and there is no canonical solution (or at least not yet!). Thankfully all we have learned so far can give useful guidance in this challenging problem.

Linear model after embedding

A natural assumption in several application domains is to suppose linearity in the loss after a correct embedding. Say we know mappings (\phi_a)_{a \in \mathcal{A}} such that \mathbb{E}_t(\ell_t(a)) = \phi_a(x_t)^{\top} \theta for some unknown \theta \in \mathbb{R}^n (or in the adversarial case that \ell_t(a) = \ell_t^{\top} \phi_a(x_t)).

This is nothing but a linear bandit problem where the action set is changing over time. All the strategies we described are robust to this modification and thus in this case one can get a regret of \sqrt{n T \log(|\mathcal{A}|)} \lesssim n \sqrt{T \log(T)} (and for the stochastic case one can get efficiently n^{3/2} \sqrt{T}).

A much more challenging case is when the correct embedding \phi=(\phi_a)_{a \in \mathcal{A}} is only known to belong to some class \Phi. Without further assumptions on \Phi we are basically back to the general model. Also note that a natural impulse is to run “bandits on top of bandits”, that is first select some \phi_t \in \Phi and then select a_t based on the assumption that \phi_t is correct. We won’t get into this here, but let us investigate a related idea.

Exp4, Auer, Cesa-Bianchi, Freund and Schapire [2001]

One can play exponential weights on the set of policies with the following unbiased estimator (obvious notation: \ell_t(\pi) = \ell_t(\pi(x_t)), \pi_t \sim p_t, and a_t=\pi_t(x_t))

    \[\tilde{\ell}_t(\pi) = \frac{1\{\pi(x_t) = a_t\}}{\sum_{\pi' : \pi'(x_t) = a_t} p_t(\pi')} \ell_t(a_t) .\]

Easy exercise: R_T^{\mathrm{ctx}} \leq \sqrt{2 T |\mathcal{A}| \log(|\Pi|)} (indeed the relative entropy term is smaller than \log(|\Pi|) while the variance term is exactly |\mathcal{A}|).

The only issue of this strategy is that the computationally complexity is linear in the policy space, which might be huge. A year and half ago a major paper by Agarwal, Hsu, Kale, Langford, Li and Schapire was posted, with a strategy obtaining the same regret as Exp4 (in the i.i.d. model) but which is also computationally efficient with an oracle for the offline problem (i.e., \min_{\pi \in \Pi} \sum_{t=1}^T \ell_t(\pi(x_t))). Unfortunately the algorithm is not simple enough yet to be included in these slides.

The statistician perspective, after Goldenshluger and Zeevi [2009, 2011], Perchet and Rigollet [2011]

Let \mathcal{X} \subset \mathbb{R}^d, \mathcal{A}=[n], (x_t) i.i.d. from some \mu absolutely continuous w.r.t. Lebesgue. The reward for playing arm a under context x is drawn from some distribution \nu_{a}(x) on [0,1] with mean function f_a(x) which is assumed to be \beta-Holder smooth. Let \Delta(x) be the “gap” function.

A key parameter is the proportion of contexts with a small gap. The margin assumption is that for some \alpha>0, one has

    \[\mu(\{x : \Delta(x) \in (0,\delta)\}) \leq C \delta^{\alpha}, \forall \delta \in (0,1] .\]

One can achieve a regret of order T \left(\frac{n \log(n)}{T} \right)^{\frac{\beta (\alpha+1)}{2\beta + d}}, which is optimal at least in the dependency on T. It can be achieved by running Successive Elimination on an adaptively refined partition of the space, see Perchet and Rigollet [2011] for the details.

The online multi-class classification perspective after Kakade, Shalev-Shwartz, and Tewari [2008]

Here the loss is assumed to be of the following very simple form: \ell_t(a) = 1\{a\neq a^*_t\}. In other words using the context x_t one has to predict the best action (which can be interpreted as a class) a_t^* \in [n].

KSST08 introduces the banditron, a bandit version of the multi-class perceptron for this problem. While with full information the online multi-class perceptron can be shown to satisfy a “regret” bound on of order \sqrt{T}, the banditron attains only a regret of order T^{2/3}. See also Chapter 4 in Bubeck and Cesa-Bianchi [2012] for more on this.

Summary of advanced results

  1. The optimal regret for the linear bandit problem is \tilde{O}(n \sqrt{T}). In the Bayesian context Thompson Sampling achieves this bound. In the i.i.d. case one can use an algorithm based on the optimism in face of uncertainty together with concentration properties of the least squares estimator.
  2. The i.i.d. algorithm can easily be modified to be computationally efficient, or to deal with sparsity in the unknown vector \theta.
  3. Extensions/variants: semi-bandit model, non-linear bandit (Lipschitz, Gaussian process, convex).
  4. Contextual bandit is still a very active subfield of bandit theory.
  5. Many important things were omitted. Example: knapsack bandit, see Badanidiyuru, Kleinberg and Slivkins [2013].

Some open problems we discussed

  1. Prove the lower bound \mathbb{E} R_T = \Omega(\sqrt{T n \log(n)}) for the adversarial n-armed bandit with adaptive adversary.
  2. Guha and Munagala [2014] conjecture: for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.
  3. Find a “simple” strategy achieving the Bubeck and Slivkins [2012] best of both worlds result (see Seldin and Slivkins [2014] for some partial progress on this question).
  4. For the combinatorial bandit problem, find a strategy with regret at most n^{3/2}\sqrt{T} (current best is n^2 \sqrt{T}).
  5. Is there a computationally efficient strategy for i.i.d. linear bandit with optimal n \sqrt{T} gap-free regret and with \log(T) gap-based regret?
  6. Is there a natural framework to think about “bandits on top of bandits” (while keeping \sqrt{T}-regret)?
Posted in Optimization, Probability theory | 5 Comments

Bandit theory, part I

This week I’m giving two 90 minutes lectures on bandit theory at MLSS Cadiz. Despite my 2012 survey with Nicolo I thought it would be a good idea to post my lectures notes here. Indeed while much of the material is similar, the style of a mini-course is quite different from the style of a survey. Also, bandit theory has surprisingly progressed since 2012 and many things can now be explained better. Finally in the survey we completely omitted the Bayesian model as we thought that we didn’t have much to add on this topic compared to existing sources (such as the 2011 book by Gittins, Glazebrook, Weber). For a mini-course this concern is irrelevant so I quickly discuss the famous Gittins index and its proof of optimality.

i.i.d. multi-armed bandit, Robbins [1952]

Known parameters: number of arms n and (possibly) number of rounds T \geq n.

Unknown parameters: n probability distributions \nu_1,\ldots,\nu_n on [0,1] with mean \mu_1, \hdots, \mu_n (notation: \mu^* = \max_{i \in [n]} \mu_i).

Protocol: For each round t=1,2,\hdots, T, the player chooses I_t \in [n] based on past observations and receives a reward/observation Y_t \sim \nu_{I_t} (independently from the past).

Performance measure: The cumulative regret is the difference between the player’s accumulated reward and the maximum the player could have obtained had she known all the parameters,

    \[\overline{R}_T = T \mu^* - \mathbb{E} \sum_{t \in [T]} Y_t .\]

This problem models the fundamental tension between exploration and exploitation (one wants to pick arms that performed well in the past, yet one needs to make sure that no good option has been missed). Almost every week new applications are found that fit this simple framework and I’m sure you already have some in mind (the most popular one being ad placement on the internet).

i.i.d. multi-armed bandit: fundamental limitations

How small can we expect \overline{R}_T to be? Consider the 2-armed case where \nu_1 = Ber(1/2) and \nu_2 = Ber(1/2+ \xi \Delta) where \xi \in \{-1,1\} is unknown. Recall from Probability 101 (or perhaps 102) that with \tau expected observations from the second arm there is a probability at least \exp(- \tau \Delta^2) to make the wrong guess on the value of \xi. Now let \tau(t) be the expected number of pulls of arm 2 up to time t when \xi=-1. One has

    \begin{eqnarray*} \overline{R}_T(\xi=+1) + \overline{R}_T(\xi = -1) & \geq & \Delta \tau(T) + \Delta \sum_{t=1}^T \exp(- \tau(t) \Delta^2) \\ & \geq & \Delta \min_{t \in [T]} (t + T \exp(- t \Delta^2)) \\ & \approx & \frac{\log(T \Delta^2)}{\Delta} . \end{eqnarray*}

We refer to Bubeck, Perchet and Rigollet [2013] for the details. The important message is that for \Delta fixed the lower bound is \frac{\log(T)}{\Delta}, while for the worse \Delta (which is of order 1/\sqrt{T}) it is \sqrt{T}. In the n-armed case this worst-case lower bound becomes \sqrt{T n} (see Auer, Cesa-Bianchi, Freund and Schapire [1995]). The \log(T)-lower bound is slightly “harder” to generalize to the n-armed case (as far as I know there is no known finite-time lower bound of this type), but thankfully it was already all done 30 years ago. First some notation: let \Delta_i = \mu^* - \mu_i and N_i(t) the number of pulls of arm i up to time t. Note that one has \overline{R}_T = \sum_{i=1}^n \Delta_i \mathbb{E} N_i(T). For p, q \in [0,1] let

    \[\mathrm{kl}(p,q) := p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q} .\]

Theorem [Lai and Robbins [1985]]

Consider a strategy s.t. \forall a>0, we have \mathbb{E} N_i(T) = o(T^a) if \Delta_i>0. Then for any Bernoulli distributions,

    \[\liminf_{T \to +\infty} \frac{\overline{R}_T}{\log(T)} \geq \sum_{i : \Delta_i > 0} \frac{\Delta_i}{\mathrm{kl}(\mu_i,\mu^*)}.\]

Note that \frac{1}{2 \Delta_i} \geq \frac{\Delta_i}{\mathrm{kl}(\mu_i,\mu^*)} \geq \frac{\mu^*(1-\mu^*)}{2 \Delta_i} so up to a variance-like term the Lai and Robbins lower bound is \sum_{i :\Delta_i>0} \frac{\log(T)}{2 \Delta_i}. This lower bound holds more generally than just for Bernoulli distributions, see for example Burnetas and Katehakis [1996].

i.i.d. multi-armed bandit: fundamental strategy

Hoeffding’s inequality teaches us that with probability at least 1-1/T, \forall t \in [T], i \in [n],

    \[\mu_i \leq \frac{1}{N_i(t)} \sum_{s < t : I_s = i} Y_s + \sqrt{\frac{2 \log(T)}{N_i(t)}} =: \mathrm{UCB}_i(t).\]

The UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]) is:

    \[I_t \in \mathrm{argmax}_{i \in [n]} \mathrm{UCB}_i(t) .\]

The regret analysis is straightforward: on a 1-2/T probability event one has

    \[N_i(t) \geq 8 \log(T) / \Delta_i^2 \Rightarrow \mathrm{UCB}_i(t) < \mu^* \leq \mathrm{UCB}_{i^*}(t) ,\]

so that \mathbb{E} N_i(T) \leq 2 + 8 \log(T)/\Delta_i^2 and in fact

    \[\overline{R}_T \leq 2 + \sum_{i : \Delta_i >0} \frac{8 \log(T)}{\Delta_i} .\]

i.i.d. multi-armed bandit: going further

  • The numerical constant 8 in the UCB regret bound can be replaced by 1/2 (which is the best one can hope for), and more importantly by slightly modifying the derivation of the UCB one can obtain the Lai and Robbins variance-like term (that is replacing \Delta_i by \mathrm{kl}(\mu_i,\mu^*)): see Cappe, Garivier, Maillard, Munos and Stoltz [2013].
  •  In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour [2006] for the fixed-confidence setting (see also Jamieson and Nowak [2014] for a recent short survey) and Audibert, Bubeck and Munos [2010] for the fixed budget setting. Key takeaway: one needs of order \mathbf{H}:=\sum_i \Delta_i^{-2} rounds to find the best arm.
  • The UCB analysis extends to sub-Gaussian reward distributions. For heavy-tailed distributions, say with 1+\epsilon moment for some \epsilon \in (0,1], one can get a regret that scales with \Delta_i^{-1/\epsilon} (instead of \Delta_i^{-1}) by using a robust mean estimator, see Bubeck, Cesa-Bianchi and Lugosi [2012].


Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001]

For t=1,\hdots, T, the player chooses I_t \in [n] based on previous observations, and simultaneously an adversary chooses a loss vector \ell_t \in [0,1]^n. The player’s loss/observation is \ell_t(I_t). The regret and pseudo-regret are defined as:

    \[R_T = \max_{i \in [n]} \sum_{t\in[T]} (\ell_t(I_t) - \ell_t(i)), \;\; \overline{R}_T = \max_{i \in [n]} \mathbb{E} \sum_{t\in[T]} (\ell_t(I_t) - \ell_t(i)) .\]

Obviously \mathbb{E} R_T \geq \overline{R}_T and there is equality in the oblivious case (\equiv adversary’s choices are independent of the player’s choices). The case where \ell_1,\hdots, \ell_T is an i.i.d. sequence corresponds to the i.i.d. model we just studied. In particular we already know that \sqrt{T n} is a lower bound on the attainable pseudo-regret.

Adversarial multi-armed bandit, fundamental strategy

The exponential weights strategy for the full information case where \ell_t is observed at the end of round t is defined by: play I_t at random from p_t where

    \[p_{t+1}(i) = \frac{1}{Z_{t+1}} p_t(i) \exp(- \eta \ell_t(i)) .\]

In five lines one can show \overline{R}_T \leq \sqrt{2 T \log(n)} with p_1(i)=1/n and a well-chosen learning rate \eta (recall that \mathrm{Ent}(p \Vert q) = \sum_i p(i) \log(p(i)/q(i))):

    \begin{align*} & \mathrm{Ent}(\delta_j \Vert p_{t}) - \mathrm{Ent}(\delta_j \Vert p_{t+1}) = \log \frac{p_{t+1}(j)}{p_t(j)} = \log \frac{1}{Z_{t+1}} - \eta \ell_t(j) \\ &\psi_t:= \log \mathbb{E}_{I\sim p_t} \exp(- \eta (\ell_t(I) - \mathbb{E}_{I' \sim p_t} \ell_t(I'))) = \eta \mathbb{E} \ell_t(I') + \log (Z_{t+1})\\ &\eta \sum_{t} \left(\sum_{i} p_t(i) \ell_t(i) - \ell_t(j)\right) = \mathrm{Ent}(\delta_j \Vert p_{1}) - \mathrm{Ent}(\delta_j \Vert p_{T+1}) + \sum_{t} \psi_t \\ & \text{Using that} \; \ell_t \geq 0 \; \text{one has} \; \psi_t \leq \frac{\eta^2}{2} \mathbb{E} \ell_t(i)^2 \; \text{thus} \; \overline{R}_T \leq \frac{\log(n)}{\eta} + \frac{\eta T}{2} \end{align*}

For the bandit case we replace \ell_t by \tilde{\ell}_t in the exponential weights strategy, where

    \[\tilde{\ell}_t(i) = \frac{\ell_t(I_t)}{p_t(i)} {1}\{i = I_t\} .\]

The resulting strategy is called Exp3. The key property of \tilde{\ell}_t is that it is an unbiased estimator of \ell_t:

    \[\mathbb{E}_{I_t \sim p_t} \tilde{\ell}_t(i) = \ell_t(i) .\]

Furthermore with the analysis described above one gets

    \[\overline{R}_T \leq \frac{\log(n)}{\eta} + \frac{\eta}{2} \mathbb{E} \sum_{t} \mathbb{E}_{I \sim p_t} \tilde{\ell}_t(I)^2 .\]

It only remains to control the variance term, and quite amazingly this is straightforward:

    \[\mathbb{E}_{I_t, I \sim p_t} \tilde{\ell}_t(I)^2 \leq \mathbb{E}_{I_t, I \sim p_t} \frac{{1}\{I = I_t\}}{p_t(I_t)^2} = n .\]

Thus with \eta = \sqrt{2 n \log(n) / T} one gets \overline{R}_T \leq \sqrt{2 T n \log(n)}.

Adversarial multi-armed bandit, going further

  • With the modified loss estimate \frac{\ell_t(I_t) {1}\{i = I_t\} + \beta}{p_t(I_t)} one can prove high probability bounds on R_T, and by integrating the deviations one can show \mathbb{E} R_T = O(\sqrt{T n \log(n)}).
  • The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that \mathbb{E} R_T = \Omega(\sqrt{T n \log(n)}).
  • T can be replaced by various measure of “variance” in the loss sequence, see e.g., Hazan and Kale [2009].
  • There exist strategies which guarantee simultaneously \overline{R}_T = \tilde{O}(\sqrt{T n}) in the adversarial model and \overline{R}_T = \tilde{O}(\sum_i \Delta_i^{-1}) in the i.i.d. model, see Bubeck and Slivkins [2012].
  • Many interesting variants: graph feedback structure of Mannor and Shamir [2011] (there is a graph on the set of arms, and when an arm is played one observes the loss for all its neighbors), regret with respect to best sequence of actions with at most S switches, switching cost (interestingly in this case the best regret is \Omega(T^{2/3}), see Dekel, Ding, Koren and Peres [2013]), and much more!


Bayesian multi-armed bandit, Thompson [1933]

Here we assume a set of “models” \{(\nu_1(\theta), \hdots, \nu_n(\theta)), \theta \in \Theta\} and prior distribution \pi_0 over \Theta. The Bayesian regret is defined as

    \[{BR}_T(\pi_0) = \mathbb{E}_{\theta \sim \pi_0} \overline{R}_T(\nu_1(\theta), \hdots, \nu_n(\theta)) ,\]

where \overline{R}_T(\nu) simply denotes the regret for the i.i.d. model when the underlying reward distributions are \nu_1,\hdots, \nu_n. In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space \mathcal{P}(\Theta). The celebrated Gittins index theorem gives sufficient condition to dramatically reduce the computational complexity of implementing the optimal Bayesian strategy under a strong product assumption on \pi_0. Notation: \pi_t denotes the posterior distribution on \theta at time t.

Theorem [Gittins [1979]]

Consider the product and \gamma-discounted case: \Theta= \times_{i} \Theta_i, \nu_i(\theta) :=\nu(\theta_i), \pi_0 = \otimes_{i} \pi_0(i), and furthermore one is interested in maximizing \mathbb{E} \sum_{t \geq 0} \gamma^t Y_t. The optimal Bayesian strategy is to pick at time s the arm maximizing the Gittins index:

    \[\sup \left\{ \lambda \in \mathbb{R} : \sup_{\tau} \mathbb{E} \left(\sum_{t < \tau} \gamma^t X_t + \frac{\gamma^{\tau}}{1-\gamma} \lambda \right) \geq \frac{1}{1-\gamma} \lambda \right\} ,\]

where the expectation is over (X_t) drawn from \nu(\theta) with \theta \sim \pi_s(i), and the supremum is taken over all stopping times \tau.

Note that the stopping time \tau in the Gittins index definition gives the optimal strategy for a 2-armed game, where one arm’s reward distribution is \delta_{\lambda} while the other arm reward’s distribution is \nu(\theta) with \pi_s(i) as a prior for \theta.

Proof: The following exquisite proof was discovered by Weber [1992]. Let

    \[\lambda_t(i):=\sup \left\{ \lambda \in \mathbb{R} : \sup_{\tau} \mathbb{E} \sum_{t < \tau} \gamma^t (X_t-\lambda) \geq 0 \right\}\]

be the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as \overline{\lambda}_t(i) = \min_{s \leq t} \lambda_s(i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level). We now make three simple observations which together conclude the proof:

  • The discounted sum of prevailing charge for played arms \sum_t \gamma^t \overline{\lambda}_t(I_t) is an upper bound (in expectation) on the discounted sum of rewards. Indeed the times at which the prevailing charge are updated are stopping times, and so between two such times (s,t) the expected sum of discounted reward is smaller than the discounted sum of the fair charge at time s which is equal to the prevailing charge at any time in [s,t-1].
  • Since the prevailing charge is nonincreasing, the discounted sum of prevailing charge is maximized if we always pick the arm with maximum prevailing charge. Also note that the sequence of prevailing charge (\overline{\lambda}_t(i))_{i,t} does not depend on the algorithm.
  • Gittins index does exactly 2. (since we stop playing an arm only at times at which the prevailing charge is updated) and in this case 1. is an equality. Q.E.D.


For much more (implementation for exponential families, interpretation as a multitoken Markov game, …) see Dumitriu, Tetali and Winkler [2003], Gittins, Glazebrook, Weber [2011], Kaufmann [2014].

Bayesian multi-armed bandit, Thompson Sampling (TS)

In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory.

In his 1933 paper Thompson proposed the following strategy: sample \theta' \sim \pi_t and play I_t \in \mathrm{argmax} \mu_i(\theta').

Theoretical guarantees for this highly practical strategy have long remained elusive. Recently Agrawal and Goyal [2012] and Kaufmann, Korda and Munos [2012] proved that TS with Bernoulli reward distributions and uniform prior on the parameters achieves \overline{R}_T = O \left( \sum_i \frac{\log(T)}{\Delta_i}\right) (note that this is the frequentist regret!). We also note that Liu and Li [2015] takes some steps in analyzing the effect of misspecification for TS.

Let me also mention a beautiful conjecture of Guha and Munagala [2014]: for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.

Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis

Assume a prior in the adversarial model, that is a prior over (\ell_1,\hdots, \ell_T) \in [0,1]^{n \times T}, and let \mathbb{E}_t denote the posterior distribution (given \ell_1(I_1),\hdots,\ell_{t-1}(I_{t-1})). We introduce

    \[r_t(i) = \mathbb{E}_t (\ell_t(i) - \ell_t({i}^*)) , \;\; \text{and} \;\; v_t(i) = \mathrm{Var}_t (\mathbb{E}_t(\ell_t(i) | {i}^*)) .\]

The key observation is that (recall that H(p) = - \sum_i p(i) \log(p(i)))

    \[\mathbb{E} \sum_{t \leq T} v_t(I_t) \leq \frac12 H(i^*)\]

Indeed, equipped with Pinsker’s inequality and basic information theory concepts one has (we denote \mathbbm{I}_t for the mutual information conditionally on everything up to time t, also \mathcal{L}_t(X) denotes the law of X conditionally on everything up to time t):

    \begin{eqnarray*} v_t(i) & = & \sum_{j} \pi_t(j) (\mathbb{E}_t(\ell_t(i)|i^*=j) - \mathbb{E}_t(\ell_t(i)))^2 \\ & \leq & \frac12 \sum_{j} \pi_t(j) \mathrm{Ent}(\mathcal{L}_t(\ell_t(i) | i^* = j) \Vert \mathcal{L}_t(\ell_t(i))) \\ & = & \frac12 \mathbbm{I}_t(\ell_t(i), i^*) = \frac12 (H_t(i^*) - H_t(i^* | \ell_t(i))) . \end{eqnarray*}

Thus \mathbb{E} v_t(I_t) \leq \frac12 \mathbb{E} (H_t(i^*) - H_{t+1}(i^*)) which gives the claim thanks to a telescopic sum. We will use this key observation as follows (we give a sequence of implications leading to a regret bound so that all that is left to do is to check that the first statement in this sequence is true for TS):

    \begin{align*} & \forall t, \mathbb{E}_{t} r_t(I_t) \leq \sqrt{C \ \mathbb{E}_{t} v_t(I_t)} \\ & \Rightarrow \;\; \mathbb{E} \sum_{t=1}^T r_t(I_t) \leq \sum_{t=1}^T \sqrt{C \ \mathbb{E} v_t(I_t)} \\ & \Rightarrow \;\; {BR}_T \leq \sqrt{C \ T \ H(i^*) / 2} . \end{align*}

Thus writing \bar{\ell}_t(i) = \mathbb{E}_t \ell_t(i) and \bar{\ell}_t(i,j) = \mathbb{E}_t (\ell_t(i) | i^*=j) we have

    \begin{align*} & \mathbb{E}_{t} r_t(I_t) \leq \sqrt{C \ \mathbb{E}_{t} v_t(I_t)} \\ & \Leftrightarrow \mathbb{E}_{I_t} \bar{\ell}_t(I_t) - \sum_i \pi_t(i) \bar{\ell}_t(i,i) \leq \sqrt{C \ \mathbb{E}_{I_t}\sum_j \pi_t(j) (\bar{\ell}_t(I_t,j) - \bar{\ell}_t(I_t))^2} \end{align*}

For TS the following shows that one can take C=n:

    \begin{eqnarray*} \mathbb{E}_{I_t} \bar{\ell}_t(I_t) - \sum_i \pi_t(i) \bar{\ell}_t(i,i) & = & \sum_i \pi_t(i) (\bar{\ell}_t(i) - \bar{\ell}_t(i,i)) \\ & \leq & \sqrt{n \sum_i \pi_t(i)^2 (\bar{\ell}_t(i) - \bar{\ell}_t(i,i))^2} \\ & \leq & \sqrt{n \sum_{i,j} \pi_t(i) \pi_t(j) (\bar{\ell}_t(i) - \bar{\ell}_t(i,j))^2} . \end{eqnarray*}

Thus TS always satisfies BR_T \leq \sqrt{T n H(i^*)} \leq \sqrt{T n \log(n)}. Side note: by the minimax theorem this implies the existence of a strategy for the oblivious adversarial model with regret \sqrt{T n \log(n)} (of course we already proved that such a strategy exist, in fact we even constructed one via exponential weights, but the point is that the proof here does not require any “miracle” –yes exponential weights are kind of a miracle, especially when you consider how the variance of the unbiased estimator gets automatically controlled).


Summary of basic results

  • In the i.i.d. model UCB attains a regret of O \left( \sum_i \frac{\log(T)}{\Delta_i}\right) and by Lai and Robbins’ lower bound this is optimal (up to a multiplicative variance-like term).
  • In the adversarial model Exp3 attains a regret of O(\sqrt{T n \log(n)}) and this is optimal up to the logarithmic term.
  • In the Bayesian model, Gittins index gives an optimal strategy for the case of product priors. For general priors Thompson Sampling is a more flexible strategy. Its Bayesian regret is controlled by the entropy of the optimal decision. Moreover TS with an uninformative prior has frequentist guarantees comparable to UCB.
Posted in Optimization, Probability theory | 6 Comments

COLT 2016 accepted papers

Like previous years (2014, 2015) I have compiled the list of accepted papers for this year’s edition of COLT together with links to the arxiv submission whenever I could find one. These 63 papers were selected from about 200 submissions (a healthy 11% increase in terms of submissions from last year).

COLT 2016 accepted papers

Posted in Conference/workshop | 1 Comment

Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables, and let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

We are interested in finding a minimizer w^* of f (we will also be satisfied with w such that f(w) - f(w^*) \leq \epsilon). Denoting \Sigma = \mathbb{E} X X^{\top} for the covariance matrix of the features (we refer to X as a feature vector) and \mu = \mathbb{E} Y X, one has w^* = \Sigma^{-1} \mu (if \Sigma is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case w^* is the smallest-norm element of the affine subspace of minimizers of f). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of (X, Y). Thus the only constraint to find w^* is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where \Sigma can be extremely ill-conditioned (for instance the smallest eigenvalue can be 0, or very close to 0), and thus bounds that depend on the condition number of \Sigma (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., Y = X^{\top} w^* + \epsilon where \epsilon \sim \mathcal{N}(0, \sigma^2) is independent of X) we define the noise level \sigma of the problem by assuming that \mathbb{E} \ (X^{\top} w^* - Y)^2 X X^{\top} \preceq \sigma^2 \Sigma. For simplicity we will assume (whenever necessary) that almost surely one has |X| \leq R and |Y| \leq B. All our iterative algorithms will start at w_0 = 0. We will also not worry too much about numerical constants and write \lesssim for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary d-dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of (X,Y)).


The most natural approach (also the most terrible as it will turn out) is to take a large sample (x_i, y_i)_{i \in [n]} and then minimize the corresponding empirical version \hat{f} of f, or in other words compute \hat{w} = \left(\mathbb{X} \mathbb{X}^{\top} \right)^{-1} \mathbb{X} \mathbb{Y} where \mathbb{Y} \in \mathbb{R}^n is the column vector (y_i) and \mathbb{X} \in \mathbb{R}^{d \times n} is the matrix whose i^{th} column is x_i. Since \hat{f} is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get \hat{w} in d iterations. However each iteration (basically computing a gradient of \hat{f}) cost O(n) elementary vector operations, thus leading overall to O(n d) elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix \hat{\Sigma} = \mathbb{X} \mathbb{X}^{\top}). What number n of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

    \[\mathbb{E} \hat{f}(\hat{w}) - f(w^*) \lesssim \frac{\sigma^2 \mathrm{rank}(\Sigma)}{n} .\]

(Note that in the above bound we evaluate \hat{w} on \hat{f} instead of f, see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking n= \sigma^2 d / \epsilon samples and performing O(\sigma^2 d^2 / \epsilon) elementary vector operations. This d^2 number of elementary vector operations completely dooms this approach when d is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least \sigma^2 d / \epsilon samples (in the worst case over all \Sigma) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of \hat{w} for some computational easiness is to add a regularization term to \hat{f} in order to make the problem better conditioned (recall that conjugate gradient reaches an \epsilon-optimal point in \sqrt{\kappa} \log(1/\epsilon) iterations, where \kappa is the condition number).

In other words we would like to minimize, for some values of n and \lambda to be specified,

    \[\frac{1}{2n} \sum_{i=1}^n (x_i^{\top} w - y_i)^2 + \frac{\lambda}{2} |w|^2 .\]

One can write the minimizer as \hat{w}(\lambda) = \left(\mathbb{X} \mathbb{X}^{\top} + n \lambda \mathrm{I}_d \right)^{-1} \mathbb{X} \mathbb{Y}, and this point is now obtained at a cost of \tilde{O} \left(n \sqrt{\frac{\|\Sigma\|}{\lambda}}\right) elementary vector operations (\|\Sigma\| denotes the spectral norm of \Sigma and we used the fact that with high probability \|\hat{\Sigma}\| \lesssim \|\Sigma\|). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to \tilde{O}\left(n +\sqrt{n \frac{\|\Sigma\|}{\lambda}} \right). Also observe that again an easy argument based on Rademacher complexity leads to

\mathbb{E} f(\hat{w}(\lambda)) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\sigma^2 d}{n}. Thus we want to take \lambda of order \frac{\sigma^2 d}{n |w^*|^2} which in terms of \epsilon is of order \epsilon/ |w^*|^2 leading to an overall number of vector operations of

    \[\frac{\sigma^2 d}{\epsilon} + \frac{|w^*| \sqrt{\|\Sigma\| \sigma^2 d}}{\epsilon} .\]

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on \epsilon than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that |w^*| appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing g_i(w) = x_i (x_i^{\top} w - y_i) (this is a stochastic gradient, that is it satisfies \mathbb{E}[g_i(w) | w] = \nabla f(w)) we recall that SGD can be written as:

    \[w_{i+1} = w_i - \eta g_i = (\mathrm{I}_d - \eta x_i x_i^{\top}) w_i + \eta y_i x_i .\]

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size \eta of order 1/\sqrt{n}, and assuming to simplify (see below for how to get around this) that |w^*| is known up to a numerical constant so that if w_{i+1} steps outside the ball of that radius then we project it back on it, one has (we also note that \mathbb{E}( |g_i(w)|^2 | w) \leq (B + |w| R)^2 \mathrm{Tr}(\Sigma), and we denote \bar{\sigma}^2 = (B + |w^*| R)^2):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \sqrt{\frac{|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{n}} .\]

This means that roughly with SGD the computational cost for our main task corresponds O(|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma) / \epsilon^2) vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in \epsilon^{-2} instead of \epsilon^{-1}, and (ii) \bar{\sigma}^2 instead of \sigma^2. We will fix both issues, but let us also observe that on the positive side the term d \|\Sigma\| in accelerated SVRG is replaced by the less conservative quantity \mathrm{Tr}(\Sigma) (that is one replaces the largest eigenvalue of \Sigma by the average eigenvalue of \Sigma).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on f(w) + \frac{\lambda}{2} |w|^2 with stochastic gradient g_i(w) = (\lambda \mathrm{I}_d + x_i x_i^{\top}) w - y_i x_i which leads to (see Theorem 6.2 here):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{\lambda n} .\]

Optimizing over \lambda we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size \eta of order of a constant times 1/R^2 satisfies

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \frac{\sigma^2 d + R^2 |w^*|^2}{n}.\]

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term \sigma^2 d / n is minimax optimal, while the bias term R^2 |w^*|^2 / n matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in 1/n^2 is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

Posted in Optimization | 3 Comments