COLT 2016 videos

This is a one-sentence post for the set {blog followers}\{Twitter/Facebook followers} to advertise the COLT 2016 videos which have just been posted!

Posted in Uncategorized | Leave a comment

Notes on least-squares, part II

As promised at the end of Part I, I will give some intuition for the Bach and Moulines analysis of constant step size SGD for least-squares. Let us recall the setting. Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables with Y=X^{\top} w^* + \epsilon where \mathbb{E}[\epsilon X] = 0. Let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

Observe that, with the notation \Sigma =\mathbb{E} X X^{\top} and \|x\|_{\Sigma} = \sqrt{x^{\top} \Sigma x} one has

    \[f(w) - f(w^*) = \|w - w^*\|_{\Sigma}^2 .\]

We assume that \mathbb{E}[ |X|^2 X X^{\top} ] \preceq R^2 \Sigma (in particular this implies \mathbb{E} |X|^2 = \mathrm{Tr}(\Sigma) \leq R^2) and \mathbb{E}[ \epsilon^2 X X^{\top} ] \preceq \sigma^2 \Sigma. For instance these assumptions are respectively satisfied with |X| \leq R almost surely, and \epsilon \sim \mathcal{N}(0, \sigma^2) independent of X. Recall that constant step-size SGD on f is defined by the following iteration (where (x_t, y_t) is a sequence of i.i.d. copies of (X,Y)), for t \geq 0,

    \[w_{t+1} = w_{t} - \eta (x_t^{\top} w_{t} - y_t) x_t .\]

Our goal is to show that for \eta = \Theta(1/R^2) one has (for a sequence (u_t)_{t \in [n]} we denote by \bar{u}_n its average \frac{1}{n} \sum_{t=1}^n u_t)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \frac{|w_0 - w^*|^2}{n} .\]

Decomposition into a noise-free SGD and a pure noise SGD

With the notation b_t = y_t x_t, A_t=I_d - \eta x_t x_t^{\top}, and A_{i:j} = A_j \hdots A_i (with the convention A_{i:i-1} = I_d) one can write an iteration of SGD as

    \[w_{t+1} = A_{t} w_{t} + \eta b_{t} = A_{1:t} w_0 + \eta \sum_{k=1}^t A_{k+1:t} b_k .\]

Interestingly, since y_t = w_*^{\top} x_t + \epsilon_t, we have with \xi_t = \epsilon_t x_t,

    \[w_{t+1}-w_* = A_{t} (w_{t}-w_*) + \eta \xi_{t} = A_{1:t} (w_0-w_*) + \eta \sum_{k=1}^t A_{k+1:t} \xi_k.\]

In other words the convergence analysis decomposes into two terms: A_{1:t} (w_0-w_*) which corresponds to a noise-free SGD (it is as if we were running SGD with \epsilon_t = 0) and \sum_{k=1}^t A_{k+1:t} \xi_k which corresponds to a pure noise SGD (it is as if we were running SGD with w_0=w^*=0).

Noise-free SGD

The analysis of the noise-free SGD is the usual drill: let \Delta_t = w_t - w^* and recall that \Delta_{t+1} = \Delta_t - \eta x_t x_t^{\top} \Delta_t (recall that in the noise-free case one has \epsilon_t=0), thus

    \[|\Delta_{t+1}|^2 = |\Delta_t|^2 - 2 \eta \Delta_t^{\top} x_t x_t^{\top} \Delta_t + \eta^2 \Delta_t^{\top} (x_t x_t^{\top})^2 \Delta_t ,\]

which directly yields \mathbb{E} |\Delta_{t+1}|^2 \leq \mathbb{E}[ |\Delta_t|^2 - 2 \eta \|\Delta_t\|_{\Sigma}^2 + \eta^2 R^2 \|\Delta_t\|_{\Sigma}^2 ] and thus for \eta \leq 1/R^2 one gets \eta \mathbb{E} \|\Delta_t\|_{\Sigma}^2 \leq \mathbb{E}[ |\Delta_t|^2 - |\Delta_{t+1}|^2 ] which finally yields (by convexity of x \mapsto \|x\|_{\Sigma}^2)

    \[\mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 = \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq\frac{|\Delta_1|^2}{n \eta} .\]

Thus from now on we can restrict our attention to the pure noise case, and the goal is to show that in this case one has \mathbb{E} \|\bar{w}_n - w^*\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}.

Full gradient and pure noise

The pure noise case is the difficult part in the proof (this was to be expected, indeed in the SGD analysis 101 one usually cancels the effect of the noise by taking a very small step size). For this case Bach and Moulines (following the footsteps of Polyak and Juditsky [1992]) suggest to first study a “full gradient” method, that is to replace the term x_t x_t^{\top} by \Sigma in the SGD iteration. More precisely the “full gradient” analysis would amount to study the sequence \Delta_t^{(1)} defined by \Delta_0^{(1)} = 0 (recall that we are in the pure noise case and thus w_0= w^*) and

    \[\Delta_{t+1}^{(1)} =(I_d - \eta \Sigma) \Delta_t^{(1)} + \eta \xi_t .\]

We expect that \Delta_t^{(1)} will manage to stay pretty close to the optimum since we are doing full gradient steps, and thus it should only suffer the inherent variance of the problem (recall that the statistical lower bound for this problem is \sigma^2 d / n). It is actually easy to prove that this intuition is correct, first note that

(1)   \begin{equation*} \Delta_{t}^{(1)} = \eta \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s , \end{equation*}

and thus for \eta \leq 1/R^2 (which implies \eta \Sigma \preceq I_d)

    \begin{align*} \bar{\Delta}_n^{(1)} & = \frac{\eta}{n} \sum_{t=1}^n \sum_{s=1}^t (I_d - \eta \Sigma)^{t-s} \xi_s\\ & = \frac{\eta}{n} \sum_{s=1}^n \sum_{k=0}^{n-s} (I_d - \eta \Sigma)^{k} \xi_s \\ & = \frac{\eta}{n} \sum_{s=1}^n (\eta \Sigma)^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s , \end{align*}

which implies (the cross terms vanish since \xi_s is centered, and also recall that \mathbb{E} \xi_s \xi_s^{\top} \preceq \sigma^2 \Sigma)

(2)   \begin{equation*} \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 = \frac{1}{n^2} \sum_{s=1}^n \mathbb{E} \| \Sigma^{-1} (I_d - (I_d - \eta \Sigma)^{n-s+1}) \xi_s \|_{\Sigma}^2 \leq \frac{\sigma^2 d}{n} . \end{equation*}

Difference between full gradient and SGD

Let us now see how different is the sequence \Delta_t^{(1)} from our real target \Delta_t:

(3)   \begin{align*} \Delta_{t+1} - \Delta_{t+1}^{(1)} & = (I_d - \eta x_t x_t^{\top}) \Delta_t - (I_d - \eta \Sigma) \Delta_t^{(1)} \notag \\ & = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} . \end{align*}

In words we see that the difference between the two sequences satisfy exactly the same recursion as SGD, except that the noise term \xi_t is replaced by (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)}. This is pretty good news since we expect \Delta_t^{(1)} to be of order \eta: indeed \Delta_t^{(1)} is a full gradient method starting from 0, so it should stay close to 0 up to the noise term \eta \xi_t; again this intuition is easy to make precise since with (X) one sees easily that \mathbb{E}[ \Delta_t^{(1)} \Delta_t^{(1) \top} ] \preceq \eta \sigma^2 I_d. In other words while \Delta_t corresponds to SGD with noise \xi_t (the covariance of this noise is bounded by \sigma^2 \Sigma), we see that \Delta_t - \Delta_t^{(1)} corresponds to SGD with noise (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} (now the covariance of this noise is bounded by \eta R^2 \sigma^2 \Sigma). Thus we see that the new noise term is a factor \eta R^2 smaller. In particular we expect to have

    \[\mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \eta \times \eta R^2 \times \sigma^2 R^2\]

(this come from the usual decomposition of the “average regret” for SGD as distance to optimum over n \times \eta –which is 0 here– plus \eta times the average of the norm squared of the noise –which is \frac{\eta}{n} \sum_{t=1}^n \mathbb{E} |(\Sigma - x_t x_t^{\top}) \Delta_{t}^{(1)}|^2 \leq \eta R^2 \times \sigma^2 R^2.)

Not quite there but close

So far we have obtained for the pure noise case:

    \[\mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \leq \mathbb{E} \|\bar{\Delta}_n^{(1)}\|_{\Sigma}^2 + \mathbb{E} \|\bar{\Delta}_n - \bar{\Delta}_n^{(1)}\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n} + \eta \times \eta R^2 \times \sigma^2 R^2 .\]

We fall short of what we were hoping to prove (which is \mathbb{E} \|\bar{\Delta}_n\|_{\Sigma}^2 \lesssim \frac{\sigma^2 d}{n}) because \Delta_t^{(1)} is not close enough to \Delta_t. Morally the above shows that \Delta_t^{(1)} can be viewed as a 1^{st} order approximation to \Delta_t, with a remainder term of order \eta R^2. What about continuing this expansion of \Delta_t as a polynomial in \eta R^2? Let us see what the next term \Delta_t^{(2)} should look like. First we hope that this term will also satisfy (2) and thus it should correspond to a full gradient method. On the other hand the noise term in this full gradient method should absorb the noise term in (3) so that we can hope that the noise term in the new (3) will be smaller. This leads to:

(4)   \begin{equation*} \Delta_{t+1}^{(2)} = (I_d - \eta \Sigma) \Delta_t^{(2)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(1)} , \end{equation*}

and thus (1) rewrites as

    \[\Delta_{t+1} - \Delta_{t+1}^{(1)} - \Delta_{t+1}^{(2)} = (I_d - \eta x_t x_t^{\top}) (\Delta_t - \Delta_t^{(1)} - \Delta_{t+1}^{(2)}) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(2)} .\]

In fact there is no reason to stop there, and for any k \geq 1 one can define

    \[\Delta_{t+1}^{(k)} = (I_d - \eta \Sigma) \Delta_t^{(k)} + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} ,\]

which leads to

    \[\Delta_{t+1} - \sum_{k=1}^r \Delta_{t+1}^{(k)} = (I_d - \eta x_t x_t^{\top}) \left(\Delta_t - \sum_{k=1}^r \Delta_t^{(k)} \right) + \eta (\Sigma - x_t x_t^{\top}) \Delta_t^{(r)} .\]

Denoting \lambda_k for a positive number such that the covariance matrix of (\Sigma - x_t x_t^{\top}) \Delta_t^{(k-1)} (which is the noise term in the “full gradient with noise” corresponding to both \Delta_t^{(k)} and \Delta_{t} - \sum_{s=1}^k \Delta_{t}^{(s)}) is bounded by \lambda_k \sigma^2 \Sigma, one gets with the same arguments as above

    \[\mathbb{E} \left\|\bar{\Delta}_{n} - \sum_{k=1}^r \bar{\Delta}_{n}^{(k)} \right\|_{\Sigma}^2 \lesssim \eta \times \lambda_r \times \sigma^2 R^2\]

and

    \[\mathbb{E} \|\bar{\Delta}_n^{(k)}\|_{\Sigma}^2 \lesssim \frac{\lambda_k \sigma^2 d}{n} .\]

In particular to conclude it suffices to show that \lambda_k is geometrically decreasing. In fact an easy induction shows that one can take \lambda_k = (\eta R^2)^k (we already did the base case k=1).

 

That’s it for today. In part III I will discuss how to use random projections to speed up computations for least squares in both the large sample regime n \gg d and in the high-dimensional regime d \gg n.

Posted in Optimization | Leave a comment

Bandit theory, part II

These are the lecture notes for the second part of my minicourse on bandit theory (see here for Part 1).

The linear bandit problem, Auer [2002]

We will mostly study the following far-reaching extension of the n-armed bandit problem.

Known parameters: compact action set \mathcal{A} \subset \mathbb{R}^n, adversary’s action set \mathcal{L} \subset \mathbb{R}^n, number of rounds T.

Protocol: For each round t=1,2,\hdots, T, the adversary chooses a loss vector \ell_t \in \mathcal{L} and simultaneously the player chooses a_t \in \mathcal{A} based on past observations and receives a loss/observation Y_t = \ell_t^{\top} a_t.

    \[R_T = \mathbb{E} \sum_{t=1}^T \ell_t^{\top} a_t - \min_{a \in \mathcal{A}} \mathbb{E} \sum_{t=1}^T \ell_t^{\top} a .\]

Other models: In the i.i.d. model we assume that there is some underlying \theta \in \mathcal{L} such that \mathbb{E}(Y_t | a_t) = \theta^{\top} a_t. In the Bayesian model we assume that we have a prior distribution \nu over the sequence (\ell_1,\hdots, \ell_T) (in this case the expectation in R_T is also over (\ell_1,\hdots,\ell_T)\sim\nu). Alternatively we could assume a prior over \theta.

Example: Part 1 was about \mathcal{A} = \{e_1,\hdots,e_n\} and \mathcal{L}=[0,1]^n. Another simple example is path planning: say you have a graph with n edges, and at each step one has to pick a path from some source node to a target node. The action set \mathcal{A} can be represented as a subset of the hypercube \{0,1\}^n. The adversary chooses delays on the edges, and the delay of the chosen path is the sum of the delays on the edges that the path visits (this is indeed a linear loss).

Assumption: unless specified otherwise we assume \mathcal{L}=\mathcal{A}^{\circ} :=\{\ell : \sup_{a \in \mathcal{A}} |\ell^{\top} a| \leq 1\}.

Other feedback model: in the case where \mathcal{A} \subset \{0,1\}^n one can assume that the loss \ell_t is observed at every coordinate i where a_t(i) = 1. This is the so-called semi-bandit feedback.

Thompson Sampling for linear bandit after RVR14

Assume \mathcal{A}=\{a_1,\hdots,a_{|\mathcal{A}|}\}. Recall from Part 1 that TS satisfies

    \begin{align*} & \sum_i \pi_t(i) (\bar{\ell}_t(i) - \bar{\ell}_t(i,i)) \leq \sqrt{C \ \sum_{i,j} \pi_t(i) \pi_t(j) (\bar{\ell}_t(i,j) - \bar{\ell}_t(i))^2} \\ & \Rightarrow R_T \leq \sqrt{C \ T \ \log(|\mathcal{A}|) / 2} , \end{align*}

where \bar{\ell}_t(i) = \mathbb{E}_t \ell_t(i) and \bar{\ell}_t(i,j) = \mathbb{E}_t (\ell_t(i) | i^*=j).

Writing \bar{\ell}_t(i) = a_i^{\top}\bar{\ell}_t, \bar{\ell}_t(i,j) = a_i^{\top} \bar{\ell}_t^{j}, and (M_{i,j}) = \left(\sqrt{\pi_t(i) \pi_t(j)} a_i^{\top} (\bar{\ell}_t - \bar{\ell}_t^j) \right) we want to show that

    \[\mathrm{Tr}(M) \leq \sqrt{C} \|M\|_F .\]

Using the eigenvalue formula for the trace and the Frobenius norm one can see that \mathrm{Tr}(M)^2 \leq \mathrm{rank}(M) \|M\|_F^2. Moreover the rank of M is at most n since M = U V^{\top} where U, V \in \mathbb{R}^{|\mathcal{A}| \times n} (the i^{th} row of U is \sqrt{\pi_t(i)} a_i and for V it is \sqrt{\pi_t(i)} (\bar{\ell}_t - \bar{\ell}_t^i)).

Let us make some observations.

  1. TS satisfies R_T \leq \sqrt{n T \log(|\mathcal{A}|)}. To appreciate the improvement recall that without the linear structure one would get a regret of order \sqrt{|\mathcal{A}| T} and that \mathcal{A} can be exponential in the dimension n (think of the path planning example).
  2. Provided that one can efficiently sample from the posterior on \ell_t (or on \theta), TS just requires at each step one linear optimization over \mathcal{A}.
  3. TS regret bound is optimal in the following sense. W.l.og. one can assume |\mathcal{A}| \leq (10 T)^n and thus TS satisfies R_T = O(n \sqrt{T \log(T)}) for any action set. Furthermore one can show that there exists an action set and a prior such that for any strategy one has R_T = \Omega(n \sqrt{T}), see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].

Adversarial linear bandit after Dani, Hayes, Kakade [2008]

Recall from Part 1 that exponential weights satisfies for any \tilde{\ell}_t such that \mathbb{E} \tilde{\ell}_t(i) = \ell_t(i) and \tilde{\ell}_t(i) \geq 0,

    \[R_T \leq \frac{\max_i \mathrm{Ent}(\delta_{i} \Vert p_{1})}{\eta} + \frac{\eta}{2} \mathbb{E} \sum_{t} \mathbb{E}_{I \sim p_t} \tilde{\ell}_t(I)^2 .\]

DHK08 proposed the following (beautiful) unbiased estimator for the linear case:

    \[\tilde{\ell}_t = \Sigma_t^{-1} a_t a_t^{\top} \ell_t \; \text{where} \; \Sigma_t = \mathbb{E}_{a \sim p_t} (a a^{\top}) .\]

Again, amazingly, the variance is automatically controlled:

    \[\mathbb{E}(\mathbb{E}_{a \sim p_t} (\tilde{\ell}_t^{\top} a)^2) = \mathbb{E} \tilde{\ell}_t^{\top} \Sigma_t \tilde{\ell}_t \leq \mathbb{E} a_t^{\top} \Sigma_t^{-1} a_t = \mathbb{E} \mathrm{Tr} (\Sigma_t^{-1} a_t a_t) = n .\]

Up to the issue that \tilde{\ell}_t can take negative values this suggests the “optimal” \sqrt{n T \log(|\mathcal{A}|)} regret bound.

Adversarial linear bandit, further development

  1. The non-negativity issue of \tilde{\ell}_t is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional \sqrt{n} in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John’s ellipsoid (smallest ellipsoid containing \mathcal{A}). You can check this video for some more details on this.
  2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions.
  3. Abernethy, Hazan and Rakhlin [2008] proposed an alternative (beautiful) strategy based on mirror descent. The key idea is to use a n-self-concordant barrier for \mathrm{conv}(\mathcal{A}) as a mirror map and to sample points uniformly in Dikin ellipses. This method’s regret is suboptimal by a factor \sqrt{n} and the computational efficiency depends on the barrier being used.
  4. Bubeck and Eldan [2014]‘s entropic barrier allows for a much more information-efficient sampling than AHR08. This gives another strategy with optimal regret which is efficient when \mathcal{A} is convex (and one can do linear optimization on \mathcal{A}). You can check this video for some more on the entropic barier.

Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014]

Combinatorial setting: \mathcal{A} \subset \{0,1\}^n, \max_a \|a\|_1 = m, \mathcal{L} = [0,1]^n.

  1. Full information case goes back to the end of the 90’s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets \mathcal{A}).
  2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013].
  3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where \mathcal{L} = \mathcal{A}^{\circ}.
  4. Optimal regret in the semi-bandit case is \sqrt{m n T} and it can be achieved with mirror descent and the natural unbiased estimator for the semi-bandit situation.
  5. For the bandit case the bound for exponential weights from the previous slides gives m \sqrt{m n T} (you should read this as “range of the loss times square root dimension times time times logsize of the set). However the lower bound from ABL14 is m \sqrt{n T}, which is conjectured to be tight.

Preliminaries for the i.i.d. case: a primer on least squares

Assume Y_t = \theta^{\top} a_t + \xi_t where (\xi_t) is an i.i.d. sequence of centered and sub-Gaussian real-valued random variables. The (regularized) least squares estimator for \theta based on \mathbb{Y}_t = (Y_1, \hdots, Y_{t-1})^{\top} is, with \mathbb{A}_t = (a_1 \hdots a_{t-1}) \in \mathbb{R}^{n \times t-1} and \Sigma_t = \lambda I_n + \sum_{s=1}^{t-1} a_s a_s^{\top}:

    \[\hat{\theta}_t = \Sigma_t^{-1} \mathbb{A}_t \mathbb{Y}_t\]

Observe that we can also write \theta = \Sigma_t^{-1} (\mathbb{A}_t (\mathbb{Y}_t+\epsilon_t) + \lambda \theta) where \epsilon_t = (\mathbb{E}(Y_1|a_1)-Y_1, \hdots, \mathbb{E}(Y_{t-1}|a_{t-1})-Y_{t-1})^{\top} so that

    \[\|\theta-\hat{\theta}_t\|_{\Sigma_t} = \|\mathbb{A}_t \epsilon_t + \lambda \theta\|_{\Sigma_t^{-1}} \leq \|\mathbb{A}_t \epsilon_t\|_{\Sigma_t^{-1}} + \sqrt{\lambda} \|\theta\| .\]

A basic martingale argument (see e.g., Abbasi-Yadkori, Pal and Szepesvari [2011]) shows that w.p. \geq 1-\delta, \forall t \geq 1,

    \[\|\mathbb{A}_t \epsilon_t\|_{\Sigma_t^{-1}} \leq \sqrt{\mathrm{logdet}(\Sigma_t) + \log(1/(\delta^2 \lambda^n))} .\]

Note that \mathrm{logdet}(\Sigma_t) \leq n \log(\mathrm{Tr}(\Sigma_t)/n) \leq n \log(\lambda + t /n) (w.l.o.g. we assumed \|a_t\|\leq1).

i.i.d. linear bandit after DHK08, RT10, AYPS11

Let \beta = 2 \sqrt{n \log(T)}, and \mathcal{E}_t = \{\theta' : \|\theta' - \hat{\theta}_t\|_{\Sigma_t} \leq \beta\}. We showed that w.p. \geq 1-1/T^2 one has \theta \in \mathcal{E}_t for all t \in [T].

The appropriate generalization of UCB is to select: (\tilde{\theta}_t, a_t) = \mathrm{argmin}_{(\theta', a) \in \mathcal{E}_t \times \mathcal{A}} \theta'^{\top} a (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event:

    \[\sum_{t=1}^T \theta^{\top} (a_t - a^*) \leq \sum_{t=1}^T (\theta-\tilde{\theta}_t)^{\top} a_t \leq \beta \sum_{t=1}^T \|a_t\|_{\Sigma_t^{-1}} \leq \beta \sqrt{T \sum_{t} \|a_t\|_{\Sigma_t^{-1}}^2}.\]

To control the sum of squares we observe that:

    \[\det(\Sigma_{t+1}) = \det(\Sigma_t) \det(I_n + \Sigma_t^{-1/2} a_t (\Sigma_t^{-1/2} a_t)^{\top}) = \det(\Sigma_t) (1+\|a_t\|^2_{\Sigma_t^{-1}})\]

so that (assuming \lambda \geq 1)

    \[\log\det(\Sigma_{T+1}) - \log\det(\Sigma_1) = \sum_t \log (1+\|a_t\|^2_{\Sigma_t^{-1}}) \geq \frac12 \sum_t \|a_t\|^2_{\Sigma_t^{-1}} .\]

Putting things together we see that the regret is O(n \log(T) \sqrt{T}).

What’s the point of i.i.d. linear bandit?

So far we did not get any real benefit from the i.i.d. assumption (the regret guarantee we obtained is the same as for the adversarial model). To me the key benefit is in the simplicity of the i.i.d. algorithm which makes it easy to incorporate further assumptions.

  1. Sparsity of \theta: instead of regularization with \ell_2-norm to define \hat{\theta} one could regularize with \ell_1-norm, see e.g., Johnson, Sivakumar and Banerjee [2016] (see also Carpentier and Munos [2012] and Abbasi-Yadkori, Pal and Szepesvari [2012].
  2. Computational constraint: instead of optimizing over \mathcal{E}_t to define \tilde{\theta}_t one could optimize over an \ell_1-ball containing \mathcal{E}_t (this would cost an extra \sqrt{n} in the regret bound).
  3. Generalized linear model: \mathbb{E}(Y_t | a_t) = \sigma(\theta^{\top} a_t) for some known increasing \sigma : \mathbb{R} \rightarrow \mathbb{R}, see Filippi, Cappe, Garivier and Szepesvari [2011].
  4. \log(T)-regime: if \mathcal{A} is finite (note that a polytope is effectively finite for us) one can get n^2 \log^2(T) / \Delta regret:

    \[R_T \leq \mathbb{E} \sum_{t=1}^T \frac{(\theta^{\top} (a_t - a^*))^2}{\Delta} \leq \frac{\beta^2}{\Delta} \mathbb{E} \sum_{t=1}^T \|a_t\|^2_{\Sigma_t^{-1}} \lesssim \frac{n^2 \log^2(T)}{\Delta} .\]

Some non-linear bandit problems

  1. Lipschitz bandit: Kleinberg, Slivkins and Upfal [2008, 2016], Bubeck, Munos, Stoltz and Szepesvari [2008, 2011], Magureanu, Combes and Proutiere [2014]
  2. Gaussian process bandit: Srinivas, Krause, Kakade and Seeger [2010]
  3. Convex bandit: see the videos by myself and Ronen Eldan here and our arxiv paper.

Contextual bandit

We now make the game-changing assumption that at the beginning of each round t a {\em context} x_t \in \mathcal{X} is revealed to the player. The ideal notion of regret is now:

    \[R_T^{\mathrm{ctx}} = \sum_{t=1}^T \ell_t(a_t) - \inf_{\Phi : \mathcal{X} \rightarrow \mathcal{A}} \sum_{t=1}^T \ell_t(\Phi(x_t)) .\]

Sometimes it makes sense to restrict the mapping from contexts to actions, so that the infimum is taken over some {\em policy set} \Pi \subset \mathcal{A}^{\mathcal{X}}.

As far as I can tell the contextual bandit problem is an infinite playground and there is no canonical solution (or at least not yet!). Thankfully all we have learned so far can give useful guidance in this challenging problem.

Linear model after embedding

A natural assumption in several application domains is to suppose linearity in the loss after a correct embedding. Say we know mappings (\phi_a)_{a \in \mathcal{A}} such that \mathbb{E}_t(\ell_t(a)) = \phi_a(x_t)^{\top} \theta for some unknown \theta \in \mathbb{R}^n (or in the adversarial case that \ell_t(a) = \ell_t^{\top} \phi_a(x_t)).

This is nothing but a linear bandit problem where the action set is changing over time. All the strategies we described are robust to this modification and thus in this case one can get a regret of \sqrt{n T \log(|\mathcal{A}|)} \lesssim n \sqrt{T \log(T)} (and for the stochastic case one can get efficiently n^{3/2} \sqrt{T}).

A much more challenging case is when the correct embedding \phi=(\phi_a)_{a \in \mathcal{A}} is only known to belong to some class \Phi. Without further assumptions on \Phi we are basically back to the general model. Also note that a natural impulse is to run “bandits on top of bandits”, that is first select some \phi_t \in \Phi and then select a_t based on the assumption that \phi_t is correct. We won’t get into this here, but let us investigate a related idea.

Exp4, Auer, Cesa-Bianchi, Freund and Schapire [2001]

One can play exponential weights on the set of policies with the following unbiased estimator (obvious notation: \ell_t(\pi) = \ell_t(\pi(x_t)), \pi_t \sim p_t, and a_t=\pi_t(x_t))

    \[\tilde{\ell}_t(\pi) = \frac{1\{\pi(x_t) = a_t\}}{\sum_{\pi' : \pi'(x_t) = a_t} p_t(\pi')} \ell_t(a_t) .\]

Easy exercise: R_T^{\mathrm{ctx}} \leq \sqrt{2 T |\mathcal{A}| \log(|\Pi|)} (indeed the relative entropy term is smaller than \log(|\Pi|) while the variance term is exactly |\mathcal{A}|).

The only issue of this strategy is that the computationally complexity is linear in the policy space, which might be huge. A year and half ago a major paper by Agarwal, Hsu, Kale, Langford, Li and Schapire was posted, with a strategy obtaining the same regret as Exp4 (in the i.i.d. model) but which is also computationally efficient with an oracle for the offline problem (i.e., \min_{\pi \in \Pi} \sum_{t=1}^T \ell_t(\pi(x_t))). Unfortunately the algorithm is not simple enough yet to be included in these slides.

The statistician perspective, after Goldenshluger and Zeevi [2009, 2011], Perchet and Rigollet [2011]

Let \mathcal{X} \subset \mathbb{R}^d, \mathcal{A}=[n], (x_t) i.i.d. from some \mu absolutely continuous w.r.t. Lebesgue. The reward for playing arm a under context x is drawn from some distribution \nu_{a}(x) on [0,1] with mean function f_a(x) which is assumed to be \beta-Holder smooth. Let \Delta(x) be the “gap” function.

A key parameter is the proportion of contexts with a small gap. The margin assumption is that for some \alpha>0, one has

    \[\mu(\{x : \Delta(x) \in (0,\delta)\}) \leq C \delta^{\alpha}, \forall \delta \in (0,1] .\]

One can achieve a regret of order T \left(\frac{n \log(n)}{T} \right)^{\frac{\beta (\alpha+1)}{2\beta + d}}, which is optimal at least in the dependency on T. It can be achieved by running Successive Elimination on an adaptively refined partition of the space, see Perchet and Rigollet [2011] for the details.

The online multi-class classification perspective after Kakade, Shalev-Shwartz, and Tewari [2008]

Here the loss is assumed to be of the following very simple form: \ell_t(a) = 1\{a\neq a^*_t\}. In other words using the context x_t one has to predict the best action (which can be interpreted as a class) a_t^* \in [n].

KSST08 introduces the banditron, a bandit version of the multi-class perceptron for this problem. While with full information the online multi-class perceptron can be shown to satisfy a “regret” bound on of order \sqrt{T}, the banditron attains only a regret of order T^{2/3}. See also Chapter 4 in Bubeck and Cesa-Bianchi [2012] for more on this.

Summary of advanced results

  1. The optimal regret for the linear bandit problem is \tilde{O}(n \sqrt{T}). In the Bayesian context Thompson Sampling achieves this bound. In the i.i.d. case one can use an algorithm based on the optimism in face of uncertainty together with concentration properties of the least squares estimator.
  2. The i.i.d. algorithm can easily be modified to be computationally efficient, or to deal with sparsity in the unknown vector \theta.
  3. Extensions/variants: semi-bandit model, non-linear bandit (Lipschitz, Gaussian process, convex).
  4. Contextual bandit is still a very active subfield of bandit theory.
  5. Many important things were omitted. Example: knapsack bandit, see Badanidiyuru, Kleinberg and Slivkins [2013].

Some open problems we discussed

  1. Prove the lower bound \mathbb{E} R_T = \Omega(\sqrt{T n \log(n)}) for the adversarial n-armed bandit with adaptive adversary.
  2. Guha and Munagala [2014] conjecture: for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.
  3. Find a “simple” strategy achieving the Bubeck and Slivkins [2012] best of both worlds result (see Seldin and Slivkins [2014] for some partial progress on this question).
  4. For the combinatorial bandit problem, find a strategy with regret at most n^{3/2}\sqrt{T} (current best is n^2 \sqrt{T}).
  5. Is there a computationally efficient strategy for i.i.d. linear bandit with optimal n \sqrt{T} gap-free regret and with \log(T) gap-based regret?
  6. Is there a natural framework to think about “bandits on top of bandits” (while keeping \sqrt{T}-regret)?
Posted in Optimization, Probability theory | 5 Comments

Bandit theory, part I

This week I’m giving two 90 minutes lectures on bandit theory at MLSS Cadiz. Despite my 2012 survey with Nicolo I thought it would be a good idea to post my lectures notes here. Indeed while much of the material is similar, the style of a mini-course is quite different from the style of a survey. Also, bandit theory has surprisingly progressed since 2012 and many things can now be explained better. Finally in the survey we completely omitted the Bayesian model as we thought that we didn’t have much to add on this topic compared to existing sources (such as the 2011 book by Gittins, Glazebrook, Weber). For a mini-course this concern is irrelevant so I quickly discuss the famous Gittins index and its proof of optimality.

i.i.d. multi-armed bandit, Robbins [1952]

Known parameters: number of arms n and (possibly) number of rounds T \geq n.

Unknown parameters: n probability distributions \nu_1,\ldots,\nu_n on [0,1] with mean \mu_1, \hdots, \mu_n (notation: \mu^* = \max_{i \in [n]} \mu_i).

Protocol: For each round t=1,2,\hdots, T, the player chooses I_t \in [n] based on past observations and receives a reward/observation Y_t \sim \nu_{I_t} (independently from the past).

Performance measure: The cumulative regret is the difference between the player’s accumulated reward and the maximum the player could have obtained had she known all the parameters,

    \[\overline{R}_T = T \mu^* - \mathbb{E} \sum_{t \in [T]} Y_t .\]

This problem models the fundamental tension between exploration and exploitation (one wants to pick arms that performed well in the past, yet one needs to make sure that no good option has been missed). Almost every week new applications are found that fit this simple framework and I’m sure you already have some in mind (the most popular one being ad placement on the internet).

i.i.d. multi-armed bandit: fundamental limitations

How small can we expect \overline{R}_T to be? Consider the 2-armed case where \nu_1 = Ber(1/2) and \nu_2 = Ber(1/2+ \xi \Delta) where \xi \in \{-1,1\} is unknown. Recall from Probability 101 (or perhaps 102) that with \tau expected observations from the second arm there is a probability at least \exp(- \tau \Delta^2) to make the wrong guess on the value of \xi. Now let \tau(t) be the expected number of pulls of arm 2 up to time t when \xi=-1. One has

    \begin{eqnarray*} \overline{R}_T(\xi=+1) + \overline{R}_T(\xi = -1) & \geq & \Delta \tau(T) + \Delta \sum_{t=1}^T \exp(- \tau(t) \Delta^2) \\ & \geq & \Delta \min_{t \in [T]} (t + T \exp(- t \Delta^2)) \\ & \approx & \frac{\log(T \Delta^2)}{\Delta} . \end{eqnarray*}

We refer to Bubeck, Perchet and Rigollet [2013] for the details. The important message is that for \Delta fixed the lower bound is \frac{\log(T)}{\Delta}, while for the worse \Delta (which is of order 1/\sqrt{T}) it is \sqrt{T}. In the n-armed case this worst-case lower bound becomes \sqrt{T n} (see Auer, Cesa-Bianchi, Freund and Schapire [1995]). The \log(T)-lower bound is slightly “harder” to generalize to the n-armed case (as far as I know there is no known finite-time lower bound of this type), but thankfully it was already all done 30 years ago. First some notation: let \Delta_i = \mu^* - \mu_i and N_i(t) the number of pulls of arm i up to time t. Note that one has \overline{R}_T = \sum_{i=1}^n \Delta_i \mathbb{E} N_i(T). For p, q \in [0,1] let

    \[\mathrm{kl}(p,q) := p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q} .\]

Theorem [Lai and Robbins [1985]]

Consider a strategy s.t. \forall a>0, we have \mathbb{E} N_i(T) = o(T^a) if \Delta_i>0. Then for any Bernoulli distributions,

    \[\liminf_{T \to +\infty} \frac{\overline{R}_T}{\log(T)} \geq \sum_{i : \Delta_i > 0} \frac{\Delta_i}{\mathrm{kl}(\mu_i,\mu^*)}.\]

Note that \frac{1}{2 \Delta_i} \geq \frac{\Delta_i}{\mathrm{kl}(\mu_i,\mu^*)} \geq \frac{\mu^*(1-\mu^*)}{2 \Delta_i} so up to a variance-like term the Lai and Robbins lower bound is \sum_{i :\Delta_i>0} \frac{\log(T)}{2 \Delta_i}. This lower bound holds more generally than just for Bernoulli distributions, see for example Burnetas and Katehakis [1996].

i.i.d. multi-armed bandit: fundamental strategy

Hoeffding’s inequality teaches us that with probability at least 1-1/T, \forall t \in [T], i \in [n],

    \[\mu_i \leq \frac{1}{N_i(t)} \sum_{s < t : I_s = i} Y_s + \sqrt{\frac{2 \log(T)}{N_i(t)}} =: \mathrm{UCB}_i(t).\]

The UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]) is:

    \[I_t \in \mathrm{argmax}_{i \in [n]} \mathrm{UCB}_i(t) .\]

The regret analysis is straightforward: on a 1-2/T probability event one has

    \[N_i(t) \geq 8 \log(T) / \Delta_i^2 \Rightarrow \mathrm{UCB}_i(t) < \mu^* \leq \mathrm{UCB}_{i^*}(t) ,\]

so that \mathbb{E} N_i(T) \leq 2 + 8 \log(T)/\Delta_i^2 and in fact

    \[\overline{R}_T \leq 2 + \sum_{i : \Delta_i >0} \frac{8 \log(T)}{\Delta_i} .\]

i.i.d. multi-armed bandit: going further

  • The numerical constant 8 in the UCB regret bound can be replaced by 1/2 (which is the best one can hope for), and more importantly by slightly modifying the derivation of the UCB one can obtain the Lai and Robbins variance-like term (that is replacing \Delta_i by \mathrm{kl}(\mu_i,\mu^*)): see Cappe, Garivier, Maillard, Munos and Stoltz [2013].
  •  In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour [2006] for the fixed-confidence setting (see also Jamieson and Nowak [2014] for a recent short survey) and Audibert, Bubeck and Munos [2010] for the fixed budget setting. Key takeaway: one needs of order \mathbf{H}:=\sum_i \Delta_i^{-2} rounds to find the best arm.
  • The UCB analysis extends to sub-Gaussian reward distributions. For heavy-tailed distributions, say with 1+\epsilon moment for some \epsilon \in (0,1], one can get a regret that scales with \Delta_i^{-1/\epsilon} (instead of \Delta_i^{-1}) by using a robust mean estimator, see Bubeck, Cesa-Bianchi and Lugosi [2012].

 

Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001]

For t=1,\hdots, T, the player chooses I_t \in [n] based on previous observations, and simultaneously an adversary chooses a loss vector \ell_t \in [0,1]^n. The player’s loss/observation is \ell_t(I_t). The regret and pseudo-regret are defined as:

    \[R_T = \max_{i \in [n]} \sum_{t\in[T]} (\ell_t(I_t) - \ell_t(i)), \;\; \overline{R}_T = \max_{i \in [n]} \mathbb{E} \sum_{t\in[T]} (\ell_t(I_t) - \ell_t(i)) .\]

Obviously \mathbb{E} R_T \geq \overline{R}_T and there is equality in the oblivious case (\equiv adversary’s choices are independent of the player’s choices). The case where \ell_1,\hdots, \ell_T is an i.i.d. sequence corresponds to the i.i.d. model we just studied. In particular we already know that \sqrt{T n} is a lower bound on the attainable pseudo-regret.

Adversarial multi-armed bandit, fundamental strategy

The exponential weights strategy for the full information case where \ell_t is observed at the end of round t is defined by: play I_t at random from p_t where

    \[p_{t+1}(i) = \frac{1}{Z_{t+1}} p_t(i) \exp(- \eta \ell_t(i)) .\]

In five lines one can show \overline{R}_T \leq \sqrt{2 T \log(n)} with p_1(i)=1/n and a well-chosen learning rate \eta (recall that \mathrm{Ent}(p \Vert q) = \sum_i p(i) \log(p(i)/q(i))):

    \begin{align*} & \mathrm{Ent}(\delta_j \Vert p_{t}) - \mathrm{Ent}(\delta_j \Vert p_{t+1}) = \log \frac{p_{t+1}(j)}{p_t(j)} = \log \frac{1}{Z_{t+1}} - \eta \ell_t(j) \\ &\psi_t:= \log \mathbb{E}_{I\sim p_t} \exp(- \eta (\ell_t(I) - \mathbb{E}_{I' \sim p_t} \ell_t(I'))) = \eta \mathbb{E} \ell_t(I') + \log (Z_{t+1})\\ &\eta \sum_{t} \left(\sum_{i} p_t(i) \ell_t(i) - \ell_t(j)\right) = \mathrm{Ent}(\delta_j \Vert p_{1}) - \mathrm{Ent}(\delta_j \Vert p_{T+1}) + \sum_{t} \psi_t \\ & \text{Using that} \; \ell_t \geq 0 \; \text{one has} \; \psi_t \leq \frac{\eta^2}{2} \mathbb{E} \ell_t(i)^2 \; \text{thus} \; \overline{R}_T \leq \frac{\log(n)}{\eta} + \frac{\eta T}{2} \end{align*}

For the bandit case we replace \ell_t by \tilde{\ell}_t in the exponential weights strategy, where

    \[\tilde{\ell}_t(i) = \frac{\ell_t(I_t)}{p_t(i)} {1}\{i = I_t\} .\]

The resulting strategy is called Exp3. The key property of \tilde{\ell}_t is that it is an unbiased estimator of \ell_t:

    \[\mathbb{E}_{I_t \sim p_t} \tilde{\ell}_t(i) = \ell_t(i) .\]

Furthermore with the analysis described above one gets

    \[\overline{R}_T \leq \frac{\log(n)}{\eta} + \frac{\eta}{2} \mathbb{E} \sum_{t} \mathbb{E}_{I \sim p_t} \tilde{\ell}_t(I)^2 .\]

It only remains to control the variance term, and quite amazingly this is straightforward:

    \[\mathbb{E}_{I_t, I \sim p_t} \tilde{\ell}_t(I)^2 \leq \mathbb{E}_{I_t, I \sim p_t} \frac{{1}\{I = I_t\}}{p_t(I_t)^2} = n .\]

Thus with \eta = \sqrt{2 n \log(n) / T} one gets \overline{R}_T \leq \sqrt{2 T n \log(n)}.

Adversarial multi-armed bandit, going further

  • With the modified loss estimate \frac{\ell_t(I_t) {1}\{i = I_t\} + \beta}{p_t(I_t)} one can prove high probability bounds on R_T, and by integrating the deviations one can show \mathbb{E} R_T = O(\sqrt{T n \log(n)}).
  • The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck [2009]. Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that \mathbb{E} R_T = \Omega(\sqrt{T n \log(n)}).
  • T can be replaced by various measure of “variance” in the loss sequence, see e.g., Hazan and Kale [2009].
  • There exist strategies which guarantee simultaneously \overline{R}_T = \tilde{O}(\sqrt{T n}) in the adversarial model and \overline{R}_T = \tilde{O}(\sum_i \Delta_i^{-1}) in the i.i.d. model, see Bubeck and Slivkins [2012].
  • Many interesting variants: graph feedback structure of Mannor and Shamir [2011] (there is a graph on the set of arms, and when an arm is played one observes the loss for all its neighbors), regret with respect to best sequence of actions with at most S switches, switching cost (interestingly in this case the best regret is \Omega(T^{2/3}), see Dekel, Ding, Koren and Peres [2013]), and much more!

 

Bayesian multi-armed bandit, Thompson [1933]

Here we assume a set of “models” \{(\nu_1(\theta), \hdots, \nu_n(\theta)), \theta \in \Theta\} and prior distribution \pi_0 over \Theta. The Bayesian regret is defined as

    \[{BR}_T(\pi_0) = \mathbb{E}_{\theta \sim \pi_0} \overline{R}_T(\nu_1(\theta), \hdots, \nu_n(\theta)) ,\]

where \overline{R}_T(\nu) simply denotes the regret for the i.i.d. model when the underlying reward distributions are \nu_1,\hdots, \nu_n. In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space \mathcal{P}(\Theta). The celebrated Gittins index theorem gives sufficient condition to dramatically reduce the computational complexity of implementing the optimal Bayesian strategy under a strong product assumption on \pi_0. Notation: \pi_t denotes the posterior distribution on \theta at time t.

Theorem [Gittins [1979]]

Consider the product and \gamma-discounted case: \Theta= \times_{i} \Theta_i, \nu_i(\theta) :=\nu(\theta_i), \pi_0 = \otimes_{i} \pi_0(i), and furthermore one is interested in maximizing \mathbb{E} \sum_{t \geq 0} \gamma^t Y_t. The optimal Bayesian strategy is to pick at time s the arm maximizing the Gittins index:

    \[\sup \left\{ \lambda \in \mathbb{R} : \sup_{\tau} \mathbb{E} \left(\sum_{t < \tau} \gamma^t X_t + \frac{\gamma^{\tau}}{1-\gamma} \lambda \right) \geq \frac{1}{1-\gamma} \lambda \right\} ,\]

where the expectation is over (X_t) drawn from \nu(\theta) with \theta \sim \pi_s(i), and the supremum is taken over all stopping times \tau.

Note that the stopping time \tau in the Gittins index definition gives the optimal strategy for a 2-armed game, where one arm’s reward distribution is \delta_{\lambda} while the other arm reward’s distribution is \nu(\theta) with \pi_s(i) as a prior for \theta.

Proof: The following exquisite proof was discovered by Weber [1992]. Let

    \[\lambda_t(i):=\sup \left\{ \lambda \in \mathbb{R} : \sup_{\tau} \mathbb{E} \sum_{t < \tau} \gamma^t (X_t-\lambda) \geq 0 \right\}\]

be the Gittins index of arm i at time t, which we interpret as the maximum charge one is willing to pay to play arm i given the current information. The prevailing charge is defined as \overline{\lambda}_t(i) = \min_{s \leq t} \lambda_s(i) (i.e. whenever the prevailing charge is too high we just drop it to the fair level). We now make three simple observations which together conclude the proof:

  • The discounted sum of prevailing charge for played arms \sum_t \gamma^t \overline{\lambda}_t(I_t) is an upper bound (in expectation) on the discounted sum of rewards. Indeed the times at which the prevailing charge are updated are stopping times, and so between two such times (s,t) the expected sum of discounted reward is smaller than the discounted sum of the fair charge at time s which is equal to the prevailing charge at any time in [s,t-1].
  • Since the prevailing charge is nonincreasing, the discounted sum of prevailing charge is maximized if we always pick the arm with maximum prevailing charge. Also note that the sequence of prevailing charge (\overline{\lambda}_t(i))_{i,t} does not depend on the algorithm.
  • Gittins index does exactly 2. (since we stop playing an arm only at times at which the prevailing charge is updated) and in this case 1. is an equality. Q.E.D.

 

For much more (implementation for exponential families, interpretation as a multitoken Markov game, …) see Dumitriu, Tetali and Winkler [2003], Gittins, Glazebrook, Weber [2011], Kaufmann [2014].

Bayesian multi-armed bandit, Thompson Sampling (TS)

In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory.

In his 1933 paper Thompson proposed the following strategy: sample \theta' \sim \pi_t and play I_t \in \mathrm{argmax} \mu_i(\theta').

Theoretical guarantees for this highly practical strategy have long remained elusive. Recently Agrawal and Goyal [2012] and Kaufmann, Korda and Munos [2012] proved that TS with Bernoulli reward distributions and uniform prior on the parameters achieves \overline{R}_T = O \left( \sum_i \frac{\log(T)}{\Delta_i}\right) (note that this is the frequentist regret!). We also note that Liu and Li [2015] takes some steps in analyzing the effect of misspecification for TS.

Let me also mention a beautiful conjecture of Guha and Munagala [2014]: for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.

Bayesian multi-armed bandit, Russo and Van Roy [2014] information ratio analysis

Assume a prior in the adversarial model, that is a prior over (\ell_1,\hdots, \ell_T) \in [0,1]^{n \times T}, and let \mathbb{E}_t denote the posterior distribution (given \ell_1(I_1),\hdots,\ell_{t-1}(I_{t-1})). We introduce

    \[r_t(i) = \mathbb{E}_t (\ell_t(i) - \ell_t({i}^*)) , \;\; \text{and} \;\; v_t(i) = \mathrm{Var}_t (\mathbb{E}_t(\ell_t(i) | {i}^*)) .\]

The key observation is that (recall that H(p) = - \sum_i p(i) \log(p(i)))

    \[\mathbb{E} \sum_{t \leq T} v_t(I_t) \leq \frac12 H(i^*)\]

Indeed, equipped with Pinsker’s inequality and basic information theory concepts one has (we denote \mathbbm{I}_t for the mutual information conditionally on everything up to time t, also \mathcal{L}_t(X) denotes the law of X conditionally on everything up to time t):

    \begin{eqnarray*} v_t(i) & = & \sum_{j} \pi_t(j) (\mathbb{E}_t(\ell_t(i)|i^*=j) - \mathbb{E}_t(\ell_t(i)))^2 \\ & \leq & \frac12 \sum_{j} \pi_t(j) \mathrm{Ent}(\mathcal{L}_t(\ell_t(i) | i^* = j) \Vert \mathcal{L}_t(\ell_t(i))) \\ & = & \frac12 \mathbbm{I}_t(\ell_t(i), i^*) = \frac12 (H_t(i^*) - H_t(i^* | \ell_t(i))) . \end{eqnarray*}

Thus \mathbb{E} v_t(I_t) \leq \frac12 \mathbb{E} (H_t(i^*) - H_{t+1}(i^*)) which gives the claim thanks to a telescopic sum. We will use this key observation as follows (we give a sequence of implications leading to a regret bound so that all that is left to do is to check that the first statement in this sequence is true for TS):

    \begin{align*} & \forall t, \mathbb{E}_{t} r_t(I_t) \leq \sqrt{C \ \mathbb{E}_{t} v_t(I_t)} \\ & \Rightarrow \;\; \mathbb{E} \sum_{t=1}^T r_t(I_t) \leq \sum_{t=1}^T \sqrt{C \ \mathbb{E} v_t(I_t)} \\ & \Rightarrow \;\; {BR}_T \leq \sqrt{C \ T \ H(i^*) / 2} . \end{align*}

Thus writing \bar{\ell}_t(i) = \mathbb{E}_t \ell_t(i) and \bar{\ell}_t(i,j) = \mathbb{E}_t (\ell_t(i) | i^*=j) we have

    \begin{align*} & \mathbb{E}_{t} r_t(I_t) \leq \sqrt{C \ \mathbb{E}_{t} v_t(I_t)} \\ & \Leftrightarrow \mathbb{E}_{I_t} \bar{\ell}_t(I_t) - \sum_i \pi_t(i) \bar{\ell}_t(i,i) \leq \sqrt{C \ \mathbb{E}_{I_t}\sum_j \pi_t(j) (\bar{\ell}_t(I_t,j) - \bar{\ell}_t(I_t))^2} \end{align*}

For TS the following shows that one can take C=n:

    \begin{eqnarray*} \mathbb{E}_{I_t} \bar{\ell}_t(I_t) - \sum_i \pi_t(i) \bar{\ell}_t(i,i) & = & \sum_i \pi_t(i) (\bar{\ell}_t(i) - \bar{\ell}_t(i,i)) \\ & \leq & \sqrt{n \sum_i \pi_t(i)^2 (\bar{\ell}_t(i) - \bar{\ell}_t(i,i))^2} \\ & \leq & \sqrt{n \sum_{i,j} \pi_t(i) \pi_t(j) (\bar{\ell}_t(i) - \bar{\ell}_t(i,j))^2} . \end{eqnarray*}

Thus TS always satisfies BR_T \leq \sqrt{T n H(i^*)} \leq \sqrt{T n \log(n)}. Side note: by the minimax theorem this implies the existence of a strategy for the oblivious adversarial model with regret \sqrt{T n \log(n)} (of course we already proved that such a strategy exist, in fact we even constructed one via exponential weights, but the point is that the proof here does not require any “miracle” –yes exponential weights are kind of a miracle, especially when you consider how the variance of the unbiased estimator gets automatically controlled).

 

Summary of basic results

  • In the i.i.d. model UCB attains a regret of O \left( \sum_i \frac{\log(T)}{\Delta_i}\right) and by Lai and Robbins’ lower bound this is optimal (up to a multiplicative variance-like term).
  • In the adversarial model Exp3 attains a regret of O(\sqrt{T n \log(n)}) and this is optimal up to the logarithmic term.
  • In the Bayesian model, Gittins index gives an optimal strategy for the case of product priors. For general priors Thompson Sampling is a more flexible strategy. Its Bayesian regret is controlled by the entropy of the optimal decision. Moreover TS with an uninformative prior has frequentist guarantees comparable to UCB.
Posted in Optimization, Probability theory | 5 Comments

COLT 2016 accepted papers

Like previous years (2014, 2015) I have compiled the list of accepted papers for this year’s edition of COLT together with links to the arxiv submission whenever I could find one. These 63 papers were selected from about 200 submissions (a healthy 11% increase in terms of submissions from last year).

COLT 2016 accepted papers

Posted in Conference/workshop | 1 Comment

Notes on least-squares, part I

These are mainly notes for myself, but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!

Let (X, Y) \in \mathbb{R}^d \times \mathbb{R} be a pair of random variables, and let f: \mathbb{R}^d \rightarrow \mathbb{R} be the convex quadratic function defined by

    \[f(w) = \frac12 \mathbb{E} (X^{\top} w - Y)^2 .\]

We are interested in finding a minimizer w^* of f (we will also be satisfied with w such that f(w) - f(w^*) \leq \epsilon). Denoting \Sigma = \mathbb{E} X X^{\top} for the covariance matrix of the features (we refer to X as a feature vector) and \mu = \mathbb{E} Y X, one has w^* = \Sigma^{-1} \mu (if \Sigma is not invertible then the inverse should be understood as the Moore-Penrose pseudo-inverse, in which case w^* is the smallest-norm element of the affine subspace of minimizers of f). The model of computation we are interested in corresponds to large-scale problems where one has access to an unlimited stream of i.i.d. copies of (X, Y). Thus the only constraint to find w^* is our computational ability to process samples. Furthermore we want to be able to deal with large dimensional problem where \Sigma can be extremely ill-conditioned (for instance the smallest eigenvalue can be 0, or very close to 0), and thus bounds that depend on the condition number of \Sigma (ratio of largest eigenvalue to smallest eigenvalue) are irrelevant for us.

Let us introduce a few more notation before diving into the comparison of different algorithms. With the well-specified model in mind (i.e., Y = X^{\top} w^* + \epsilon where \epsilon \sim \mathcal{N}(0, \sigma^2) is independent of X) we define the noise level \sigma of the problem by assuming that \mathbb{E} \ (X^{\top} w^* - Y)^2 X X^{\top} \preceq \sigma^2 \Sigma. For simplicity we will assume (whenever necessary) that almost surely one has |X| \leq R and |Y| \leq B. All our iterative algorithms will start at w_0 = 0. We will also not worry too much about numerical constants and write \lesssim for an inequality true up to a numerical constant. Finally for the computational complexity we will count the number of elementary d-dimensional vector operations that one has to do (e.g. adding/scaling such vectors, or requesting a new i.i.d. copy of (X,Y)).

ERM

The most natural approach (also the most terrible as it will turn out) is to take a large sample (x_i, y_i)_{i \in [n]} and then minimize the corresponding empirical version \hat{f} of f, or in other words compute \hat{w} = \left(\mathbb{X} \mathbb{X}^{\top} \right)^{-1} \mathbb{X} \mathbb{Y} where \mathbb{Y} \in \mathbb{R}^n is the column vector (y_i) and \mathbb{X} \in \mathbb{R}^{d \times n} is the matrix whose i^{th} column is x_i. Since \hat{f} is a quadratic function, we can use the conjugate gradient algorithm (see for example Section 2.4 here) to get \hat{w} in d iterations. However each iteration (basically computing a gradient of \hat{f}) cost O(n) elementary vector operations, thus leading overall to O(n d) elementary vector operations (which, perhaps surprisingly, is also the cost of just forming the empirical covariance matrix \hat{\Sigma} = \mathbb{X} \mathbb{X}^{\top}). What number n of samples should one take? The answer (at least for the well-specified model) comes from a standard statistical analysis (e.g. with the Rademacher complexity described here, see also these nice lecture notes by Philippe Rigollet) which yield:

    \[\mathbb{E} \hat{f}(\hat{w}) - f(w^*) \lesssim \frac{\sigma^2 \mathrm{rank}(\Sigma)}{n} .\]

(Note that in the above bound we evaluate \hat{w} on \hat{f} instead of f, see the comments section for more details on this issue. Since this is not the main point of the post I overlook this issue in what follows.) Wrapping up we see that this “standard” approach solves our objective by taking n= \sigma^2 d / \epsilon samples and performing O(\sigma^2 d^2 / \epsilon) elementary vector operations. This d^2 number of elementary vector operations completely dooms this approach when d is very large. We note however that the statistical bound above can be shown to be minimax optimal, that is any algorithm has to use at least \sigma^2 d / \epsilon samples (in the worst case over all \Sigma) in order to satisfy our objective. Thus the only hope to do better than the above approach is to somehow process the data more efficiently.

A potential idea to trade a bit of the nice statistical property of \hat{w} for some computational easiness is to add a regularization term to \hat{f} in order to make the problem better conditioned (recall that conjugate gradient reaches an \epsilon-optimal point in \sqrt{\kappa} \log(1/\epsilon) iterations, where \kappa is the condition number).

In other words we would like to minimize, for some values of n and \lambda to be specified,

    \[\frac{1}{2n} \sum_{i=1}^n (x_i^{\top} w - y_i)^2 + \frac{\lambda}{2} |w|^2 .\]

One can write the minimizer as \hat{w}(\lambda) = \left(\mathbb{X} \mathbb{X}^{\top} + n \lambda \mathrm{I}_d \right)^{-1} \mathbb{X} \mathbb{Y}, and this point is now obtained at a cost of \tilde{O} \left(n \sqrt{\frac{\|\Sigma\|}{\lambda}}\right) elementary vector operations (\|\Sigma\| denotes the spectral norm of \Sigma and we used the fact that with high probability \|\hat{\Sigma}\| \lesssim \|\Sigma\|). Here one can be even smarter and use accelerated SVRG instead of conjugate gradient (for SVRG see Section 6.3 here, and here or here for how to accelerate it), which will reduce the number of elementary vector operations to \tilde{O}\left(n +\sqrt{n \frac{\|\Sigma\|}{\lambda}} \right). Also observe that again an easy argument based on Rademacher complexity leads to

\mathbb{E} f(\hat{w}(\lambda)) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\sigma^2 d}{n}. Thus we want to take \lambda of order \frac{\sigma^2 d}{n |w^*|^2} which in terms of \epsilon is of order \epsilon/ |w^*|^2 leading to an overall number of vector operations of

    \[\frac{\sigma^2 d}{\epsilon} + \frac{|w^*| \sqrt{\|\Sigma\| \sigma^2 d}}{\epsilon} .\]

Note that, perhaps surprisingly, anything less smart than accelerated SVRG would lead to a worse dependency on \epsilon than the one obtained above (which is also the same as the one obtained by the “standard” approach). Finally we observe that at a high level the two terms in the above bound can be viewed as “variance” and “bias” terms. The variance term comes from the inherent statistical limitation while the bias comes from the approximate optimization. In particular it makes sense that |w^*| appears in this bias term as it is the initial distance to the optimum for the optimization procedure.

Standard SGD

Another approach altogether, which is probably the most natural approach for anyone trained in machine learning, is to run the stochastic gradient algorithm. Writing g_i(w) = x_i (x_i^{\top} w - y_i) (this is a stochastic gradient, that is it satisfies \mathbb{E}[g_i(w) | w] = \nabla f(w)) we recall that SGD can be written as:

    \[w_{i+1} = w_i - \eta g_i = (\mathrm{I}_d - \eta x_i x_i^{\top}) w_i + \eta y_i x_i .\]

The most basic guarantee for SGD (see Theorem 6.1 here) gives that with step size \eta of order 1/\sqrt{n}, and assuming to simplify (see below for how to get around this) that |w^*| is known up to a numerical constant so that if w_{i+1} steps outside the ball of that radius then we project it back on it, one has (we also note that \mathbb{E}( |g_i(w)|^2 | w) \leq (B + |w| R)^2 \mathrm{Tr}(\Sigma), and we denote \bar{\sigma}^2 = (B + |w^*| R)^2):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \sqrt{\frac{|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{n}} .\]

This means that roughly with SGD the computational cost for our main task corresponds O(|w^*|^2 \bar{\sigma}^2 \mathrm{Tr}(\Sigma) / \epsilon^2) vector operations. This is a fairly weak result compared to accelerated SVRG mainly because of: (i) dependency in \epsilon^{-2} instead of \epsilon^{-1}, and (ii) \bar{\sigma}^2 instead of \sigma^2. We will fix both issues, but let us also observe that on the positive side the term d \|\Sigma\| in accelerated SVRG is replaced by the less conservative quantity \mathrm{Tr}(\Sigma) (that is one replaces the largest eigenvalue of \Sigma by the average eigenvalue of \Sigma).

Let us see quickly check whether one can improve the performance of SGD by regularization. That is we consider applying SGD on f(w) + \frac{\lambda}{2} |w|^2 with stochastic gradient g_i(w) = (\lambda \mathrm{I}_d + x_i x_i^{\top}) w - y_i x_i which leads to (see Theorem 6.2 here):

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \lambda |w^*|^2 + \frac{\bar{\sigma}^2 \mathrm{Tr}(\Sigma)}{\lambda n} .\]

Optimizing over \lambda we see that we get the same rate as without regularization. Thus, unsurprisingly, standard SGD fares poorly compared to the more sophisticated approach based on accelerated SVRG described above. However we will see next that there is a very simple fix to make SGD competitive with accelerated SVRG, namely just tune up the learning rate!

Constant step size SGD (after Bach and Moulines 2013)

SGD with a constant step size \eta of order of a constant times 1/R^2 satisfies

    \[\mathbb{E} f\left(\frac1{n} \sum_{i=1}^n w_i\right) - f(w^*) \lesssim \frac{\sigma^2 d + R^2 |w^*|^2}{n}.\]

We see that this result gives a similar bias/variance decomposition as the one we saw for accelerated SVRG. In particular the variance term \sigma^2 d / n is minimax optimal, while the bias term R^2 |w^*|^2 / n matches what one would expect for a basic first order method (this term is not quite optimal as one would expect that a decrease in 1/n^2 is possible since this is the optimal rate for first-order optimization of a smooth quadratic function -without strong convexity-, and indeed a recent paper of Dieuleveut, Flammarion and Bach tackles this issue).

In part II I hope to give the intuition/proof of the above result, and in part III I will discuss other aspects of the least squares problem (dual methods, random projections, sparsity).

Posted in Optimization | 3 Comments

AlphaGo is born

Google DeepMind is making the front page of Nature (again) with a new AI for Go, named AlphaGo (see also this Nature youtube video). Computer Go is a notoriously difficult problem, and up to now AI were faring very badly compared to good human players. In their paper the DeepMind team reports that AlphaGo won 5-0 against the best European player Fan Hui!!! This is truly a jump in performance: the previous best AI, Crazy Stone, needed several handicap stones to compete with pro players. Congratulations to the DeepMind team for this breakthrough!

How did they do it? From a very high level point of view they simply combined the previous state of the art (Monte Carlo Tree Search) with the new deep learning techniques. Recall that MCTS is a technique inspired from multi-armed bandits to efficiently explore the tree of possible action sequences in a game, for more details see this very nice survey by my PhD advisor Remi Munos: From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. Now in MCTS there are two key elements beyond the bandit part (i.e., how to deal with exploration v.s. exploitation): one needs a way to combine all the information collected to produce a value function for each state (this value is the key term in the upper confidence bound used by the bandit algorithm); and one needs a reasonable random policy to carry out the random rollouts once the bandit strategy has gone deep enough in the tree. In AlphaGo the initial random rollout strategy is learned via supervised deep learning on a dataset of human expert games, and the value function is learned (online, with MCTS guiding the search) via convolutional neural networks (in some sense this corresponds to a very natural inductive bias for this game).

Of course there is much more to AlphaGo than what I described above and you are invited to take a look at the paper (see this reddit thread to find the paper)!

Posted in Uncategorized | 1 Comment

On the spirit of NIPS 2015 and OpenAI

I just came back from NIPS 2015 which was a clear success in terms of numbers (note that this growth is not all because of deep learning, only about 10% of the papers were on this topic, which is about double of those on convex optimization for example):

NIPS15growth

In this post I want to talk about some of the new emerging directions that the NIPS community is taking. Of course my view is completely biased as I am more representative of COLT than NIPS (though obviously the two communities have a large overlap). Also I only looked in details at about 25% of the papers so perhaps I missed the most juicy breakthrough. In any case below you will find a short summary of each of these new directions with pointers to some of the relevant papers. Before going into the fun math I wanted to first share some thoughts about the big announcement of yesterday.

 

Thoughts about OpenAI

Obvious disclaimer: the opinions expressed here represent my own and not those of my employer (or previous employer hosting this blog). Now, for those of you who missed it, yesterday Elon Musk and friends made a huge announcement: they are giving $1 billion to create a non-profit organization whose goal is the advancement of AI (see here for the official statement, and here for the New York Times covering). This is just absolutely wonderful news, and I really feel like we are watching history in the making. There are very very few places in the world solely dedicated to basic research and with that kind of money. Examples are useful to get some perspective: the Perimeter Institute for Theoretical Physics was funded with $100 million (I believe it has a major impact in the field), the Institute for Advanced Studies was funded with a similar size gift (a simple statistic give an idea of the impact: 41 out of 57 Fields medalists have been affiliated with IAS), more recently and perhaps closer to us the Simons Institute for the Theory of Computing was created with $60 million and its influence on the field keep growing (it was certainly a very influential place in my own career). Looking at what those places are doing with 1/10 of OpenAI’s budget sets the bar extremely high for OpenAI, and I am very excited to see what direction they take and what their long term plans are!

Now let’s move on to what worries me a little: the 10 founding members of OpenAI are all working on deep learning. Before explaining further why this is worrisome let me emphasize that I strongly believe that disentangling the mysteries behind the impressive practical successes of deep nets is a key challenge for the future of AI (in fact I am spending a good amount of time thinking about this issue, just like many other groups in theoretical machine learning these days). I also believe that pushing the engineering aspect of deep nets will lead to wonderful technological breakthroughs, which is why it makes sense for companies such as Facebook, Google, Baidu, Microsoft, Amazon to invest heavily in this endeavor. However it seems insane to think that the current understanding of deep nets will be sufficient to achieve even very weak forms of AI. AI is still far from being an engineering problem, and there are some fundamental theoretical questions that have to be resolved before we can brute force our way through this problem. In fact the mission statement of OpenAI mention one such fundamental question about which we know very little: currently we build systems that solve one task (e.g., image segmentation) but how do we combine these systems so that they take advantage of each other and help improving the learning of future tasks? While one can cook up heuristics to attack this problem (such as using the learned weights for one task as the initialization for another one) it seems clear to me that we are lacking the mathematical framework and tools to think properly about this question. I don’t think that deep learners are the best positioned to make conceptual progress on this question (and similar ones), though I definitely admit that they are probably the best positioned right now to make some practical progress. Again this is why all big companies are investing in this, but for an institution that wants to look into the more distant future it seems critical to diversify the portfolio (in fact this is exactly what Microsoft Research does) and not just follow companies who often have much shorter term objectives. I really hope that this is part of their plans.

I wish the best of luck to OpenAI and their members. The game-changing potential of this organization puts a lot of responsibility on them and I sincerely hope that they will try to seriously explore different paths to AI rather than to chase local-in-time advertisement (please don’t just solve Go with deep nets!!!).

Now time for some of the cool stuff that happened at NIPS.

 

Scaling up sampling

Variational inference is a very successful paradigm in Bayesian learning where instead of trying to compute exactly the posterior distribution one searches through a parametric family for the closest (in relative entropy) distribution to the true posterior. The key observation is that one can perform stochastic gradient descent for this problem without having to compute the normalization constant in the posterior distribution (which is often an intractable problem). The only catch is that one needs to be able to sample from an element (conditioned on the observed data) of the parametric family under consideration, and this might itself be a difficult problem in large-scale applications. A basic MCMC method for this type of problems is the Langevin Monte Carlo (LMC) algorithm for which a very nice theoretical analysis was recently provided by Dalalyan in the case of convex negative log-likelihood. The issue for large-scale applications is that each step of LMC requires going through the entire data set. This is where SGLD (Stochastic Gradient Langevin Dynamics) comes in, a very nice idea of Welling and Whye Teh, where the gradient step on the convex negative log-likelihood is replaced by a stochastic gradient step. The issue is that this introduces a bias in the stationary distribution, and fixing this bias can be done in several ways such as adding an accept-reject step, or modifying appropriately the covariance matrix of the noise in the Langevin Dynamics. The jury is still out on what is the most appropriate fix, and three papers made contributions to this question at NIPS: “On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators“, “Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling“, and “A Complete Recipe for Stochastic Gradient MCMC“. Another key related question on which progress was made is to decide when to stop the chain, see “Measuring Sample Quality with Stein’s Method” (my favorite paper at this NIPS) and “Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path“. My own paper “Finite-Time Analysis of Projected Langevin Monte Carlo” was also in that space: it adds nothing to the large scale picture but it shows how Langevin dynamics can cope with compactly supported distributions. Finally another related paper that I found interesting is “Sampling from Probabilistic Submodular Models“.


When I’m a grown-up I want to do non-convex optimization!

With deep nets in mind all the rage is about non-convex optimization. One direction in that space is to develop more efficient algorithms for specific problems where we already know polynomial-time methods under reasonable assumptions, such as low rank estimation (see “A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements“) and phase retrieval (see “Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems“). The nice thing about those new results is that they essentially show that gradient descent with a spectral initialization will work (previous evidence was already shown for alternating minimization, see also “A Nonconvex Optimization Framework for Low Rank Matrix Estimation“). Another direction in non-convex optimization is to slowly extend the class of functions that one can solve efficiently, see “Beyond Convexity: Stochastic Quasi-Convex Optimization“. Finally a thought-provoking paper which is worth mentioning is “Matrix Manifold Optimization for Gaussian Mixtures” (it comes without provable guarantees but maybe something can be done there…).

 

Convex optimization strikes back

As I said non-convex optimization is all the rage, yet there are still many things about convex optimization that we don’t understand (an interesting example is given in this paper “Information-theoretic lower bounds for convex optimization with erroneous oracles“). I blogged recently about a new understanding of Nesterov’s acceleration, but this said nothing about the Nesterov’s accelerated gradient descent. The paper “Accelerated Mirror Descent in Continuous and Discrete Time” builds on (and refines) recent advances on understanding the relation of AGD and Mirror Descent, as well as the differential equations underlying them. Talking about Mirror Descent, I was happy to see it applied to deep nets optimization in “End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture“. Another interesting trend is the revival of second-order methods (e.g., Newton’s method) by using various low-rank approximations to the Hessian, see “Convergence rates of sub-sampled Newton methods“, “Newton-Stein Method: A Second Order Method for GLMs via Stein’s Lemma“, and “Natural Neural Networks“.

 

Other topics

There are a few other topics that caught my attention but I am running out of stamina. These include many papers on the analysis of cascades in networks (I am particularly curious about the COEVOLVE model), papers that further our understanding of random features, adaptive data analysis (see this), and a very healthy list of bandit papers (or Bayesian optimization as some like to call it).

 

Posted in Conference/workshop | 14 Comments

Convex Optimization: Algorithms and Complexity

I am thrilled to announce that my short introduction to convex optimization has just came out in the Foundations and Trends in Machine Learning series (free version on arxiv). This project started on this blog in 2013 with the lecture notes “The complexities of optimization”, it then morphed a year later into a first draft titled “Theory of Convex Optimization for Machine Learning”, and finally after one more iteration it has found the more appropriate name: “Convex Optimization: Algorithms and Complexity”. Notable additions since the last version include: a short treatment of conjugate gradient, an almost self-contained analysis of Vaidya’s 1989 cutting plane method (which attains the best of both center of gravity and ellipsoid method in terms of oracle complexity and computational complexity), and finally an algorithm with a simple geometric intuition which attains the rate of convergence of Nesterov’s accelerated gradient descent.

Posted in Optimization | Leave a comment

Crash course on learning theory, part 2

It might be useful to refresh your memory on the concepts we saw in part 1 (particularly the notions of VC dimension and Rademacher complexity). In this second and last part we will discuss two of the most successful algorithm paradigms in learning: boosting and SVM. Note that just like last time each topic we cover have its own books dedicated to them (see for example the boosting book by Schapire and Freund, and the SVM book by Scholkopf and Smola). Finally we conclude our short tour of the learning’s basics with a simple observation: stable algorithms generalize well.

 

Boosting

Say that given a distribution D supported on m points (x_i, y_i)_{i \in [m]} one can find (efficiently) a classifier h \in \mathcal{B} such that L_{D}(h) \leq 1/2 - \gamma (here we are in the context of classification with the zero-one loss). Can we “boost” this weak learning algorithm into a strong learning algorithm with L_{\mu}(A) arbitrarily small for m large enough? It turns out that this is possible, and even simple. The idea is to build a linear combination of hypotheses in \mathcal{B} with a greedy procedure. That is at time step t-1 our hypothesis is (the sign of) f_{t-1} = \sum_{s=1}^{t-1} w_s h_s, and we are now looking to add h_{t} \in \mathcal{B} with an approriate weight w_{t} \in \mathbb{R}. A natural guess is to optimize over (w_t, h_t) to minimize the training error of f_t on our sample (x_i, y_i)_{i \in [m]}. This might be a difficult computational problem (how do you optimize over \mathcal{B}?), and furthermore we would like to make use of our efficient weak learning algorithm. The key trick is that \mathds1\{x < 0\} \leq \exp(-x). More precisely:

    \begin{eqnarray*} L_{\hat{\mu}_m}(f_t) = \frac{1}{m} \sum_{i=1}^m \mathds1\{\mathrm{sign}(f_t(x_i)) \neq y_i\} & \leq & \frac{1}{m} \sum_{i=1}^m \exp( - y_i f_t(x_i) ) \; (=: Z_t / m) \\ & = & \frac{Z_{t-1}}{m} \sum_{i=1}^m D_t(i) \exp( - y_i w_t h_t(x_i) ) , \end{eqnarray*}

where D_t(i) = \frac{1}{Z_{t-1}} \exp(- y_i f_{t-1}(i)). From this we see that we would like h_t to be a good predictor for the distribution D_t. Thus we can pass D_t to the weak learning algorithm, which in turns gives us h_t with L_{D_t}(h_t) = \epsilon_t \leq 1/2 - \gamma. Thus we now have:

    \[Z_ t = Z_{t-1} (\epsilon_t \exp(w_t) + (1-\epsilon_t) \exp(-w_t) ).\]

Optimizing the above expression one finds that w_t = \frac{1}{2} \log\left(\frac{1-\epsilon_t}{\epsilon_t}\right) leads to (using \epsilon_t \leq 1/2-\gamma)

    \[Z_t \leq Z_{t-1} \exp(-2 \gamma^2) \leq m \exp(-2 \gamma^2 t) .\]

The procedure we just described is called AdaBoost (introduce by Schapire and Freund) and we proved that it satisfies

(1)   \begin{equation*}  L_{\hat{\mu}_m}(f_t) \leq \exp(- 2 \gamma^2 t) . \end{equation*}

In particular we see that our weak learner assumption implies that \hat{\mu}_m is realizable (and in fact realizable with margin \gamma, see next section for the definition of margin) with the hypothesis class:

    \[\mathcal{H} = \left\{ \mathrm{sign} \left( \sum_{t \in \mathbb{N}} w_t h_t(\cdot) \right) : \forall t \in \mathbb{N}, w_t \in \mathbb{R}, h_t \in \mathcal{B} \right\} .\]

This class can be thought of as a neural network with one (infinite size) hidden layer. To realize how expressive \mathcal{H} is compared to \mathcal{B} it’s a useful exercise to think about the very basic case of decision stumps (for which the empirical risk minimization can be implemented very efficiently):

    \[\mathcal{B} = \{x \mapsto \mathrm{sign}(\theta - x_i), i \in [n], \theta \in \mathbb{R}\} .\]

To derive a bound on the true risk L_{\mu}(f_t) of AdaBoost it remains to calculate the VC dimension of the class \mathcal{H}_t where the size of the hidden layer is t. This follows from more general results on the VC dimension of neural networks, and up to logarithmic factors one obtains that \mathrm{VC}(\mathcal{H}_t) is of order t \times \mathrm{VC}(\mathcal{B}). Putting this together with \eqref{eq:empada} we see that when \gamma is a constant, one should run AdaBoost for t=\Theta(\log(m)) rounds, and then one gets L_{\mu}(f_t) \lesssim \sqrt{\frac{\mathrm{VC}(\mathcal{B})}{m}}.

 

Margin

We consider \mathcal{D} to be the set of distributions \mu such that there exists w^* with \mathbb{P}_{\mu}(Y \langle w^{*} , X \rangle > 0) =1 (again we are in the context of classification with the zero-one loss, and this assumption means that the data is almost surely realizable). The SVM idea is to search for w with minimal Euclidean norm and such that Y_i \langle w , X_i \rangle \geq 1, \forall i \in [m]. Effectively this is doing empirical risk minimization over the following data dependent hypothesis class (which we write in terms of the set of admissible weight vectors):

    \[\mathcal{W}_S = \{w \in \mathbb{R}^n : Y_i \langle w , X_i \rangle \geq 1, \forall i \in [m] \; \text{and} \; |w| \leq |w^*|\} .\]

The key point is that we can now use the contraction lemma to bound the Rademacher complexity of this class. Indeed replacing the zero-one loss \mathds1\{\mathrm{sign}(\langle w, x\rangle) \neq y\} by the (Lipschitz!) “ramp loss” \min(1,\max(0,1-y\langle w , x \rangle)) makes no difference for the optimum w^*, and our estimated weight still has 0 training error while its true loss is only surestimated. Using the argument from previous sections we see that the Rademacher complexity of our hypothesis class (with respect to the ramp loss) is bounded by 2 |w^*| / \sqrt{m} (assuming the examples X_i are normalized to be in the Euclidean ball). Now it is easy to see that the existence of w^* with |w^*| \leq 1/\gamma (and \mathbb{P}_{\mu}(Y \langle w^{*} , X \rangle \geq 1) =1) exactly corresponds to a geometric margin of \gamma between positive and negative examples (indeed the margin is exactly \min_i y_i \langle \frac{w^*}{|w^*|} , x \rangle). To summarize we just saw that under the \gamma-margin condition the SVM algorithm has a sample complexity of order 1/ (\gamma \epsilon)^2. This suggests that from an estimation perspective one should map the points into a high-dimensional space so that one could hope to have the separability condition (with margin). However this raises computational challenges, as the QP given by the SVM can suddenly look daunting. This is where kernels come into the picture.

 

Kernels

So let’s go overboard and map the points to an infinite dimensional Hilbert space \mathcal{H} (as we will see in the next subsection this notation will be consistent with \mathcal{H} being the hypothesis class). Denote \psi for this map, and let K(x,x') = \langle \psi(x), \psi(x') \rangle_{\mathcal{H}} be the kernel associated with it. The key point is that we are not using all the dimensions of our Hilbert space in the SVM optimization problem, but rather we are effectively working in the subspace spanned by \psi(x_1), \hdots, \psi(x_m) (this is because we are only working with inner products with those vectors, and we are trying to minimize the norm of the resulting vector). Thus we can restrict our search to w=\sum_{i=1}^m \alpha_i \psi(x_i) (this fact is called Mercer representer theorem and the previous sentence is the proof…). The beauty is that now we only need to compute the Gram matrix (K(x_i, x_j))_{i, j \in [m]} as we only need to consider \langle w, \psi(x_i)\rangle = \sum_{j=1}^m \alpha_j K(x_j, x_i) and |w|^2 = \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i, x_j). In particular we never need to compute the points \psi(x_i) (which anyway could be infinite dimensional, so we couldn’t really write them down…). Note that the same trick would work with soft SVM (i.e., regularized hinge loss). To drive the point home let’s see an example: \psi(x) = (\prod_{i \in J} x_i)_{J \subset [n], |J| =k} leads to K(x,x') = (\langle x, x'\rangle)^k. I guess it doesn’t get much better than this :). Despite all this beauty, one should note that we now have to manipulate an object of size m^2 (the kernel matrix (K(x_i,x_j))) and in our big data days this can be a disaster. We really want to focus on methods with computational complexity linear in m, and thus one is led to the problem of kernel approximation, which we will explore below. But first let us explore a bit further what kernel SVM is really doing.

 

RKHS and the inductive bias of kernels

As we just saw in kernel methods the true central element is the kernel K rather than the embedding \psi. In particular since the only thing that matter are inner products \langle \psi(x), \psi(x') \rangle_{\mathcal{H}} we might as well assume that \psi(x) := K(x,\cdot), and \mathcal{H} is the completion of \mathrm{Span}(K(x, \cdot), x \in \mathcal{X}) where the inner product is defined by \langle K(x, \cdot), K(x', \cdot) \rangle_{\mathcal{H}} = K(x,x') (and the definition is extended to \mathcal{H} by linearity). Assuming that K is positive definite (that is Gram matrices built from K are positive definite) one obtains a well-defined Hilbert space \mathcal{H}. Furthermore this Hilbert space has a special property: for any f \in \mathcal{H}, x \in \mathcal{X}, f(x) = \langle f, K(x, \cdot) \rangle_{\mathcal{H}}. In other words \mathcal{H} is a reproducing kernel Hilbert space (RKHS), and in fact any RKHS can be obtained with the above construction (this is a simple consequence of Riesz representation theorem). Now observe that we can rewrite the kernel SVM problem

    \begin{align*} & \mathrm{min}. \; \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i,x_j) \\ & \text{s.t.} \; \alpha \in \mathbb{R}^m \; \text{and} \; \forall j \in [m], y_j \sum_{i=1}^m \alpha_i K(x_i, x_j) \geq 1 , \end{align*}

as

    \begin{align*} & \mathrm{min}. \; \|f\|_{\mathcal{H}} \\ & \text{s.t.} \; f \in \mathcal{H} \; \text{and} \; \forall j \in [m], y_j f(x_j) \geq 1 . \end{align*}

While the first formulation is computationally more effective, the second sheds light on what we are really doing: simply searching the consistent (with margin) hypothesis in \mathcal{H} with smallest norm. In other words, thinking in terms of inductive bias, one should choose a kernel for which the norm \|\cdot\|_{\mathcal{H}} represents the kind of smoothness one expects in the mapping from input to output (more on that next).

It should also be clear now that one can “kernelize” any regularized empirical risk minimization, that is instead of the boring (note that here the loss \ell is defined on \mathcal{Y} \times \mathbb{R} instead of \mathcal{Y} \times \mathcal{Y})

    \[\min_{w \in \mathbb{R}^n}. \; \frac1{m} \sum_{i=1}^m \ell(y_i, \langle w, x_i \rangle) + \lambda |w|^2 ,\]

one can consider the much more exciting

    \[\min_{f \in \mathcal{H}}. \; \frac1{m} \sum_{i=1}^m \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2 ,\]

since this can be equivalently written as

    \[\min_{\alpha \in \mathbb{R}^m}. \; \frac1{m} \sum_{i=1}^m \ell\left(y_i, \sum_{j=1}^m \alpha_j K(x_j, x_i)\right) + \lambda \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i,x_j) .\]

This gives the kernel ridge regression, kernel logistic regression, etc…

 

Translation invariant kernels and low-pass filtering

We will now investigate a bit further the RKHS that one obtains with translation invariant kernels, that is K(x,x')=k(x-x'). A beautiful theorem of Bochner characterizes the continuous maps k: \mathbb{R}^n \rightarrow \mathbb{R} (with k(0)=1) for which such a K is a positive definite kernel: it is necessary and sufficient that k is the characteristic function of a probability measure \mu, that is

    \[k(x) = \int_{\mathbb{R}^n} \exp(i \langle \omega, x \rangle ) d\mu(\omega) .\]

An important example in practice is the Gaussian kernel: K(x,x')=\exp\left( - \frac{|x-x'|^2}{2 \sigma^2}\right) (this corresponds to mapping x to the function \exp\left( - \frac{|x- \cdot |^2}{2 \sigma^2}\right)). One can check that in this case \mu is itself a Gaussian (centered and with covariance \sigma^{-2} \mathrm{I}_n).

Now let us restrict our attention to the case where \mu has a density h with respect to the Lebesgue measure, that is \frac{d\mu}{d\omega}=h. A standard calculation then shows that

    \[K(x,x')= \frac{1}{(2\pi)^{n}} \int_{\mathbb{R}^n} \frac{\widehat{K(x, \cdot)}(\omega)\overline{\widehat{K(x, \cdot)}(\omega)}}{h(\omega)} d\omega ,\]

which implies in particular

    \[\| f \|_{\mathcal{H}} = \frac{1}{(2\pi)^{\frac{n}{2}}} \int_{\mathbb{R}^n} \frac{|\widehat{f}(\omega)|^2}{h(\omega)} d\omega .\]

Note that for the Gaussian kernel one has h(\omega)^{-1} \propto \exp\left(\frac{\sigma^2 |\omega|^2}{2}\right), that is the high frequency in f are severely penalized in the RKHS norm. Also note that smaller values of \sigma correspond to less regularization, which is what one would have expected from the feature map representation (indeed the features \exp\left( - \frac{|x- \cdot |^2}{2 \sigma^2}\right) are more localized around the data point x for larger values of \sigma).

To summarize, SVM with translation invariant kernels correspond to some kind of soft low-pass filtering, where the exact form of the penalization for higher frequency depends on the specific kernel being used (smoother kernels lead to more penalization).

 

Random features

Let us now come back to computational issues. As we pointed out before, the vanilla kernel method has at least a quadratic cost in the number of data points. A common approach to reduce this cost is to use a low rank approximation of the Gram matrix (indeed thanks to the i.i.d. assumption there is presumably a lot of redundancy in the Gram matrix), or to resort to online algorithms (see for example the forgetron of Dekel, Shalev-Shwartz and Singer). Another idea is to approximate the feature map itself (a priori this doesn’t sound like a good idea, since as we explained above the beauty of kernels is that we avoid computing this feature map). We now describe an elegant and powerful approximation of the feature map (for translation invariant kernels) proposed by Rahimi and Recht which is based on random features.

Let K be a translation invariant kernel and \mu its corresponding probability measure. Let’s rewrite K in a convenient form, using \cos(\theta-\theta') = \frac{1}{\pi} \int_{0}^{\pi} 2 \cos(\theta + \phi) \cos(\theta'+\phi) d\phi and Bochner’s theorem,

    \[K(x,x') = \mathbb{E}_{\omega \sim \mu} \cos(\langle \omega, x-x' \rangle) = 2 \mathbb{E}_{\omega \sim \mu, \phi \sim \mathrm{unif}([0,\pi])} \cos(\langle \omega, x \rangle + \phi) \cos(\langle \omega, x' \rangle + \phi).\]

A simple idea is thus to build the following random feature map: given \omega_1, \hdots, \omega_d and \phi_1, \hdots, \phi_d, i.i.d. draws from respectively \mu and \mathrm{unif}([0, \pi]), let \Psi: \mathbb{R}^n \rightarrow \mathbb{R}^d be defined by \Psi(x)=(\Psi_1(x), \hdots, \Psi_d(x)) where

    \[\Psi_i(x) = \sqrt{\frac{2}{d}} \cos(\langle \omega_i, x \rangle + \phi_i) .\]

For d \gtrsim n / \epsilon^2 it is an easy exercise to verify that with probability at least 1-\epsilon (provided \mu has a second moment at most polynomial in n) one will have for any x, x' in some compact set (with diameter at most polynomial in n),

    \[|K(x,x') - \langle \Psi(x), \Psi(x') \rangle | \leq \epsilon .\]

The SVM problem can now be approximated by:

    \begin{align*} & \mathrm{min}. \; |w| \\ & \text{s.t.} \; w \in \mathbb{R}^d \; \text{and} \; y_j \langle w, \Psi(x_j) \geq 1 , \forall j \in [m]. \end{align*}

This optimization problem is potentially much simpler than the vanilla kernel SVM when m is much bigger than n (essentially n replaces m for most computational aspects of the problem, including space/time complexity of prediction after training).

 

Stability

We conclude our tour of the basic topics in statistical learning with a different point of view on generalization that was put forward by Bousquet and Elisseeff.

Let’s start with a simple observation. Let S'=(Z_1', \hdots, Z_m') be an independent copy of S, and denote S^i = (Z_1, \hdots, Z_{i-1}, Z_i', Z_{i+1}, \hdots, Z_m). Then one can write, using the slight abuse of notation \ell(h, (x,y)) :=\ell(h(x), y),

    \begin{eqnarray*} \mathbb{E}_S (L_{\mu}(A(S)) - L_{\hat{\mu}_m}(A(S))) & = & \mathbb{E}_{S, S', i \sim \mathrm{unif}([m])} (\ell(A(S), Z_i') - \ell(A(S), Z_i)) \\ & = & \mathbb{E}_{S, S', i \sim \mathrm{unif}([m])} (\ell(A(S^i), Z_i) - \ell(A(S), Z_i)) . \end{eqnarray*}

This last quantity can be interpreted as a stability notion, and we see that controlling it would in turn control how different is the true risk compared to the empirical risk. Thus stable methods generalize.

We will now show that regularization can be interpreted as a stabilizer. Precisely we show that

    \[w(S) \in \argmin_{w \in \mathbb{R}} L_{\hat{\mu}_m}(h_w) + \lambda |w|^2 ,\]

is 2 / (\lambda m)-stable for a convex and 1-Lipschitz loss. Denote f_S(w) for the above objective function, then one has

    \[f_S(w) - f_S(w') = f_{S^i}(w) - f_{S^i}(w') + \frac{\ell(h_w, Z_i) - \ell(h_{w'}, Z_i)}{m} + \frac{\ell(h_{w'}, Z_i') - \ell(h_{w}, Z_i')}{m} ,\]

and thus by Lipschitzness

    \begin{eqnarray*} f_S(w(S^i))-f_S(w(S)) & \leq & \frac{\ell(h_{w(S^i)}, Z_i) - \ell(h_{w(S)}, Z_i)}{m} + \frac{\ell(h_{w(S)}, Z_i') - \ell(h_{w(S^i)}, Z_i')}{m} \\ & \leq & \frac{2}{m} |w(S) - w(S^i)| . \end{eqnarray*}

On the other hand by strong convexity one has

    \[f_S(w) - f_S(w(S)) \geq \lambda |w-w(S)|^2 ,\]

and thus with the above we get |w(S) - w(S^i)| \leq 2 / (\lambda m) which implies (by Lipschitzness)

    \[\ell(h_{w(S^i)}, Z_i) - \ell(h_{w(S)}, Z_i) \leq \frac{2}{\lambda m} ,\]

or in other words regularized empirical risk minimization (\mathrm{R-ERM}) is stable. In particular denoting w^* for the minimizer of w \mapsto L_{\mu}(h_w) we have:

    \begin{eqnarray*} L_{\mu}(\mathrm{R-ERM}) & \leq & \mathbb{E}_S L_{\hat{\mu}_m}(\mathrm{R-ERM}) + \frac{2}{\lambda m} \\ & \leq & \mathbb{E}_S L_{\hat{\mu}_m}(w^*) + \lambda |w^*| + \frac{2}{\lambda m} \\ & = & L_{\mu}(w^*) + \lambda |w^*| + \frac{2}{\lambda m} . \end{eqnarray*}

Assuming |w^*| \leq 1 and optimizing over \lambda we recover the bound we obtained previously via Rademacher complexity.

Posted in Optimization, Probability theory, Theoretical Computer Science | Leave a comment