# Bandit theory, part II

These are the lecture notes for the second part of my minicourse on bandit theory (see here for Part 1).

The linear bandit problem, Auer [2002]

We will mostly study the following far-reaching extension of the -armed bandit problem.

Known parameters: compact action set , adversary’s action set , number of rounds .

Protocol: For each round , the adversary chooses a loss vector and simultaneously the player chooses based on past observations and receives a loss/observation .

Other models: In the i.i.d. model we assume that there is some underlying such that . In the Bayesian model we assume that we have a prior distribution over the sequence (in this case the expectation in is also over ). Alternatively we could assume a prior over .

Example: Part 1 was about and . Another simple example is path planning: say you have a graph with edges, and at each step one has to pick a path from some source node to a target node. The action set can be represented as a subset of the hypercube . The adversary chooses delays on the edges, and the delay of the chosen path is the sum of the delays on the edges that the path visits (this is indeed a linear loss).

Assumption: unless specified otherwise we assume .

Other feedback model: in the case where one can assume that the loss is observed at every coordinate where . This is the so-called semi-bandit feedback.

Thompson Sampling for linear bandit after RVR14

Assume . Recall from Part 1 that TS satisfies

where and .

Writing , , and we want to show that

Using the eigenvalue formula for the trace and the Frobenius norm one can see that . Moreover the rank of is at most since where (the row of is and for it is ).

Let us make some observations.

1. TS satisfies . To appreciate the improvement recall that without the linear structure one would get a regret of order and that can be exponential in the dimension (think of the path planning example).
2. Provided that one can efficiently sample from the posterior on (or on ), TS just requires at each step one linear optimization over .
3. TS regret bound is optimal in the following sense. W.l.og. one can assume and thus TS satisfies for any action set. Furthermore one can show that there exists an action set and a prior such that for any strategy one has , see Dani, Hayes and Kakade [2008], Rusmevichientong and Tsitsiklis [2010], and Audibert, Bubeck and Lugosi [2011, 2014].

Recall from Part 1 that exponential weights satisfies for any such that and ,

DHK08 proposed the following (beautiful) unbiased estimator for the linear case:

Again, amazingly, the variance is automatically controlled:

Up to the issue that can take negative values this suggests the “optimal” regret bound.

1. The non-negativity issue of is a manifestation of the need for an added exploration. DHK08 used a suboptimal exploration which led to an additional in the regret. This was later improved in Bubeck, Cesa-Bianchi, and Kakade [2012] with an exploration based on the John’s ellipsoid (smallest ellipsoid containing ). You can check this video for some more details on this.
2. Sampling the exp. weights is usually computationally difficult, see Cesa-Bianchi and Lugosi [2009] for some exceptions.
3. Abernethy, Hazan and Rakhlin [2008] proposed an alternative (beautiful) strategy based on mirror descent. The key idea is to use a -self-concordant barrier for as a mirror map and to sample points uniformly in Dikin ellipses. This method’s regret is suboptimal by a factor and the computational efficiency depends on the barrier being used.
4. Bubeck and Eldan [2014]‘s entropic barrier allows for a much more information-efficient sampling than AHR08. This gives another strategy with optimal regret which is efficient when is convex (and one can do linear optimization on ). You can check this video for some more on the entropic barier.

Adversarial combinatorial bandit after Audibert, Bubeck and Lugosi [2011, 2014]

Combinatorial setting: , , .

1. Full information case goes back to the end of the 90’s (Warmuth and co-authors), semi-bandit and bandit were introduced in Audibert, Bubeck and Lugosi [2011] (following several papers that studied specific sets ).
2. This is a natural setting to study FPL-type (Follow the Perturbed Leader) strategies, see e.g. Kalai and Vempala [2004] and more recently Devroye, Lugosi and Neu [2013].
3. ABL11: Exponential weights is provably suboptimal in this setting! This is in sharp contrast with the case where .
4. Optimal regret in the semi-bandit case is and it can be achieved with mirror descent and the natural unbiased estimator for the semi-bandit situation.
5. For the bandit case the bound for exponential weights from the previous slides gives (you should read this as “range of the loss times square root dimension times time times logsize of the set). However the lower bound from ABL14 is , which is conjectured to be tight.

Preliminaries for the i.i.d. case: a primer on least squares

Assume where is an i.i.d. sequence of centered and sub-Gaussian real-valued random variables. The (regularized) least squares estimator for based on is, with and :

Observe that we can also write where so that

A basic martingale argument (see e.g., Abbasi-Yadkori, Pal and Szepesvari [2011]) shows that w.p. , ,

Note that (w.l.o.g. we assumed ).

i.i.d. linear bandit after DHK08, RT10, AYPS11

Let , and . We showed that w.p. one has for all .

The appropriate generalization of UCB is to select: (this optimization is NP-hard in general, more on that next slide). Then one has on the high-probability event:

To control the sum of squares we observe that:

so that (assuming )

Putting things together we see that the regret is .

What’s the point of i.i.d. linear bandit?

So far we did not get any real benefit from the i.i.d. assumption (the regret guarantee we obtained is the same as for the adversarial model). To me the key benefit is in the simplicity of the i.i.d. algorithm which makes it easy to incorporate further assumptions.

1. Sparsity of : instead of regularization with -norm to define one could regularize with -norm, see e.g., Johnson, Sivakumar and Banerjee [2016] (see also Carpentier and Munos [2012] and Abbasi-Yadkori, Pal and Szepesvari [2012].
2. Computational constraint: instead of optimizing over to define one could optimize over an -ball containing (this would cost an extra in the regret bound).
3. Generalized linear model: for some known increasing , see Filippi, Cappe, Garivier and Szepesvari [2011].
4. -regime: if is finite (note that a polytope is effectively finite for us) one can get regret:

Some non-linear bandit problems

1. Lipschitz bandit: Kleinberg, Slivkins and Upfal [2008, 2016], Bubeck, Munos, Stoltz and Szepesvari [2008, 2011], Magureanu, Combes and Proutiere [2014]
2. Gaussian process bandit: Srinivas, Krause, Kakade and Seeger [2010]
3. Convex bandit: see the videos by myself and Ronen Eldan here and our arxiv paper.

Contextual bandit

We now make the game-changing assumption that at the beginning of each round a {\em context} is revealed to the player. The ideal notion of regret is now:

Sometimes it makes sense to restrict the mapping from contexts to actions, so that the infimum is taken over some {\em policy set} .

As far as I can tell the contextual bandit problem is an infinite playground and there is no canonical solution (or at least not yet!). Thankfully all we have learned so far can give useful guidance in this challenging problem.

Linear model after embedding

A natural assumption in several application domains is to suppose linearity in the loss after a correct embedding. Say we know mappings such that for some unknown (or in the adversarial case that ).

This is nothing but a linear bandit problem where the action set is changing over time. All the strategies we described are robust to this modification and thus in this case one can get a regret of (and for the stochastic case one can get efficiently ).

A much more challenging case is when the correct embedding is only known to belong to some class . Without further assumptions on we are basically back to the general model. Also note that a natural impulse is to run “bandits on top of bandits”, that is first select some and then select based on the assumption that is correct. We won’t get into this here, but let us investigate a related idea.

Exp4, Auer, Cesa-Bianchi, Freund and Schapire [2001]

One can play exponential weights on the set of policies with the following unbiased estimator (obvious notation: , , and )

Easy exercise: (indeed the relative entropy term is smaller than while the variance term is exactly ).

The only issue of this strategy is that the computationally complexity is linear in the policy space, which might be huge. A year and half ago a major paper by Agarwal, Hsu, Kale, Langford, Li and Schapire was posted, with a strategy obtaining the same regret as Exp4 (in the i.i.d. model) but which is also computationally efficient with an oracle for the offline problem (i.e., ). Unfortunately the algorithm is not simple enough yet to be included in these slides.

The statistician perspective, after Goldenshluger and Zeevi [2009, 2011], Perchet and Rigollet [2011]

Let , , i.i.d. from some absolutely continuous w.r.t. Lebesgue. The reward for playing arm under context is drawn from some distribution on with mean function which is assumed to be -Holder smooth. Let be the “gap” function.

A key parameter is the proportion of contexts with a small gap. The margin assumption is that for some , one has

One can achieve a regret of order , which is optimal at least in the dependency on . It can be achieved by running Successive Elimination on an adaptively refined partition of the space, see Perchet and Rigollet [2011] for the details.

The online multi-class classification perspective after Kakade, Shalev-Shwartz, and Tewari [2008]

Here the loss is assumed to be of the following very simple form: . In other words using the context one has to predict the best action (which can be interpreted as a class) .

KSST08 introduces the banditron, a bandit version of the multi-class perceptron for this problem. While with full information the online multi-class perceptron can be shown to satisfy a “regret” bound on of order , the banditron attains only a regret of order . See also Chapter 4 in Bubeck and Cesa-Bianchi [2012] for more on this.

1. The optimal regret for the linear bandit problem is . In the Bayesian context Thompson Sampling achieves this bound. In the i.i.d. case one can use an algorithm based on the optimism in face of uncertainty together with concentration properties of the least squares estimator.
2. The i.i.d. algorithm can easily be modified to be computationally efficient, or to deal with sparsity in the unknown vector .
3. Extensions/variants: semi-bandit model, non-linear bandit (Lipschitz, Gaussian process, convex).
4. Contextual bandit is still a very active subfield of bandit theory.
5. Many important things were omitted. Example: knapsack bandit, see Badanidiyuru, Kleinberg and Slivkins [2013].

Some open problems we discussed