This week I’m giving two 90 minutes lectures on bandit theory at MLSS Cadiz. Despite my 2012 survey with Nicolo I thought it would be a good idea to post my lectures notes here. Indeed while much of the material is similar, the style of a mini-course is quite different from the style of a survey. Also, bandit theory has surprisingly progressed since 2012 and many things can now be explained better. Finally in the survey we completely omitted the Bayesian model as we thought that we didn’t have much to add on this topic compared to existing sources (such as the 2011 book by Gittins, Glazebrook, Weber). For a mini-course this concern is irrelevant so I quickly discuss the famous Gittins index and its proof of optimality.
i.i.d. multi-armed bandit, Robbins 
Known parameters: number of arms and (possibly) number of rounds .
Unknown parameters: probability distributions on with mean (notation: ).
Protocol: For each round , the player chooses based on past observations and receives a reward/observation (independently from the past).
Performance measure: The cumulative regret is the difference between the player’s accumulated reward and the maximum the player could have obtained had she known all the parameters,
This problem models the fundamental tension between exploration and exploitation (one wants to pick arms that performed well in the past, yet one needs to make sure that no good option has been missed). Almost every week new applications are found that fit this simple framework and I’m sure you already have some in mind (the most popular one being ad placement on the internet).
i.i.d. multi-armed bandit: fundamental limitations
How small can we expect to be? Consider the -armed case where and where is unknown. Recall from Probability 101 (or perhaps 102) that with expected observations from the second arm there is a probability at least to make the wrong guess on the value of . Now let be the expected number of pulls of arm up to time when . One has
We refer to Bubeck, Perchet and Rigollet  for the details. The important message is that for fixed the lower bound is , while for the worse (which is of order ) it is . In the -armed case this worst-case lower bound becomes (see Auer, Cesa-Bianchi, Freund and Schapire ). The -lower bound is slightly “harder” to generalize to the -armed case (as far as I know there is no known finite-time lower bound of this type), but thankfully it was already all done 30 years ago. First some notation: let and the number of pulls of arm up to time . Note that one has . For let
Theorem [Lai and Robbins ]
Consider a strategy s.t. , we have if . Then for any Bernoulli distributions,
Note that so up to a variance-like term the Lai and Robbins lower bound is . This lower bound holds more generally than just for Bernoulli distributions, see for example Burnetas and Katehakis .
i.i.d. multi-armed bandit: fundamental strategy
Hoeffding’s inequality teaches us that with probability at least , ,
The regret analysis is straightforward: on a probability event one has
so that and in fact
i.i.d. multi-armed bandit: going further
- The numerical constant in the UCB regret bound can be replaced by (which is the best one can hope for), and more importantly by slightly modifying the derivation of the UCB one can obtain the Lai and Robbins variance-like term (that is replacing by ): see Cappe, Garivier, Maillard, Munos and Stoltz .
- In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour  for the fixed-confidence setting (see also Jamieson and Nowak  for a recent short survey) and Audibert, Bubeck and Munos  for the fixed budget setting. Key takeaway: one needs of order rounds to find the best arm.
- The UCB analysis extends to sub-Gaussian reward distributions. For heavy-tailed distributions, say with moment for some , one can get a regret that scales with (instead of ) by using a robust mean estimator, see Bubeck, Cesa-Bianchi and Lugosi .
Adversarial multi-armed bandit, Auer, Cesa-Bianchi, Freund and Schapire [1995, 2001]
For , the player chooses based on previous observations, and simultaneously an adversary chooses a loss vector . The player’s loss/observation is . The regret and pseudo-regret are defined as:
Obviously and there is equality in the oblivious case ( adversary’s choices are independent of the player’s choices). The case where is an i.i.d. sequence corresponds to the i.i.d. model we just studied. In particular we already know that is a lower bound on the attainable pseudo-regret.
Adversarial multi-armed bandit, fundamental strategy
The exponential weights strategy for the full information case where is observed at the end of round is defined by: play at random from where
In five lines one can show with and a well-chosen learning rate (recall that ):
For the bandit case we replace by in the exponential weights strategy, where
The resulting strategy is called Exp3. The key property of is that it is an unbiased estimator of :
Furthermore with the analysis described above one gets
It only remains to control the variance term, and quite amazingly this is straightforward:
Thus with one gets .
Adversarial multi-armed bandit, going further
- With the modified loss estimate one can prove high probability bounds on , and by integrating the deviations one can show .
- The extraneous logarithmic factor in the pseudo-regret upper can be removed, see Audibert and Bubeck . Conjecture: one cannot remove the log factor for the expected regret, that is for any strategy there exists an adaptive adversary such that .
- can be replaced by various measure of “variance” in the loss sequence, see e.g., Hazan and Kale .
- There exist strategies which guarantee simultaneously in the adversarial model and in the i.i.d. model, see Bubeck and Slivkins .
- Many interesting variants: graph feedback structure of Mannor and Shamir  (there is a graph on the set of arms, and when an arm is played one observes the loss for all its neighbors), regret with respect to best sequence of actions with at most switches, switching cost (interestingly in this case the best regret is , see Dekel, Ding, Koren and Peres ), and much more!
Bayesian multi-armed bandit, Thompson 
Here we assume a set of “models” and prior distribution over . The Bayesian regret is defined as
where simply denotes the regret for the i.i.d. model when the underlying reward distributions are . In principle the strategy minimizing the Bayesian regret can be computed by dynamic programming on the potentially huge state space . The celebrated Gittins index theorem gives sufficient condition to dramatically reduce the computational complexity of implementing the optimal Bayesian strategy under a strong product assumption on . Notation: denotes the posterior distribution on at time .
Theorem [Gittins ]
Consider the product and -discounted case: , , , and furthermore one is interested in maximizing . The optimal Bayesian strategy is to pick at time the arm maximizing the Gittins index:
where the expectation is over drawn from with , and the supremum is taken over all stopping times .
Note that the stopping time in the Gittins index definition gives the optimal strategy for a 2-armed game, where one arm’s reward distribution is while the other arm reward’s distribution is with as a prior for .
Proof: The following exquisite proof was discovered by Weber . Let
be the Gittins index of arm at time , which we interpret as the maximum charge one is willing to pay to play arm given the current information. The prevailing charge is defined as (i.e. whenever the prevailing charge is too high we just drop it to the fair level). We now make three simple observations which together conclude the proof:
- The discounted sum of prevailing charge for played arms is an upper bound (in expectation) on the discounted sum of rewards. Indeed the times at which the prevailing charge are updated are stopping times, and so between two such times the expected sum of discounted reward is smaller than the discounted sum of the fair charge at time which is equal to the prevailing charge at any time in .
- Since the prevailing charge is nonincreasing, the discounted sum of prevailing charge is maximized if we always pick the arm with maximum prevailing charge. Also note that the sequence of prevailing charge does not depend on the algorithm.
- Gittins index does exactly 2. (since we stop playing an arm only at times at which the prevailing charge is updated) and in this case 1. is an equality. Q.E.D.
Bayesian multi-armed bandit, Thompson Sampling (TS)
In machine learning we want (i) strategies that can deal with complicated priors, and (ii) guarantees for misspecified priors. This is why we have to go beyond the Gittins index theory.
In his 1933 paper Thompson proposed the following strategy: sample and play .
Theoretical guarantees for this highly practical strategy have long remained elusive. Recently Agrawal and Goyal  and Kaufmann, Korda and Munos  proved that TS with Bernoulli reward distributions and uniform prior on the parameters achieves (note that this is the frequentist regret!). We also note that Liu and Li  takes some steps in analyzing the effect of misspecification for TS.
Let me also mention a beautiful conjecture of Guha and Munagala : for product priors, TS is a 2-approximation to the optimal Bayesian strategy for the objective of minimizing the number of pulls on suboptimal arms.
Bayesian multi-armed bandit, Russo and Van Roy  information ratio analysis
Assume a prior in the adversarial model, that is a prior over , and let denote the posterior distribution (given ). We introduce
The key observation is that (recall that )
Indeed, equipped with Pinsker’s inequality and basic information theory concepts one has (we denote for the mutual information conditionally on everything up to time , also denotes the law of conditionally on everything up to time ):
Thus which gives the claim thanks to a telescopic sum. We will use this key observation as follows (we give a sequence of implications leading to a regret bound so that all that is left to do is to check that the first statement in this sequence is true for TS):
Thus writing and we have
For TS the following shows that one can take :
Thus TS always satisfies . Side note: by the minimax theorem this implies the existence of a strategy for the oblivious adversarial model with regret (of course we already proved that such a strategy exist, in fact we even constructed one via exponential weights, but the point is that the proof here does not require any “miracle” –yes exponential weights are kind of a miracle, especially when you consider how the variance of the unbiased estimator gets automatically controlled).
Summary of basic results
- In the i.i.d. model UCB attains a regret of and by Lai and Robbins’ lower bound this is optimal (up to a multiplicative variance-like term).
- In the adversarial model Exp3 attains a regret of and this is optimal up to the logarithmic term.
- In the Bayesian model, Gittins index gives an optimal strategy for the case of product priors. For general priors Thompson Sampling is a more flexible strategy. Its Bayesian regret is controlled by the entropy of the optimal decision. Moreover TS with an uninformative prior has frequentist guarantees comparable to UCB.