Amazing progress in adversarially robust stochastic multi-armed bandits

In this post I briefly discuss some recent stunning progress on robust bandits (for more background on bandits see these two posts, part 1 and part 2, in particular what is described below gives a solution to Open Problem 3 at the end of part 2).

 

Stochastic bandit and adversarial examples

In multi-armed bandit problems the gold standard property, going back to a seminal paper of Lai and Robbins in 1985 is to have a regret upper bounded by:

(1)   \begin{equation*} \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \,. \end{equation*}

Let me unpack this a bit: this is for the scenario where the reward process for each action is simply an i.i.d. sequence from some fixed distribution, i^* is the index of the (unique) best action, and \Delta_i is the gap between the mean value of the best action and the one of i. Such guarantee is extremely strong, as in particular it means that actions whose average performance is a constant away from the optimal arm are very rarely played (only of order \log(T)). On the other hand, the price to pay for such an aggressive behavior (by this I mean focusing on good actions very quickly) is that all the classical algorithms attaining the above bound are extremely sensitive to adversarial examples: that is if there is some deviation from the i.i.d. assumption (even very brief in time), the algorithms can suddenly suffer linear in T regret.

 

Adversarial bandit

Of course there is an entirely different line of works, on adversarial multi-armed bandits, where the whole point is to prove regret guarantee for any reward process. In this case the best one can hope for is a regret of order \sqrt{K T}. The classical algorithm in this model, Exp3, attains a regret of order \sqrt{K T \log(K)}. In joint work with Jean-Yves Audibert we showed back in 2009 that the following strategy, which we called PolyINF, attains the optimal \sqrt{KT}: view Exp3 as Mirror Descent with the (negative) entropy as a regularizer, and now replace the entropy by a simple rational function namely - \sum_{i=1}^K x_i^p with p \in (0,1) (this mirror descent view was actually derived in a later paper with Gabor Lugosi). The proof becomes one line (given the appropriate knowledge of mirror descent and estimation in bandit games): the radius part of the bound is of the form \frac{1}{\eta} \sum_{i=1}^K x_i^p \leq \frac{K^{p}}{\eta}, while the variance is of the form (since the inverse of the Hessian of the mirror map is a diagonal matrix with entries x_i^{2-p}):

    \[ \eta \sum_{i=1} x_i^{2-p} \frac{1}{x_i} \leq \eta K^{1-p} \,. \]

Thus optimizing over \eta yields \sqrt{K T} for any p \in (0,1). Interestingly, the best numerical constant in the bound is obtained for p=1/2.

 

Best of both worlds

This was the state of affairs back in 2011, when with Alex Slivkins we started working on a best of both worlds type algorithm (which in today’s language is exactly a stochastic MAB robust to adversarial examples): namely one that gets the \log(T) guarantee (in fact \log^2(T) in our original paper) if the environment is the nice i.i.d. one, and also \sqrt{K T} (in fact \sqrt{K T \log^3(T)}) in the worst case. This original best of both worlds algorithm was of the following form: be aggressive as if it was a stochastic environment, but still sample sufficiently often the bad actions to make sure there isn’t an adversary trying to hide some non-stochastic behavior on these seemingly bad performing actions. Of course the whole difficulty was to show that it is possible to implement such a defense without hurting the stochastic performance too much (remember that bad actions can only be sampled of order \log(T) times!). Since this COLT 2012 paper there has been many improvements to the original strategy, as well as many variants/refinements (one such variant worth mentioning are the works trying to do a smooth transition between the stochastic and adversarial models, see e.g. here and here).

 

A stunning twist

The amazing development that I want to talk about in this post is the following: about a year ago Julian Zimmert and Yevgeny Seldin proved that the 2009 PolyINF (crucially with p=1/2) strategy actually gets the 2012 best of both worlds bound! This is truly surprising, as in principle mirror descent does not “know” anything about stochastic environments, it does not make any sophisticated concentration reasoning (say as in Lai and Robbins), yet it seems to automatically and optimally pick up on the regularity in the data. This is really amazing to me, and of course also a total surprise that the polynomial regularizer has such strong adaptivity property, while it was merely introduced to remove a benign log term.

The crucial observation of Zimmert and Seldin is that a a certain self-bounding property of the regret implies (in a one-line calculation) the best of both worlds result:

Lemma 1: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(2)   \begin{equation*} C \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

(Recall that for multi-armed bandit one selects a probability distribution x_t over the actions, so x_{i,t} denote here the probability of playing action i at time t.) Then one has that the regret is in fact bounded by 2 C \sqrt{K T} (this follows trivially by Jensen on the i sum), and moreover if the environment is stochastic one has that the regret is in fact bounded by C^2 times (1).

Proof: Assuming that the environment is stochastic we can write the regret as \sum_{i,t} \Delta_i x_{i,t}, so by assumption and using that C \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) one has:

    \[ \sum_{i \neq i^*,t} \Delta_i x_{i,t} \leq \frac{1}{2} \sum_{i \neq i^*,t} \left(\Delta_i x_{i,t} + \frac{C^2}{t \Delta_i} \right) \,, \]

which means that the left hand side is smaller than \sum_{i \neq i^*,t} \frac{C^2}{t \Delta_i} which is indeed smaller than C^2 times (1).

 

Yet another one-line proof (okay, maybe 5 lines)

Zimmert and Seldin proved that PolyINF with p=1/2 actually satisfies the self-bounding property of Lemma 1 (and thus obtains the best of both worlds guarantee). In another recent paper by Zimmert, in joint work with Haipeng Luo and Chen-Yu Wei, they simplify the analysis by using a very mild variation of the PolyINF regularizer, namely - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}). In my view it’s the proof from the book for the best of both worlds result (or very close to it)! Here it is:

Lemma 2: Equation (2) with C=10 is an upper bound on the regret of mirror descent with learning \eta_t = \frac{1}{\sqrt{t}}, mirror map \Phi(x) = - \sum_{i=1}^K (\sqrt{x_i} + \sqrt{1-x_i}), and standard multi-armed bandit loss estimator.

Proof: The classical mirror descent analysis from any good book will tell you that the regret is controlled by (for \Phi(x) = \sum_{i=1}^K \phi(x_i) and with the convention \eta_0 = + \infty):

(3)   \begin{equation*} \sum_{t=1}^T \left(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}\right) (\Phi(x^*) - \Phi(x_t)) + \sum_{t=1}^T \eta_t \sum_{i=1}^K \frac{1}{x_{i,t} \phi''(x_t)} \,. \end{equation*}

We now consider those terms for the specific \Phi and \eta_t suggested in the lemma. First notice that \frac{1}{\eta_t} - \frac{1}{\eta_{t-1}} \leq \eta_t. Moreover \phi(x^*_i) = - 1 (since x^* is integral) so that \phi(x^*_i) - \phi(x_{i,t}) \leq \min(\sqrt{x_{i,t}}, \sqrt{1-x_{i,t}}). In other words the first term in (3) is upper bounded by

    \[ \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{\min(x_{i,t}, 1-x_{i,t})}{t}} \leq 2 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality simply comes from \sqrt{1-x_{i^*,t}} = \sqrt{\sum_{i \neq i^*} x_{i,t}} \leq \sum_{i \neq i^*} \sqrt{x_{i,t}}.

Next note that \phi''(s) = \frac{1}{4} (s^{-3/2} + (1-s)^{-3/2}) \geq \frac{1}{4 \min(s,1-s)^{3/2}}, so that the second term in (3) is upper bounded by

    \[ 4 \sum_{t=1}^T \sum_{i=1}^K \sqrt{\frac{x_{i,t}}{t}} \min(1, (1-x_{i,t})/x_{i,t})^{3/2} \leq 8 \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \, \]

where the inequality follows trivially by considering the two cases whether x_{i^*,t} is smaller or larger than 1/2.

This entry was posted in Machine learning, Optimization, Theoretical Computer Science. Bookmark the permalink.

6 Responses to "Amazing progress in adversarially robust stochastic multi-armed bandits"

Leave a reply