Optimal bound for stochastic bandits with corruption

Guest post by Mark Sellke.

In the comments of the previous blog post we asked if the new viewpoint on best of both worlds can be used to get clean “interpolation” results. The context is as follows: in a STOC 2018 paper followed by a COLT 2019 paper, the following corruption model was discussed: stochastic bandits, except for C rounds which are adversarial. The state of the art bounds were of the form: optimal (or almost optimal) stochastic term plus K C, and it was mentioned as an open problem whether KC could be improved to C (there is a lower bound showing that C is necessary — when C = O(\sqrt{T})). As was discussed in the comment section, it seemed that indeed this clean best of both worlds approach should certainly shed light on the corruption model. It turns out that this is indeed the case, and a one-line calculation resolves positively the open problem from the COLT paper. The formal result is as follows (recall the notation/definitions from the previous blog post):

Lemma: Consider a strategy whose regret with respect to the optimal action i^* is upper bounded by

(1)   \begin{equation*} c \sum_{t=1}^T \sum_{i \neq i^*} \sqrt{\frac{x_{i,t}}{t}} \,. \end{equation*}

Then in the C-corruption stochastic bandit model one has that the regret is bounded by:

    \[ C + 2 c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{\Delta_i} \]


Note that by the previous blog post we know strategies that satisfy (1) with c=10 (see Lemma 2 in the previous post).

Proof: In equation (1) let us apply Jensen over the corrupt rounds, this yields a term c \sqrt{K C}. For the non-corrupt rounds, let us use that

    \[ c \sqrt{\frac{x_{i,t}}{t}} \leq \frac{1}{2} \left( \Delta_i x_{i,t} + \frac{c^2}{t \Delta_i} \right) \]

The sum of the second term on the right hand side is upper bounded by c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i}. On the other hand the sum (over non-corrupt rounds) of the first term is equal to 1/2 of the regret over the non-corrupt rounds, which is certainly smaller than 1/2 of the total regret plus C. Thus we obtain (denoting R for the total regret):

    \[ R \leq c \sqrt{K C} + c^2 \sum_{i \neq i^*} \frac{\log(T)}{2 \Delta_i} + \frac{C}{2} + \frac{R}{2} \]

which concludes the proof.


This entry was posted in Machine learning, Optimization, Theoretical Computer Science. Bookmark the permalink.

Leave a reply