ORF523: Optimization with bandit feedback

In this last lecture we consider the case where one can only access the function with a noisy 0^{th}-order oracle (see this lecture for a definition). This assumption models the ‘physical reality’ of many practical problems (on the contrary to the case of a 1^{st}-order oracle which was essentially used as an ‘approximation’). Indeed there are many situations where the function f to be optimized is not known, but one can simulate the function. Imagine for example that we are trying to find the correct dosage of a few chemical elements in order to produce the most effective drug. We can imagine that given specific dosage one can produce the corresponding drug, and then test it to estimate its effectiveness. This corresponds precisely to assuming that we have a noisy 0^{th}-order oracle for the function \mathrm{dosage} \mapsto \mathrm{effectiveness}.

As one can see with the above example, the type of applications for 0^{th}-order optimization are fundamentally different from the type of applications for 1^{st}-order optimization. In particular in 1^{st}-order optimization the function f is known (we need to compute gradients), and thus it must have been directly engineered by humans (who wrote down the function!). This observation shows that in these applications one has some flexibility in the choice of f, and since we know that (roughly speaking) convex functions can be optimized efficiently, one has a strong incentive into engineering convex functions. On the other hand in 0^{th}-order optimization the function is produced by ‘Nature’. As it turns out, Nature does not care that we can only solve convex problems, and in most applications of 0^{th}-order optimization I would argue that it does not make sense to make a convexity assumption.

As you can imagine 0^{th}-order optimization appear in various contexts, from experiments in physics or biology, to the design of Artificial Intelligence for games, and the list goes on. In particular different communities have been looking at this problem, with different terminology and different assumptions. Being completely biased by my work on the multi-armed bandit problem, I believe that the bandit terminology is the nicest one to describe 0^{th}-order optimization and I will now refer to it as optimization with bandit feedback (for more explanation on this you can check my survey with Cesa-Bianchi on bandit problems). In order to simplify the discussion we will focus in this lecture on optimization over a finite set, that is |\mathcal{X}| < + \infty (we give some references on the continuous case at the end of the post).


Discrete optimization with bandit feedback

We want to solve the problem:

    \[\max_{i = 1, \hdots, K} \mu_i ,\]

where the \mu_i‘s are unknown quantities on which we can obtain information as follows: if one makes a query to action i (we will also say that one pulls arm i), then one receives an independent random variable Y such that \mathbb{E} Y = \mu_i. We will also assume that the distribution of Y is subgaussian (for example this include Gaussian distributions with variance less than 1 and distributions supported on an interval of length less than 2). The precise definition is not important, we will only use the fact that if one receives Y_1, \hdots, Y_s from pullings s times arm i then \hat{\mu}_{i,s} = \frac{1}{s} \sum_{t=1}^s Y_s satisfies:

(1)   \begin{equation*}  \mathbb{P}(\hat{\mu}_{i,s} - \mu_i > u) \leq \exp \left(- \frac{s u^2}{2} \right) . \end{equation*}

For sake of simplicity we assume that \mathrm{argmax}_{1 \leq i \leq K} \mu_i = \{i^*\}. An important parameter in our analysis is the suboptimality gap for arm i: \Delta_i = \mu_{i^*} - \mu_i (we also denote \Delta = \min_{i \neq i^*} \Delta_i).


We will be interested in two very different models:

  • In the PAC (Probably Approximately Correct) setting the algorithm can make as many queries as necessary, so that when it stops querying it can output an arm J \in \{1, \hdots, K\} such that \mathbb{P}(\mu_J < \mu_{i^*} - \epsilon) \leq \delta where \epsilon and \delta are prespecified accuracies. We denote by \mathcal{N} the number of queries that the algorithm made before stopping, and the objective is to have (\epsilon,\delta)-PAC algorithms with \mathcal{N} as small as possible. This formulation goes back to Bechhofer (1954).
  • In the fixed budget setting the algorithm can make at most n queries, and then it has to output an arm J \in \{1, \hdots, K\}. The goal here is to minimize the optimization error \mu_{i^*} - \mu_J. Strangely this formulation is much more recent: it was proposed in this paper that I wrote with Munos and Stoltz.

While at first sight the two models might seem similar, we will see that in fact there is a key fundamental difference.


Trivial algorithms

A trivial (\epsilon,\delta)-PAC algorithm would be to query each arm of order of \frac{1}{\epsilon^2} \log \frac{K}{\delta} times and then output the empirical best arm. Using (1) it is obvious that this algorithm is indeed (\epsilon,\delta)-PAC, and furthermore it satisfies \mathcal{N} \leq c \frac{K}{\epsilon^2} \log \frac{K}{\delta}.


In the fixed budget setting a trivial algorithm is to divide the budget evenly among the K arms. Using (1) it is immediate that this strategy satisfies \mathbb{P}(\mu_{i^*} - \mu_J \geq \epsilon) \leq K \exp( - c \frac{n}{K} \epsilon^2), which equivalently states that to have \mathbb{P}(\mu_{i^*} - \mu_J \geq \epsilon) \leq \delta one needs the budget n to be at least of order of \frac{K}{\epsilon^2} \log \frac{K}{\delta}.


We will now see that these trivial algorithms can be dramatically improved by taking into account the potential heterogeneity in the \mu_i‘s. For sake of simplicty we focus now on finding the best arm i^*, that is in the PAC setting we take \epsilon = 0, and in the fixed budget setting we consider \mathbb{P}(J \neq i^*). The critical quantity will be the hardness measure:

    \[H = \sum_{i \neq i^*} \frac{1}{\Delta_i^2} ,\]

that we introduced with Audibert and Munos in this paper (though it appeared previously in many other papers). We will show that, roughly speaking, while the trivial algorithms need of order of \frac{K}{\Delta^2} queries to find the optimal arm, our improved algorithms can find it with only H queries. Note that for information-theoretic reasons H is a lower bound on the number of queries that one has to make to find the optimal arm (see this paper by Mannor and Tsitsiklis for a lower bound in the PAC setting, and my paper with Audibert and Munos for a lower bound in the fixed budget setting).


Successive Elimination (SE) for the PAC setting

The Successive Elimination algorithm was proposed in this paper by Even-Dar, Mannor and Mansour. The idea is very simple: start with a set of active arms A_1 = \{1, \hdots, K\}. At each round t, pull once each arm in A_t. Now construct confidence intervals of size c \sqrt{\frac{\log (K t / \delta)}{t}} for each arm, and build A_{t+1} from A_t by eliminating arms in A_t for which the confidence interval does not overlap with the confidence interval of the currently best empirical arm in A_t. The algorithm stops when |A_t| = 1, and it outputs the single element of A_t. Using an union bound it is an easy exercise to verify that this algorithm is \delta-PAC, and furthermore with probability at least 1- \delta one has \mathcal{N} \leq c H \log \frac{K}{\delta \Delta^2}.


Successive Rejects (SR) for the fixed budget setting

The Successive Elimination algorithm cannot be easily adapted in the fixed budget setting. The reason is that in the fixed budget framework we do not know a priori what is a reasonable value for the confidence parameter \delta. Indeed in the end an optimal algorithm should have a probability of error of order \exp(- c n / H), which depends on the unknown hardness parameter H. As a result one cannot use strategy based on confidence intervals in the fixed budget setting unless one knows H (note that estimating H from data is basically as hard as finding the best arm). With Audibert and Munos we proposed an alternative to SE for the fixed budget that we called Successive Rejects.

The idea is to divide the budget n into K-1 chunks n_1, \hdots, n_{K-1} such that n_1 + \hdots + n_{K-1} = n. The algorithm then runs in K-1 phases. Let A_k be the set of active arms in phase k (with A_1 = \{1, \hdots, K\}). In phase k each arm in A_k is sampled n_k - n_{k-1} times, and the end of the phase the arm i with the worst empirical performance in A_k is rejected, that is A_{k+1} = A_k \setminus \{i\}. The output of the algorithm is the unique element of A_K.

Let \Delta_{(1)} \leq \Delta_{(2)} \leq \hdots \leq \Delta_{(K)} be the ordered \Delta_i‘s. Remark that in phase k, one of the k worst arms must still be in the active set, and thus using a trivial union bound one obtains that:

    \[\mathbb{P}(J \neq i^*) \leq \sum_{k=1}^{K-1} 2 k \exp\left( - n_k \frac{\Delta_{(K-k+1)}^2}{8} \right) .\]

Now the key observation is that by taking n_k proportional to \frac{1}{K+1-k} one obtains a bound of the form \exp( - \tilde{O} (n / H)). Precisely let n_k = \alpha \frac{n}{K+1 - k} where \alpha is such that the n_k‘s sum to n (morally \alpha is of order 1 / \log K). Then we have

    \[n_k \Delta_{(K-k+1)}^2 \geq \frac{\alpha n}{\max_{2 \leq i \leq K} i \Delta_{(i)}^{-2}} \geq \frac{\alpha n}{\sum_{i=2}^K \frac{1}{\Delta_{(i)}^2}} \geq c \frac{n}{\log(K) H} ,\]

for some numerical constant c>0. Thus we proved that SR satisfies

\mathbb{P}(J \neq i^*) \leq K^2 \exp \left(- c \frac{n}{\log(K) H} \right), which equivalently states that SR finds the best arm with probability at least 1-\delta provided that the budget is at least of order H \log(K) \log(K / \delta).


The continuous case

Many questions are still open in the continuous case. As I explained at the beginning of the post, convexity might not be the best assumption from a practical point of view, but it is nonetheless a very natural mathematical problem to try to understand the best rate of convergence in this case. This is still an open problem, and you can see the state of the art for upper bounds in this paper by Agarwal, Foster, Hsu, Kakade and Rakhlin, and for lower bounds in this paper by Shamir. In my opinion a more ‘practical’ assumption on the function is simply to assume that it is Lipschitz in some metric. The HOO (Hierarchical Optimistic Optimization) algorithm attains interesting performances when the metric is known, see this paper by myself, Munos, Stoltz and Szepesvari (similar results were obtained independently by Kleinberg, Slivkins and Upfal in this paper). Recently progress has been made for the case where the metric is unknown, see this paper by Slivkins, this paper by Munos, and this one by Bull. Finally let me remark that this Lipschitzian assumption is very weak, and thus one cannot hope to solve high-dimensional problem with the above algorithms. In fact, these algorithms are designed for small dimensional problems (say dimension less than 5 or so). In standard optimization one can solve problems in very large dimension because of the convexity assumption. For 0^{th}-order optimization I think that we still don’t have a natural assumption that would allow us to scale-up the algorithms to higher dimensional problems.

This entry was posted in Optimization. Bookmark the permalink.

One Response to "ORF523: Optimization with bandit feedback"