Kernel-based methods for convex bandits, part 3

(This post absolutely requires to have read Part 1 and Part 2.) A key assumption at the end of Part 1 was that, after rescaling space so that the current exponential weights distribution p_t is isotropic, one has

(1)   \begin{equation*} \mathrm{supp}(p_t) \subset \{x \in \mathbb{R}^n : |x| \leq R\} =:F_{t-1} , \end{equation*}

for some R = \tilde{O}(\sqrt{n}). Effectively this means that the estimated cumulative loss outside of the ball F_t is infinite (recall that p_t is proportional to \exp(- \eta \sum_{s=1}^{t-1} \tilde{\ell}_s)). Thus to enforce (1) (at time t+1) we will actually set the loss estimate \tilde{\ell}_t to +\infty on F_t^c. The price one pays for this is that the loss estimator is now unbiased only on F_t, which in turn means that we control the cumulative regret with respect to points in F_T only. We believe that in the so-called stochastic setting where the loss sequence \ell_t is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in F_t at all round t (for R=\tilde{O}(\sqrt{n})) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order \tilde{O}(n^{4.5} \sqrt{T}) (note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first T^{1-\epsilon} rounds and then move out the optimum very far from this region for the remaining (1-o(1)) T rounds. To solve this issue we introduce a restart condition: essentially if the algorithm believes that the optimum might be outside the region F_t then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon T-t). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region F_t given in (1) will not quite work, and in fact we will construct a region which verifies

    \[F_{t-1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

Furthermore we will need to take R = \tilde{O}(n^2) which in turn forces us to take \lambda = \tilde{O}(1/n^8) (indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially \exp(nR \times \sqrt{\lambda n}) where nR instead of R comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take \lambda to be 1/n times smaller than the predicted value from Part 1) and hence the \tilde{O}(n^{9.5} \sqrt{T}) final regret bound (indeed recall that we got a bound of \frac{n}{\lambda} \sqrt{n T} where the first n comes from the fact that we need to take a learning rate 1/n times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the \sqrt{n} comes from the relative entropy distance to the optimum at the beginning).

We will use extensively the following result:

Lemma: Let p = \exp(-f) be an isotropic log-concave measure. Then for any x such that |x| = \tilde{\Omega}(n) one has f(x) - \min_y f(y) = \Omega(|x|).

Restart condition

The simplest idea to restart goes as follows. Let us consider some x \in \partial F_t (with F_t defined as in (1)). Why is it there? Well there must be some time s in the past where we estimated the cumulative regret of x to be at least \Omega(R/\eta). In particular if this x now has a small estimated regret, say smaller than O(1/\eta) then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus F_t. Thus our restart condition looks like this: restart if

    \[\exists \ x \in \partial F_t : \eta \left(\sum_{s=1}^t \tilde{\ell}_t(x) - \min_{y \in F_t} \sum_{s=1}^t \tilde{\ell}_t(y) \right) = O(1) .\]

A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the O(1) constant to be large enough, we know that the optimum (over all of \mathcal{K}) of the “true” cumulative loss \sum_{s=1}^t K^*_s \ell_s(x) is still in F_t at the time when we restart.

Hoping for negative regret

If we catch the adversary’s presumptive attempt to move the optimum out of F_t quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region F_t for a good reason, and thus if we compare ourselves to the point x which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since x was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where s is the time at which x enters the boundary of the focus region,

(2)   \begin{equation*} \mathrm{Ent}(\delta_x \Vert p_1) - \mathrm{Ent}(\delta_x \Vert p_{s+1}) = \log(p_{s+1}(x) / p_1(x)) . \end{equation*}

Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as \tilde{O}(n) - R for R = \Omega(n), and thus this term divided by \eta (which is then O(- R \sqrt{T n})) could easily compensate the variance term which is O(\eta T) = O(\sqrt{T / n}). This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step s we had increased the learning rate from \eta to (1+\gamma) \eta then we would have the term - \mathrm{Ent}(\delta_x \Vert p_{s+1}) multiplied by \gamma in the final regret bound and thus we would only need to have \gamma R \geq \mathrm{Ent}(\delta_x \Vert p_1) /\eta = \tilde{O}(n) to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of \eta). Let us make this slightly more formal.

Turning up the learning rate

Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: p_{t+1}(x) \propto p_t(x) \exp(- \eta_t \tilde{\ell}_t(x)). One can see with the same calculations than in Part 1 that:

    \begin{align*} \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \leq \frac{\mathrm{Ent}(\delta_x \Vert p_t) - \mathrm{Ent}(\delta_x \Vert p_{t+1}) }{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle \\ & = \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} + \eta_t \langle p_t, \tilde{\ell}_t^2 \rangle . \end{align*}

In particular if we increase the learning rate at some times \tau_1, \hdots, \tau_N by \eta_{\tau_i+1} = (1+\gamma) \eta_{\tau_i} then we have

    \begin{align*} & \sum_{t=1}^T \frac{\log(p_{t+1}(x)) - \log(p_t(x))}{\eta_t} \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \sum_{i=1}^N \log(p_{\tau_i+1}) \left(\frac{1}{\eta}_{\tau_i} - \frac{1}{\eta_{\tau_i+1}} \right) \\ & = \frac{\log(p_{T+1}(x))}{\eta_{T}} - \frac{\log(p_1(x))}{\eta_1} + \frac{\gamma}{1+\gamma} \sum_{i=1}^N \frac{\log(p_{\tau_i+1})}{\eta_{\tau_i}} . \end{align*}

Using that the normalizing constants are all larger than T^{-O(n)} (this can be proved using the fact that covariance of the exponential weights is at least some T^{ -O(1)}) and assuming that \eta_T / \eta_1 = (1+\gamma)^N = O(1) (i.e. N \approx 1/\gamma) one gets roughly, with Q_t(x) = \sum_{s=1}^t \eta_s \tilde{\ell}_s(x),

    \begin{align*} \sum_{t=1}^T \langle p_t - \delta_x , \tilde{\ell}_t \rangle & \lesssim \eta_1 T + \frac{n \log(T)}{\eta_1} - \frac{\gamma}{\eta_1} \sum_{i=1}^N Q_{\tau_i}(x) \\ & \lesssim \sqrt{T} \bigg( (n \log(T))^{3/2} - \gamma \sqrt{n \log(T)} \max_{i \in [N]} Q_{\tau_i}(x) \bigg), \end{align*}

where the second inequality uses that \eta_1 \approx \frac{1}{\sqrt{n \log(T)}} (see Part 2).

We see that by taking \gamma R = \Omega(n \log(T)) we can guarantee negative regret with respect to any x whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times \tau_1, \hdots, \tau_N). Thus we see that we should take \gamma = \tilde{\Theta}(1/\mathrm{poly}(n)) if we want to have a regret of order \tilde{O}(\mathrm{poly}(n) \sqrt{T}). Recall also that we can afford to update the learning rate only N \approx 1/\gamma = \tilde{\Theta}(\mathrm{poly}(n)) times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take N = O(n \log(T)) (and thus \gamma = \Theta(1/(n \log(T)), R = \Theta( (n \log(T))^2)) while guaranteeing that the focus region satisfies

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : \|x - \mathrm{mean}(p_t)\|_{\mathrm{Cov}(p_t)^{-1}} \leq O(n R) \}\]

As we explained at the beginning of this post this will conclude the proof of the n^{9.5} \sqrt{T} regret bound.

Updating the focus region more carefully

We update the focus region F_t only at times such that, once space is rescaled so that p_{t+1} is isotropic,

    \[\mathrm{Vol} \bigg(F_t \cap\{x \in \mathbb{R}^n : |x| \leq R\}\bigg) \leq \frac{1}{2} \mathrm{Vol}(F_t) ,\]

and in this case we set

    \[F_{t+1} := F_t \cap \{x \in \mathbb{R}^n : |x| \leq R\} .\]

Somewhat clearly we won’t make more than N=O(n \log(T)) updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)

    \[F_{t+1} \subset \{x \in \mathbb{R}^n : |x| \leq O(n R) \} .\]

This follows from:

Lemma: Let \mathcal{K} be a convex body and \mathcal{E} be the centered unit ball. Suppose that \mathrm{Vol}(\mathcal{K} \cap \mathcal{E}) \geq \tfrac{1}{2} \mathrm{Vol}(\mathcal{K}). Then \mathcal{K} \subset 10 n \mathcal{E}.

Proof: Let us prove the contrapositive and assume that there is a point x \in \mathcal{K} with |x| > 10n. Denote s_i = \frac{2i}{10n}, i=1,..,5n and consider the sets \mathcal{K}_i = (1 - s_i) (\mathcal{E} \cap \mathcal{K}) + s_i x.

Note that those sets are disjoint. Indeed, the intervals (1 - s_i) [-1,1] + |x| s_i are disjoint, which implies that the projections of the ellipsoids (1 - s_i) \mathcal{E} + s_i x onto the span of x are disjoint. So, we have

    \[\mathrm{Vol}(\mathcal{K}) \geq \sum_{i=1}^{5 n} \mathrm{Vol}(\mathcal{K}_i) = \sum_{i=1}^{5 n} (1 - s_i)^{n} \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) \geq 2 \mathrm{Vol}(\mathcal{E} \cap \mathcal{K}) ,\]

which concludes the proof.

That’s it folks!

This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability 1/\mathrm{poly}(T) do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.

Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample x_t in time \mathrm{poly}(n) T. Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on \mathrm{poly}(n) convex sets (which are n-1 dimensional). All of this can be found in the paper of course!

This entry was posted in Optimization, Theoretical Computer Science. Bookmark the permalink.