Kernel-based methods for bandit convex optimization, part 1

A month ago Ronen Eldan, Yin Tat Lee and myself posted our latest work on bandit convex optimization. I’m quite happy with the result (first polynomial time method with poly(dimension)\times\sqrt{T}-regret) but I’m even more excited by the techniques we developed. Next week I will give three 60 minutes lectures at MSR to explain those techniques. I thought the notes for these lectures could be useful to people who are going through the main paper as the informal style of blog posts/lectures allow to gloss over some of the annoying details we had to deal with in the real proof. Each lecture covers essentially one major idea of the paper. Here we go for the notes of the first lecture!

Bandit convex optimization
Let \mathcal{K} a convex body in \mathbb{R}^n. For t=1,\hdots, T, the player selects (at random) x_t \in \mathcal{K} and simultaneously an adversary selects a convex function \ell_t : \mathcal{K} \rightarrow [0,1]. The player observes/suffers the loss \ell_t(x_t). The player’s performance is measured by the regret:

    \[R_T = \sum_{t=1}^T \ell_t(x_t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T \ell_t(x) .\]

The goal of these notes is to give a “simple” informal proof of a \mathrm{poly}(n) \sqrt{T}-regret (with high probability).

The full information engine
Most previous works on bandit convex optimization were using Online Gradient Descent as a full information primitive (in the full information setting one observes the complete loss function \ell_t). Here we will use continuous exponential weights. Let us recall what this means and the basic regret bound one can obtain. For sake of notation we use the same symbol for a probability measure p and for its density with respect to the Lebesgue measure (when it has one). We denote \langle \cdot, \cdot \rangle for the usual L^2 inner product (i.e. \langle f, g\rangle = \int_{x \in \mathbb{R}^n} f(x) g(x) dx) and \mathrm{Ent}(f \Vert g) = \langle f, \log(f/g) \rangle.

Theorem: Let \ell_1, \hdots, \ell_T be arbitrary functions from \mathcal{K} to \mathbb{R}_+. Let p_1 be some measure on \mathcal{K} and define by induction

    \[p_{t+1}(x) = \frac{1}{Z_{t+1}} p_t(x) \exp(- \eta \ell_t(x)) ,\]

where Z_{t+1} is a normalization constant and \eta>0 is some fixed learning rate. Then

    \[\sum_{t=1}^T \langle p_t - q , \ell_t \rangle \leq \frac{\mathrm{Ent}(q \Vert p_1) - \mathrm{Ent}(q \Vert p_{T+1})}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \ell_t^2 \rangle .\]

Note that the quantity \sum_{t=1}^T \langle p_t - q , \ell_t \rangle should be interpreted as the expected regret of playing x_t at random from p_t compared to playing the fixed distribution q. Just as a teaser for the experts reading this, a key point of the argument in the third lecture will be to use the term - \mathrm{Ent}(q \Vert p_{T+1}) in the regret bound (usually we simply drop this term as it is negative).

Proof: Observe that

    \[{\mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) = \langle q , \log(p_{t+1} / {p_t}) \rangle = - \log Z_{t+1} - \eta \langle q, \ell_t \rangle}.\]

Using the basic inequalities \log(1+s) \leq s and \exp(-s) \leq 1 - s + \frac{s^2}{2} for any s \geq 0, as well as the fact that \ell_t \geq 0 one obtains

    \[\log(Z_{t+1}) = \log(\langle p_t , \exp(- \eta \ell_t) \rangle) \leq - \eta \langle p_t, \ell_t \rangle + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle .\]

Rearranging the two above displays we proved

    \[\eta \langle p_t - q, \ell_t \rangle \leq \mathrm{Ent}(q \Vert p_{t}) - \mathrm{Ent}(q \Vert p_{t+1}) + \frac{\eta^2}{2} \langle p_t, \ell_t^2 \rangle ,\]

which concludes the proof.

Kernels
The usual bandit argument is to use the bandit feedback \ell_t(x_t) to build some unbiased estimator \tilde{\ell}_t of \ell_t (i.e. such that \mathbb{E}_{x_t \sim p_t} \tilde{\ell}_t = \ell_t) and then use the full information primitive with the estimated losses instead of the real ones. The previous theorem indicates that the only difficulty is then to control the variance of these estimators, i.e. \mathbb{E} \langle p_t, \tilde{\ell_t}^2 \rangle. This is where the shape assumption on the losses has to be used: the observation of the loss at x_t has to be somehow propagated to nearby points so as to have a low variance estimator. The bottleneck of bandit convex optimization is exactly that it is highly unclear how to leverage the convexity assumption to propagate information to nearby points. Our first idea is that one should not estimate \ell_t directly, but rather it is enough to estimate a coarse approximation to \ell_t, where the level of coarseness depends on our level of uncertainty about where lies the optimum of the current cumulative loss. In particular “far” (where distances are measured with respect to the variance of our current exponential weights) from our current believed optimum (which is basically the center of mass of our current exponential weights) a very rough estimation of \ell_t is probably enough. We found that the best way to realize this idea is via kernels as we now explain.

Let K_t : \mathcal{K} \times \mathcal{K} \rightarrow \mathbb{R}_+ be a kernel (which will depend on the current exponential weights distribution p_t), which we think of as a linear operator over probability measures through the formula K_t q(x) = \int K_t(x,y) q(y) dy. The adjoint K_t^* acts on functions: K_t^* f(y) = \int f(x) K_t(x,y) dy (indeed \langle K_t q , f \rangle = \langle q, K_t^* f \rangle). The key point of K_t^* f is that there is a canonical unbiased estimator of it based on the observation of f(x_t) with x_t sampled from some distribution q, namely f(x_t) K_t(x_t, \cdot) / q(x_t). That’s great, and we will see soon how to control the variance of this estimator, but first let us talk about a more pressing issue: with these estimators one would control the regret as if the adversary was playing the losses K_t^* \ell_t, but what we want is to control the regret with respect to real losses \ell_t. In other words we control terms of the form \langle p_t - q, K_t^* \ell_t \rangle = \langle K_t (p_t - q), \ell_t \rangle but what we really care about is something like \langle q_t - q, \ell_t \rangle for some q_t potentially different from p_t (indeed we can play from some q_t which can be different from the exponential weights distribution). A natural guess is to take q_t = K_t p_t, in which case we would be satisfied by an inequality of the form \langle K_t p_t - q, \ell_t \rangle \lesssim \langle K_t (p_t - q), \ell_t \rangle.

Generalized Bernoulli convolutions
(What follows is an almost exact copy of Section 1.2.2. in the main paper.) As we just explained above for a given measure p we want to find a kernel K such that \langle Kp - \delta_x, f \rangle \lesssim \langle K(p - \delta_x), f \rangle for all convex functions f and all points x \in \mathcal{K}. We note that for any \lambda \in (0,1) one has

(1)   \begin{equation*}  \langle Kp - \delta_x, f \rangle \leq \frac{1}{\lambda} \langle K(p - \delta_x), f \rangle \Leftrightarrow K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) . \end{equation*}

Leveraging the fact that f is convex we see that a natural kernel to consider is such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. Indeed in this case one has

    \[K^* f(x) = \mathbb{E} f((1-\lambda) Z + \lambda x) \leq (1-\lambda) \mathbb{E} f(Z) + \lambda f(x) .\]

Thus this kernel satisfies the right hand side of (1) if Z is defined to be equal to K p, that is Z satisfies the following distributional identity, where X \sim p,

(2)   \begin{equation*}  Z \; \overset{D}{=} \; (1-\lambda) Z + \lambda X . \end{equation*}

If (2) holds true we say that Z is the core of p. It is easy to see that the core always exists and is unique by taking Z = \sum_{k=0}^{+\infty} (1-\lambda)^k \lambda X_k where X_0, X_1, \hdots are i.i.d. copies of X. Interestingly such random variables have a long history for the special case of a random sign X where they are called Bernoulli convolutions (they were introduced by Erdos in a 1939 paper). Our notion of core can thus be viewed as a generalized Bernoulli convolution. We refer the reader to the following very nice a survey on Bernoulli convolutions by Peres, Schlag and Solomyak, and we simply mention that the main objective in this literature is to understand for which values of \lambda is the random variable Z “smooth” (say for instance absolutely continuous with respect to the Lebesgue measure).

Summarizing the discussion so far, we see that by playing at round t from K_t p_t, where K_t is the kernel described above and p_t is the continuous exponential weights strategy on the estimated losses \tilde{\ell}_s = \ell_s(x_s) \frac{K_s(x_s, \cdot)}{K_s p_s(x_s)} one has for any q,

    \[\mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \langle q, \ell_t \rangle) \leq \frac{1}{\lambda} \mathbb{E} \left(\frac{\mathrm{Ent}(q \Vert p_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^T \langle p_t, \left(\frac{K_t(x_t, \cdot)}{K_t p_t(x_t)}\right)^2 \rangle\right) .\]

The term in the left hand side is almost exactly the expected regret by taking q to be uniform on a small ball around the minimizer of \sum_{t=1}^T \ell_t, and in this case the term \mathrm{Ent}(q \Vert p_1) is of order n \log(T).

Variance control in the Gaussian case
All that remains to be done is to control the variance term \mathbb{E}_{x \sim K p} \langle p, \tilde{\ell}^2 \rangle where \tilde{\ell}(y) = \frac{K(x, y)}{K p(x)} = \frac{K(x,y)}{\int K(x,y') p(y') dy}. More precisely if this quantity is O(1) then we obtain a regret of \tilde{O}\left( \frac{1}{\lambda} \sqrt{n T}\right). This variance control will turn out to be more challenging than one might expect and our goal for the rest of this first lecture is to derive some intuition from the case where p is a standard Gaussian. In the next lecture we will see how to somehow reduce the general case to the Gaussian case. In the calculations below we will also assume that p is compactly supported on the Euclidean ball of radius R = \mathrm{poly}(n, \log(T)). Of course this is at odds with the Gaussian assumption, but again this issue will be dealt with in the next lecture. The more tricky part will be to ensure that the exponential weights distribution is indeed truncated outside of a ball of such radius (observe that p is assumed to be isotropic, so with respect to the distances in the original space this ball might represent a tiny ball in \mathcal{K}). This truncation (and dealing with the issues that result from it) will be the topic of the third lecture (this is where the negative entropic term in the regret bound of exponential weights will turn out to be useful!).
First observe that to bound the variance term it is in fact sufficient to show that K(x,y) / K(x,y') is bounded for all y,y' in the support of p and all x in the support of Kp. In fact it is fine if we have this control with probability at least 1-1/T^{10} with respect to the draw of x from Kp (recall that eventually x is the played point, so events with small probability on x can intuitively be ignored). Next we need to understand what is K(x,y) when p is a standard Gaussian. Observe that, with c denoting the core of p, one always has K(x,y) = K \delta_y (x) = \mathrm{cst} \times c\left(\frac{x-\lambda y}{1-\lambda}\right). Furthermore it is a straightforward calculation that the core of standard Gaussian is a centered Gaussian with covariance \frac{\lambda}{2-\lambda} I_n. Thus we obtain:

    \begin{align*} \frac{K(x,y)}{K(x,y')} & = \exp \left(\frac{2-\lambda}{2 \lambda} \left(\left\vert \frac{x-\lambda y'}{1-\lambda} \right\vert^2 - \left\vert \frac{x-\lambda y}{1-\lambda} \right\vert^2 \right) \right) \\ & \leq \exp\left( \frac{1}{(1-\lambda)^2} (4 R |x| + 2 \lambda R^2) \right). \end{align*}

Finally note that with high probability one has |x| \lesssim \lambda R + \sqrt{\lambda n \log(T)}. Ignoring logarithmic factors, and provided that somehow one could take R = \tilde{O}(\sqrt{n}) (which is the ball that contains most of the mass of a standard Gaussian so it seems reasonable), we see that for \lambda = \tilde{O}(1/n^2) we have a constant variance, which in turn lead to a regret of order n^{2.5} \sqrt{T}. In fact by being more careful in bounding the variance one could hope that \lambda = \tilde{O}(1/n) is enough (basically in the display above one could do a moment generating function calculation instead of Cauchy-Schwarz to replace the term R|x| by |x|^2) which in turn would give a n^{1.5} \sqrt{T} regret. This latter result is what we conjecture to be optimal, but we are quite far from being able to prove this!

In the next lecture we will do a little bit of geometry of log-concave measures to show how to reduce the general p case to the Gaussian case. In the third and final lecture we will deal with the truncation of the exponential weights distribution (this will force us to significantly modify the algorithm).

This entry was posted in Optimization, Theoretical Computer Science. Bookmark the permalink.