A month ago Ronen Eldan, Yin Tat Lee and myself posted our latest work on bandit convex optimization. I’m quite happy with the result (first polynomial time method with poly(dimension)-regret) but I’m even more excited by the techniques we developed. Next week I will give three 60 minutes lectures at MSR to explain those techniques. I thought the notes for these lectures could be useful to people who are going through the main paper as the informal style of blog posts/lectures allow to gloss over some of the annoying details we had to deal with in the real proof. Each lecture covers essentially one major idea of the paper. Here we go for the notes of the first lecture!
Bandit convex optimization
Let a convex body in . For , the player selects (at random) and simultaneously an adversary selects a convex function . The player observes/suffers the loss . The player’s performance is measured by the regret:
The goal of these notes is to give a “simple” informal proof of a -regret (with high probability).
The full information engine
Most previous works on bandit convex optimization were using Online Gradient Descent as a full information primitive (in the full information setting one observes the complete loss function ). Here we will use continuous exponential weights. Let us recall what this means and the basic regret bound one can obtain. For sake of notation we use the same symbol for a probability measure and for its density with respect to the Lebesgue measure (when it has one). We denote for the usual inner product (i.e. ) and .
Theorem: Let be arbitrary functions from to . Let be some measure on and define by induction
where is a normalization constant and is some fixed learning rate. Then
Note that the quantity should be interpreted as the expected regret of playing at random from compared to playing the fixed distribution . Just as a teaser for the experts reading this, a key point of the argument in the third lecture will be to use the term in the regret bound (usually we simply drop this term as it is negative).
Proof: Observe that
Using the basic inequalities and for any , as well as the fact that one obtains
Rearranging the two above displays we proved
which concludes the proof.
The usual bandit argument is to use the bandit feedback to build some unbiased estimator of (i.e. such that ) and then use the full information primitive with the estimated losses instead of the real ones. The previous theorem indicates that the only difficulty is then to control the variance of these estimators, i.e. . This is where the shape assumption on the losses has to be used: the observation of the loss at has to be somehow propagated to nearby points so as to have a low variance estimator. The bottleneck of bandit convex optimization is exactly that it is highly unclear how to leverage the convexity assumption to propagate information to nearby points. Our first idea is that one should not estimate directly, but rather it is enough to estimate a coarse approximation to , where the level of coarseness depends on our level of uncertainty about where lies the optimum of the current cumulative loss. In particular “far” (where distances are measured with respect to the variance of our current exponential weights) from our current believed optimum (which is basically the center of mass of our current exponential weights) a very rough estimation of is probably enough. We found that the best way to realize this idea is via kernels as we now explain.
Let be a kernel (which will depend on the current exponential weights distribution ), which we think of as a linear operator over probability measures through the formula . The adjoint acts on functions: (indeed ). The key point of is that there is a canonical unbiased estimator of it based on the observation of with sampled from some distribution , namely . That’s great, and we will see soon how to control the variance of this estimator, but first let us talk about a more pressing issue: with these estimators one would control the regret as if the adversary was playing the losses , but what we want is to control the regret with respect to real losses . In other words we control terms of the form but what we really care about is something like for some potentially different from (indeed we can play from some which can be different from the exponential weights distribution). A natural guess is to take , in which case we would be satisfied by an inequality of the form .
Generalized Bernoulli convolutions
(What follows is an almost exact copy of Section 1.2.2. in the main paper.) As we just explained above for a given measure we want to find a kernel such that for all convex functions and all points . We note that for any one has
Leveraging the fact that is convex we see that a natural kernel to consider is such that is the distribution of for some random variable to be defined. Indeed in this case one has
If (2) holds true we say that is the core of . It is easy to see that the core always exists and is unique by taking where are i.i.d. copies of . Interestingly such random variables have a long history for the special case of a random sign where they are called Bernoulli convolutions (they were introduced by Erdos in a 1939 paper). Our notion of core can thus be viewed as a generalized Bernoulli convolution. We refer the reader to the following very nice a survey on Bernoulli convolutions by Peres, Schlag and Solomyak, and we simply mention that the main objective in this literature is to understand for which values of is the random variable “smooth” (say for instance absolutely continuous with respect to the Lebesgue measure).
Summarizing the discussion so far, we see that by playing at round from , where is the kernel described above and is the continuous exponential weights strategy on the estimated losses one has for any ,
The term in the left hand side is almost exactly the expected regret by taking to be uniform on a small ball around the minimizer of , and in this case the term is of order .
Variance control in the Gaussian case
All that remains to be done is to control the variance term where . More precisely if this quantity is then we obtain a regret of . This variance control will turn out to be more challenging than one might expect and our goal for the rest of this first lecture is to derive some intuition from the case where is a standard Gaussian. In the next lecture we will see how to somehow reduce the general case to the Gaussian case. In the calculations below we will also assume that is compactly supported on the Euclidean ball of radius . Of course this is at odds with the Gaussian assumption, but again this issue will be dealt with in the next lecture. The more tricky part will be to ensure that the exponential weights distribution is indeed truncated outside of a ball of such radius (observe that is assumed to be isotropic, so with respect to the distances in the original space this ball might represent a tiny ball in ). This truncation (and dealing with the issues that result from it) will be the topic of the third lecture (this is where the negative entropic term in the regret bound of exponential weights will turn out to be useful!).
First observe that to bound the variance term it is in fact sufficient to show that is bounded for all in the support of and all in the support of . In fact it is fine if we have this control with probability at least with respect to the draw of from (recall that eventually is the played point, so events with small probability on can intuitively be ignored). Next we need to understand what is when is a standard Gaussian. Observe that, with denoting the core of , one always has . Furthermore it is a straightforward calculation that the core of standard Gaussian is a centered Gaussian with covariance . Thus we obtain:
Finally note that with high probability one has . Ignoring logarithmic factors, and provided that somehow one could take (which is the ball that contains most of the mass of a standard Gaussian so it seems reasonable), we see that for we have a constant variance, which in turn lead to a regret of order . In fact by being more careful in bounding the variance one could hope that is enough (basically in the display above one could do a moment generating function calculation instead of Cauchy-Schwarz to replace the term by ) which in turn would give a regret. This latter result is what we conjecture to be optimal, but we are quite far from being able to prove this!
In the next lecture we will do a little bit of geometry of log-concave measures to show how to reduce the general case to the Gaussian case. In the third and final lecture we will deal with the truncation of the exponential weights distribution (this will force us to significantly modify the algorithm).