A month ago Ronen Eldan, Yin Tat Lee and myself posted our latest work on bandit convex optimization. I’m quite happy with the result (first polynomial time method with poly(dimension)-regret) but I’m even more excited by the techniques we developed. Next week I will give three 60 minutes lectures at MSR to explain those techniques. I thought the notes for these lectures could be useful to people who are going through the main paper as the informal style of blog posts/lectures allow to gloss over some of the annoying details we had to deal with in the real proof. Each lecture covers essentially one major idea of the paper. Here we go for the notes of the first lecture!
Bandit convex optimization
Let a convex body in
. For
, the player selects (at random)
and simultaneously an adversary selects a convex function
. The player observes/suffers the loss
. The player’s performance is measured by the regret:
The goal of these notes is to give a “simple” informal proof of a -regret (with high probability).
The full information engine
Most previous works on bandit convex optimization were using Online Gradient Descent as a full information primitive (in the full information setting one observes the complete loss function ). Here we will use continuous exponential weights. Let us recall what this means and the basic regret bound one can obtain. For sake of notation we use the same symbol for a probability measure
and for its density with respect to the Lebesgue measure (when it has one). We denote
for the usual
inner product (i.e.
) and
.
Theorem: Let
be arbitrary functions from
to
. Let
be some measure on
and define by induction
where
is a normalization constant and
is some fixed learning rate. Then
Note that the quantity should be interpreted as the expected regret of playing
at random from
compared to playing the fixed distribution
. Just as a teaser for the experts reading this, a key point of the argument in the third lecture will be to use the term
in the regret bound (usually we simply drop this term as it is negative).
Proof: Observe that
Using the basic inequalities and
for any
, as well as the fact that
one obtains
Rearranging the two above displays we proved
which concludes the proof.
Kernels
The usual bandit argument is to use the bandit feedback to build some unbiased estimator
of
(i.e. such that
) and then use the full information primitive with the estimated losses instead of the real ones. The previous theorem indicates that the only difficulty is then to control the variance of these estimators, i.e.
. This is where the shape assumption on the losses has to be used: the observation of the loss at
has to be somehow propagated to nearby points so as to have a low variance estimator. The bottleneck of bandit convex optimization is exactly that it is highly unclear how to leverage the convexity assumption to propagate information to nearby points. Our first idea is that one should not estimate
directly, but rather it is enough to estimate a coarse approximation to
, where the level of coarseness depends on our level of uncertainty about where lies the optimum of the current cumulative loss. In particular “far” (where distances are measured with respect to the variance of our current exponential weights) from our current believed optimum (which is basically the center of mass of our current exponential weights) a very rough estimation of
is probably enough. We found that the best way to realize this idea is via kernels as we now explain.
Let be a kernel (which will depend on the current exponential weights distribution
), which we think of as a linear operator over probability measures through the formula
. The adjoint
acts on functions:
(indeed
). The key point of
is that there is a canonical unbiased estimator of it based on the observation of
with
sampled from some distribution
, namely
. That’s great, and we will see soon how to control the variance of this estimator, but first let us talk about a more pressing issue: with these estimators one would control the regret as if the adversary was playing the losses
, but what we want is to control the regret with respect to real losses
. In other words we control terms of the form
but what we really care about is something like
for some
potentially different from
(indeed we can play from some
which can be different from the exponential weights distribution). A natural guess is to take
, in which case we would be satisfied by an inequality of the form
.
Generalized Bernoulli convolutions
(What follows is an almost exact copy of Section 1.2.2. in the main paper.) As we just explained above for a given measure we want to find a kernel
such that
for all convex functions
and all points
. We note that for any
one has
(1)
Leveraging the fact that is convex we see that a natural kernel to consider is such that
is the distribution of
for some random variable
to be defined. Indeed in this case one has
Thus this kernel satisfies the right hand side of (1) if is defined to be equal to
, that is
satisfies the following distributional identity, where
,
(2)
If (2) holds true we say that is the core of
. It is easy to see that the core always exists and is unique by taking
where
are i.i.d. copies of
. Interestingly such random variables have a long history for the special case of a random sign
where they are called Bernoulli convolutions (they were introduced by Erdos in a 1939 paper). Our notion of core can thus be viewed as a generalized Bernoulli convolution. We refer the reader to the following very nice a survey on Bernoulli convolutions by Peres, Schlag and Solomyak, and we simply mention that the main objective in this literature is to understand for which values of
is the random variable
“smooth” (say for instance absolutely continuous with respect to the Lebesgue measure).
Summarizing the discussion so far, we see that by playing at round from
, where
is the kernel described above and
is the continuous exponential weights strategy on the estimated losses
one has for any
,
The term in the left hand side is almost exactly the expected regret by taking to be uniform on a small ball around the minimizer of
, and in this case the term
is of order
.
Variance control in the Gaussian case
All that remains to be done is to control the variance term where
. More precisely if this quantity is
then we obtain a regret of
. This variance control will turn out to be more challenging than one might expect and our goal for the rest of this first lecture is to derive some intuition from the case where
is a standard Gaussian. In the next lecture we will see how to somehow reduce the general case to the Gaussian case. In the calculations below we will also assume that
is compactly supported on the Euclidean ball of radius
. Of course this is at odds with the Gaussian assumption, but again this issue will be dealt with in the next lecture. The more tricky part will be to ensure that the exponential weights distribution is indeed truncated outside of a ball of such radius (observe that
is assumed to be isotropic, so with respect to the distances in the original space this ball might represent a tiny ball in
). This truncation (and dealing with the issues that result from it) will be the topic of the third lecture (this is where the negative entropic term in the regret bound of exponential weights will turn out to be useful!).
First observe that to bound the variance term it is in fact sufficient to show that is bounded for all
in the support of
and all
in the support of
. In fact it is fine if we have this control with probability at least
with respect to the draw of
from
(recall that eventually
is the played point, so events with small probability on
can intuitively be ignored). Next we need to understand what is
when
is a standard Gaussian. Observe that, with
denoting the core of
, one always has
. Furthermore it is a straightforward calculation that the core of standard Gaussian is a centered Gaussian with covariance
. Thus we obtain:
Finally note that with high probability one has . Ignoring logarithmic factors, and provided that somehow one could take
(which is the ball that contains most of the mass of a standard Gaussian so it seems reasonable), we see that for
we have a constant variance, which in turn lead to a regret of order
. In fact by being more careful in bounding the variance one could hope that
is enough (basically in the display above one could do a moment generating function calculation instead of Cauchy-Schwarz to replace the term
by
) which in turn would give a
regret. This latter result is what we conjecture to be optimal, but we are quite far from being able to prove this!
In the next lecture we will do a little bit of geometry of log-concave measures to show how to reduce the general case to the Gaussian case. In the third and final lecture we will deal with the truncation of the exponential weights distribution (this will force us to significantly modify the algorithm).