(This post absolutely requires to have read Part 1 and Part 2.) A key assumption at the end of Part 1 was that, after rescaling space so that the current exponential weights distribution is isotropic, one has
(1)
for some . Effectively this means that the estimated cumulative loss outside of the ball
is infinite (recall that
is proportional to
). Thus to enforce (1) (at time
) we will actually set the loss estimate
to
on
. The price one pays for this is that the loss estimator is now unbiased only on
, which in turn means that we control the cumulative regret with respect to points in
only. We believe that in the so-called stochastic setting where the loss sequence
is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in
at all round
(for
) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order
(note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first
rounds and then move out the optimum very far from this region for the remaining
rounds. To solve this issue we introduce a restart condition: essentially if the algorithm believes that the optimum might be outside the region
then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon
). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region
given in (1) will not quite work, and in fact we will construct a region which verifies
Furthermore we will need to take which in turn forces us to take
(indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially
where
instead of
comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take
to be
times smaller than the predicted value from Part 1) and hence the
final regret bound (indeed recall that we got a bound of
where the first
comes from the fact that we need to take a learning rate
times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the
comes from the relative entropy distance to the optimum at the beginning).
We will use extensively the following result:
Lemma: Let
be an isotropic log-concave measure. Then for any
such that
one has
.
Restart condition
The simplest idea to restart goes as follows. Let us consider some (with
defined as in (1)). Why is it there? Well there must be some time
in the past where we estimated the cumulative regret of
to be at least
. In particular if this
now has a small estimated regret, say smaller than
then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus
. Thus our restart condition looks like this: restart if
A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the constant to be large enough, we know that the optimum (over all of
) of the “true” cumulative loss
is still in
at the time when we restart.
Hoping for negative regret
If we catch the adversary’s presumptive attempt to move the optimum out of quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region
for a good reason, and thus if we compare ourselves to the point
which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since
was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where
is the time at which
enters the boundary of the focus region,
(2)
Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as for
, and thus this term divided by
(which is then
) could easily compensate the variance term which is
. This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step
we had increased the learning rate from
to
then we would have the term
multiplied by
in the final regret bound and thus we would only need to have
to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of
). Let us make this slightly more formal.
Turning up the learning rate
Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: . One can see with the same calculations than in Part 1 that:
In particular if we increase the learning rate at some times by
then we have
Using that the normalizing constants are all larger than (this can be proved using the fact that covariance of the exponential weights is at least some
) and assuming that
(i.e.
) one gets roughly, with
,
where the second inequality uses that (see Part 2).
We see that by taking we can guarantee negative regret with respect to any
whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times
). Thus we see that we should take
if we want to have a regret of order
. Recall also that we can afford to update the learning rate only
times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take
(and thus
,
) while guaranteeing that the focus region satisfies
As we explained at the beginning of this post this will conclude the proof of the regret bound.
Updating the focus region more carefully
We update the focus region only at times such that, once space is rescaled so that
is isotropic,
and in this case we set
Somewhat clearly we won’t make more than updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)
This follows from:
Lemma: Let
be a convex body and
be the centered unit ball. Suppose that
. Then
.
Proof: Let us prove the contrapositive and assume that there is a point with
. Denote
,
and consider the sets
.
Note that those sets are disjoint. Indeed, the intervals are disjoint, which implies that the projections of the ellipsoids
onto the span of
are disjoint. So, we have
which concludes the proof.
That’s it folks!
This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.
Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample in time
. Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on
convex sets (which are
dimensional). All of this can be found in the paper of course!