Kernel-based methods for convex bandits, part 2

The goal of this second lecture is to explain how to do the variance calculation we talked about at the end of Part 1 for the case where the exponential weights distribution is non-Gaussian. We will lose here a factor n^2 in the regret bound compared to the optimistic calculations from Part 1, and thus by the end of this lecture we could still hope to prove a regret bound of order n^{4.5} \sqrt{T}.

Gaussian core

First let us revisit the introduction of the core. Recall that we are considering a kernel K such that K \delta_x is the distribution of (1- \lambda) Z + \lambda x for some random variable Z to be defined. To control the regret we want to have the following inequality for any convex function f:

    \[K^* f(x) \leq (1-\lambda) \langle Kp, f\rangle + \lambda f(x) ,\]

which means that the distribution q of the random variable Z should satisfy

(1)   \begin{equation*} \langle q, f \rangle \leq \langle Kp, f\rangle . \end{equation*}

The core is defined so that (1) is in fact an equality. As we discussed in the previous post the core is a deep mathematical construction, but unfortunately we could not find a way to generalize the Part 1’s variance calculation when the core is non-Gaussian. In what follows we describe a slightly different construction which will allow us to satisfy (1) with q being Gaussian. A key observation is that if r is convexly dominated by p, i.e. for any convex f one has \langle r, f \rangle \leq \langle p, f\rangle, then (1) is satisfied by taking q to be the core of r:

    \[\langle q, f \rangle = \mathbb{E}_{X \sim q} f(X) = \mathbb{E}_{X \sim q, Y \sim r} f((1-\lambda) X + \lambda Y) \leq \mathbb{E}_{X \sim q, Y \sim p} f((1-\lambda) X + \lambda Y) = \langle Kp, f\rangle .\]

Thus it suffices to show that for any p that we care about one can find a Gaussian r “inside” of p (in the sense that r is convexly dominated by p). Then instead of taking the core of p to define the kernel one can take the core of r (which we will also refer to as the Gaussian core of p).

Next we show that we can essentially restrict our attention to the case where p is log-concave, and then we show how to find a Gaussian inside a log-concave measure.

Approximate log-concavity

Recall that in Part 1 we replaced the variance calculation by simply showing that the loss estimates \tilde{\ell}_t(y) are bounded. Thus we see by induction and using Hoeffding-Azuma that with probability at least 1-\delta one has

    \[\left|\sum_{s=1}^{t-1} (\tilde{\ell}_s (y) - K_s^* \ell_s(y)) \right| \lesssim \sqrt{T \log(1/\delta)} .\]

Using an union bound over an \epsilon-net together with Lipschitzness of \tilde{\ell}_s (which we didn’t prove but it follows the same lines than boundedness) one has in fact that with high probability (say 1-1/T^{10}) there is a convex function f (recall that \mathbb{E}_{x_t \sim K_t p_t} \tilde{\ell}_t = K_t^* \ell_t is a convex function) such that

    \[\|f - \sum_{s=1}^{t-1} \tilde{\ell}_s\|_{\infty} \lesssim \sqrt{T n \log(T)} .\]

In particular provided that \eta \lesssim \frac{1}{\sqrt{T n \log(T)}} the above display shows that p_{t} (whose density is proportional to \exp(-\eta \sum_{s=1}^{t-1} \tilde{\ell}_s)) is O(1)-approximately log-concave (recall that p is said to be \epsilon-approximately log-concave if there is a log-concave function q such that \epsilon \leq p/q \leq 1/\epsilon). To simplify the discussion in these lectures we will in fact assume that p_t is exactly log-concave.

We note that the above concentration argument makes us lose a n factor in the regret. Indeed in Part 1 we used \eta \approx \sqrt{n / T} to (informally) obtain the regret n^{2.5} \sqrt{T} (or n^{1.5} \sqrt{T} with even more optimistic calculations). We see that now we are forced to take a learning rate which is 1/n smaller than this, which in turn multiplies the regret bound by a factor n.

Finding a Gaussian inside a log-concave

Let p be an isotropic log-concave measure. We will show that a centered Gaussian G with covariance \tilde{O}(\frac{1}{n}) I_n is (approximately) convexly dominated by p. We note in passing that by going back to the calculations at the end of Part 1 it is easy to see that this factor \tilde{O}(\frac{1}{n}) in the covariance will in turn force us to take \lambda to be \tilde{O}(\frac{1}{n}) smaller than what we hoped at the end of Part 1, leading to yet again an extra n in the regret bound.

In fact we will show that any centered measure q supported on a ball of radius \epsilon=O(1) is (exactly) convexly dominated by p. This implies the (approximate) Gaussian convex domination we mentioned above since most of the mass of G is in such a ball (it is easy to deal with the remaining \mathrm{poly}(1/T) mass but we won’t worry about it here).

Let us fix some convex function f. We want to show that \langle q, f \rangle \leq \langle p, f \rangle. Since both measures are centered we may add any linear function to f without affecting this inequality, so we may legitimately assume that f(0)=0 and f(x) \geq 0 for all x\in \mathbb{R}^n. By scaling we might also assume that the maximum on the centered ball of radius \epsilon is 1 and thus it is enough to show \langle p, f \rangle \geq 1. By convexity f is above the linear function x \mapsto 1+ \nabla f(y)^{\top} (x - y) where y is the maximizer of f on the centered ball of radius \epsilon. Note also that |\nabla f(y)| \geq 1/\epsilon and thus it is enough to show (recall that f \geq 0):

    \[\langle p, x \mapsto \max(0, \theta^{\top} (x-y)) \rangle \geq \epsilon ,\]

where \theta = \nabla f(y) /|\nabla f(y)|, which is implied by

    \[\langle p, x \mapsto \max(0, \theta^{\top} x) \rangle \geq 2 \epsilon .\]

It only remains to use the fact that an isotropic real log-concave random variable X verifies \mathbb{E} (X \textbf{1}\{X \geq 0\}) \approx 1, and thus the above display holds true with \epsilon = O(1).

This entry was posted in Optimization, Theoretical Computer Science. Bookmark the permalink.