Big congratulations to my former intern Mark Sellke, and to the CMU team (C. J. Argue, Anupam Gupta, Guru Guruganesh, and Ziye Tang) for jointly winning the best paper award at SODA 2020 (as well as the best student paper for Mark)! They obtain a linear in the dimension competitive ratio for convex body chasing, a problem which was entirely open for any
just two years ago. What’s more is that they found the algorithm (and proof) from The Book! Let me explain.
Convex body chasing
In convex body chasing an online algorithm is presented at each time step with a convex body
, and it must choose a point
. The online algorithm is trying to minimize its total movement, namely
(all the norms here will be Euclidean, for simplicity). To evaluate the performance of the algorithm, one benchmarks it against the smallest movement one could achieve for this sequence of bodies, if one knew in advance the whole sequence. A good example to have in mind is if the sequence
corresponds to lines rotating around some fixed point, then the best thing to do is to move to this fixed point and never move again, while a greedy algorithm might have infinite movement (in the continuous time limit). The competitive ratio is the worst case (over all possible sequences of convex bodies) ratio of the algorithm’s performance to those of the oracle optimum. This problem was introduced in 1993 by Linial and Friedman, mainly motivated by considering a more geometric version of the
-server problem.
As it turns out, this problem in fact has a lot of “real” applications. It is not too surprising given how elementary the problem is. Just to give a flavor, Adam Wierman and friends show that dynamic powering of data centers can be viewed as an instance of convex *function* chasing. In convex function chasing, instead of a convex body one gets a convex function
, and there is then a movement cost
and a *service cost*
. While this is more general than convex body chasing, it turns out that convex body chasing in dimension
is enough to solve convex function chasing in dimension
(this is left to the reader as an exercise). Roughly speaking in dynamic power management, one controls various power levels (number of servers online, etc), and a request correspond to a new set of jobs. Turning on/off servers has an associated cost (the movement cost), while servicing the jobs with a certain number of servers has an associated delay (the service cost). You get the idea.
The Steiner point
In a paper to appear at SODA too, we propose (with Yin Tat Lee, Yuanzhi Li, Bo’az Klartag, and Mark Sellke) to use the Steiner point for the nested convex body chasing problem. In nested convex body chasing the sequence of bodies is nested, and to account for irrelevant additive constant we assume that the oracle optimum starts from the worst point in the starting set
. In other words, opt is starting at the point the furthest away in
from the closest point in
, so its movement is exactly equal to the Hausdorff distance between
and
. So all we have to do is to give an online algorithm whose movement is proportional to this Hausdorff distance. Some kind of Lipschitzness in Hausdorff distance. Well, it turns out that mathematicians have been looking at this for centuries already, and Steiner back in the 1800’s introduced just what we need: a selector (a map from convex sets to points in them, i.e.
for any convex set
) which is Lipschitz with respect to the Hausdorff distance, that is:
![Rendered by QuickLaTeX.com \[ \|S(K) - S(K')\| \leq L \cdot \mathrm{dist}_H(K,K') \,. \]](https://web.archive.org/web/20210123234450im_/https://i0.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-88bc573a98d3513706bc117297bfc04e_l3.png?resize=281%2C21&ssl=1)
Note that obtaining such a selector is not trivial, for example the first natural guess would be the center of gravity, but it turns out that this is not Lipschitz (consider for example what happens with
being a very thin triangle and
being the base of this triangle). Crazy thing is, this Steiner point selector (to be defined shortly) is not only Lipschitz, it is in fact the most Lipschitz selector for convex sets. How would you prove such a thing? Well, you clearly need a miracle to happen, and the miracle here is that the Steiner point satisfies some symmetries, which define it uniquely. From there all you need to do is that starting with any selector, you can add some of these symmetries while also improving the Lipschitz constant (for more on this see the references in the linked paper above).
OK, so what is the Steiner point? It has many definitions, but a particular appealing one from an optimizer’s viewpoint is as follows. For a convex set
and a vector
let
be the maximizer of the linear function
in the set
. Then the Steiner point of
,
is defined by:
![Rendered by QuickLaTeX.com \[ \mathrm{St}(K) = \mathbb{E}_{\theta : \|\theta\| \leq 1} [g_K(\theta)] \,. \]](https://web.archive.org/web/20210123234450im_/https://i0.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-1ed1a61ff1d6b62cd71a620f8bdeebb5_l3.png?resize=193%2C21&ssl=1)
In words: for any direction take the furthest point in
in that direction, and average over all directions. This feels like a really nice object, and clearly it satisfies
(i.e., it is a selector). The algorithm we proposed in our SODA paper is simply to move to
. Let us now see how to analyze this algorithm.
The nested case proof
First we give an alternative formula for the Steiner point. Denote
, and observe that
for almost all
. Thus using the divergence theorem one obtains:
![Rendered by QuickLaTeX.com \[ \mathrm{St}(K) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta h_K(\theta)] \,. \]](https://web.archive.org/web/20210123234450im_/https://i2.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-9e786f72667e6f0939c1fc5e0e436e41_l3.png?resize=225%2C21&ssl=1)
(The factor
comes from the ratio of the volume of the ball to the sphere, and the
in the expectation is because the normal to
is
on the sphere.)
Now we have the following one-line inequality to control the movement of the Steiner point:
![Rendered by QuickLaTeX.com \begin{align*} \|\mathrm{St}(K) - \mathrm{St}(K')\| & = d \cdot \| \mathbb{E}_{\theta : \|\theta\| = 1} [\theta (h_K(\theta) - h_{K'}(\theta))] \| \\ & \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ |h_K(\theta) - h_{K'}(\theta)| ] \end{align*}](https://web.archive.org/web/20210123234450im_/https://i1.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-bf37a0d06362627ccbf1c16cdfccce58_l3.png?resize=410%2C52&ssl=1)
It only remains to observe that
to obtain a proof that Steiner is
-Lipschitz (in fact as an exercise you can try to improve the above argument to obtain
-Lipschitz). Now for nested convex body chasing we will have
so that
(i.e., no absolute values), and thus the upper bound on the movement will telescope! More precisely:
![Rendered by QuickLaTeX.com \begin{align*} \sum_{t \geq 0} \|\mathrm{St}(K_t) - \mathrm{St}(K_{t+1})\| & \leq d \cdot \sum_{t \geq 0} \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_t}(\theta) - h_{K_{t+1}}(\theta) ] \\ & = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ h_{K_0}(\theta) - h_{K_{T}}(\theta) ] \\ & \leq d \cdot \mathrm{dist}_H(K_0, K_T) \,. \end{align*}](https://web.archive.org/web/20210123234450im_/https://i1.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-e648f7a36eb3795aae31e998ff0866c5_l3.png?resize=470%2C96&ssl=1)
This proves that the Steiner point is
-competitive for nested convex body chasing. How to generalize this to the non-nested case seemed difficult, and the breakthrough of Mark and the CMU team is to bring back an old friend of the online algorithm community: the work function.
The work function
The work function
is defined as follows. For any point
it is the smallest cost one can pay to satisfy all the requests
and end up at
. First observe that this function is convex. Indeed take two point
and
, and
the middle point between
and
(any point on the segment between
and
would be treated similarly). In the definition of the work function there is an associated trajectory for both
and
. By convexity of the requests, the sequence of mid points between those two trajectories is a valid trajectory for
! And moreover the movement of this mid trajectory is (by triangle inequality) less than the average of the movement of the trajectory for
and
. Hence
.
A very simple property we will need from the work function is that the norm of its gradient carries some information on the current request. Namely,
(indeed, if
is not in the current request set, then the best way to end up there is to move to a point
in
and then move to
, and if
is a polytope then when you move a little bit
you don’t move
, so the cost is changing at rate
, hence the norm of the gradient being
). Or to put it differently:
.
The Sellke point
Mark’s beautiful idea (and the CMU team very much related –in fact equivalent– idea) to generalize the Steiner point to the non-nested case is to use the work function as a surrogate for the request. Steiner of
will clearly not work since all of
does not matter in the same way, as some points might be essentially irrelevant because they are very far from the previous requests, a fact which will be uncovered by the work function (while on the other hand the random direction from the Steiner point definition is oblivious to the geometry of the previous requests). So how to pick an appropriately random point in
while respecting the work function structure? Well we just saw that
. So how about taking a random direction
, and applying the inverse of the gradient map, namely the gradient of the Fenchel dual
? Recall that the Fenchel dual is defined by:
![Rendered by QuickLaTeX.com \[ W_t^*(\theta) = \max_{x \in \mathbb{R}^d} \theta \cdot x - W_t(x) \,. \]](https://web.archive.org/web/20210123234450im_/https://i0.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-38f4e45e13b56402f2542d6d10104c23_l3.png?resize=216%2C29&ssl=1)
This is exactly the algorithm Mark proposes and it goes like this (Mark calls it the *functional Steiner point*):
![Rendered by QuickLaTeX.com \[ \mathrm{Se}(K_t) = \mathbb{E}_{\theta : \|\theta\| < 1} [\nabla W_t^*(\theta)] \,. \]](https://web.archive.org/web/20210123234450im_/https://i1.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-1c48be67527ecd410d08235502958d42_l3.png?resize=219%2C22&ssl=1)
Crucially this is a valid point, namely
. Moreover just like for the Steiner point we can apply the divergence theorem and obtain:
![Rendered by QuickLaTeX.com \[ \mathrm{Se}(K_t) = d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [\theta W_t^*(\theta)] \,. \]](https://web.archive.org/web/20210123234450im_/https://i2.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-329630ec5ed1634f7880d91ce9666807_l3.png?resize=235%2C22&ssl=1)
The beautiful thing is that we can now exactly repeat the nested convex body argument, since
for all
(just like we had in the nested case
) and so we get:
![Rendered by QuickLaTeX.com \[ \sum_{t \geq 0} \|\mathrm{Se}(K_t) - \mathrm{Se}(K_{t+1})\| \leq d \cdot \mathbb{E}_{\theta : \|\theta\| = 1} [ W_{0}^*(\theta) - W^*_{{T}}(\theta) ] \]](https://web.archive.org/web/20210123234450im_/https://i0.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-593f9a9e90a2740715be744ac89275ac_l3.png?resize=426%2C40&ssl=1)
The first term with
is just some additive constant which we can ignore, while the second term is bounded as follows:
![Rendered by QuickLaTeX.com \begin{align*} \mathbb{E}_{\theta : \|\theta\| = 1} [ - W^*_{{T}}(\theta) ] & = \mathbb{E}_{\theta : \|\theta\| = 1} [ \min_{x \in \mathbb{R}^d} W_t(x) - \theta \cdot x ] \\ & \leq \min_{x \in \mathbb{R}^d} \mathbb{E}_{\theta : \|\theta\| = 1}[ W_t(x) - \theta \cdot x ] \\ & = \min_{x \in \mathbb{R}^d} W_t(x) \,. \end{align*}](https://web.archive.org/web/20210123234450im_/https://i2.wp.com/blogs.princeton.edu/imabandit/wp-content/ql-cache/quicklatex.com-f7d98d503b4ecd2f79bb081ccddccceb_l3.png?resize=364%2C105&ssl=1)
Thus we exactly proved that the movement of the Sellke point is upped bounded by a constant plus
times the minimum of the work function, and the latter is nothing but the value of the oracle optimum!
Note that once everything is done and said, the proof has only two inequalities (triangle inequality as in the nested case, and the minimum of the expectation is less than the expectation of the minimum). It doesn’t get any better than this!