k-server, part 3: entropy regularization for weighted k-paging

If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a $O(\log(k))$ -competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of $k$ -server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a $O(\log(k))$ algorithm was known much before).

State space for weighted $k$ -paging

Let $w_i$ be the weight of the edge from leaf $i \in [n]$ to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein- $1$ distance between two fractional $k$ -server configurations.

Consider a fractional move from $z \in \Delta_n(k)$ to $z + \xi$ . Then clearly, the amount of mass $|\xi_i|$ has to transfer through the edge $(i, \mathrm{root})$ , so that the Wasserstein- $1$ distance is at least $\sum_i w_i |\xi_i|$ . Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted $\ell_1$ norm (namely $\|\xi\| := \sum_{i=1}^n w_i |\xi_i|$ ) and one can simply use the basic state space $K = \{x \in [0,1]^n : \sum_i x_i = n-k\}$ (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by $z=1-x$ ).

Applying the general mirror descent framework

Given a request at location $r \in [n]$ and a current anticonfiguration $x_0$ , our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost $c(t) = e_r$ from $x(0)=x_0$ (i.e., $x'(t) = (\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t))$ for some $\lambda(t) \in N_K(x(t))$ ) and until the first time $t$ at which $x_r(t) = 0$ (i.e., one has a server at location $r$ in $z(t) := 1-x(t)$ ). By the lemma at the end of the previous post one has (denote $r(t)$ for the request being serviced at time $t$ )

$\int x_{r(t)}(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \mathrm{OPT} + O(1)$

One can think of $\int x_{r(t)}(t) dt$ as a “virtual service cost”. In $k$ -server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that $\Phi$ has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map $\Phi$ : (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost $\|x'(t)\| = \|(\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t))\|$ to this “virtual service cost” $x_{r}(t)$ , say up to a multiplicative factor $\alpha$ . One would then obtain a $\alpha \times \mathrm{Lip}_{\|\cdot\|}(\Phi)$ -competitive algorithm.

Entropy regularization for weighted $k$ -paging

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier $\lambda(t)$ and let us search for a separable regularizer of the form $\Phi(x) = \sum_i \phi_i(x_i)$ . We want to relate $x_r(t)$ to $w_r/\phi_i''(x_r(t)) (= (\nabla^2 \Phi(x(t)))^{-1} e_r)$ . Making those two quantities equal gives $\phi_i(x) = w_i x_i \log(x_i)$ and thus the regularizer is a weighted negentropy: $\Phi(x) = \sum_i w_i x_i \log x_i$ .

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier $\lambda(t)$ is taken into account. Note that because of the form of the state space $K$ the multiplier contains a term of the form $(\mu(t), \mu(t), \hdots, \mu(t))$ (which corresponds to the constraint $\sum_{i=1}^n x_i = n-k$ ) and for each location a term forcing the derivative to be $0$ if the value of the missing mass has reached $1$ . In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

$\begin{align*} & x_r'(t) = \frac{x_r(t)}{w_r} (-1 + \mu(t)) \\ & x_i'(t) = \frac{x_i(t)}{w_i} \mu(t) 1\{x_i(t) < 1\} \ \text{for} \ i \neq r \end{align*}$

Notice that up to a factor $2$ one can focus on controlling $\|(x'(t))_-\|$ (that is only movement into a location is charged). In that view the Lagrange multipliers simply have no effect, since one has $\mu(t) \geq 0$ (indeed recall that $\sum_{i=1}^n x_i'(t) = 0$ ). Thus we see that the movement cost $\|(x'(t))_-\|$ is exactly bounded by the virtual service cost $x_r(t)$ in this case.

Making $\Phi$ Lipschitz

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove tracking expert regret bounds, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some $\delta>0$ , in which case the Lipschitz constant would be $O(\log(1/\delta))$ . In the $k$ -server scenario one can stop the dynamics when $x_{r(t)}(t)=\delta$ (instead of $x_{r(t)}(t)=0$ ) provided that the mapping from $x$ to the actual fractional configuration $z$ is now given by $z=\frac{1-x}{1-\delta}$ . This raises a final issue: the total amount of server mass in such a $z$ is $\frac{k}{1-\delta} > k$ . Next we show that if $\delta$ is small enough then $z$ can be “rounded” online to a fractional $k$ -server configuration at the expense of a multiplicative $O(1)$ movement. Precisely we show that $\delta<1/(2k)$ is sufficient, which in turns gives a final competitive ratio of $O(\log(k))$ for weighted $k$ -paging.

From $k+\epsilon$ servers to $k$ servers

Consider a fractional server configuration $z$ with total mass $k+\epsilon$ (i.e., $\sum_{i=1}^n z_i = k+\epsilon$ ), which we want to transform into a server configuration $\sigma(z)$ with total mass $k$ . A natural way to “round down” is at each location to put $0$ if the mass was $\leq \epsilon$ . Furthermore a mass of $1$ should stay $1$ . This suggests the mapping $\sigma(z_i) = \frac{z_i-\epsilon}{1-\epsilon} 1\{z_i > \epsilon\}$ , which is $\frac{1}{1-\epsilon}$ -Lipschitz. Thus the movement of $\sigma(z(t))$ is controlled by the movement of $z(t)$ up to a multiplicative factor $\frac{1}{1-\epsilon}$ . Moreover one clearly has $\sum_{i=1}^n \sigma(z_i) \leq k$ (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

10 Responses to "k-server, part 3: entropy regularization for weighted k-paging"

By kocaeli web tasarım July 30, 2020 - 5:16 pm

thank you a quality article

https://biadammedya.com/sosyal-medya-yonetimi/
By kapadokya balayı July 27, 2020 - 7:09 pm

kapadokya balayı için muazzam paketler sizi bekliyor. I think that’s true yes
By Ziye December 16, 2018 - 8:50 am

Hi Sebastien, I have a question about the lagrangian multiplier.

The math characterization say the multipliers look like $yA$ where $y$ satisfies the complementary slackness condition to the constraints. In your post, $\mu(t)$ is for the equality constraint. For the set of upper bound constraint $x_i\leq 1$, when it holds as an equality, there exists a corresponding non-negative scalar as the multiplier.

From the above point of view, I understand that x’_r(t) looks the way you wrote it. But why does $x_i'(t)=0$ when the missing amount in position $i$ is equal to one? (i.e. I don’t understand this sentence ‘for each location a term forcing the derivative to be 0 if the value of the missing mass has reached 1’)

Thank you very much,
- By Sebastien Bubeck January 9, 2019 - 1:58 pm
  
  Hi Ziye. Actually what I wrote is not quite correct, thanks for catching this. The Lagrange multiplier simply ensures that $x_i'(t) <=0$ (rather than equal to $0$ as I wrote) when the missing mass is equal to $1$.
By Anonymous August 28, 2018 - 4:38 pm

Why is that “Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| “?
- By Sebastien Bubeck August 28, 2018 - 5:05 pm
  
  It’s because all the mass that come in eventually has to come out (unless it stays there forever and then that’s just an additive constant).
By Anonymous January 31, 2018 - 2:59 pm

Do you think the primal-dual framework for algorithm design for solving exploration-exploitation settings has not been used much in machine learning literature?
- By Sebastien Bubeck February 3, 2018 - 12:58 am
  
  I think that’s true yes (it has been used, but not much by core ML researchers).
By Anonymous January 29, 2018 - 5:55 pm

Why \int x_{r(t)}(t)dt is called the “virtual service cost”? It equals to \int dt, which seems to be the actual service cost.
- By Sebastien Bubeck January 30, 2018 - 8:01 am
  
  In the k-server problem there is no real service cost, only movement cost. The “virtual service cost” is merely a quantity that naturally relates to OPT, and which in turn can be related to the movement cost. The name “virtual service cost” comes from the analogy with MTS.

k-server, part 3: entropy regularization for weighted k-paging

10 Responses to "k-server, part 3: entropy regularization for weighted k-paging"

By kocaeli web tasarım July 30, 2020 - 5:16 pm

By kapadokya balayı July 27, 2020 - 7:09 pm

By Ziye December 16, 2018 - 8:50 am

By Sebastien Bubeck January 9, 2019 - 1:58 pm

By Anonymous August 28, 2018 - 4:38 pm

By Sebastien Bubeck August 28, 2018 - 5:05 pm

By Anonymous January 31, 2018 - 2:59 pm

By Sebastien Bubeck February 3, 2018 - 12:58 am

By Anonymous January 29, 2018 - 5:55 pm

By Sebastien Bubeck January 30, 2018 - 8:01 am

Leave a replyLeave a Reply to Anonymous

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll