If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a -competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of -server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a algorithm was known much before).

**State space for weighted -paging**

Let be the weight of the edge from leaf to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein- distance between two fractional -server configurations.

Consider a fractional move from to . Then clearly, the amount of mass has to transfer through the edge , so that the Wasserstein- distance is at least . Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted norm (namely ) and one can simply use the basic state space (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by ).

**Applying the general mirror descent framework**

Given a request at location and a current anticonfiguration , our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost from (i.e., for some ) and until the first time at which (i.e., one has a server at location in ). By the lemma at the end of the previous post one has (denote for the request being serviced at time )

One can think of as a “virtual service cost”. In -server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map : (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost to this “virtual service cost” , say up to a multiplicative factor . One would then obtain a -competitive algorithm.

**Entropy regularization for weighted -paging**

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier and let us search for a separable regularizer of the form . We want to relate to . Making those two quantities equal gives and thus the regularizer is a *weighted negentropy*: .

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier is taken into account. Note that because of the form of the state space the multiplier contains a term of the form (which corresponds to the constraint ) and for each location a term forcing the derivative to be if the value of the missing mass has reached . In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

Notice that up to a factor one can focus on controlling (that is only movement *into* a location is charged). In that view the Lagrange multipliers simply have no effect, since one has (indeed recall that ). Thus we see that the movement cost is exactly bounded by the virtual service cost in this case.

**Making Lipschitz**

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove *tracking expert regret bounds*, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some , in which case the Lipschitz constant would be . In the -server scenario one can stop the dynamics when (instead of ) provided that the mapping from to the actual fractional configuration is now given by . This raises a final issue: the total amount of server mass in such a is . Next we show that if is small enough then can be “rounded” online to a fractional -server configuration at the expense of a multiplicative movement. Precisely we show that is sufficient, which in turns gives a final competitive ratio of for weighted -paging.

**From servers to servers**

Consider a fractional server configuration with total mass (i.e., ), which we want to transform into a server configuration with total mass . A natural way to “round down” is at each location to put if the mass was . Furthermore a mass of should stay . This suggests the mapping , which is -Lipschitz. Thus the movement of is controlled by the movement of up to a multiplicative factor . Moreover one clearly has (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

## By Beylikdüzü Evden Eve Nakliyat January 17, 2020 - 5:58 am

“Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| “?

## By Ziye December 16, 2018 - 8:50 am

Hi Sebastien, I have a question about the lagrangian multiplier.

The math characterization say the multipliers look like $yA$ where $y$ satisfies the complementary slackness condition to the constraints. In your post, $\mu(t)$ is for the equality constraint. For the set of upper bound constraint $x_i\leq 1$, when it holds as an equality, there exists a corresponding non-negative scalar as the multiplier.

From the above point of view, I understand that x’_r(t) looks the way you wrote it. But why does $x_i'(t)=0$ when the missing amount in position $i$ is equal to one? (i.e. I don’t understand this sentence ‘for each location a term forcing the derivative to be 0 if the value of the missing mass has reached 1’)

Thank you very much,

## By Sebastien Bubeck January 9, 2019 - 1:58 pm

Hi Ziye. Actually what I wrote is not quite correct, thanks for catching this. The Lagrange multiplier simply ensures that $x_i'(t) <=0$ (rather than equal to $0$ as I wrote) when the missing mass is equal to $1$.

## By Anonymous August 28, 2018 - 4:38 pm

Why is that “Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| “?

## By Sebastien Bubeck August 28, 2018 - 5:05 pm

It’s because all the mass that come in eventually has to come out (unless it stays there forever and then that’s just an additive constant).

## By Anonymous January 31, 2018 - 2:59 pm

Do you think the primal-dual framework for algorithm design for solving exploration-exploitation settings has not been used much in machine learning literature?

## By Sebastien Bubeck February 3, 2018 - 12:58 am

I think that’s true yes (it has been used, but not much by core ML researchers).

## By Anonymous January 29, 2018 - 5:55 pm

Why \int x_{r(t)}(t)dt is called the “virtual service cost”? It equals to \int dt, which seems to be the actual service cost.

## By Sebastien Bubeck January 30, 2018 - 8:01 am

In the k-server problem there is no real service cost, only movement cost. The “virtual service cost” is merely a quantity that naturally relates to OPT, and which in turn can be related to the movement cost. The name “virtual service cost” comes from the analogy with MTS.