k-server, part 2: continuous time mirror descent

We continue our k-server series (see post 1 here). In this post we briefly discuss the concept of a fractional solution for k-server, which by analogy with MTS will in fact be a fractional “anti-solution”. Then we introduce the continuous time version of MTS and explain how it applies for k-server. Finally the most important part of the post is the description of the basic potential based analysis of mirror descent and how to interpret it in the context of MTS.


State representation for MTS

Recall the MTS problem: one maintains a random state \rho_t \in [n] (where [n] is equipped with a distance d), and given a new cost function c_{t+1} : [n] \rightarrow \mathbb{R}_+, one can update the state to \rho_{t+1} with the corresponding price to pay: c_{t+1}(\rho_{t+1}) + d(\rho_t, \rho_{t+1}). Equivalently one can maintain the probability distribution p_t \in \Delta_n := \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} of \rho_t: indeed given \rho_t, p_t, p_{t+1} one can obtain \rho_{t+1} by optimal coupling, that is the couple of random variables (\rho_t, \rho_{t+1}) minimizes the expected distance \E \ d(Y,Z) over all couples (Y,Z) such that Y has marginal p_t and Z has marginal p_{t+1} (the latter quantity is called the Wasserstein-1 distance between p_t and p_{t+1}). In this view the (expected) service cost is now a linear function, that is c_t \cdot p_t where c_t \in \mathbb{R}_+^n, and the movement cost between p_t and p_{t+1} is the Wasserstein-1 distance.

We will further assume the existence of a convex body K and a norm \|\cdot\| in \mathbb{R}^N (N \geq n) such that the Wasserstein-1 distance between two distributions x,y \in \Delta_n is equal to \|\hat{x} - \hat{y}\| where \hat{x}, \hat{y} \in K are “expanded states” with x=(\hat{x}_1,\hdots, \hat{x}_n) and y=(\hat{y}_1, \hdots, \hat{y}_n). For a weighted star metric it suffices to take K = \Delta_n, but we will see in the fourth post that for trees one indeed need to expand the state space.

Fractional solution for k-server

Recall the k-server problem: one maintains a random configuration \rho_t \in [n]^k, and given a new request r_{t+1} \in [n] one must update the configuration to \rho_{t+1} such that r_{t+1} \in \{\rho_{t+1}(1), \hdots, \rho_{t+1}(k)\}. One could equivalently maintain a distribution p_t \in \Delta_{n^k} as before. In the fractional problem one in fact only maintains the 1^{st} moment of this distribution, z_t \in \Delta_n(k) := \{z \in [0,1]^n : \sum_{i=1}^n z_i = k\}, defined by z_t(i) = \sum_{\rho \in [n]^k : i \in \{\rho(1),\hdots, \rho(k)\}} p_t(\rho) (in particular servicing a request a location i means that one must have z_t(i) = 1). Again the metric here on the z variables is the Wasserstein distance induced by d (a.k.a., the optimal transport distance). Importantly we note that this view is not equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can round online a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional k-server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.

To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the anticonfiguration x=1-z. In this view a request r \in [n] is equivalent to a cost vector c = +\infty \times e_r. Finally like in MTS we will assume the existence of an expanded state space K \subset \mathbb{R}^N and a norm \|\cdot\| that measures movement in this expanded view.

Continuous time decision making

We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path (c(t))_{t \in \mathbb{R}_+}, with c(t) \in \mathbb{R}_+^N (and c_i(t) \neq 0 \Rightarrow i \leq n). The decision maker’s reponse is a path (x(t))_{t \in \mathbb{R}_+} that lives in K \subset \mathbb{R}^N. In this setting the service cost of the algorithm is \int_{\mathbb{R}^+} c_t \cdot x_t dt and the movement cost is equal to \int_{\mathbb{R}^+} \|x'(t)\| dt where x'(t) is the time derivative of the path x(t). We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for k-server (where there is only a movement cost). If you are curious see the appendix here.

For k-server we will use c(t) = e_{r(t)} where r(t) is the currently requested location, and we move to the next request at the first time such that x_{r(t)}(t) = 0 (which means that z=1-x satisfies z_{r(t)}(t)=1, i.e., there is a server at the requested location.

Mirror descent

If you have never seen the mirror descent framework now is a good time to take a quick look here.

Very succintly mirror descent with mirror map \Phi : K \rightarrow \mathbb{R} can be written as follows, with a step-size \eta:

    \[ x(t+\eta) = \mathrm{argmin}_{x \in K} D_{\Phi}(x, \nabla \Phi^*(\nabla \Phi(x(t)) - \eta c(t))) \]

where we recall that \Phi^* is the Legendre-Fenchel transform of \Phi (i.e., the map whose gradient is the inverse map of the gradient of \Phi) and D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) \cdot (x-y) is the Bregman divergence associated to \Phi.

We now want to take \eta to 0 in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote N_K(x) = \{\theta : \theta \cdot (y-x) \leq 0, \ \forall y \in K\} for the normal cone of K at x, i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function f,

    \[ x^* \in \mathrm{argmin}_{x \in K} f(x) \Leftrightarrow -\nabla f(x^*) \in N_K(x^*) \]

In particular we see that (note that \nabla_x D_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y))

    \[ \nabla \Phi(x(t+\eta)) - \nabla \Phi(x(t)) + \eta c(t) \in - N_{K}(x(t+\eta)) \]

and thus taking \eta to 0 we morally get

    \[ \nabla^2 \Phi(x(t)) x'(t) \in - c(t) - N_{K}(x(t)) \]

This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set K we get a viability problem. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on \Phi.

Potential based analysis

The mirror descent algorithm is fully described by:

    \[ \nabla^2 \Phi(x(t)) x'(t) = - c(t) - \lambda(t), \; \lambda(t) \in N_{K}(x(t)) \]

Denoting \hat{D}_{\Phi}(y ; x) = - \Phi(x) - \nabla \Phi(x) \cdot (y -x ) we see that for any fixed y,

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y ; x(t)) & = & - \nabla^2 \Phi(x(t)) x'(t) \cdot (y - x(t)) \\ & = & (c(t) + \lambda(t)) \cdot (y - x(t)) \\ & \leq & c(t) \cdot (y - x(t)) \end{eqnarray*}

The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than y, i.e., c(t) \cdot (y - x(t)) <0, then it is actually getting closer to y in the sense that \hat{D}_{\Phi}(y ; x(t)) is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point y. However in MTS and k-server we compete against a path y(t), and thus we also need to evaluate by how much the Bregman divergence can go up when y is moving. This is captured by the following calculation:

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y(t) ; x) & = & - \nabla \Phi(x(t)) \cdot y'(t) \\ & \leq & \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \|y'(t)\| \end{eqnarray*}

Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:

Lemma: The mirror descent path (x(t))_{t \geq 0} satisfies for any comparator path (y(t))_{t \geq 0},

    \[ \int c(t) \cdot x(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \int \|y'(t)\| dt + \int c(t) \cdot y(t) dt + O(1) \]

At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.


This entry was posted in Theoretical Computer Science. Bookmark the permalink.

Leave a reply