Nemirovski’s acceleration

I will describe here the very first (to my knowledge) acceleration algorithm for smooth convex optimization, which is due to Arkadi Nemirovski (dating back to the end of the 70’s). The algorithm relies on a 2-dimensional plane-search subroutine (which, in theory, can be implemented in \log(1/\epsilon) calls to a first-order oracle). He later improved it to only require a 1-dimensional line-search in 1981, but of course the breakthrough that everyone knows about came a year after with the famous 1982 paper by Nesterov that gets rid of this extraneous logarithmic term altogether (and in addition is based on the deep insight of modifying Polyak’s momentum).

Let f be a 1-smooth function. Denote x^{+} = x - \nabla f(x). Fix a sequence (\lambda_t)_{t \in \N}, to be optimized later. We consider the “conjugate” point \sum_{s =1}^t \lambda_s \nabla f(x_s). The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

    \[ x_{t+1} = \mathrm{argmin}_{x \in P_t} f(x) \, \text{where} \, P_t = \mathrm{span}\left(x_t^+, \sum_{s =1}^t \lambda_s \nabla f(x_s)\right) \,. \]

Let us denote g_s = \nabla f(x_s) and \delta_s = f(x_s) - f(x^*) for shorthand. The key point is that g_{t+1} \in P_t^{\perp}, and in particular \|\sum_{s \leq t} \lambda_s g_s\|^2 = \sum_{s \leq t} \lambda_s^2 \|g_s\|^2. Now recognize that \|g_s\|^2 is a lower bound on the improvement \delta_s - \delta_{s+1} (here we use that x_{s+1} is better than x_s^+). Thus we get:

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s^2 (\delta_s - \delta_{s+1}) \leq \sum_{s \leq t} \delta_s (\lambda_s^2 - \lambda_{s-1}^2) \,. \]

In other words if the sequence \lambda is chosen such that \lambda_s = \lambda_s^2 - \lambda_{s-1}^2 then we get

    \[ \|\sum_{s \leq t} \lambda_s g_s\|^2 \leq \sum_{s \leq t} \lambda_s \delta_s \,. \]

This is good because roughly the reverse inequality also holds true by convexity (and the fact that x_s \in P_s so g_s \cdot x_s = 0):

    \[ \sum_{s \leq t} \lambda_s \delta_s \leq \sum_{s \leq t} \lambda_s g_s \cdot (x_s - x^*) \leq \|x^*\| \cdot \| \sum_{s \leq t} \lambda_s g_s\| \,. \]

So finally we get \sum_{s \leq t} \lambda_s \delta_s \leq \|x^*\|^2, and it just remains to realize that \lambda_s is of order s so that \delta_t \leq \|x^*\|^2 / t^2.

Posted in Optimization | 17 Comments

A short proof for Nesterov’s momentum

Yesterday I posted the following picture on Twitter and it quickly became my most visible tweet ever (by far):

    \begin{eqnarray*} & x^+ := x - \eta \nabla f(x) & \text{(gradient step)} \\ & d_t := \gamma_t \cdot (x_{t} - x_{t-1}) & \text{(momentum term)} \\ \\ \text{ [Cauchy, 1847] } & x_{t+1} = x_t^+ & \text{(gradient descent)} \\ \text{ [Polyak, 1964] } & x_{t+1} = x_t^+ + d_{t} & \text{(momentum + gradient)} \\ \text{ [Nesterov, 1983] } & x_{t+1} = (x_t + d_{t})^+ & \text{(momentum + lookahead gradient)} \end{eqnarray*}

I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the 1/t^2 convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that f is \beta-smooth convex function, and we take \eta = 1/\beta in the gradient step. The momentum term \gamma_t will be set to a very particular value, which comes out naturally in the proof.


The two basic inequalities

Let us denote \delta_t = f(x_t) - f(x^*) and g_t = - \frac{1}{\beta} \nabla f(x_t + d_t) (note that x_{t+1} = x_t + g_t + d_t). Now let us write our favorite inequalities (using f(x^+) - f(x) \leq - \frac{1}{2 \beta} |\nabla f(x)|^2 and f(x) - f(y) \leq \nabla f(x) \cdot (x-y)):

    \[ \delta_{t+1} - \delta_t \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot d_t \right) \,, \]


    \[ \delta_{t+1} \leq - \frac{\beta}{2} \left( |g_t|^2 + 2 g_t \cdot (x_t + d_t - x^*) \right) \,. \]


On the way to a telescopic sum

Recall now that |a|^2 + 2 a \cdot b = |a+b|^2 - |b|^2, so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient 1 on the second inequality, so let us do \lambda_t-1 times the first inequality plus 1 times the second inequality. We obtain an inequality whose right hand side is given by -\frac{\beta}{2} times

    \begin{align*} & \lambda_t |g_t|^2 + 2 g_t \cdot (x_t + \lambda_t d_t - x^*) \\ & = \frac{1}{\lambda_t} \left( |x_t + \lambda_t d_t - x^* + \lambda_t g_t|^2 - |x_t + \lambda_t d_t - x^*|^2 \right) \,. \end{align*}

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose \lambda_t and \gamma_t. What we would like to have is:

    \[ x_t + \lambda_t d_t - x^* + \lambda_t g_t = x_{t+1} + \lambda_{t+1} d_{t+1} - x^* \,. \]

Observe that (since d_{t+1} = \gamma_{t+1} \cdot (d_t + g_t)) the right hand side can be written as x_t + g_t + d_t + \lambda_{t+1} \cdot \gamma_{t+1} \cdot (g_t + d_t) - x^*, and thus we see that we simply need to have:

    \[ \lambda_t = 1+ \lambda_{t+1} \cdot \gamma_{t+1} \,. \]


Setting the parameters and concluding the proof

Writing u_t := \frac{\beta}{2} |x_t + \lambda_t d_t - x^*|^2 we now obtain as a result of the combination of our two starting inequalities:

    \[ \lambda_t^2 \delta_{t+1} - (\lambda_t^2 - \lambda_t) \delta_t \leq u_t - u_{t+1} \,. \]

It only remains to select \lambda_t such that \lambda_t^2 - \lambda_t = \lambda_{t-1}^2 (i.e., roughly \lambda_t is of order t) so that by summing the previous inequality one obtains \delta_{t+1} \leq \frac{\beta |x_1 - x^*|^2}{2 \lambda_t^2} which is exactly the 1/t^2 rate we were looking for.

Posted in Optimization | 20 Comments

Remembering Michael

It has been a year since the tragic event of September 2017. We now know what happened, and it is a tremendously sad story of undiagnosed type 1 diabetes.

This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.

Posted in Uncategorized | 1 Comment

How to generalize (algorithmically)

A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).


Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.


Some notes on algorithmic generalization

Let \mathcal{X}, \mathcal{Y} be input/output spaces. Let \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow [0,1] be a loss function, \mu a probability measure supported on \mathcal{X} \times \mathcal{Y}, and \underline{h} : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^\mathcal{X} a learning rule (in words \underline{h} takes as input a dataset of m examples, and output a mapping from \mathcal{X}-inputs to \mathcal{Y}-outputs). With a slight abuse of notation, for z=(x,y) \in \mathcal{X} \times \mathcal{Y} and h \in \mathcal{Y}^{\mathcal{X}}, we write \ell(h, z) := \ell(h(x), y). We define the generalization of \underline{h} on \mu by:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S=(z_1,\hdots,z_m) \sim \mu^{\otimes m}, z \sim \mu} \left[ \ell(\underline{h}(S), z) - \frac1{m} \sum_{i=1}^m \ell(\underline{h}(S), z_i) \right] \,. \]

In words, if \mathrm{gen}(\underline{h} , \mu) \leq \epsilon then we expect the empirical performance of the learned classifier \underline{h}(S) to be representative of its performance on a fresh out-of-sample data point, up to an additive \epsilon. The whole difficulty of course is that the empirical evaluation is done with the same dataset S that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on \mathrm{gen}(\underline{h} , \mu) here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class \mathcal{H} one trivially has

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S, z} \,\, \sup_{h \in \mathcal{H}} \left[ \frac1{m} \sum_{i=1}^m \ell(h, z_i) - \ell(h, z) \right] \,. \]

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.


The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:

    \begin{eqnarray*} \mathrm{gen}(\underline{h} , \mu) & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S), z) - \ell(\underline{h}(S), z_i) \right] \\ & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S^{-i}), z_i) - \ell(\underline{h}(S), z_i) \right] \,, \end{eqnarray*}

where S^{-i}=(z_1,\hdots, z_{i-1}, z, z_{i+1}, \hdots, z_m). This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).


The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define (K, \epsilon(\cdot))-robustness as follows: assume that \mathcal{Z} can be partitioned into K sets \{C_k\}_{k \in [K]} such that if z and z_i are in the same set C_k then

    \[ | \ell(\underline{h}(S), z_i) - \ell(\underline{h}(S), z) | \leq \epsilon(S) \,. \]

A good example to have in mind would be a binary classifier with large margin, in which case K corresponds to the covering number of \mathcal{X} at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases K would be typically exponential in the dimension of \mathcal{X}. The key result of Xu and Mannor that we prove next is a generalization bound of order \sqrt{K / m}. In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A (K, \epsilon(\cdot))-robust learning rule satisfies

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S} \, \epsilon(S) + \sqrt{\frac{K}{m}} \,. \]

Proof: Let N_k = |S \cap C_k| and note that \mathbb{E}_S \, N_k = m \mu(C_k). Now one has for a robust h:

    \begin{align*} \mathbb{E}_z \ \ell(h, z) & =\sum_{k=1}^K \mu(C_k) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & = \sum_{k=1}^K \frac{N_k}{m} \mathbb{E}_z [\ell(h,z) | z \in C_k] + \sum_{k=1}^K \left(\mu(C_k) - \frac{N_k}{m}\right) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & \leq \epsilon(S) + \frac{1}{m} \sum_{i = 1}^m \ell(h, z_i) + \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \, . \end{align*}

It only remains to observe that

    \[ \mathbb{E}_S \, \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \leq \frac{1}{m} \sum_{k=1}^K \sqrt{\mu(C_k)} \leq \sqrt{\frac{K}{m}} \, . \]


Information theoretic perspective

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information \mathrm{I}(\underline{h}(S), S) as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \sqrt{\frac{\mathrm{I}(\underline{h}(S), S)}{2 m}} \, . \]

Note that here we have assumed that the codomain of the learning rule \underline{h} consists of deterministic maps from inputs to outputs, in which case the mutual information \mathrm{I}(\underline{h}(S), S) is simply the entropy \mathrm{H}(\underline{h}(S)). However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S, S' \sim \mu^{\otimes m}} [ F(\underline{h}(S), S') - F(\underline{h}(S), S) ] \,, \]

where F(h, S) = \frac{1}{m} \sum_{z \in S} \ell(h, z).

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:


Let f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}. Let (X,Y) be random variables in \mathcal{X} \times \mathcal{Y} and \bar{X}, \bar{Y} be mutually independent copies of X and Y. Assume that f(\bar{X}, \bar{Y}) is \sigma-subgaussian (i.e., \log \mathbb{E} \exp( \lambda (f(\bar{X}, \bar{Y}) - \mathbb{E} f) ) \leq \lambda^2 \sigma^2 /2, \forall \lambda) then

    \[ \mathbb{E}[ f(X,Y) - f(\bar{X}, \bar{Y}) ] \leq \sqrt{2 \sigma^2 \mathrm{I}(X, Y)} \]

Proof: The mutual information is equal to the relative entropy between the distribution of (X,Y) and the distribution of (\bar{X}, \bar{Y}). Recall also the variational representation of the relative entropy which is that the map \mu \mapsto \mathrm{Ent}(\mu, \nu) is the convex conjugate of the log-partition function \Phi_{\nu}(g) := \log(\int \exp(g(x)) d \nu(x)). In particular one has a lower bound on the mutual information for any such g which means:

    \[ \mathrm{I}(X, Y) \geq \mathbb{E}[ g(X,Y) ] - \log \mathbb{E} [ \exp(g(\bar{X},\bar{Y})) ] \,. \]

Now it only remains to use the definition of subgaussianity, that is take g = \lambda f, and optimize over \lambda.

Posted in Machine learning | 1 Comment

Some updates

The blog has been eerily silent for most of 2018, here is why:

  • The main culprit is definitely the COLT 2018 chairing. This year we received a surprise 50% increase in number of submissions. This is great news for the community, and it led to a truly fantastic program. We are looking forward to see many of you in Stockholm! Note that the early registration deadline is tomorrow (May 30th).
  • The first volume of our new Mathematical Statistics and Learning journal just got published. If you care about being evaluated by top experts, and/or get the visibility of being in a highly selective mathematical journal, you should consider submitting your highest quality work there!
  • The first set of video lectures on the youtube channel is now complete. I realize that the quality is not perfect and that it is hard to read the blackboard. I will try to improve this for the future videos (which should be posted some time during the summer).
  • Finally I am quite excited to have a first preprint (joint work with Eric Price and Ilya Razenshteyn) at least loosely related to deep learning. It is titled “Adversarial examples from computational constraints”, which sums up the paper pretty well. In a nutshell, we prove that avoiding adversarial examples is computationally hard, even in situations where there exists a very robust classifier. We are looking forward to get the deep learning community’s feedback on this work!
Posted in Announcement | 5 Comments

Youtube channel

I just started a youtube channel. The hope is that this will be a companion to the blog. Some of the posts will “graduate” into videos (e.g., the first set of videos will correspond to an expanded version of bandit lecture notes –part 1, part 2-, and the second set will correspond to the crash course in learning theory –part 1, part 2-). But the mid/long-term plan is to also have dedicated content on the youtube channel (I still have to figure out the exact format). I personally like a lot to watch math videos online, so I hope the channel will be a nice addition to this growing material (some of my favorite existing channels include the Simons Institute, IAS, Princeton TCS, Harvard Math, IHESa new MIT data science channel, feel free to add yours in the comments below).

Posted in Announcement | 2 Comments

k-server, part 3: entropy regularization for weighted k-paging

If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a O(\log(k))-competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of k-server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a O(\log(k)) algorithm was known much before).

State space for weighted k-paging

Let w_i be the weight of the edge from leaf i \in [n] to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein-1 distance between two fractional k-server configurations.

Consider a fractional move from z \in \Delta_n(k) to z + \xi. Then clearly, the amount of mass |\xi_i| has to transfer through the edge (i, \mathrm{root}), so that the Wasserstein-1 distance is at least \sum_i w_i |\xi_i|. Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted \ell_1 norm (namely \|\xi\| := \sum_{i=1}^n w_i |\xi_i|) and one can simply use the basic state space K = \{x \in [0,1]^n : \sum_i x_i = n-k\} (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by z=1-x).

Applying the general mirror descent framework

Given a request at location r \in [n] and a current anticonfiguration x_0, our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost c(t) = e_r from x(0)=x_0 (i.e., x'(t) = (\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t)) for some \lambda(t) \in N_K(x(t))) and until the first time t at which x_r(t) = 0 (i.e., one has a server at location r in z(t) := 1-x(t)). By the lemma at the end of the previous post one has (denote r(t) for the request being serviced at time t)

    \[ \int x_{r(t)}(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \mathrm{OPT} + O(1) \]

One can think of \int x_{r(t)}(t) dt as a “virtual service cost”. In k-server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that \Phi has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map \Phi: (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost \|x'(t)\| = \|(\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t))\| to this “virtual service cost” x_{r}(t), say up to a multiplicative factor \alpha. One would then obtain a \alpha \times \mathrm{Lip}_{\|\cdot\|}(\Phi)-competitive algorithm.

Entropy regularization for weighted k-paging

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier \lambda(t) and let us search for a separable regularizer of the form \Phi(x) = \sum_i \phi_i(x_i). We want to relate x_r(t) to w_r/\phi_i''(x_r(t)) (= (\nabla^2 \Phi(x(t)))^{-1} e_r). Making those two quantities equal gives \phi_i(x) = w_i x_i \log(x_i) and thus the regularizer is a weighted negentropy: \Phi(x) = \sum_i w_i x_i \log x_i.

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier \lambda(t) is taken into account. Note that because of the form of the state space K the multiplier contains a term of the form (\mu(t), \mu(t), \hdots, \mu(t)) (which corresponds to the constraint \sum_{i=1}^n x_i = n-k) and for each location a term forcing the derivative to be 0 if the value of the missing mass has reached 1. In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

    \begin{align*} & x_r'(t) = \frac{x_r(t)}{w_r} (-1 + \mu(t)) \\ & x_i'(t) = \frac{x_i(t)}{w_i} \mu(t) 1\{x_i(t) < 1\} \ \text{for} \ i \neq r \end{align*}

Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| (that is only movement into a location is charged). In that view the Lagrange multipliers simply have no effect, since one has \mu(t) \geq 0 (indeed recall that \sum_{i=1}^n x_i'(t) = 0). Thus we see that the movement cost \|(x'(t))_-\| is exactly bounded by the virtual service cost x_r(t) in this case.

Making \Phi Lipschitz

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove tracking expert regret bounds, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some \delta>0, in which case the Lipschitz constant would be O(\log(1/\delta)). In the k-server scenario one can stop the dynamics when x_{r(t)}(t)=\delta (instead of x_{r(t)}(t)=0) provided that the mapping from x to the actual fractional configuration z is now given by z=\frac{1-x}{1-\delta}. This raises a final issue: the total amount of server mass in such a z is \frac{k}{1-\delta} > k. Next we show that if \delta is small enough then z can be “rounded” online to a fractional k-server configuration at the expense of a multiplicative O(1) movement. Precisely we show that \delta<1/(2k) is sufficient, which in turns gives a final competitive ratio of O(\log(k)) for weighted k-paging.

From k+\epsilon servers to k servers

Consider a fractional server configuration z with total mass k+\epsilon (i.e., \sum_{i=1}^n z_i = k+\epsilon), which we want to transform into a server configuration \sigma(z) with total mass k. A natural way to “round down” is at each location to put 0 if the mass was \leq \epsilon. Furthermore a mass of 1 should stay 1. This suggests the mapping \sigma(z_i) = \frac{z_i-\epsilon}{1-\epsilon} 1\{z_i > \epsilon\}, which is \frac{1}{1-\epsilon}-Lipschitz. Thus the movement of \sigma(z(t)) is controlled by the movement of z(t) up to a multiplicative factor \frac{1}{1-\epsilon}. Moreover one clearly has \sum_{i=1}^n \sigma(z_i) \leq k (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

Posted in Theoretical Computer Science | 18 Comments

k-server, part 2: continuous time mirror descent

We continue our k-server series (see post 1 here). In this post we briefly discuss the concept of a fractional solution for k-server, which by analogy with MTS will in fact be a fractional “anti-solution”. Then we introduce the continuous time version of MTS and explain how it applies for k-server. Finally the most important part of the post is the description of the basic potential based analysis of mirror descent and how to interpret it in the context of MTS.


State representation for MTS

Recall the MTS problem: one maintains a random state \rho_t \in [n] (where [n] is equipped with a distance d), and given a new cost function c_{t+1} : [n] \rightarrow \mathbb{R}_+, one can update the state to \rho_{t+1} with the corresponding price to pay: c_{t+1}(\rho_{t+1}) + d(\rho_t, \rho_{t+1}). Equivalently one can maintain the probability distribution p_t \in \Delta_n := \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} of \rho_t: indeed given \rho_t, p_t, p_{t+1} one can obtain \rho_{t+1} by optimal coupling, that is the couple of random variables (\rho_t, \rho_{t+1}) minimizes the expected distance \E \ d(Y,Z) over all couples (Y,Z) such that Y has marginal p_t and Z has marginal p_{t+1} (the latter quantity is called the Wasserstein-1 distance between p_t and p_{t+1}). In this view the (expected) service cost is now a linear function, that is c_t \cdot p_t where c_t \in \mathbb{R}_+^n, and the movement cost between p_t and p_{t+1} is the Wasserstein-1 distance.

We will further assume the existence of a convex body K and a norm \|\cdot\| in \mathbb{R}^N (N \geq n) such that the Wasserstein-1 distance between two distributions x,y \in \Delta_n is equal to \|\hat{x} - \hat{y}\| where \hat{x}, \hat{y} \in K are “expanded states” with x=(\hat{x}_1,\hdots, \hat{x}_n) and y=(\hat{y}_1, \hdots, \hat{y}_n). For a weighted star metric it suffices to take K = \Delta_n, but we will see in the fourth post that for trees one indeed need to expand the state space.

Fractional solution for k-server

Recall the k-server problem: one maintains a random configuration \rho_t \in [n]^k, and given a new request r_{t+1} \in [n] one must update the configuration to \rho_{t+1} such that r_{t+1} \in \{\rho_{t+1}(1), \hdots, \rho_{t+1}(k)\}. One could equivalently maintain a distribution p_t \in \Delta_{n^k} as before. In the fractional problem one in fact only maintains the 1^{st} moment of this distribution, z_t \in \Delta_n(k) := \{z \in [0,1]^n : \sum_{i=1}^n z_i = k\}, defined by z_t(i) = \sum_{\rho \in [n]^k : i \in \{\rho(1),\hdots, \rho(k)\}} p_t(\rho) (in particular servicing a request a location i means that one must have z_t(i) = 1). Again the metric here on the z variables is the Wasserstein distance induced by d (a.k.a., the optimal transport distance). Importantly we note that this view is not equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can round online a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional k-server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.

To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the anticonfiguration x=1-z. In this view a request r \in [n] is equivalent to a cost vector c = +\infty \times e_r. Finally like in MTS we will assume the existence of an expanded state space K \subset \mathbb{R}^N and a norm \|\cdot\| that measures movement in this expanded view.

Continuous time decision making

We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path (c(t))_{t \in \mathbb{R}_+}, with c(t) \in \mathbb{R}_+^N (and c_i(t) \neq 0 \Rightarrow i \leq n). The decision maker’s reponse is a path (x(t))_{t \in \mathbb{R}_+} that lives in K \subset \mathbb{R}^N. In this setting the service cost of the algorithm is \int_{\mathbb{R}^+} c_t \cdot x_t dt and the movement cost is equal to \int_{\mathbb{R}^+} \|x'(t)\| dt where x'(t) is the time derivative of the path x(t). We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for k-server (where there is only a movement cost). If you are curious see the appendix here.

For k-server we will use c(t) = e_{r(t)} where r(t) is the currently requested location, and we move to the next request at the first time such that x_{r(t)}(t) = 0 (which means that z=1-x satisfies z_{r(t)}(t)=1, i.e., there is a server at the requested location.

Mirror descent

If you have never seen the mirror descent framework now is a good time to take a quick look here.

Very succintly mirror descent with mirror map \Phi : K \rightarrow \mathbb{R} can be written as follows, with a step-size \eta:

    \[ x(t+\eta) = \mathrm{argmin}_{x \in K} D_{\Phi}(x, \nabla \Phi^*(\nabla \Phi(x(t)) - \eta c(t))) \]

where we recall that \Phi^* is the Legendre-Fenchel transform of \Phi (i.e., the map whose gradient is the inverse map of the gradient of \Phi) and D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) \cdot (x-y) is the Bregman divergence associated to \Phi.

We now want to take \eta to 0 in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote N_K(x) = \{\theta : \theta \cdot (y-x) \leq 0, \ \forall y \in K\} for the normal cone of K at x, i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function f,

    \[ x^* \in \mathrm{argmin}_{x \in K} f(x) \Leftrightarrow -\nabla f(x^*) \in N_K(x^*) \]

In particular we see that (note that \nabla_x D_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y))

    \[ \nabla \Phi(x(t+\eta)) - \nabla \Phi(x(t)) + \eta c(t) \in - N_{K}(x(t+\eta)) \]

and thus taking \eta to 0 we morally get

    \[ \nabla^2 \Phi(x(t)) x'(t) \in - c(t) - N_{K}(x(t)) \]

This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set K we get a viability problem. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on \Phi.

Potential based analysis

The mirror descent algorithm is fully described by:

    \[ \nabla^2 \Phi(x(t)) x'(t) = - c(t) - \lambda(t), \; \lambda(t) \in N_{K}(x(t)) \]

Denoting \hat{D}_{\Phi}(y ; x) = - \Phi(x) - \nabla \Phi(x) \cdot (y -x ) we see that for any fixed y,

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y ; x(t)) & = & - \nabla^2 \Phi(x(t)) x'(t) \cdot (y - x(t)) \\ & = & (c(t) + \lambda(t)) \cdot (y - x(t)) \\ & \leq & c(t) \cdot (y - x(t)) \end{eqnarray*}

The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than y, i.e., c(t) \cdot (y - x(t)) <0, then it is actually getting closer to y in the sense that \hat{D}_{\Phi}(y ; x(t)) is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point y. However in MTS and k-server we compete against a path y(t), and thus we also need to evaluate by how much the Bregman divergence can go up when y is moving. This is captured by the following calculation:

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y(t) ; x) & = & - \nabla \Phi(x(t)) \cdot y'(t) \\ & \leq & \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \|y'(t)\| \end{eqnarray*}

Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:

Lemma: The mirror descent path (x(t))_{t \geq 0} satisfies for any comparator path (y(t))_{t \geq 0},

    \[ \int c(t) \cdot x(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \int \|y'(t)\| dt + \int c(t) \cdot y(t) dt + O(1) \]

At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.


Posted in Theoretical Computer Science | 1 Comment

k-server, part 1: online learning and online algorithms

The k-server problem is a classical and very attractive instance of online decision making. The decisions to be made in this problem are simple: given a requested location in some finite metric space and a fleet of k servers currently sitting in this metric space, one has to choose which server to use to service the request (that is the server will move from its current location to the requested location). The problem is to minimize the total amount of movement of the servers on a stream of such requests. This formulation is both abstract enough that it can model many real-life problems (e.g., virtual memory management) yet concrete/simple enough that it seems that everything ought to be understood about it.

In the coming series of 5 blog posts I will explain the main ideas behind our k-server preprint with Michael Cohen, James Lee, Yin Tat Lee, and Aleksander Madry. In this first post I will briefly put the problem in the broader context of online learning and online algorithms, which will be helpful as it suggests an approach based on the mirror descent algorithm. In the next post I will explain the general framework of mirror descent and how it can be applied to a problem such as k-server. The third post will show how to use this general framework to recover effortlessly the state of the art for weighted k-paging (i.e., when the underlying metric space is a weighted star). The fourth post will show how to extend the analysis to arbitrary trees. Finally in the fifth post we will discuss both classical embedding techniques to reduce the problem to the tree case, as well as our new dynamic embedding based on multiscale restarts on the classical technique. The content of the first three posts was essentially covered in my TCS+ seminar talk.

Online decision making: two probability free models

I will start by introducing a very general framework for online algorithms due to Borodin, Linial and Saks 92, called metrical task systems (MTS). At the same time I will recall the online learning framework, so that one can see the similarities/differences between the two settings. In fact this connection was already made at the end of the Nineties, see Blum and Burch 00. At the time it was natural to explore the power of multiplicative weights. With today’s perspective it is natural to explore the much more general mirror descent algorithm (we note that the closely related concept of regularization was already brought to bear for these problems in Abernethy, Bartlett, Buchbinder and Stanton 10, and Buchbinder, Chen and Naor 14).

Let X be a set which we think of as either a state space (for online algorithms) or an action space (for online learning). Let d be a metric on X. Finally let \mathcal{C} \subset \mathbb{R}^{X} be a set of possible cost functions (e.g., arbitrary bounded functions, or linear functions if X \subset \mathbb{R}^N, or convex functions, etc). Online decision making is about making decisions in an uncertain environment (the future is not known), and here we model this uncertain environment as an unknown sequence of cost functions c_1, \hdots, c_T \in \mathcal{C}. The interaction protocol can then be described as follows: For each t \in [T], the player chooses (possibly randomly) x_t \in X based on past observations (to be made precise soon) and pays the cost c_t(x_t) \in \mathbb{R}, plus possibly a movement penalty: d(x_{t-1}, x_t).

There are now two key differences between online algorithms and online learning: (i) the observation model, and (ii) the performance metric. In online learning one assumes that the cost c_t is unknown at the decision time, and is only revealed after the decision is made (in bandits an even weaker signal is revealed, namely only the paid cost c_t(x_t)). The type of applications one has in mind is say online recommendations, where a user’s preference is only revealed once some item has been recommended to her. On the other hand in online algorithms the cost c_t is known at decision time. In this context the cost corresponds to a request, and the player has to find a way to satisfy this request (in the language of MTS the cost represents a task, and one gets the option to update its state so as to complete this task more efficiently, i.e., at a lower cost). Now let us discuss the performance metric. In online learning one assumes that there is some “hidden” good action (it is hidden in the noise, i.e., a single cost function does not say much about which actions are good, but if one takes the whole sequence into account then there is some good action that hopefully emerges). Thus it is natural to consider the following regret notion:

    \[ \sum_{t=1}^T c_t(x_t) - \min_{x \in X} \sum_{t=1}^T c_t(x) \]

This regret notion does not make sense for online algorithms where one may have to keep changing states to satisfy the request sequence. There one must compare to the best offline strategy, in which case additive guarantees are not attainable and one resorts to a multiplicative guarantee, the so-called competitive ratio:

    \[ \sum_{t=1}^T \left( c_t(x_t) + d(x_t,x_{t-1}) \right) / \min_{x^* \in X^T} \sum_{t=1}^T \left( c_t(x_t^*) + d(x_t^*,x_{t-1}^*) \right) ~. \]

(Note that in MTS one always assumes nonnegative cost functions so that the multicative guarantee makes sense.) The k-server problem, introduced in Manasse, McGeoch and Sleator 90, corresponds to a metrical task system on the product state space X^k equipped with the Earthmover distance inherited by (X,d), and with cost functions

    \[ \mathcal{C} = \{X^k \ni x \mapsto + \infty \times 1\{ r \not\in \{x(1),\hdots,x(k)\} \}; \ r \in X \} ~. \]

Known results

The online learning setting is by now fairly well-understood. We know that the optimal regret for a finite action set |X| = n and bounded cost functions is O(\sqrt{T \log(n)}) (see e.g., Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth 97). In fact we even know the optimal constant (for large n and T), and we have both algorithmic and information theoretic proofs of this result. Moreover we know how to leverage combinatorial information, e.g., when X \subset \{0,1\}^d and \mathcal{C} is a set of linear functions. A unifying principle that was put forth in the last decade or so is the mirror descent strategy.

On the other hand the situation for MTS is much worse. The optimal guarantees for finite sets are not known: the worst case (over all metric spaces of size n) competitive ratio is known to be \Omega(\log(n)) (trivial coupon-collector style lower bound on the uniform metric) and O(\log^2(n) \log\log(n)) (the latter bound is due to Fiat and Mendel 03). No information theoretic analysis is known, even for the uniform metric. With combinatorial structure the situation becomes even more disastrous. That’s where the k-server problem comes in, as a combinatorial MTS striking the best balance between simplicity and importance. For k-server it is conjectured that one could get a competitive ratio of O(\log(k)) (the coupon-collector lower bound on uniform metric gives here \Omega(\log(k))), while the best result prior to our work was O(\log^3(n) \log^2(k)) (due to Bansal, Buchbinder, Naor and Madry 11), and O(k) if one insists for a bound independent of n (due to Koutsoupias and Papadimitriou 95).

Posted in Theoretical Computer Science | 5 Comments

Algorithms, Machine Learning, and Optimization: we are hiring!

As some of you already know there has been some movement at MSR lately, specifically for the theory group. We have now branched out into two groups, one in Machine Learning and Optimization -MLO- (with Zeyuan Allen-Zhu, myself, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, Lin Xiao, and until recently postdoc Yin Tat Lee) and one in Algorithms (Nikhil Devanur, Sergey Yekhanin, postdocs Janardhan Kulkarni and Xiaorui Sun). The MLO group is part of the new MSR AI organization headed by Eric Horvitz, with grand goals at all levels of AIs (including a broad view of the foundations of AI which very much includes the type of mathematics showcased on this blog). Both groups are now hiring at all levels. From my point of view MSR is a unique place where you can devote all your time and energy to your research. The growth opportunities are truly amazing here: constant stream of world-class experts in algorithms, optimization, probability; lots of resources for travel, interns, postodcs; closeness to real-world products and possibilities to see your research directly applied. Below is the official call for application of the MLO group (the Algorithms’ one is here). Looking forward to receive your application!

Call for applications for researcher and postdoc positions in Machine Learning and Optimization

Application deadline: December 1st, 2017

The Machine Learning and Optimization Group at Microsoft Research Redmond invites applications for researcher positions at all levels (postdoc, full-time junior, full-time senior). The current full-time research members are Zeyuan Allen-Zhu, Sebastien Bubeck, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, and Lin Xiao.

All applicants working on machine learning and optimization, including their algorithmic and mathematical foundations, will be considered. We expect candidates to have demonstrated excellence in research and have developed their own original research agenda. Our current areas of focus include statistical and online learning, convex and non-convex optimization, high dimensional data analysis, combinatorial optimization and its applications in AI, statistics, and probability. We are also looking to expand our domain of expertise to other areas of modern machine learning, including more applied research areas.

Microsoft Research offers wonderful resources to develop your research, opportunities for collaborations across all MSR labs and top academic institutions, and the potential to realize your ideas in products and services used worldwide. Our group is part of Microsoft Research AI, a new organization that brings together the breadth of talent across Microsoft Research to pursue game-changing advances in artificial intelligence.

Please apply directly on the Microsoft Research Careers website and include Ofer Dekel as a Microsoft Research contact. In addition, send a copy of your application to To be assured full consideration, all your materials, including at least two reference letters, should arrive by December 1st, 2017. We recommend including a brief academic research statement and links to publications.

Microsoft is an equal opportunity employer. We welcome applications from all qualified candidates, and in particular from women and underrepresented groups.

Posted in Announcement | 4 Comments