How to generalize (algorithmically)

A couple of months ago I taught an introduction to statistical learning theory. I took inspiration from two very good introductory books on SLT: “Foundations of ML”, and “Understanding Machine Learning: From Theory to Algorithms”. I also covered some classical results about nearest neighbors from the seminal book “A Probabilistic Theory of Pattern Recognition”. The lectures were recorded and I put the videos on the youtube channel. Sadly I still did not learn the basic lesson to stick to a black pen when teaching on a whiteboard, so at (many) times it is hard to read what I write. I also got sick after the first lecture so my voice is quite weak in the following lectures, making it hard to hear too… Despite these fatal flaws some people still found the material useful, so check it out if you are interested. Below I give a rough syllabus for the lectures, and I also give some notes on “algorithmic generalization” which was in my opinion the highlight of the lectures (you can also check my earlier notes on SLT here and here).


Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.


Some notes on algorithmic generalization

Let \mathcal{X}, \mathcal{Y} be input/output spaces. Let \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow [0,1] be a loss function, \mu a probability measure supported on \mathcal{X} \times \mathcal{Y}, and \underline{h} : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^\mathcal{X} a learning rule (in words \underline{h} takes as input a dataset of m examples, and output a mapping from \mathcal{X}-inputs to \mathcal{Y}-outputs). With a slight abuse of notation, for z=(x,y) \in \mathcal{X} \times \mathcal{Y} and h \in \mathcal{Y}^{\mathcal{X}}, we write \ell(h, z) := \ell(h(x), y). We define the generalization of \underline{h} on \mu by:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S=(z_1,\hdots,z_m) \sim \mu^{\otimes m}, z \sim \mu} \left[ \ell(\underline{h}(S), z) - \frac1{m} \sum_{i=1}^m \ell(\underline{h}(S), z_i) \right] \,. \]

In words, if \mathrm{gen}(\underline{h} , \mu) \leq \epsilon then we expect the empirical performance of the learned classifier \underline{h}(S) to be representative of its performance on a fresh out-of-sample data point, up to an additive \epsilon. The whole difficulty of course is that the empirical evaluation is done with the same dataset S that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on \mathrm{gen}(\underline{h} , \mu) here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class \mathcal{H} one trivially has

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S, z} \,\, \sup_{h \in \mathcal{H}} \left[ \frac1{m} \sum_{i=1}^m \ell(h, z_i) - \ell(h, z) \right] \,. \]

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use algorithm-based arguments.


The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a stability notion:

    \begin{eqnarray*} \mathrm{gen}(\underline{h} , \mu) & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S), z) - \ell(\underline{h}(S), z_i) \right] \\ & = & \mathbb{E}_{S, z, i \sim [n]} \left[\ell(\underline{h}(S^{-i}), z_i) - \ell(\underline{h}(S), z_i) \right] \,, \end{eqnarray*}

where S^{-i}=(z_1,\hdots, z_{i-1}, z, z_{i+1}, \hdots, z_m). This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).


The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define (K, \epsilon(\cdot))-robustness as follows: assume that \mathcal{Z} can be partitioned into K sets \{C_k\}_{k \in [K]} such that if z and z_i are in the same set C_k then

    \[ | \ell(\underline{h}(S), z_i) - \ell(\underline{h}(S), z) | \leq \epsilon(S) \,. \]

A good example to have in mind would be a binary classifier with large margin, in which case K corresponds to the covering number of \mathcal{X} at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases K would be typically exponential in the dimension of \mathcal{X}. The key result of Xu and Mannor that we prove next is a generalization bound of order \sqrt{K / m}. In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A (K, \epsilon(\cdot))-robust learning rule satisfies

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \mathbb{E}_{S} \, \epsilon(S) + \sqrt{\frac{K}{m}} \,. \]

Proof: Let N_k = |S \cap C_k| and note that \mathbb{E}_S \, N_k = m \mu(C_k). Now one has for a robust h:

    \begin{align*} \mathbb{E}_z \ \ell(h, z) & =\sum_{k=1}^K \mu(C_k) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & = \sum_{k=1}^K \frac{N_k}{m} \mathbb{E}_z [\ell(h,z) | z \in C_k] + \sum_{k=1}^K \left(\mu(C_k) - \frac{N_k}{m}\right) \mathbb{E}_z [\ell(h,z) | z \in C_k] \\ & \leq \epsilon(S) + \frac{1}{m} \sum_{i = 1}^m \ell(h, z_i) + \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \, . \end{align*}

It only remains to observe that

    \[ \mathbb{E}_S \, \sum_{k=1}^K \left|\mu(C_k) - \frac{N_k}{m}\right| \leq \frac{1}{m} \sum_{k=1}^K \sqrt{\mu(C_k)} \leq \sqrt{\frac{K}{m}} \, . \]


Information theoretic perspective

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was memorized by the neural network (which be a very non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information \mathrm{I}(\underline{h}(S), S) as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

    \[ \mathrm{gen}(\underline{h} , \mu) \leq \sqrt{\frac{\mathrm{I}(\underline{h}(S), S)}{2 m}} \, . \]

Note that here we have assumed that the codomain of the learning rule \underline{h} consists of deterministic maps from inputs to outputs, in which case the mutual information \mathrm{I}(\underline{h}(S), S) is simply the entropy \mathrm{H}(\underline{h}(S)). However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

    \[ \mathrm{gen}(\underline{h} , \mu) = \mathbb{E}_{S, S' \sim \mu^{\otimes m}} [ F(\underline{h}(S), S') - F(\underline{h}(S), S) ] \,, \]

where F(h, S) = \frac{1}{m} \sum_{z \in S} \ell(h, z).

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:


Let f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}. Let (X,Y) be random variables in \mathcal{X} \times \mathcal{Y} and \bar{X}, \bar{Y} be mutually independent copies of X and Y. Assume that f(\bar{X}, \bar{Y}) is \sigma-subgaussian (i.e., \log \mathbb{E} \exp( \lambda (f(\bar{X}, \bar{Y}) - \mathbb{E} f) ) \leq \lambda^2 \sigma^2 /2, \forall \lambda) then

    \[ \mathbb{E}[ f(X,Y) - f(\bar{X}, \bar{Y}) ] \leq \sqrt{2 \sigma^2 \mathrm{I}(X, Y)} \]

Proof: The mutual information is equal to the relative entropy between the distribution of (X,Y) and the distribution of (\bar{X}, \bar{Y}). Recall also the variational representation of the relative entropy which is that the map \mu \mapsto \mathrm{Ent}(\mu, \nu) is the convex conjugate of the log-partition function \Phi_{\nu}(g) := \log(\int \exp(g(x)) d \nu(x)). In particular one has a lower bound on the mutual information for any such g which means:

    \[ \mathrm{I}(X, Y) \geq \mathbb{E}[ g(X,Y) ] - \log \mathbb{E} [ \exp(g(\bar{X},\bar{Y})) ] \,. \]

Now it only remains to use the definition of subgaussianity, that is take g = \lambda f, and optimize over \lambda.

Posted in Machine learning | 49 Comments

Some updates

The blog has been eerily silent for most of 2018, here is why:

  • The main culprit is definitely the COLT 2018 chairing. This year we received a surprise 50% increase in number of submissions. This is great news for the community, and it led to a truly fantastic program. We are looking forward to see many of you in Stockholm! Note that the early registration deadline is tomorrow (May 30th).
  • The first volume of our new Mathematical Statistics and Learning journal just got published. If you care about being evaluated by top experts, and/or get the visibility of being in a highly selective mathematical journal, you should consider submitting your highest quality work there!
  • The first set of video lectures on the youtube channel is now complete. I realize that the quality is not perfect and that it is hard to read the blackboard. I will try to improve this for the future videos (which should be posted some time during the summer).
  • Finally I am quite excited to have a first preprint (joint work with Eric Price and Ilya Razenshteyn) at least loosely related to deep learning. It is titled “Adversarial examples from computational constraints”, which sums up the paper pretty well. In a nutshell, we prove that avoiding adversarial examples is computationally hard, even in situations where there exists a very robust classifier. We are looking forward to get the deep learning community’s feedback on this work!
Posted in Announcement | 21 Comments

Youtube channel

I just started a youtube channel. The hope is that this will be a companion to the blog. Some of the posts will “graduate” into videos (e.g., the first set of videos will correspond to an expanded version of bandit lecture notes –part 1, part 2-, and the second set will correspond to the crash course in learning theory –part 1, part 2-). But the mid/long-term plan is to also have dedicated content on the youtube channel (I still have to figure out the exact format). I personally like a lot to watch math videos online, so I hope the channel will be a nice addition to this growing material (some of my favorite existing channels include the Simons Institute, IAS, Princeton TCS, Harvard Math, IHESa new MIT data science channel, feel free to add yours in the comments below).

Posted in Announcement | 2 Comments

k-server, part 3: entropy regularization for weighted k-paging

If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a O(\log(k))-competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of k-server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a O(\log(k)) algorithm was known much before).

State space for weighted k-paging

Let w_i be the weight of the edge from leaf i \in [n] to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein-1 distance between two fractional k-server configurations.

Consider a fractional move from z \in \Delta_n(k) to z + \xi. Then clearly, the amount of mass |\xi_i| has to transfer through the edge (i, \mathrm{root}), so that the Wasserstein-1 distance is at least \sum_i w_i |\xi_i|. Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted \ell_1 norm (namely \|\xi\| := \sum_{i=1}^n w_i |\xi_i|) and one can simply use the basic state space K = \{x \in [0,1]^n : \sum_i x_i = n-k\} (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by z=1-x).

Applying the general mirror descent framework

Given a request at location r \in [n] and a current anticonfiguration x_0, our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost c(t) = e_r from x(0)=x_0 (i.e., x'(t) = (\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t)) for some \lambda(t) \in N_K(x(t))) and until the first time t at which x_r(t) = 0 (i.e., one has a server at location r in z(t) := 1-x(t)). By the lemma at the end of the previous post one has (denote r(t) for the request being serviced at time t)

    \[ \int x_{r(t)}(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \mathrm{OPT} + O(1) \]

One can think of \int x_{r(t)}(t) dt as a “virtual service cost”. In k-server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that \Phi has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map \Phi: (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost \|x'(t)\| = \|(\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t))\| to this “virtual service cost” x_{r}(t), say up to a multiplicative factor \alpha. One would then obtain a \alpha \times \mathrm{Lip}_{\|\cdot\|}(\Phi)-competitive algorithm.

Entropy regularization for weighted k-paging

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier \lambda(t) and let us search for a separable regularizer of the form \Phi(x) = \sum_i \phi_i(x_i). We want to relate x_r(t) to w_r/\phi_i''(x_r(t)) (= (\nabla^2 \Phi(x(t)))^{-1} e_r). Making those two quantities equal gives \phi_i(x) = w_i x_i \log(x_i) and thus the regularizer is a weighted negentropy: \Phi(x) = \sum_i w_i x_i \log x_i.

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier \lambda(t) is taken into account. Note that because of the form of the state space K the multiplier contains a term of the form (\mu(t), \mu(t), \hdots, \mu(t)) (which corresponds to the constraint \sum_{i=1}^n x_i = n-k) and for each location a term forcing the derivative to be 0 if the value of the missing mass has reached 1. In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

    \begin{align*} & x_r'(t) = \frac{x_r(t)}{w_r} (-1 + \mu(t)) \\ & x_i'(t) = \frac{x_i(t)}{w_i} \mu(t) 1\{x_i(t) < 1\} \ \text{for} \ i \neq r \end{align*}

Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| (that is only movement into a location is charged). In that view the Lagrange multipliers simply have no effect, since one has \mu(t) \geq 0 (indeed recall that \sum_{i=1}^n x_i'(t) = 0). Thus we see that the movement cost \|(x'(t))_-\| is exactly bounded by the virtual service cost x_r(t) in this case.

Making \Phi Lipschitz

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove tracking expert regret bounds, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some \delta>0, in which case the Lipschitz constant would be O(\log(1/\delta)). In the k-server scenario one can stop the dynamics when x_{r(t)}(t)=\delta (instead of x_{r(t)}(t)=0) provided that the mapping from x to the actual fractional configuration z is now given by z=\frac{1-x}{1-\delta}. This raises a final issue: the total amount of server mass in such a z is \frac{k}{1-\delta} > k. Next we show that if \delta is small enough then z can be “rounded” online to a fractional k-server configuration at the expense of a multiplicative O(1) movement. Precisely we show that \delta<1/(2k) is sufficient, which in turns gives a final competitive ratio of O(\log(k)) for weighted k-paging.

From k+\epsilon servers to k servers

Consider a fractional server configuration z with total mass k+\epsilon (i.e., \sum_{i=1}^n z_i = k+\epsilon), which we want to transform into a server configuration \sigma(z) with total mass k. A natural way to “round down” is at each location to put 0 if the mass was \leq \epsilon. Furthermore a mass of 1 should stay 1. This suggests the mapping \sigma(z_i) = \frac{z_i-\epsilon}{1-\epsilon} 1\{z_i > \epsilon\}, which is \frac{1}{1-\epsilon}-Lipschitz. Thus the movement of \sigma(z(t)) is controlled by the movement of z(t) up to a multiplicative factor \frac{1}{1-\epsilon}. Moreover one clearly has \sum_{i=1}^n \sigma(z_i) \leq k (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

Posted in Theoretical Computer Science | 30 Comments

k-server, part 2: continuous time mirror descent

We continue our k-server series (see post 1 here). In this post we briefly discuss the concept of a fractional solution for k-server, which by analogy with MTS will in fact be a fractional “anti-solution”. Then we introduce the continuous time version of MTS and explain how it applies for k-server. Finally the most important part of the post is the description of the basic potential based analysis of mirror descent and how to interpret it in the context of MTS.


State representation for MTS

Recall the MTS problem: one maintains a random state \rho_t \in [n] (where [n] is equipped with a distance d), and given a new cost function c_{t+1} : [n] \rightarrow \mathbb{R}_+, one can update the state to \rho_{t+1} with the corresponding price to pay: c_{t+1}(\rho_{t+1}) + d(\rho_t, \rho_{t+1}). Equivalently one can maintain the probability distribution p_t \in \Delta_n := \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} of \rho_t: indeed given \rho_t, p_t, p_{t+1} one can obtain \rho_{t+1} by optimal coupling, that is the couple of random variables (\rho_t, \rho_{t+1}) minimizes the expected distance \E \ d(Y,Z) over all couples (Y,Z) such that Y has marginal p_t and Z has marginal p_{t+1} (the latter quantity is called the Wasserstein-1 distance between p_t and p_{t+1}). In this view the (expected) service cost is now a linear function, that is c_t \cdot p_t where c_t \in \mathbb{R}_+^n, and the movement cost between p_t and p_{t+1} is the Wasserstein-1 distance.

We will further assume the existence of a convex body K and a norm \|\cdot\| in \mathbb{R}^N (N \geq n) such that the Wasserstein-1 distance between two distributions x,y \in \Delta_n is equal to \|\hat{x} - \hat{y}\| where \hat{x}, \hat{y} \in K are “expanded states” with x=(\hat{x}_1,\hdots, \hat{x}_n) and y=(\hat{y}_1, \hdots, \hat{y}_n). For a weighted star metric it suffices to take K = \Delta_n, but we will see in the fourth post that for trees one indeed need to expand the state space.

Fractional solution for k-server

Recall the k-server problem: one maintains a random configuration \rho_t \in [n]^k, and given a new request r_{t+1} \in [n] one must update the configuration to \rho_{t+1} such that r_{t+1} \in \{\rho_{t+1}(1), \hdots, \rho_{t+1}(k)\}. One could equivalently maintain a distribution p_t \in \Delta_{n^k} as before. In the fractional problem one in fact only maintains the 1^{st} moment of this distribution, z_t \in \Delta_n(k) := \{z \in [0,1]^n : \sum_{i=1}^n z_i = k\}, defined by z_t(i) = \sum_{\rho \in [n]^k : i \in \{\rho(1),\hdots, \rho(k)\}} p_t(\rho) (in particular servicing a request a location i means that one must have z_t(i) = 1). Again the metric here on the z variables is the Wasserstein distance induced by d (a.k.a., the optimal transport distance). Importantly we note that this view is not equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can round online a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional k-server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.

To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the anticonfiguration x=1-z. In this view a request r \in [n] is equivalent to a cost vector c = +\infty \times e_r. Finally like in MTS we will assume the existence of an expanded state space K \subset \mathbb{R}^N and a norm \|\cdot\| that measures movement in this expanded view.

Continuous time decision making

We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path (c(t))_{t \in \mathbb{R}_+}, with c(t) \in \mathbb{R}_+^N (and c_i(t) \neq 0 \Rightarrow i \leq n). The decision maker’s reponse is a path (x(t))_{t \in \mathbb{R}_+} that lives in K \subset \mathbb{R}^N. In this setting the service cost of the algorithm is \int_{\mathbb{R}^+} c_t \cdot x_t dt and the movement cost is equal to \int_{\mathbb{R}^+} \|x'(t)\| dt where x'(t) is the time derivative of the path x(t). We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for k-server (where there is only a movement cost). If you are curious see the appendix here.

For k-server we will use c(t) = e_{r(t)} where r(t) is the currently requested location, and we move to the next request at the first time such that x_{r(t)}(t) = 0 (which means that z=1-x satisfies z_{r(t)}(t)=1, i.e., there is a server at the requested location.

Mirror descent

If you have never seen the mirror descent framework now is a good time to take a quick look here.

Very succintly mirror descent with mirror map \Phi : K \rightarrow \mathbb{R} can be written as follows, with a step-size \eta:

    \[ x(t+\eta) = \mathrm{argmin}_{x \in K} D_{\Phi}(x, \nabla \Phi^*(\nabla \Phi(x(t)) - \eta c(t))) \]

where we recall that \Phi^* is the Legendre-Fenchel transform of \Phi (i.e., the map whose gradient is the inverse map of the gradient of \Phi) and D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) \cdot (x-y) is the Bregman divergence associated to \Phi.

We now want to take \eta to 0 in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote N_K(x) = \{\theta : \theta \cdot (y-x) \leq 0, \ \forall y \in K\} for the normal cone of K at x, i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function f,

    \[ x^* \in \mathrm{argmin}_{x \in K} f(x) \Leftrightarrow -\nabla f(x^*) \in N_K(x^*) \]

In particular we see that (note that \nabla_x D_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y))

    \[ \nabla \Phi(x(t+\eta)) - \nabla \Phi(x(t)) + \eta c(t) \in - N_{K}(x(t+\eta)) \]

and thus taking \eta to 0 we morally get

    \[ \nabla^2 \Phi(x(t)) x'(t) \in - c(t) - N_{K}(x(t)) \]

This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set K we get a viability problem. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on \Phi.

Potential based analysis

The mirror descent algorithm is fully described by:

    \[ \nabla^2 \Phi(x(t)) x'(t) = - c(t) - \lambda(t), \; \lambda(t) \in N_{K}(x(t)) \]

Denoting \hat{D}_{\Phi}(y ; x) = - \Phi(x) - \nabla \Phi(x) \cdot (y -x ) we see that for any fixed y,

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y ; x(t)) & = & - \nabla^2 \Phi(x(t)) x'(t) \cdot (y - x(t)) \\ & = & (c(t) + \lambda(t)) \cdot (y - x(t)) \\ & \leq & c(t) \cdot (y - x(t)) \end{eqnarray*}

The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than y, i.e., c(t) \cdot (y - x(t)) <0, then it is actually getting closer to y in the sense that \hat{D}_{\Phi}(y ; x(t)) is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point y. However in MTS and k-server we compete against a path y(t), and thus we also need to evaluate by how much the Bregman divergence can go up when y is moving. This is captured by the following calculation:

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y(t) ; x) & = & - \nabla \Phi(x(t)) \cdot y'(t) \\ & \leq & \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \|y'(t)\| \end{eqnarray*}

Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:

Lemma: The mirror descent path (x(t))_{t \geq 0} satisfies for any comparator path (y(t))_{t \geq 0},

    \[ \int c(t) \cdot x(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \int \|y'(t)\| dt + \int c(t) \cdot y(t) dt + O(1) \]

At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.


Posted in Theoretical Computer Science | 1 Comment

k-server, part 1: online learning and online algorithms

The k-server problem is a classical and very attractive instance of online decision making. The decisions to be made in this problem are simple: given a requested location in some finite metric space and a fleet of k servers currently sitting in this metric space, one has to choose which server to use to service the request (that is the server will move from its current location to the requested location). The problem is to minimize the total amount of movement of the servers on a stream of such requests. This formulation is both abstract enough that it can model many real-life problems (e.g., virtual memory management) yet concrete/simple enough that it seems that everything ought to be understood about it.

In the coming series of 5 blog posts I will explain the main ideas behind our k-server preprint with Michael Cohen, James Lee, Yin Tat Lee, and Aleksander Madry. In this first post I will briefly put the problem in the broader context of online learning and online algorithms, which will be helpful as it suggests an approach based on the mirror descent algorithm. In the next post I will explain the general framework of mirror descent and how it can be applied to a problem such as k-server. The third post will show how to use this general framework to recover effortlessly the state of the art for weighted k-paging (i.e., when the underlying metric space is a weighted star). The fourth post will show how to extend the analysis to arbitrary trees. Finally in the fifth post we will discuss both classical embedding techniques to reduce the problem to the tree case, as well as our new dynamic embedding based on multiscale restarts on the classical technique. The content of the first three posts was essentially covered in my TCS+ seminar talk.

Online decision making: two probability free models

I will start by introducing a very general framework for online algorithms due to Borodin, Linial and Saks 92, called metrical task systems (MTS). At the same time I will recall the online learning framework, so that one can see the similarities/differences between the two settings. In fact this connection was already made at the end of the Nineties, see Blum and Burch 00. At the time it was natural to explore the power of multiplicative weights. With today’s perspective it is natural to explore the much more general mirror descent algorithm (we note that the closely related concept of regularization was already brought to bear for these problems in Abernethy, Bartlett, Buchbinder and Stanton 10, and Buchbinder, Chen and Naor 14).

Let X be a set which we think of as either a state space (for online algorithms) or an action space (for online learning). Let d be a metric on X. Finally let \mathcal{C} \subset \mathbb{R}^{X} be a set of possible cost functions (e.g., arbitrary bounded functions, or linear functions if X \subset \mathbb{R}^N, or convex functions, etc). Online decision making is about making decisions in an uncertain environment (the future is not known), and here we model this uncertain environment as an unknown sequence of cost functions c_1, \hdots, c_T \in \mathcal{C}. The interaction protocol can then be described as follows: For each t \in [T], the player chooses (possibly randomly) x_t \in X based on past observations (to be made precise soon) and pays the cost c_t(x_t) \in \mathbb{R}, plus possibly a movement penalty: d(x_{t-1}, x_t).

There are now two key differences between online algorithms and online learning: (i) the observation model, and (ii) the performance metric. In online learning one assumes that the cost c_t is unknown at the decision time, and is only revealed after the decision is made (in bandits an even weaker signal is revealed, namely only the paid cost c_t(x_t)). The type of applications one has in mind is say online recommendations, where a user’s preference is only revealed once some item has been recommended to her. On the other hand in online algorithms the cost c_t is known at decision time. In this context the cost corresponds to a request, and the player has to find a way to satisfy this request (in the language of MTS the cost represents a task, and one gets the option to update its state so as to complete this task more efficiently, i.e., at a lower cost). Now let us discuss the performance metric. In online learning one assumes that there is some “hidden” good action (it is hidden in the noise, i.e., a single cost function does not say much about which actions are good, but if one takes the whole sequence into account then there is some good action that hopefully emerges). Thus it is natural to consider the following regret notion:

    \[ \sum_{t=1}^T c_t(x_t) - \min_{x \in X} \sum_{t=1}^T c_t(x) \]

This regret notion does not make sense for online algorithms where one may have to keep changing states to satisfy the request sequence. There one must compare to the best offline strategy, in which case additive guarantees are not attainable and one resorts to a multiplicative guarantee, the so-called competitive ratio:

    \[ \sum_{t=1}^T \left( c_t(x_t) + d(x_t,x_{t-1}) \right) / \min_{x^* \in X^T} \sum_{t=1}^T \left( c_t(x_t^*) + d(x_t^*,x_{t-1}^*) \right) ~. \]

(Note that in MTS one always assumes nonnegative cost functions so that the multicative guarantee makes sense.) The k-server problem, introduced in Manasse, McGeoch and Sleator 90, corresponds to a metrical task system on the product state space X^k equipped with the Earthmover distance inherited by (X,d), and with cost functions

    \[ \mathcal{C} = \{X^k \ni x \mapsto + \infty \times 1\{ r \not\in \{x(1),\hdots,x(k)\} \}; \ r \in X \} ~. \]

Known results

The online learning setting is by now fairly well-understood. We know that the optimal regret for a finite action set |X| = n and bounded cost functions is O(\sqrt{T \log(n)}) (see e.g., Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth 97). In fact we even know the optimal constant (for large n and T), and we have both algorithmic and information theoretic proofs of this result. Moreover we know how to leverage combinatorial information, e.g., when X \subset \{0,1\}^d and \mathcal{C} is a set of linear functions. A unifying principle that was put forth in the last decade or so is the mirror descent strategy.

On the other hand the situation for MTS is much worse. The optimal guarantees for finite sets are not known: the worst case (over all metric spaces of size n) competitive ratio is known to be \Omega(\log(n)) (trivial coupon-collector style lower bound on the uniform metric) and O(\log^2(n) \log\log(n)) (the latter bound is due to Fiat and Mendel 03). No information theoretic analysis is known, even for the uniform metric. With combinatorial structure the situation becomes even more disastrous. That’s where the k-server problem comes in, as a combinatorial MTS striking the best balance between simplicity and importance. For k-server it is conjectured that one could get a competitive ratio of O(\log(k)) (the coupon-collector lower bound on uniform metric gives here \Omega(\log(k))), while the best result prior to our work was O(\log^3(n) \log^2(k)) (due to Bansal, Buchbinder, Naor and Madry 11), and O(k) if one insists for a bound independent of n (due to Koutsoupias and Papadimitriou 95).

Posted in Theoretical Computer Science | 4 Comments

Algorithms, Machine Learning, and Optimization: we are hiring!

As some of you already know there has been some movement at MSR lately, specifically for the theory group. We have now branched out into two groups, one in Machine Learning and Optimization -MLO- (with Zeyuan Allen-Zhu, myself, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, Lin Xiao, and until recently postdoc Yin Tat Lee) and one in Algorithms (Nikhil Devanur, Sergey Yekhanin, postdocs Janardhan Kulkarni and Xiaorui Sun). The MLO group is part of the new MSR AI organization headed by Eric Horvitz, with grand goals at all levels of AIs (including a broad view of the foundations of AI which very much includes the type of mathematics showcased on this blog). Both groups are now hiring at all levels. From my point of view MSR is a unique place where you can devote all your time and energy to your research. The growth opportunities are truly amazing here: constant stream of world-class experts in algorithms, optimization, probability; lots of resources for travel, interns, postodcs; closeness to real-world products and possibilities to see your research directly applied. Below is the official call for application of the MLO group (the Algorithms’ one is here). Looking forward to receive your application!

Call for applications for researcher and postdoc positions in Machine Learning and Optimization

Application deadline: December 1st, 2017

The Machine Learning and Optimization Group at Microsoft Research Redmond invites applications for researcher positions at all levels (postdoc, full-time junior, full-time senior). The current full-time research members are Zeyuan Allen-Zhu, Sebastien Bubeck, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, and Lin Xiao.

All applicants working on machine learning and optimization, including their algorithmic and mathematical foundations, will be considered. We expect candidates to have demonstrated excellence in research and have developed their own original research agenda. Our current areas of focus include statistical and online learning, convex and non-convex optimization, high dimensional data analysis, combinatorial optimization and its applications in AI, statistics, and probability. We are also looking to expand our domain of expertise to other areas of modern machine learning, including more applied research areas.

Microsoft Research offers wonderful resources to develop your research, opportunities for collaborations across all MSR labs and top academic institutions, and the potential to realize your ideas in products and services used worldwide. Our group is part of Microsoft Research AI, a new organization that brings together the breadth of talent across Microsoft Research to pursue game-changing advances in artificial intelligence.

Please apply directly on the Microsoft Research Careers website and include Ofer Dekel as a Microsoft Research contact. In addition, send a copy of your application to To be assured full consideration, all your materials, including at least two reference letters, should arrive by December 1st, 2017. We recommend including a brief academic research statement and links to publications.

Microsoft is an equal opportunity employer. We welcome applications from all qualified candidates, and in particular from women and underrepresented groups.

Posted in Announcement | Comments Off on Algorithms, Machine Learning, and Optimization: we are hiring!

COLT 2018 call for papers

Philippe Rigollet and myself will be the program chairs for this year’s edition of COLT. It will be in Stockholm in July, which I hear is absolutely gorgeous at that time of the year. We also have a fantastic lineup of invited speakers: Stephane Mallat, Susan Murphy, and Johan Hastad. We believe we will also guarantee high quality reviews for your submissions as we have assembled a great (and extended) program committee to make sure that we cover all aspects of modern learning theory (in particular we expect the number of subreviewers to be minimal). See below for the PC, as well as the call for papers. Looking forward to read all your submissions in February!

Program Committee:

Jacob Abernethy (Georgia Tech)
Shivani Agarwal (University of Pennsylvania)
Shipra Agrawal (Columbia University)
Alexandr Andoni (Columbia University)
Pranjal Awasthi (Rutgers University)
Francis Bach (INRIA)
Nina Balcan (Carnegie Mellon University)
Afonso Bandeira (New-York University)
Mihail Belkin (Ohio State University)
Shai Ben-David (University of Waterloo)
Quentin Berthet (University of Cambridge)
Alina Beygelzimer (Yahoo! Research)
Avrim Blum (TTI Chicago)
Guy Bresler (MIT)
Olivier Cappé (Université Paris-Saclay)
Constantine Caramanis (UT Austin)
Alexandra Carpentier (OvGU Magdeburg)
Nicolo Cesa-Bianchi (University of Milan)
Arnak Dalalyan (ENSAE)
Amit Daniely (Hebrew University)
Ronen Eldan (Weizman Institute)
Tim van Erven (Leiden University)
Vitaly Feldman (IBM Research)
Aurelien Garivier (University Paul-Sabatier)
Rong Ge (Duke University)
Claudio Gentile (Universita degli Studi dell’Insubria)
Steve Hanneke
Elad Hazan (Princeton University)
Daniel Hsu (Columbia University)
Prateek Jain (Microsoft Research)
Satyen Kale (Google)
Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)
Wouter Koolen (CWI, Amsterdam)
Tomer Koren (Google)
John Lafferty (Yale University)
Po-Ling Loh (UW Madison)
Gabor Lugosi (ICREA and Pompeu Fabra University)
Shie Mannor (Technion)
Yishay Mansour (Tel Aviv University and Google)
Ankur Moitra (MIT)
Robert Nowak (UW Madison)
Vianney Perchet (ENS Paris-Saclay and Criteo)
Alexandre Proutiere (KTH)
Luis Rademacher (UC Davis)
Maxim Raginsky (University of Illinois)
Sasha Rakhlin (University of Pennsylvania)
Lorenzo Rosasco (MIT and Universita’ di Genova)
Robert Schapire (Microsoft Research)
Ohad Shamir (Weizman Institute)
David Steurer (ETH Zurich)
Suvrit Sra (MIT)
Nathan Srebro (TTI Chicago)
Karthik Sridharan (Cornell University)
Csaba Szepesvari (University of Alberta)
Matus Telgarsky (University of Illinois)
Ambuj Tewari (University of Michigan)
Alexandre Tsybakov (ENSAE-CREST)
Ruth Urner (Max Planck Institute)
Santosh Vempala (Georgia Tech)
Roman Vershynin (UC Irvine)
Manfred Warmuth (UC Santa Cruz)
Yihong Wu (Yale University)


Call for papers:

The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:
-Design and analysis of learning algorithms
-Statistical and computational complexity of learning
-Optimization methods for learning
-Unsupervised, semi-supervised, online and active learning
-Interactions with other mathematical fields
-Interactions with statistical physics
-Artificial neural networks, including deep learning
-High-dimensional and non-parametric statistics
-Learning with algebraic or combinatorial structure
-Geometric and topological data analysis
-Bayesian methods in learning
-Planning and control, including reinforcement learning
-Learning with system constraints: e.g. privacy, memory or communication budget
-Learning from complex data: e.g., networks, time series, etc.
-Learning in other settings: e.g. social, economic, and game-theoretic

Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.
All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.

Paper Awards:

COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.

Dual Submissions:

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.


Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.

Rebuttal Phase:

As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.

Open Problems Session:

We also invite submission of open problems. A separate call for open problems will be made available on the conference website.

Important Dates:

-Paper submission deadline: February 16, 2018, 11:00 PM EST
-Author feedback: April 18-21, 2018
-Author notification: May 2, 2018
-Conference: July 5-9, 2018 (welcome reception on the 4th)


Papers should be submitted through EasyChair at


Posted in Conference/workshop | Comments Off on COLT 2018 call for papers

Michael B. Cohen

This is an incredibly difficult post to write. Michael Benjamin Cohen, an amazing student and person passed away this week in Berkeley. Below is the official MSR statement (where he spent the summer together with many friends) and some personal remembrance. Also this is a picture of one of the many many happy moments we had together this summer at MSR (with Yin Tat Lee too):

MSR statement:

The Microsoft Research family is shocked and deeply saddened by the tragic loss of our colleague and friend Michael Cohen.  Michael was a brilliant mathematician and a rising star in his field.  He spent this past summer with us as a Microsoft Research Fellow, doing what he loved most. Over the summer, he made sweeping progress in online learning and online algorithms, two fields he had just recently become acquainted with. In addition to solving five open problems in these areas, he continued his substantial progress on the k-server problem, one of the most celebrated and notoriously difficult challenges in the space of adaptive algorithms.

Michael was a truly exceptional individual.  We will remember Michael for his infectious smile and his larger-than-life personality. We will never forget his unrelenting curiosity, his thirst for knowledge, and his deep love for mathematics and theoretical foundations of computing.  We are stunned by his loss.  We will hold onto our memories of Michael, and know that his ideas and scientific accomplishments will continue on as important advances.

We extend our most sincere condolences to Michael’s family and friends.

Personal notes:

I have known Michael for less than a year, but in that short time span he had a profound impact on me, like only a handful of people have had in my scientific life. I will always remember our first meeting, on October 26th 2016 at MIT. We were about to start lunch with a small group of graduate students and Michael entered the room, he (gently) interrupted the conversation and his first sentence to me was a question about mirror descent that I was not able to answer (we now know the answer, and as it turns out his question was pretty deep and the answer highly non-trivial). This is a typical Michael story, and I’m sure anyone who has interacted with him had similar experiences, where a seemingly innocuous question (at a perhaps inopportune time) turned out to be extremely deep and interesting.

Michael’s way of doing mathematics was truly remarkable. He was relentlessly searching for the right way to look at a problem, and he was never satisfied with a hacky explanation. As far as I know he never wrote anything on paper while doing research. I assume that if the calculations were too long to fit in his head then it was probably the wrong approach anyway. Doing research with him was an immense pleasure. He was always willing to share his new and original point of view on his current readings. He was also interested in a breadth of topics that is only matched by a tiny fraction of researchers in theoretical computer science. As James Lee shared in his comment on Luca’s blog, I too was feeling that all of this was just warm-up for him. I am incredibly sad to have lost a friend and a great collaborator, but I’m also sad that we will never see Michael’s next steps.

Beyond his mathematical gift he was also a kind and gentle human being. Perhaps it was not obvious to see at first sight with his exuberant personality, but he was compassionate and caring.

We truly lost one of the best. It is really hard to cope with the unfairness of these events. We will do our best to share Michael’s wonderful ideas with the world, and to make his scientific accomplishements live on. Needless to say my heart breaks for his family and all my thoughts are with them at this moment.

Posted in Uncategorized | 7 Comments

Smooth distributed convex optimization

A couple of months ago we (Kevin Scaman, Francis Bach, Yin Tat Lee, Laurent Massoulie and myself) uploaded a new paper on distributed convex optimization. We came up with a pretty clean picture for the optimal oracle complexity of this setting, which I will describe below. I should note that there are hundreds of papers on this topic, but the point of the post is to show our simple cute calculations and not to survey the immense literature on distributed optimization, see the paper itself for a number of pointers to other recent works.

Distributed optimization setting

Let G=(V,E) be an undirected graph on n vertices (|V| =n) and with diameter \Delta. We will think of the nodes in G as the computing units. To each vertex v \in V there is an associated convex function f_v : \mathbb{R}^d \rightarrow \mathbb{R}. For machine learning applications one can think that each computing unit has access to a “private” dataset, and f_v(x) represents the fit of the model corresponding to x \in \mathbb{R}^d on this dataset (say measured on least squares loss, or logistic loss for example). The goal will be to find in a distributed way the optimal “consensus” point:

    \[x^* \in \mathrm{argmin}_{x \in \mathbb{R}^d} \sum_{v \in V} f_v(x) .\]

The distributed processing protocol is as follows: asynchronously/in parallel, each node v can (i) compute a (local) gradient \nabla f_i in time 1, and (ii) communicate a vector in \mathbb{R}^d to its neighbors in G in time \tau. We denote by x_{v,t} the local model (essentially its guess for x^*) of node v at time t \in \mathbb{R}_+. We aim to characterize the smallest time T_{\epsilon} such that one can guarantee that all nodes v satisfy \bar{f}(x_{v,T}) - \bar{f}(x^*) \leq \epsilon where \bar{f} = \frac1n \sum_{v \in V} f_v.

We focus on the case where \bar{f} is \beta-smooth and \alpha-strongly convex (\kappa =\beta/\alpha is the condition number), which is arguably the most challenging case since one expects linear convergence (i.e., the scaling of T_{\epsilon} in \epsilon should be \log(1/\epsilon)) which a priori makes the interaction of optimization error and communication error potentially delicate (one key finding is that in fact it is not delicate!). Also, having in mind applications outside of large-scale machine learning (such as “federated” learning), we will make no assumptions about the functions at different vertices relate to each other.

A trivial answer

Recall that Nesterov’s accelerated gradient descent solves the serial problem in time O(\sqrt{\kappa} \log(1/\epsilon)). Trivially one can distribute a step of Nesterov’s accelerated gradient descent in time O(\tau \Delta) (simply designate a master node at the beginning, and everybody sends its local gradient to the master node in time O(\tau \Delta)). Thus we arrive at the upper bound T_{\epsilon} = O((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon)) using a trivial (centralized) algorithm. We now show (slightly informally, see the paper for proper definitions) that this in fact optimal!

First let us recall the lower bound proof in the serial case (see for example Theorem 3.15 here). The idea is to introduce the function f(x) = \beta L[x,x] - 2 \alpha x(1) + \alpha |x|^2 where L is the Laplacian of the path graph on \N, or in other words

    \[L[x,x] = \sum_{i \in \mathbb{N}} (x(i) - x(i+1))^2 .\]

First it is easy to see that this function is indeed O(\beta)-smooth and \Omega(\alpha)-strongly convex. The key point is that, for any algorithm starting at x_0=0 and such that each iteration stays in the linear span of the previously computed gradients (a very natural assumption) then

    \[x_t(i) = 0 \;\; \text{for all} \;\; i > t ~.\]

In words one can say that each gradient calculation “discovers” a new edge of the path graph involved in the definition of f. Concluding the serial proof is then just a matter of brute force calculations.

Now let us move to the distributed setting, and consider two vertices u and v that realize the diameter of G. The idea goes as follows: let L_1 (respectively L_2) be the Laplacian of even edges of the path graph on \mathbb{N} (respectively the odd edges), that is

    \[L_1[x,x] = \sum_{i \in \mathbb{N} : i \; \text{even}} (x(i) - x(i+1))^2 .\]

Now define \frac{1}{n} f_u = \beta L_1[x,x] - 2 \alpha x(1) + \frac{\alpha}{2} |x|^2, \frac{1}{n} f_v = \beta L_2[x,x] + \frac{\alpha}{2} |x|^2, and f_w = 0 for any w \not\in \{u,v\}. The key observation is that node u does not “know” about the even edges until it receives a message from v and vice versa. Thus it fairly easy to show that in this case one has:

    \[x_{u,t}(i) = 0 \;\; \text{for all} \;\; i > t / (1+\tau \Delta) ~,\]

which effectively amounts to a slowdown by a factor (1+\tau \Delta) compared to the serial case and proves the lower bound T_{\epsilon} = \Omega((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon).

Not so fast!

One can say that the algorithm proposed above defeats a bit the purpose of the distributed setting. Indeed the centralized communication protocol it relies on is not robust to various real-life issues such as machine failures, time-varying graphs, edges with different latency, etc. An elegant and practical solution is to restrict communication to be gossip-like. That is local computations have now to be communicated via matrix multiplication with a walk matrix W which we define as satisfying the following three conditions: (i) W(i,j) \neq 0 \Rightarrow (i,j) \in E, (ii) \mathrm{Ker}(W) =\mathrm{Span}((1,\hdots,1)), and (iii) W \succeq 0. Let us briefly discuss these conditions: (i) simply means that if x \in \mathbb{R}^V represents real values stored at the vertices, then W x can be calculated with a distributed communication protocol; (ii) says that if there is consensus (that is all vertices hold the same value) then no communication occurs with this matrix multiplication; and (iii) will turn out to be natural in a just a moment for our algorithm based on duality. A prominent example of a walk matrix would be the (normalized) Laplacian of G

We denote by \gamma the inverse condition number of W on (1,\hdots,1)^{\perp} (that is the ratio of the smallest non-zero eigvenvalue of W to its largest eigenvalue), also known as the spectral gap of G when W is the Laplacian. Notice that \gamma naturally controls the number of gossip steps to reach consensus, in the sense that gossip steps corresponds to gradient descent steps on f(x) = x^{\top} W x, which will converge in O(\gamma^{-1} \log(1/\epsilon)) steps. Doing an “accelerated gossip” (also known as Chebyshev gossiping) one could thus hope to essentially replace the diameter \Delta by O(\gamma^{-1/2} \log(1/\epsilon)). Notice that this is hopeful thinking because in the centralized model \Delta steps gets you to an exact consensus, while in the gossip model one only reaches an \epsilon-approximate consensus and errors might compound. In fact with a bit of graph theory one can immediately see that simply replacing \Delta by O(\gamma^{-1/2} \log(1/\epsilon)) is too good to be true: there are graphs (namely expanders) where \Delta is of order \log(n) while \gamma is of order of a constant, and thus an upper bound of the form (say) \Omega((1+\tau \gamma^{-1/2} \log(1/\epsilon)) \sqrt{\kappa} \log(1/\epsilon) would violate our previous lower bound by a factor 1/\log(n).

To save the day we will make extra assumptions, namely that each local function f_v has condition number \kappa and that in addition to computing local gradient the vertices can also compute local gradients of the Fenchel dual functions f_v^*. The latter assumption can be removed at the expense of extra logarithmic factors but we will ignore this point here (see the paper for some hints as well as further discussion on this point). For the former assumption we note that the lower bound proof given above completely breaks under this assumption. However one can save the construction for some specific graphs (finding the correct generalization to arbitrary graphs is one of our open problems). For example imagine a line graph, and cluster the vertices into three groups, the first third, the middle, and the last third. Then one could distribute the even part of the Laplacian on \mathbb{N} in the first group, and the odd part on the last group, as well as distribute the Euclidean norm evenly among all vertices. This construction verifies that each vertex function has condition number O(\kappa) and furthermore the rest of the argument still goes through. Interestingly in this case one also has \Delta \approx 1/\sqrt{\gamma} and thus this proves that for the line graph one has T_{\epsilon} = \Omega((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon) for gossip algorithms. We will now show a matching upper bound (which holds for arbitrary graphs).

Dual algorithm

For X \in \mathbb{R}^{d \times n} (which we think of as a set of column vectors, one for each vertex i \in V = [n]), denote X_i for the i^{th} column and let F(X) := \sum_{i =1}^n f_i(X_i). We are interested in minimizing F under the constraint that all columns are equal, which can be written as X \sqrt{W} = 0. By definition of the Fenchel dual F^* and a simple change of variable one has:

    \[\min_{X \in \mathbb{R}^{d\times n} : X \sqrt{W} = 0} F(X) = \max_{\lambda \in \mathbb{R}^{d \times n}} - F^*(\lambda \sqrt{W}) .\]

Next observe that gradient ascent on \lambda can be written as

    \[\lambda^+ = \lambda + \eta \nabla F^*(\lambda \sqrt{W}) \sqrt{W}\]

and with the notation y = \lambda \sqrt{W} this is simply y^+ = y - \eta \nabla F^*(y) W. Crucially \nabla F^*(y) W exactly corresponds to gossiping the local conjugate gradients (which are also the local models) \nabla f_i^*(y_{i,t}) =: X_{i,t}. In other words we only have to understand the condition number of the function \lambda \mapsto F^*(\lambda \sqrt{W}). The beauty of all of this is that this condition number is precisely \kappa / \gamma (i.e. it naturally combines the condition number of the vertex functions with the “condition number” of the graph). Thus by accelerating gradient ascent we arrive at a time complexity of O((1+\tau) \sqrt{\kappa / \gamma} \log(1/\epsilon)) (recall that a gossip step takes time \tau). We call the corresponding algorithm SSDA (Single-Step Dual Accelerated). One can improve it slightly in the case of low communication cost by doing multiple rounds of communication between two gradient computations (essentially replacing W by W^k). We call the corresponding algorithm MSDA (Multi-Step Dual Accelerated) and its attains the optimal (in the worst case over graphs) complexity of O((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon)).

Posted in Optimization | 2 Comments