I just started a youtube channel. The hope is that this will be a companion to the blog. Some of the posts will “graduate” into videos (e.g., the first set of videos will correspond to an expanded version of bandit lecture notes –part 1, part 2-, and the second set will correspond to the crash course in learning theory –part 1, part 2-). But the mid/long-term plan is to also have dedicated content on the youtube channel (I still have to figure out the exact format). I personally like a lot to watch math videos online, so I hope the channel will be a nice addition to this growing material (some of my favorite existing channels include the Simons Institute, IAS, Princeton TCS, Harvard Math, IHES, a new MIT data science channel, feel free to add yours in the comments below).
If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a -competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of -server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a algorithm was known much before).
State space for weighted -paging
Let be the weight of the edge from leaf to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein- distance between two fractional -server configurations.
Consider a fractional move from to . Then clearly, the amount of mass has to transfer through the edge , so that the Wasserstein- distance is at least . Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted norm (namely ) and one can simply use the basic state space (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by ).
Applying the general mirror descent framework
Given a request at location and a current anticonfiguration , our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost from (i.e., for some ) and until the first time at which (i.e., one has a server at location in ). By the lemma at the end of the previous post one has (denote for the request being serviced at time )
One can think of as a “virtual service cost”. In -server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map : (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost to this “virtual service cost” , say up to a multiplicative factor . One would then obtain a -competitive algorithm.
Entropy regularization for weighted -paging
Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier and let us search for a separable regularizer of the form . We want to relate to . Making those two quantities equal gives and thus the regularizer is a weighted negentropy: .
We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier is taken into account. Note that because of the form of the state space the multiplier contains a term of the form (which corresponds to the constraint ) and for each location a term forcing the derivative to be if the value of the missing mass has reached . In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:
Notice that up to a factor one can focus on controlling (that is only movement into a location is charged). In that view the Lagrange multipliers simply have no effect, since one has (indeed recall that ). Thus we see that the movement cost is exactly bounded by the virtual service cost in this case.
It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove tracking expert regret bounds, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some , in which case the Lipschitz constant would be . In the -server scenario one can stop the dynamics when (instead of ) provided that the mapping from to the actual fractional configuration is now given by . This raises a final issue: the total amount of server mass in such a is . Next we show that if is small enough then can be “rounded” online to a fractional -server configuration at the expense of a multiplicative movement. Precisely we show that is sufficient, which in turns gives a final competitive ratio of for weighted -paging.
From servers to servers
Consider a fractional server configuration with total mass (i.e., ), which we want to transform into a server configuration with total mass . A natural way to “round down” is at each location to put if the mass was . Furthermore a mass of should stay . This suggests the mapping , which is -Lipschitz. Thus the movement of is controlled by the movement of up to a multiplicative factor . Moreover one clearly has (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).
We continue our -server series (see post 1 here). In this post we briefly discuss the concept of a fractional solution for -server, which by analogy with MTS will in fact be a fractional “anti-solution”. Then we introduce the continuous time version of MTS and explain how it applies for -server. Finally the most important part of the post is the description of the basic potential based analysis of mirror descent and how to interpret it in the context of MTS.
State representation for MTS
Recall the MTS problem: one maintains a random state (where is equipped with a distance ), and given a new cost function , one can update the state to with the corresponding price to pay: . Equivalently one can maintain the probability distribution of : indeed given one can obtain by optimal coupling, that is the couple of random variables minimizes the expected distance over all couples such that has marginal and has marginal (the latter quantity is called the Wasserstein- distance between and ). In this view the (expected) service cost is now a linear function, that is where , and the movement cost between and is the Wasserstein- distance.
We will further assume the existence of a convex body and a norm in () such that the Wasserstein- distance between two distributions is equal to where are “expanded states” with and . For a weighted star metric it suffices to take , but we will see in the fourth post that for trees one indeed need to expand the state space.
Fractional solution for -server
Recall the -server problem: one maintains a random configuration , and given a new request one must update the configuration to such that . One could equivalently maintain a distribution as before. In the fractional problem one in fact only maintains the moment of this distribution, , defined by (in particular servicing a request a location means that one must have ). Again the metric here on the variables is the Wasserstein distance induced by (a.k.a., the optimal transport distance). Importantly we note that this view is not equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can round online a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional -server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.
To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the anticonfiguration . In this view a request is equivalent to a cost vector . Finally like in MTS we will assume the existence of an expanded state space and a norm that measures movement in this expanded view.
Continuous time decision making
We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path , with (and ). The decision maker’s reponse is a path that lives in . In this setting the service cost of the algorithm is and the movement cost is equal to where is the time derivative of the path . We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for -server (where there is only a movement cost). If you are curious see the appendix here.
For -server we will use where is the currently requested location, and we move to the next request at the first time such that (which means that satisfies , i.e., there is a server at the requested location.
If you have never seen the mirror descent framework now is a good time to take a quick look here.
Very succintly mirror descent with mirror map can be written as follows, with a step-size :
where we recall that is the Legendre-Fenchel transform of (i.e., the map whose gradient is the inverse map of the gradient of ) and is the Bregman divergence associated to .
We now want to take to in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote for the normal cone of at , i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function ,
In particular we see that (note that )
and thus taking to we morally get
This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set we get a viability problem. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on .
Potential based analysis
The mirror descent algorithm is fully described by:
Denoting we see that for any fixed ,
The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than , i.e., , then it is actually getting closer to in the sense that is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point . However in MTS and -server we compete against a path , and thus we also need to evaluate by how much the Bregman divergence can go up when is moving. This is captured by the following calculation:
Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:
Lemma: The mirror descent path satisfies for any comparator path ,
At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.
The -server problem is a classical and very attractive instance of online decision making. The decisions to be made in this problem are simple: given a requested location in some finite metric space and a fleet of k servers currently sitting in this metric space, one has to choose which server to use to service the request (that is the server will move from its current location to the requested location). The problem is to minimize the total amount of movement of the servers on a stream of such requests. This formulation is both abstract enough that it can model many real-life problems (e.g., virtual memory management) yet concrete/simple enough that it seems that everything ought to be understood about it.
In the coming series of 5 blog posts I will explain the main ideas behind our -server preprint with Michael Cohen, James Lee, Yin Tat Lee, and Aleksander Madry. In this first post I will briefly put the problem in the broader context of online learning and online algorithms, which will be helpful as it suggests an approach based on the mirror descent algorithm. In the next post I will explain the general framework of mirror descent and how it can be applied to a problem such as k-server. The third post will show how to use this general framework to recover effortlessly the state of the art for weighted k-paging (i.e., when the underlying metric space is a weighted star). The fourth post will show how to extend the analysis to arbitrary trees. Finally in the fifth post we will discuss both classical embedding techniques to reduce the problem to the tree case, as well as our new dynamic embedding based on multiscale restarts on the classical technique. The content of the first three posts was essentially covered in my TCS+ seminar talk.
Online decision making: two probability free models
I will start by introducing a very general framework for online algorithms due to Borodin, Linial and Saks 92, called metrical task systems (MTS). At the same time I will recall the online learning framework, so that one can see the similarities/differences between the two settings. In fact this connection was already made at the end of the Nineties, see Blum and Burch 00. At the time it was natural to explore the power of multiplicative weights. With today’s perspective it is natural to explore the much more general mirror descent algorithm (we note that the closely related concept of regularization was already brought to bear for these problems in Abernethy, Bartlett, Buchbinder and Stanton 10, and Buchbinder, Chen and Naor 14).
Let be a set which we think of as either a state space (for online algorithms) or an action space (for online learning). Let be a metric on . Finally let be a set of possible cost functions (e.g., arbitrary bounded functions, or linear functions if , or convex functions, etc). Online decision making is about making decisions in an uncertain environment (the future is not known), and here we model this uncertain environment as an unknown sequence of cost functions . The interaction protocol can then be described as follows: For each , the player chooses (possibly randomly) based on past observations (to be made precise soon) and pays the cost , plus possibly a movement penalty: .
There are now two key differences between online algorithms and online learning: (i) the observation model, and (ii) the performance metric. In online algorithm one assumes that the cost is unknown at the decision time, and is only revealed after the decision is made (in bandits an even weaker signal is revealed, namely only the paid cost ). The type of applications one has in mind is say online recommendations, where a user’s preference is only revealed once some item has been recommended to her. On the other hand in online algorithms the cost is known at decision time. In this context the cost corresponds to a request, and the player has to find a way to satisfy this request (in the language of MTS the cost represents a task, and one gets the option to update its state so as to complete this task more efficiently, i.e., at a lower cost). Now let us discuss the performance metric. In online learning one assumes that there is some “hidden” good action (it is hidden in the noise, i.e., a single cost function does not say much about which actions are good, but if one takes the whole sequence into account then there is some good action that hopefully emerges). Thus it is natural to consider the following regret notion:
This regret notion does not make sense for online algorithms where one may have to keep changing states to satisfy the request sequence. There one must compare to the best offline strategy, in which case additive guarantees are not attainable and one resorts to a multiplicative guarantee, the so-called competitive ratio:
(Note that in MTS one always assumes nonnegative cost functions so that the multicative guarantee makes sense.) The -server problem, introduced in Manasse, McGeoch and Sleator 90, corresponds to a metrical task system on the product state space equipped with the Earthmover distance inherited by , and with cost functions
The online learning setting is by now fairly well-understood. We know that the optimal regret for a finite action set and bounded cost functions is (see e.g., Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth 97). In fact we even know the optimal constant (for large and ), and we have both algorithmic and information theoretic proofs of this result. Moreover we know how to leverage combinatorial information, e.g., when and is a set of linear functions. A unifying principle that was put forth in the last decade or so is the mirror descent strategy.
On the other hand the situation for MTS is much worse. The optimal guarantees for finite sets are not known: the worst case (over all metric spaces of size ) competitive ratio is known to be (trivial coupon-collector style lower bound on the uniform metric) and (the latter bound is due to Fiat and Mendel 03). No information theoretic analysis is known, even for the uniform metric. With combinatorial structure the situation becomes even more disastrous. That’s where the -server problem comes in, as a combinatorial MTS striking the best balance between simplicity and importance. For -server it is conjectured that one could get a competitive ratio of (the coupon-collector lower bound on uniform metric gives here ), while the best result prior to our work was (due to Bansal, Buchbinder, Naor and Madry 11), and if one insists for a bound independent of (due to Koutsoupias and Papadimitriou 95).
As some of you already know there has been some movement at MSR lately, specifically for the theory group. We have now branched out into two groups, one in Machine Learning and Optimization -MLO- (with Zeyuan Allen-Zhu, myself, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, Lin Xiao, and until recently postdoc Yin Tat Lee) and one in Algorithms (Nikhil Devanur, Sergey Yekhanin, postdocs Janardhan Kulkarni and Xiaorui Sun). The MLO group is part of the new MSR AI organization headed by Eric Horvitz, with grand goals at all levels of AIs (including a broad view of the foundations of AI which very much includes the type of mathematics showcased on this blog). Both groups are now hiring at all levels. From my point of view MSR is a unique place where you can devote all your time and energy to your research. The growth opportunities are truly amazing here: constant stream of world-class experts in algorithms, optimization, probability; lots of resources for travel, interns, postodcs; closeness to real-world products and possibilities to see your research directly applied. Below is the official call for application of the MLO group (the Algorithms’ one is here). Looking forward to receive your application!
Call for applications for researcher and postdoc positions in Machine Learning and Optimization
Application deadline: December 1st, 2017
The Machine Learning and Optimization Group at Microsoft Research Redmond invites applications for researcher positions at all levels (postdoc, full-time junior, full-time senior). The current full-time research members are Zeyuan Allen-Zhu, Sebastien Bubeck, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, and Lin Xiao.
All applicants working on machine learning and optimization, including their algorithmic and mathematical foundations, will be considered. We expect candidates to have demonstrated excellence in research and have developed their own original research agenda. Our current areas of focus include statistical and online learning, convex and non-convex optimization, high dimensional data analysis, combinatorial optimization and its applications in AI, statistics, and probability. We are also looking to expand our domain of expertise to other areas of modern machine learning, including more applied research areas.
Microsoft Research offers wonderful resources to develop your research, opportunities for collaborations across all MSR labs and top academic institutions, and the potential to realize your ideas in products and services used worldwide. Our group is part of Microsoft Research AI, a new organization that brings together the breadth of talent across Microsoft Research to pursue game-changing advances in artificial intelligence.
Please apply directly on the Microsoft Research Careers website and include Ofer Dekel as a Microsoft Research contact. In addition, send a copy of your application to email@example.com. To be assured full consideration, all your materials, including at least two reference letters, should arrive by December 1st, 2017. We recommend including a brief academic research statement and links to publications.
Microsoft is an equal opportunity employer. We welcome applications from all qualified candidates, and in particular from women and underrepresented groups.
Philippe Rigollet and myself will be the program chairs for this year’s edition of COLT. It will be in Stockholm in July, which I hear is absolutely gorgeous at that time of the year. We also have a fantastic lineup of invited speakers: Stephane Mallat, Susan Murphy, and Johan Hastad. We believe we will also guarantee high quality reviews for your submissions as we have assembled a great (and extended) program committee to make sure that we cover all aspects of modern learning theory (in particular we expect the number of subreviewers to be minimal). See below for the PC, as well as the call for papers. Looking forward to read all your submissions in February!
Shivani Agarwal (University of Pennsylvania)
Shipra Agrawal (Columbia University)
Alexandr Andoni (Columbia University)
Pranjal Awasthi (Rutgers University)
Francis Bach (INRIA)
Nina Balcan (Carnegie Mellon University)
Afonso Bandeira (New-York University)
Mihail Belkin (Ohio State University)
Shai Ben-David (University of Waterloo)
Quentin Berthet (University of Cambridge)
Alina Beygelzimer (Yahoo! Research)
Avrim Blum (TTI Chicago)
Guy Bresler (MIT)
Olivier Cappé (Université Paris-Saclay)
Constantine Caramanis (UT Austin)
Alexandra Carpentier (OvGU Magdeburg)
Nicolo Cesa-Bianchi (University of Milan)
Arnak Dalalyan (ENSAE)
Amit Daniely (Hebrew University)
Ronen Eldan (Weizman Institute)
Tim van Erven (Leiden University)
Aurelien Garivier (University Paul-Sabatier)
Rong Ge (Duke University)
Claudio Gentile (Universita degli Studi dell’Insubria)
Elad Hazan (Princeton University)
Daniel Hsu (Columbia University)
Prateek Jain (Microsoft Research)
Satyen Kale (Google)
Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)
Wouter Koolen (CWI, Amsterdam)
Tomer Koren (Google)
John Lafferty (Yale University)
Po-Ling Loh (UW Madison)
Gabor Lugosi (ICREA and Pompeu Fabra University)
Shie Mannor (Technion)
Yishay Mansour (Tel Aviv University and Google)
Ankur Moitra (MIT)
Robert Nowak (UW Madison)
Vianney Perchet (ENS Paris-Saclay and Criteo)
Luis Rademacher (UC Davis)
Maxim Raginsky (University of Illinois)
Sasha Rakhlin (University of Pennsylvania)
Lorenzo Rosasco (MIT and Universita’ di Genova)
Robert Schapire (Microsoft Research)
Ohad Shamir (Weizman Institute)
David Steurer (ETH Zurich)
Suvrit Sra (MIT)
Nathan Srebro (TTI Chicago)
Karthik Sridharan (Cornell University)
Csaba Szepesvari (University of Alberta)
Matus Telgarsky (University of Illinois)
Ambuj Tewari (University of Michigan)
Alexandre Tsybakov (ENSAE-CREST)
Ruth Urner (Max Planck Institute)
Santosh Vempala (Georgia Tech)
Roman Vershynin (UC Irvine)
Manfred Warmuth (UC Santa Cruz)
Yihong Wu (Yale University)
Call for papers:
The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:
-Design and analysis of learning algorithms
-Statistical and computational complexity of learning
-Optimization methods for learning
-Unsupervised, semi-supervised, online and active learning
-Interactions with other mathematical fields
-Interactions with statistical physics
-Artificial neural networks, including deep learning
-High-dimensional and non-parametric statistics
-Learning with algebraic or combinatorial structure
-Geometric and topological data analysis
-Bayesian methods in learning
-Planning and control, including reinforcement learning
-Learning with system constraints: e.g. privacy, memory or communication budget
-Learning from complex data: e.g., networks, time series, etc.
-Learning in other settings: e.g. social, economic, and game-theoretic
Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.
All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.
COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.
Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.
Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.
As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.
Open Problems Session:
We also invite submission of open problems. A separate call for open problems will be made available on the conference website.
-Paper submission deadline: February 16, 2018, 11:00 PM EST
-Author feedback: April 18-21, 2018
-Author notification: May 2, 2018
-Conference: July 5-9, 2018 (welcome reception on the 4th)
Papers should be submitted through EasyChair at https://easychair.org/conferences/?conf=colt2018
This is an incredibly difficult post to write. Michael Benjamin Cohen, an amazing student and person passed away this week in Berkeley. Below is the official MSR statement (where he spent the summer together with many friends) and some personal remembrance. Also this is a picture of one of the many many happy moments we had together this summer at MSR (with Yin Tat Lee too):
The Microsoft Research family is shocked and deeply saddened by the tragic loss of our colleague and friend Michael Cohen. Michael was a brilliant mathematician and a rising star in his field. He spent this past summer with us as a Microsoft Research Fellow, doing what he loved most. Over the summer, he made sweeping progress in online learning and online algorithms, two fields he had just recently become acquainted with. In addition to solving five open problems in these areas, he continued his substantial progress on the k-server problem, one of the most celebrated and notoriously difficult challenges in the space of adaptive algorithms.
Michael was a truly exceptional individual. We will remember Michael for his infectious smile and his larger-than-life personality. We will never forget his unrelenting curiosity, his thirst for knowledge, and his deep love for mathematics and theoretical foundations of computing. We are stunned by his loss. We will hold onto our memories of Michael, and know that his ideas and scientific accomplishments will continue on as important advances.
We extend our most sincere condolences to Michael’s family and friends.
I have known Michael for less than a year, but in that short time span he had a profound impact on me, like only a handful of people have had in my scientific life. I will always remember our first meeting, on October 26th 2016 at MIT. We were about to start lunch with a small group of graduate students and Michael entered the room, he (gently) interrupted the conversation and his first sentence to me was a question about mirror descent that I was not able to answer (we now know the answer, and as it turns out his question was pretty deep and the answer highly non-trivial). This is a typical Michael story, and I’m sure anyone who has interacted with him had similar experiences, where a seemingly innocuous question (at a perhaps inopportune time) turned out to be extremely deep and interesting.
Michael’s way of doing mathematics was truly remarkable. He was relentlessly searching for the right way to look at a problem, and he was never satisfied with a hacky explanation. As far as I know he never wrote anything on paper while doing research. I assume that if the calculations were too long to fit in his head then it was probably the wrong approach anyway. Doing research with him was an immense pleasure. He was always willing to share his new and original point of view on his current readings. He was also interested in a breadth of topics that is only matched by a tiny fraction of researchers in theoretical computer science. As James Lee shared in his comment on Luca’s blog, I too was feeling that all of this was just warm-up for him. I am incredibly sad to have lost a friend and a great collaborator, but I’m also sad that we will never see Michael’s next steps.
Beyond his mathematical gift he was also a kind and gentle human being. Perhaps it was not obvious to see at first sight with his exuberant personality, but he was compassionate and caring.
We truly lost one of the best. It is really hard to cope with the unfairness of these events. We will do our best to share Michael’s wonderful ideas with the world, and to make his scientific accomplishements live on. Needless to say my heart breaks for his family and all my thoughts are with them at this moment.
A couple of months ago we (Kevin Scaman, Francis Bach, Yin Tat Lee, Laurent Massoulie and myself) uploaded a new paper on distributed convex optimization. We came up with a pretty clean picture for the optimal oracle complexity of this setting, which I will describe below. I should note that there are hundreds of papers on this topic, but the point of the post is to show our simple cute calculations and not to survey the immense literature on distributed optimization, see the paper itself for a number of pointers to other recent works.
Distributed optimization setting
Let be an undirected graph on vertices () and with diameter . We will think of the nodes in as the computing units. To each vertex there is an associated convex function . For machine learning applications one can think that each computing unit has access to a “private” dataset, and represents the fit of the model corresponding to on this dataset (say measured on least squares loss, or logistic loss for example). The goal will be to find in a distributed way the optimal “consensus” point:
The distributed processing protocol is as follows: asynchronously/in parallel, each node can (i) compute a (local) gradient in time , and (ii) communicate a vector in to its neighbors in in time . We denote by the local model (essentially its guess for ) of node at time . We aim to characterize the smallest time such that one can guarantee that all nodes satisfy where .
We focus on the case where is -smooth and -strongly convex ( is the condition number), which is arguably the most challenging case since one expects linear convergence (i.e., the scaling of in should be ) which a priori makes the interaction of optimization error and communication error potentially delicate (one key finding is that in fact it is not delicate!). Also, having in mind applications outside of large-scale machine learning (such as “federated” learning), we will make no assumptions about the functions at different vertices relate to each other.
A trivial answer
Recall that Nesterov’s accelerated gradient descent solves the serial problem in time . Trivially one can distribute a step of Nesterov’s accelerated gradient descent in time (simply designate a master node at the beginning, and everybody sends its local gradient to the master node in time ). Thus we arrive at the upper bound using a trivial (centralized) algorithm. We now show (slightly informally, see the paper for proper definitions) that this in fact optimal!
First let us recall the lower bound proof in the serial case (see for example Theorem 3.15 here). The idea is to introduce the function where is the Laplacian of the path graph on , or in other words
First it is easy to see that this function is indeed -smooth and -strongly convex. The key point is that, for any algorithm starting at and such that each iteration stays in the linear span of the previously computed gradients (a very natural assumption) then
In words one can say that each gradient calculation “discovers” a new edge of the path graph involved in the definition of . Concluding the serial proof is then just a matter of brute force calculations.
Now let us move to the distributed setting, and consider two vertices and that realize the diameter of . The idea goes as follows: let (respectively ) be the Laplacian of even edges of the path graph on (respectively the odd edges), that is
Now define , , and for any . The key observation is that node does not “know” about the even edges until it receives a message from and vice versa. Thus it fairly easy to show that in this case one has:
which effectively amounts to a slowdown by a factor compared to the serial case and proves the lower bound .
Not so fast!
One can say that the algorithm proposed above defeats a bit the purpose of the distributed setting. Indeed the centralized communication protocol it relies on is not robust to various real-life issues such as machine failures, time-varying graphs, edges with different latency, etc. An elegant and practical solution is to restrict communication to be gossip-like. That is local computations have now to be communicated via matrix multiplication with a walk matrix which we define as satisfying the following three conditions: (i) , (ii) , and (iii) . Let us briefly discuss these conditions: (i) simply means that if represents real values stored at the vertices, then can be calculated with a distributed communication protocol; (ii) says that if there is consensus (that is all vertices hold the same value) then no communication occurs with this matrix multiplication; and (iii) will turn out to be natural in a just a moment for our algorithm based on duality. A prominent example of a walk matrix would be the (normalized) Laplacian of
We denote by the inverse condition number of on (that is the ratio of the smallest non-zero eigvenvalue of to its largest eigenvalue), also known as the spectral gap of when is the Laplacian. Notice that naturally controls the number of gossip steps to reach consensus, in the sense that gossip steps corresponds to gradient descent steps on , which will converge in steps. Doing an “accelerated gossip” (also known as Chebyshev gossiping) one could thus hope to essentially replace the diameter by . Notice that this is hopeful thinking because in the centralized model steps gets you to an exact consensus, while in the gossip model one only reaches an -approximate consensus and errors might compound. In fact with a bit of graph theory one can immediately see that simply replacing by is too good to be true: there are graphs (namely expanders) where is of order while is of order of a constant, and thus an upper bound of the form (say) would violate our previous lower bound by a factor .
To save the day we will make extra assumptions, namely that each local function has condition number and that in addition to computing local gradient the vertices can also compute local gradients of the Fenchel dual functions . The latter assumption can be removed at the expense of extra logarithmic factors but we will ignore this point here (see the paper for some hints as well as further discussion on this point). For the former assumption we note that the lower bound proof given above completely breaks under this assumption. However one can save the construction for some specific graphs (finding the correct generalization to arbitrary graphs is one of our open problems). For example imagine a line graph, and cluster the vertices into three groups, the first third, the middle, and the last third. Then one could distribute the even part of the Laplacian on in the first group, and the odd part on the last group, as well as distribute the Euclidean norm evenly among all vertices. This construction verifies that each vertex function has condition number and furthermore the rest of the argument still goes through. Interestingly in this case one also has and thus this proves that for the line graph one has for gossip algorithms. We will now show a matching upper bound (which holds for arbitrary graphs).
For (which we think of as a set of column vectors, one for each vertex ), denote for the column and let . We are interested in minimizing under the constraint that all columns are equal, which can be written as . By definition of the Fenchel dual and a simple change of variable one has:
Next observe that gradient ascent on can be written as
and with the notation this is simply . Crucially exactly corresponds to gossiping the local conjugate gradients (which are also the local models) . In other words we only have to understand the condition number of the function . The beauty of all of this is that this condition number is precisely (i.e. it naturally combines the condition number of the vertex functions with the “condition number” of the graph). Thus by accelerating gradient ascent we arrive at a time complexity of (recall that a gossip step takes time ). We call the corresponding algorithm SSDA (Single-Step Dual Accelerated). One can improve it slightly in the case of low communication cost by doing multiple rounds of communication between two gradient computations (essentially replacing by ). We call the corresponding algorithm MSDA (Multi-Step Dual Accelerated) and its attains the optimal (in the worst case over graphs) complexity of .
Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss
A week or so ago at our Theory Lunch we had the pleasure to listen to Harishchandra Ramadas (student of Thomas Rothvoss) who told us about their latest discrepancy algorithm. I think the algorithm is quite interesting as it combines ideas from gradient descent and multiplicative weights in a non-trivial (yet very simple) way. Below I reprove Spencer’s deviations theorem with their machinery (in the actual paper Levy, Ramadas and Rothvoss do more than this).
First let me remind you the setting (see also this previous blog post for some motivation on discrepancy and a bit more context; by the way it is funny to read the comments in that post after this): given one wants to find (think of it as a “coloring” of the coordinates) such that for some numerical constant (when is a normalized vectors of ‘s and ‘s the quantity represents the unbalancedness of the coloring in the set corresponding to ). Clearly it suffices to give a method to find with at least half of its coordinates equal to and and such that for some numerical constant (indeed one can then simply recurse on the coordinates not yet set to or ; this is the so-called “partial coloring” argument). Note also that one can drop the absolute value by taking and (the number of constraints then becomes but this is easy to deal with and we ignore it here for sake of simplicity).
Let , . We run an iterative algorithm which keeps at every time step a subspace of valid update directions and then proceeds as follows. First find (using for instance a basis for ) such that
Then update where is maximal so that remains in . Finally update the exponential weights by .
It remains to describe the subspace . For this we introduce the set containing the largest coordinates of (the “inactive” coordinates) and the set containing the coordinates of equal to or (the “frozen” coordinates). The subspace is now described as the set of points orthogonal to (i) , (ii) , (iii) , (iv) . The intuition for (i) and (ii) is rather clear: for (i) one simply wants to ensure that the method keeps making progress towards the boundary of the cube (i.e., ) while for (ii) one wants to make sure that coordinates which are already “colored” (i.e., set to or ) are not updated. In particular (i) and (ii) together ensures that at each step either the norm squared of augments by (in particular ) or that one fixes forever one of the coordinates to or . In particular this means that after at most iterations one will have a partial coloring (i.e., half of the coordinates set to or , which was our objective). Property (iii) is about ensuring that we stop walking in the directions where we are not making good progress (there are many ways to ensure this and this precise form will make sense towards the end of the analysis). Property (iv) is closely related, and while it might be only a technical condition it can also be understood as ensuring that locally one is not increasing the softmax of the constraints, indeed (iv) exactly says that one should move orthogonally to the gradient of .
Let . Note that since is on the sphere and one has that . Thus using for , as well as property (iv) (i.e., ) and one obtains:
Observe now that the subspace has dimension at least (say for ) and thus by (1) and the above inequalities one gets:
In particular for any for some numerical constant . It only remains to observe that this ensures for any (this concludes the proof since we already observed that at time at least half of the coordinates are colored). For this last implication we simply use property (iii). Indeed assume that some coordinate satisfies at some time , for some . Since each update increases the weights (multiplicatively) by at most it means that there is a previous time (say ) where this weight was larger than and yet it got updated, meaning that it was not in the top weights, and in particular one had which contradicts for large enough (namely ).