Youtube channel

I just started a youtube channel. The hope is that this will be a companion to the blog. Some of the posts will “graduate” into videos (e.g., the first set of videos will correspond to an expanded version of bandit lecture notes –part 1, part 2-, and the second set will correspond to the crash course in learning theory –part 1, part 2-). But the mid/long-term plan is to also have dedicated content on the youtube channel (I still have to figure out the exact format). I personally like a lot to watch math videos online, so I hope the channel will be a nice addition to this growing material (some of my favorite existing channels include the Simons Institute, IAS, Princeton TCS, Harvard Math, IHESa new MIT data science channel, feel free to add yours in the comments below).

Posted in Announcement | 7 Comments

k-server, part 3: entropy regularization for weighted k-paging

If you have been following the first two posts (post 1, post 2), now is time to reap the rewards! I will show here how to obtain a O(\log(k))-competitive algorithm for (weighted) paging, i.e., when the metric space corresponds to the leafs of a weighted star. This was viewed as a breakthrough result 10 years ago (with a JACM publication by Bansal, Buchbinder and Naor in 2012), and for good reasons as this simplest instance of k-server was in fact the one studied in the seminal paper by Sleator and Tarjan in 1985 which introduced the competitive analysis of online algorithms (actually to be precise Sleator and Tarjan considered the unweighted case, for which a O(\log(k)) algorithm was known much before).

State space for weighted k-paging

Let w_i be the weight of the edge from leaf i \in [n] to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein-1 distance between two fractional k-server configurations.

Consider a fractional move from z \in \Delta_n(k) to z + \xi. Then clearly, the amount of mass |\xi_i| has to transfer through the edge (i, \mathrm{root}), so that the Wasserstein-1 distance is at least \sum_i w_i |\xi_i|. Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted \ell_1 norm (namely \|\xi\| := \sum_{i=1}^n w_i |\xi_i|) and one can simply use the basic state space K = \{x \in [0,1]^n : \sum_i x_i = n-k\} (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by z=1-x).

Applying the general mirror descent framework

Given a request at location r \in [n] and a current anticonfiguration x_0, our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost c(t) = e_r from x(0)=x_0 (i.e., x'(t) = (\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t)) for some \lambda(t) \in N_K(x(t))) and until the first time t at which x_r(t) = 0 (i.e., one has a server at location r in z(t) := 1-x(t)). By the lemma at the end of the previous post one has (denote r(t) for the request being serviced at time t)

    \[ \int x_{r(t)}(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \mathrm{OPT} + O(1) \]

One can think of \int x_{r(t)}(t) dt as a “virtual service cost”. In k-server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that \Phi has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map \Phi: (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost \|x'(t)\| = \|(\nabla^2 \Phi(x(t)))^{-1} (e_r + \lambda(t))\| to this “virtual service cost” x_{r}(t), say up to a multiplicative factor \alpha. One would then obtain a \alpha \times \mathrm{Lip}_{\|\cdot\|}(\Phi)-competitive algorithm.

Entropy regularization for weighted k-paging

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier \lambda(t) and let us search for a separable regularizer of the form \Phi(x) = \sum_i \phi_i(x_i). We want to relate x_r(t) to w_r/\phi_i''(x_r(t)) (= (\nabla^2 \Phi(x(t)))^{-1} e_r). Making those two quantities equal gives \phi_i(x) = w_i x_i \log(x_i) and thus the regularizer is a weighted negentropy: \Phi(x) = \sum_i w_i x_i \log x_i.

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier \lambda(t) is taken into account. Note that because of the form of the state space K the multiplier contains a term of the form (\mu(t), \mu(t), \hdots, \mu(t)) (which corresponds to the constraint \sum_{i=1}^n x_i = n-k) and for each location a term forcing the derivative to be 0 if the value of the missing mass has reached 1. In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

    \begin{align*} & x_r'(t) = \frac{x_r(t)}{w_r} (-1 + \mu(t)) \\ & x_i'(t) = \frac{x_i(t)}{w_i} \mu(t) 1\{x_i(t) < 1\} \ \text{for} \ i \neq r \end{align*}

Notice that up to a factor 2 one can focus on controlling \|(x'(t))_-\| (that is only movement into a location is charged). In that view the Lagrange multipliers simply have no effect, since one has \mu(t) \geq 0 (indeed recall that \sum_{i=1}^n x_i'(t) = 0). Thus we see that the movement cost \|(x'(t))_-\| is exactly bounded by the virtual service cost x_r(t) in this case.

Making \Phi Lipschitz

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove tracking expert regret bounds, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some \delta>0, in which case the Lipschitz constant would be O(\log(1/\delta)). In the k-server scenario one can stop the dynamics when x_{r(t)}(t)=\delta (instead of x_{r(t)}(t)=0) provided that the mapping from x to the actual fractional configuration z is now given by z=\frac{1-x}{1-\delta}. This raises a final issue: the total amount of server mass in such a z is \frac{k}{1-\delta} > k. Next we show that if \delta is small enough then z can be “rounded” online to a fractional k-server configuration at the expense of a multiplicative O(1) movement. Precisely we show that \delta<1/(2k) is sufficient, which in turns gives a final competitive ratio of O(\log(k)) for weighted k-paging.

From k+\epsilon servers to k servers

Consider a fractional server configuration z with total mass k+\epsilon (i.e., \sum_{i=1}^n z_i = k+\epsilon), which we want to transform into a server configuration \sigma(z) with total mass k. A natural way to “round down” is at each location to put 0 if the mass was \leq \epsilon. Furthermore a mass of 1 should stay 1. This suggests the mapping \sigma(z_i) = \frac{z_i-\epsilon}{1-\epsilon} 1\{z_i > \epsilon\}, which is \frac{1}{1-\epsilon}-Lipschitz. Thus the movement of \sigma(z(t)) is controlled by the movement of z(t) up to a multiplicative factor \frac{1}{1-\epsilon}. Moreover one clearly has \sum_{i=1}^n \sigma(z_i) \leq k (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

Posted in Theoretical Computer Science | 7 Comments

k-server, part 2: continuous time mirror descent

We continue our k-server series (see post 1 here). In this post we briefly discuss the concept of a fractional solution for k-server, which by analogy with MTS will in fact be a fractional “anti-solution”. Then we introduce the continuous time version of MTS and explain how it applies for k-server. Finally the most important part of the post is the description of the basic potential based analysis of mirror descent and how to interpret it in the context of MTS.


State representation for MTS

Recall the MTS problem: one maintains a random state \rho_t \in [n] (where [n] is equipped with a distance d), and given a new cost function c_{t+1} : [n] \rightarrow \mathbb{R}_+, one can update the state to \rho_{t+1} with the corresponding price to pay: c_{t+1}(\rho_{t+1}) + d(\rho_t, \rho_{t+1}). Equivalently one can maintain the probability distribution p_t \in \Delta_n := \{x \in \mathbb{R}_+^n : \sum_{i=1}^n x_i = 1\} of \rho_t: indeed given \rho_t, p_t, p_{t+1} one can obtain \rho_{t+1} by optimal coupling, that is the couple of random variables (\rho_t, \rho_{t+1}) minimizes the expected distance \E \ d(Y,Z) over all couples (Y,Z) such that Y has marginal p_t and Z has marginal p_{t+1} (the latter quantity is called the Wasserstein-1 distance between p_t and p_{t+1}). In this view the (expected) service cost is now a linear function, that is c_t \cdot p_t where c_t \in \mathbb{R}_+^n, and the movement cost between p_t and p_{t+1} is the Wasserstein-1 distance.

We will further assume the existence of a convex body K and a norm \|\cdot\| in \mathbb{R}^N (N \geq n) such that the Wasserstein-1 distance between two distributions x,y \in \Delta_n is equal to \|\hat{x} - \hat{y}\| where \hat{x}, \hat{y} \in K are “expanded states” with x=(\hat{x}_1,\hdots, \hat{x}_n) and y=(\hat{y}_1, \hdots, \hat{y}_n). For a weighted star metric it suffices to take K = \Delta_n, but we will see in the fourth post that for trees one indeed need to expand the state space.

Fractional solution for k-server

Recall the k-server problem: one maintains a random configuration \rho_t \in [n]^k, and given a new request r_{t+1} \in [n] one must update the configuration to \rho_{t+1} such that r_{t+1} \in \{\rho_{t+1}(1), \hdots, \rho_{t+1}(k)\}. One could equivalently maintain a distribution p_t \in \Delta_{n^k} as before. In the fractional problem one in fact only maintains the 1^{st} moment of this distribution, z_t \in \Delta_n(k) := \{z \in [0,1]^n : \sum_{i=1}^n z_i = k\}, defined by z_t(i) = \sum_{\rho \in [n]^k : i \in \{\rho(1),\hdots, \rho(k)\}} p_t(\rho) (in particular servicing a request a location i means that one must have z_t(i) = 1). Again the metric here on the z variables is the Wasserstein distance induced by d (a.k.a., the optimal transport distance). Importantly we note that this view is not equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can round online a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional k-server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.

To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the anticonfiguration x=1-z. In this view a request r \in [n] is equivalent to a cost vector c = +\infty \times e_r. Finally like in MTS we will assume the existence of an expanded state space K \subset \mathbb{R}^N and a norm \|\cdot\| that measures movement in this expanded view.

Continuous time decision making

We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path (c(t))_{t \in \mathbb{R}_+}, with c(t) \in \mathbb{R}_+^N (and c_i(t) \neq 0 \Rightarrow i \leq n). The decision maker’s reponse is a path (x(t))_{t \in \mathbb{R}_+} that lives in K \subset \mathbb{R}^N. In this setting the service cost of the algorithm is \int_{\mathbb{R}^+} c_t \cdot x_t dt and the movement cost is equal to \int_{\mathbb{R}^+} \|x'(t)\| dt where x'(t) is the time derivative of the path x(t). We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for k-server (where there is only a movement cost). If you are curious see the appendix here.

For k-server we will use c(t) = e_{r(t)} where r(t) is the currently requested location, and we move to the next request at the first time such that x_{r(t)}(t) = 0 (which means that z=1-x satisfies z_{r(t)}(t)=1, i.e., there is a server at the requested location.

Mirror descent

If you have never seen the mirror descent framework now is a good time to take a quick look here.

Very succintly mirror descent with mirror map \Phi : K \rightarrow \mathbb{R} can be written as follows, with a step-size \eta:

    \[ x(t+\eta) = \mathrm{argmin}_{x \in K} D_{\Phi}(x, \nabla \Phi^*(\nabla \Phi(x(t)) - \eta c(t))) \]

where we recall that \Phi^* is the Legendre-Fenchel transform of \Phi (i.e., the map whose gradient is the inverse map of the gradient of \Phi) and D_{\Phi}(x,y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) \cdot (x-y) is the Bregman divergence associated to \Phi.

We now want to take \eta to 0 in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote N_K(x) = \{\theta : \theta \cdot (y-x) \leq 0, \ \forall y \in K\} for the normal cone of K at x, i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function f,

    \[ x^* \in \mathrm{argmin}_{x \in K} f(x) \Leftrightarrow -\nabla f(x^*) \in N_K(x^*) \]

In particular we see that (note that \nabla_x D_{\Phi}(x,y) = \nabla \Phi(x) - \nabla \Phi(y))

    \[ \nabla \Phi(x(t+\eta)) - \nabla \Phi(x(t)) + \eta c(t) \in - N_{K}(x(t+\eta)) \]

and thus taking \eta to 0 we morally get

    \[ \nabla^2 \Phi(x(t)) x'(t) \in - c(t) - N_{K}(x(t)) \]

This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set K we get a viability problem. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on \Phi.

Potential based analysis

The mirror descent algorithm is fully described by:

    \[ \nabla^2 \Phi(x(t)) x'(t) = - c(t) - \lambda(t), \; \lambda(t) \in N_{K}(x(t)) \]

Denoting \hat{D}_{\Phi}(y ; x) = - \Phi(x) - \nabla \Phi(x) \cdot (y -x ) we see that for any fixed y,

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y ; x(t)) & = & - \nabla^2 \Phi(x(t)) x'(t) \cdot (y - x(t)) \\ & = & (c(t) + \lambda(t)) \cdot (y - x(t)) \\ & \leq & c(t) \cdot (y - x(t)) \end{eqnarray*}

The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than y, i.e., c(t) \cdot (y - x(t)) <0, then it is actually getting closer to y in the sense that \hat{D}_{\Phi}(y ; x(t)) is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point y. However in MTS and k-server we compete against a path y(t), and thus we also need to evaluate by how much the Bregman divergence can go up when y is moving. This is captured by the following calculation:

    \begin{eqnarray*} \partial_t \hat{D}_{\Phi}(y(t) ; x) & = & - \nabla \Phi(x(t)) \cdot y'(t) \\ & \leq & \mathrm{Lip}_{\|\cdot\|}(\Phi) \times \|y'(t)\| \end{eqnarray*}

Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:

Lemma: The mirror descent path (x(t))_{t \geq 0} satisfies for any comparator path (y(t))_{t \geq 0},

    \[ \int c(t) \cdot x(t) dt \leq \mathrm{Lip}_{\|\cdot\|}(\Phi) \int \|y'(t)\| dt + \int c(t) \cdot y(t) dt + O(1) \]

At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.


Posted in Theoretical Computer Science | 2 Comments

k-server, part 1: online learning and online algorithms

The k-server problem is a classical and very attractive instance of online decision making. The decisions to be made in this problem are simple: given a requested location in some finite metric space and a fleet of k servers currently sitting in this metric space, one has to choose which server to use to service the request (that is the server will move from its current location to the requested location). The problem is to minimize the total amount of movement of the servers on a stream of such requests. This formulation is both abstract enough that it can model many real-life problems (e.g., virtual memory management) yet concrete/simple enough that it seems that everything ought to be understood about it.

In the coming series of 5 blog posts I will explain the main ideas behind our k-server preprint with Michael Cohen, James Lee, Yin Tat Lee, and Aleksander Madry. In this first post I will briefly put the problem in the broader context of online learning and online algorithms, which will be helpful as it suggests an approach based on the mirror descent algorithm. In the next post I will explain the general framework of mirror descent and how it can be applied to a problem such as k-server. The third post will show how to use this general framework to recover effortlessly the state of the art for weighted k-paging (i.e., when the underlying metric space is a weighted star). The fourth post will show how to extend the analysis to arbitrary trees. Finally in the fifth post we will discuss both classical embedding techniques to reduce the problem to the tree case, as well as our new dynamic embedding based on multiscale restarts on the classical technique. The content of the first three posts was essentially covered in my TCS+ seminar talk.

Online decision making: two probability free models

I will start by introducing a very general framework for online algorithms due to Borodin, Linial and Saks 92, called metrical task systems (MTS). At the same time I will recall the online learning framework, so that one can see the similarities/differences between the two settings. In fact this connection was already made at the end of the Nineties, see Blum and Burch 00. At the time it was natural to explore the power of multiplicative weights. With today’s perspective it is natural to explore the much more general mirror descent algorithm (we note that the closely related concept of regularization was already brought to bear for these problems in Abernethy, Bartlett, Buchbinder and Stanton 10, and Buchbinder, Chen and Naor 14).

Let X be a set which we think of as either a state space (for online algorithms) or an action space (for online learning). Let d be a metric on X. Finally let \mathcal{C} \subset \mathbb{R}^{X} be a set of possible cost functions (e.g., arbitrary bounded functions, or linear functions if X \subset \mathbb{R}^N, or convex functions, etc). Online decision making is about making decisions in an uncertain environment (the future is not known), and here we model this uncertain environment as an unknown sequence of cost functions c_1, \hdots, c_T \in \mathcal{C}. The interaction protocol can then be described as follows: For each t \in [T], the player chooses (possibly randomly) x_t \in X based on past observations (to be made precise soon) and pays the cost c_t(x_t) \in \mathbb{R}, plus possibly a movement penalty: d(x_{t-1}, x_t).

There are now two key differences between online algorithms and online learning: (i) the observation model, and (ii) the performance metric. In online algorithm one assumes that the cost c_t is unknown at the decision time, and is only revealed after the decision is made (in bandits an even weaker signal is revealed, namely only the paid cost c_t(x_t)). The type of applications one has in mind is say online recommendations, where a user’s preference is only revealed once some item has been recommended to her. On the other hand in online algorithms the cost c_t is known at decision time. In this context the cost corresponds to a request, and the player has to find a way to satisfy this request (in the language of MTS the cost represents a task, and one gets the option to update its state so as to complete this task more efficiently, i.e., at a lower cost). Now let us discuss the performance metric. In online learning one assumes that there is some “hidden” good action (it is hidden in the noise, i.e., a single cost function does not say much about which actions are good, but if one takes the whole sequence into account then there is some good action that hopefully emerges). Thus it is natural to consider the following regret notion:

    \[ \sum_{t=1}^T c_t(x_t) - \min_{x \in X} \sum_{t=1}^T c_t(x) \]

This regret notion does not make sense for online algorithms where one may have to keep changing states to satisfy the request sequence. There one must compare to the best offline strategy, in which case additive guarantees are not attainable and one resorts to a multiplicative guarantee, the so-called competitive ratio:

    \[ \sum_{t=1}^T \left( c_t(x_t) + d(x_t,x_{t-1}) \right) / \min_{x^* \in X^T} \sum_{t=1}^T \left( c_t(x_t^*) + d(x_t^*,x_{t-1}^*) \right) ~. \]

(Note that in MTS one always assumes nonnegative cost functions so that the multicative guarantee makes sense.) The k-server problem, introduced in Manasse, McGeoch and Sleator 90, corresponds to a metrical task system on the product state space X^k equipped with the Earthmover distance inherited by (X,d), and with cost functions

    \[ \mathcal{C} = \{X^k \ni x \mapsto + \infty \times 1\{ r \not\in \{x(1),\hdots,x(k)\} \}; \ r \in X \} ~. \]

Known results

The online learning setting is by now fairly well-understood. We know that the optimal regret for a finite action set |X| = n and bounded cost functions is O(\sqrt{T \log(n)}) (see e.g., Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth 97). In fact we even know the optimal constant (for large n and T), and we have both algorithmic and information theoretic proofs of this result. Moreover we know how to leverage combinatorial information, e.g., when X \subset \{0,1\}^d and \mathcal{C} is a set of linear functions. A unifying principle that was put forth in the last decade or so is the mirror descent strategy.

On the other hand the situation for MTS is much worse. The optimal guarantees for finite sets are not known: the worst case (over all metric spaces of size n) competitive ratio is known to be \Omega(\log(n)) (trivial coupon-collector style lower bound on the uniform metric) and O(\log^2(n) \log\log(n)) (the latter bound is due to Fiat and Mendel 03). No information theoretic analysis is known, even for the uniform metric. With combinatorial structure the situation becomes even more disastrous. That’s where the k-server problem comes in, as a combinatorial MTS striking the best balance between simplicity and importance. For k-server it is conjectured that one could get a competitive ratio of O(\log(k)) (the coupon-collector lower bound on uniform metric gives here \Omega(\log(k))), while the best result prior to our work was O(\log^3(n) \log^2(k)) (due to Bansal, Buchbinder, Naor and Madry 11), and O(k) if one insists for a bound independent of n (due to Koutsoupias and Papadimitriou 95).

Posted in Theoretical Computer Science | 1 Comment

Algorithms, Machine Learning, and Optimization: we are hiring!

As some of you already know there has been some movement at MSR lately, specifically for the theory group. We have now branched out into two groups, one in Machine Learning and Optimization -MLO- (with Zeyuan Allen-Zhu, myself, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, Lin Xiao, and until recently postdoc Yin Tat Lee) and one in Algorithms (Nikhil Devanur, Sergey Yekhanin, postdocs Janardhan Kulkarni and Xiaorui Sun). The MLO group is part of the new MSR AI organization headed by Eric Horvitz, with grand goals at all levels of AIs (including a broad view of the foundations of AI which very much includes the type of mathematics showcased on this blog). Both groups are now hiring at all levels. From my point of view MSR is a unique place where you can devote all your time and energy to your research. The growth opportunities are truly amazing here: constant stream of world-class experts in algorithms, optimization, probability; lots of resources for travel, interns, postodcs; closeness to real-world products and possibilities to see your research directly applied. Below is the official call for application of the MLO group (the Algorithms’ one is here). Looking forward to receive your application!

Call for applications for researcher and postdoc positions in Machine Learning and Optimization

Application deadline: December 1st, 2017

The Machine Learning and Optimization Group at Microsoft Research Redmond invites applications for researcher positions at all levels (postdoc, full-time junior, full-time senior). The current full-time research members are Zeyuan Allen-Zhu, Sebastien Bubeck, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, and Lin Xiao.

All applicants working on machine learning and optimization, including their algorithmic and mathematical foundations, will be considered. We expect candidates to have demonstrated excellence in research and have developed their own original research agenda. Our current areas of focus include statistical and online learning, convex and non-convex optimization, high dimensional data analysis, combinatorial optimization and its applications in AI, statistics, and probability. We are also looking to expand our domain of expertise to other areas of modern machine learning, including more applied research areas.

Microsoft Research offers wonderful resources to develop your research, opportunities for collaborations across all MSR labs and top academic institutions, and the potential to realize your ideas in products and services used worldwide. Our group is part of Microsoft Research AI, a new organization that brings together the breadth of talent across Microsoft Research to pursue game-changing advances in artificial intelligence.

Please apply directly on the Microsoft Research Careers website and include Ofer Dekel as a Microsoft Research contact. In addition, send a copy of your application to To be assured full consideration, all your materials, including at least two reference letters, should arrive by December 1st, 2017. We recommend including a brief academic research statement and links to publications.

Microsoft is an equal opportunity employer. We welcome applications from all qualified candidates, and in particular from women and underrepresented groups.

Posted in Announcement | Leave a comment

COLT 2018 call for papers

Philippe Rigollet and myself will be the program chairs for this year’s edition of COLT. It will be in Stockholm in July, which I hear is absolutely gorgeous at that time of the year. We also have a fantastic lineup of invited speakers: Stephane Mallat, Susan Murphy, and Johan Hastad. We believe we will also guarantee high quality reviews for your submissions as we have assembled a great (and extended) program committee to make sure that we cover all aspects of modern learning theory (in particular we expect the number of subreviewers to be minimal). See below for the PC, as well as the call for papers. Looking forward to read all your submissions in February!

Program Committee:

Jacob Abernethy (Georgia Tech)
Shivani Agarwal (University of Pennsylvania)
Shipra Agrawal (Columbia University)
Alexandr Andoni (Columbia University)
Pranjal Awasthi (Rutgers University)
Francis Bach (INRIA)
Nina Balcan (Carnegie Mellon University)
Afonso Bandeira (New-York University)
Mihail Belkin (Ohio State University)
Shai Ben-David (University of Waterloo)
Quentin Berthet (University of Cambridge)
Alina Beygelzimer (Yahoo! Research)
Avrim Blum (TTI Chicago)
Guy Bresler (MIT)
Olivier Cappé (Université Paris-Saclay)
Constantine Caramanis (UT Austin)
Alexandra Carpentier (OvGU Magdeburg)
Nicolo Cesa-Bianchi (University of Milan)
Arnak Dalalyan (ENSAE)
Amit Daniely (Hebrew University)
Ronen Eldan (Weizman Institute)
Tim van Erven (Leiden University)
Vitaly Feldman (IBM Research)
Aurelien Garivier (University Paul-Sabatier)
Rong Ge (Duke University)
Claudio Gentile (Universita degli Studi dell’Insubria)
Steve Hanneke
Elad Hazan (Princeton University)
Daniel Hsu (Columbia University)
Prateek Jain (Microsoft Research)
Satyen Kale (Google)
Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)
Wouter Koolen (CWI, Amsterdam)
Tomer Koren (Google)
John Lafferty (Yale University)
Po-Ling Loh (UW Madison)
Gabor Lugosi (ICREA and Pompeu Fabra University)
Shie Mannor (Technion)
Yishay Mansour (Tel Aviv University and Google)
Ankur Moitra (MIT)
Robert Nowak (UW Madison)
Vianney Perchet (ENS Paris-Saclay and Criteo)
Alexandre Proutiere (KTH)
Luis Rademacher (UC Davis)
Maxim Raginsky (University of Illinois)
Sasha Rakhlin (University of Pennsylvania)
Lorenzo Rosasco (MIT and Universita’ di Genova)
Robert Schapire (Microsoft Research)
Ohad Shamir (Weizman Institute)
David Steurer (ETH Zurich)
Suvrit Sra (MIT)
Nathan Srebro (TTI Chicago)
Karthik Sridharan (Cornell University)
Csaba Szepesvari (University of Alberta)
Matus Telgarsky (University of Illinois)
Ambuj Tewari (University of Michigan)
Alexandre Tsybakov (ENSAE-CREST)
Ruth Urner (Max Planck Institute)
Santosh Vempala (Georgia Tech)
Roman Vershynin (UC Irvine)
Manfred Warmuth (UC Santa Cruz)
Yihong Wu (Yale University)


Call for papers:

The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:
-Design and analysis of learning algorithms
-Statistical and computational complexity of learning
-Optimization methods for learning
-Unsupervised, semi-supervised, online and active learning
-Interactions with other mathematical fields
-Interactions with statistical physics
-Artificial neural networks, including deep learning
-High-dimensional and non-parametric statistics
-Learning with algebraic or combinatorial structure
-Geometric and topological data analysis
-Bayesian methods in learning
-Planning and control, including reinforcement learning
-Learning with system constraints: e.g. privacy, memory or communication budget
-Learning from complex data: e.g., networks, time series, etc.
-Learning in other settings: e.g. social, economic, and game-theoretic

Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.
All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.

Paper Awards:

COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.

Dual Submissions:

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.


Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.

Rebuttal Phase:

As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.

Open Problems Session:

We also invite submission of open problems. A separate call for open problems will be made available on the conference website.

Important Dates:

-Paper submission deadline: February 16, 2018, 11:00 PM EST
-Author feedback: April 18-21, 2018
-Author notification: May 2, 2018
-Conference: July 5-9, 2018 (welcome reception on the 4th)


Papers should be submitted through EasyChair at


Posted in Conference/workshop | Leave a comment

Michael B. Cohen

This is an incredibly difficult post to write. Michael Benjamin Cohen, an amazing student and person passed away this week in Berkeley. Below is the official MSR statement (where he spent the summer together with many friends) and some personal remembrance. Also this is a picture of one of the many many happy moments we had together this summer at MSR (with Yin Tat Lee too):

MSR statement:

The Microsoft Research family is shocked and deeply saddened by the tragic loss of our colleague and friend Michael Cohen.  Michael was a brilliant mathematician and a rising star in his field.  He spent this past summer with us as a Microsoft Research Fellow, doing what he loved most. Over the summer, he made sweeping progress in online learning and online algorithms, two fields he had just recently become acquainted with. In addition to solving five open problems in these areas, he continued his substantial progress on the k-server problem, one of the most celebrated and notoriously difficult challenges in the space of adaptive algorithms.

Michael was a truly exceptional individual.  We will remember Michael for his infectious smile and his larger-than-life personality. We will never forget his unrelenting curiosity, his thirst for knowledge, and his deep love for mathematics and theoretical foundations of computing.  We are stunned by his loss.  We will hold onto our memories of Michael, and know that his ideas and scientific accomplishments will continue on as important advances.

We extend our most sincere condolences to Michael’s family and friends.

Personal notes:

I have known Michael for less than a year, but in that short time span he had a profound impact on me, like only a handful of people have had in my scientific life. I will always remember our first meeting, on October 26th 2016 at MIT. We were about to start lunch with a small group of graduate students and Michael entered the room, he (gently) interrupted the conversation and his first sentence to me was a question about mirror descent that I was not able to answer (we now know the answer, and as it turns out his question was pretty deep and the answer highly non-trivial). This is a typical Michael story, and I’m sure anyone who has interacted with him had similar experiences, where a seemingly innocuous question (at a perhaps inopportune time) turned out to be extremely deep and interesting.

Michael’s way of doing mathematics was truly remarkable. He was relentlessly searching for the right way to look at a problem, and he was never satisfied with a hacky explanation. As far as I know he never wrote anything on paper while doing research. I assume that if the calculations were too long to fit in his head then it was probably the wrong approach anyway. Doing research with him was an immense pleasure. He was always willing to share his new and original point of view on his current readings. He was also interested in a breadth of topics that is only matched by a tiny fraction of researchers in theoretical computer science. As James Lee shared in his comment on Luca’s blog, I too was feeling that all of this was just warm-up for him. I am incredibly sad to have lost a friend and a great collaborator, but I’m also sad that we will never see Michael’s next steps.

Beyond his mathematical gift he was also a kind and gentle human being. Perhaps it was not obvious to see at first sight with his exuberant personality, but he was compassionate and caring.

We truly lost one of the best. It is really hard to cope with the unfairness of these events. We will do our best to share Michael’s wonderful ideas with the world, and to make his scientific accomplishements live on. Needless to say my heart breaks for his family and all my thoughts are with them at this moment.

Posted in Uncategorized | 7 Comments

Smooth distributed convex optimization

A couple of months ago we (Kevin Scaman, Francis Bach, Yin Tat Lee, Laurent Massoulie and myself) uploaded a new paper on distributed convex optimization. We came up with a pretty clean picture for the optimal oracle complexity of this setting, which I will describe below. I should note that there are hundreds of papers on this topic, but the point of the post is to show our simple cute calculations and not to survey the immense literature on distributed optimization, see the paper itself for a number of pointers to other recent works.

Distributed optimization setting

Let G=(V,E) be an undirected graph on n vertices (|V| =n) and with diameter \Delta. We will think of the nodes in G as the computing units. To each vertex v \in V there is an associated convex function f_v : \mathbb{R}^d \rightarrow \mathbb{R}. For machine learning applications one can think that each computing unit has access to a “private” dataset, and f_v(x) represents the fit of the model corresponding to x \in \mathbb{R}^d on this dataset (say measured on least squares loss, or logistic loss for example). The goal will be to find in a distributed way the optimal “consensus” point:

    \[x^* \in \mathrm{argmin}_{x \in \mathbb{R}^d} \sum_{v \in V} f_v(x) .\]

The distributed processing protocol is as follows: asynchronously/in parallel, each node v can (i) compute a (local) gradient \nabla f_i in time 1, and (ii) communicate a vector in \mathbb{R}^d to its neighbors in G in time \tau. We denote by x_{v,t} the local model (essentially its guess for x^*) of node v at time t \in \mathbb{R}_+. We aim to characterize the smallest time T_{\epsilon} such that one can guarantee that all nodes v satisfy \bar{f}(x_{v,T}) - \bar{f}(x^*) \leq \epsilon where \bar{f} = \frac1n \sum_{v \in V} f_v.

We focus on the case where \bar{f} is \beta-smooth and \alpha-strongly convex (\kappa =\beta/\alpha is the condition number), which is arguably the most challenging case since one expects linear convergence (i.e., the scaling of T_{\epsilon} in \epsilon should be \log(1/\epsilon)) which a priori makes the interaction of optimization error and communication error potentially delicate (one key finding is that in fact it is not delicate!). Also, having in mind applications outside of large-scale machine learning (such as “federated” learning), we will make no assumptions about the functions at different vertices relate to each other.

A trivial answer

Recall that Nesterov’s accelerated gradient descent solves the serial problem in time O(\sqrt{\kappa} \log(1/\epsilon)). Trivially one can distribute a step of Nesterov’s accelerated gradient descent in time O(\tau \Delta) (simply designate a master node at the beginning, and everybody sends its local gradient to the master node in time O(\tau \Delta)). Thus we arrive at the upper bound T_{\epsilon} = O((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon)) using a trivial (centralized) algorithm. We now show (slightly informally, see the paper for proper definitions) that this in fact optimal!

First let us recall the lower bound proof in the serial case (see for example Theorem 3.15 here). The idea is to introduce the function f(x) = \beta L[x,x] - 2 \alpha x(1) + \alpha |x|^2 where L is the Laplacian of the path graph on \N, or in other words

    \[L[x,x] = \sum_{i \in \mathbb{N}} (x(i) - x(i+1))^2 .\]

First it is easy to see that this function is indeed O(\beta)-smooth and \Omega(\alpha)-strongly convex. The key point is that, for any algorithm starting at x_0=0 and such that each iteration stays in the linear span of the previously computed gradients (a very natural assumption) then

    \[x_t(i) = 0 \;\; \text{for all} \;\; i > t ~.\]

In words one can say that each gradient calculation “discovers” a new edge of the path graph involved in the definition of f. Concluding the serial proof is then just a matter of brute force calculations.

Now let us move to the distributed setting, and consider two vertices u and v that realize the diameter of G. The idea goes as follows: let L_1 (respectively L_2) be the Laplacian of even edges of the path graph on \mathbb{N} (respectively the odd edges), that is

    \[L_1[x,x] = \sum_{i \in \mathbb{N} : i \; \text{even}} (x(i) - x(i+1))^2 .\]

Now define \frac{1}{n} f_u = \beta L_1[x,x] - 2 \alpha x(1) + \frac{\alpha}{2} |x|^2, \frac{1}{n} f_v = \beta L_2[x,x] + \frac{\alpha}{2} |x|^2, and f_w = 0 for any w \not\in \{u,v\}. The key observation is that node u does not “know” about the even edges until it receives a message from v and vice versa. Thus it fairly easy to show that in this case one has:

    \[x_{u,t}(i) = 0 \;\; \text{for all} \;\; i > t / (1+\tau \Delta) ~,\]

which effectively amounts to a slowdown by a factor (1+\tau \Delta) compared to the serial case and proves the lower bound T_{\epsilon} = \Omega((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon).

Not so fast!

One can say that the algorithm proposed above defeats a bit the purpose of the distributed setting. Indeed the centralized communication protocol it relies on is not robust to various real-life issues such as machine failures, time-varying graphs, edges with different latency, etc. An elegant and practical solution is to restrict communication to be gossip-like. That is local computations have now to be communicated via matrix multiplication with a walk matrix W which we define as satisfying the following three conditions: (i) W(i,j) \neq 0 \Rightarrow (i,j) \in E, (ii) \mathrm{Ker}(W) =\mathrm{Span}((1,\hdots,1)), and (iii) W \succeq 0. Let us briefly discuss these conditions: (i) simply means that if x \in \mathbb{R}^V represents real values stored at the vertices, then W x can be calculated with a distributed communication protocol; (ii) says that if there is consensus (that is all vertices hold the same value) then no communication occurs with this matrix multiplication; and (iii) will turn out to be natural in a just a moment for our algorithm based on duality. A prominent example of a walk matrix would be the (normalized) Laplacian of G

We denote by \gamma the inverse condition number of W on (1,\hdots,1)^{\perp} (that is the ratio of the smallest non-zero eigvenvalue of W to its largest eigenvalue), also known as the spectral gap of G when W is the Laplacian. Notice that \gamma naturally controls the number of gossip steps to reach consensus, in the sense that gossip steps corresponds to gradient descent steps on f(x) = x^{\top} W x, which will converge in O(\gamma^{-1} \log(1/\epsilon)) steps. Doing an “accelerated gossip” (also known as Chebyshev gossiping) one could thus hope to essentially replace the diameter \Delta by O(\gamma^{-1/2} \log(1/\epsilon)). Notice that this is hopeful thinking because in the centralized model \Delta steps gets you to an exact consensus, while in the gossip model one only reaches an \epsilon-approximate consensus and errors might compound. In fact with a bit of graph theory one can immediately see that simply replacing \Delta by O(\gamma^{-1/2} \log(1/\epsilon)) is too good to be true: there are graphs (namely expanders) where \Delta is of order \log(n) while \gamma is of order of a constant, and thus an upper bound of the form (say) \Omega((1+\tau \gamma^{-1/2} \log(1/\epsilon)) \sqrt{\kappa} \log(1/\epsilon) would violate our previous lower bound by a factor 1/\log(n).

To save the day we will make extra assumptions, namely that each local function f_v has condition number \kappa and that in addition to computing local gradient the vertices can also compute local gradients of the Fenchel dual functions f_v^*. The latter assumption can be removed at the expense of extra logarithmic factors but we will ignore this point here (see the paper for some hints as well as further discussion on this point). For the former assumption we note that the lower bound proof given above completely breaks under this assumption. However one can save the construction for some specific graphs (finding the correct generalization to arbitrary graphs is one of our open problems). For example imagine a line graph, and cluster the vertices into three groups, the first third, the middle, and the last third. Then one could distribute the even part of the Laplacian on \mathbb{N} in the first group, and the odd part on the last group, as well as distribute the Euclidean norm evenly among all vertices. This construction verifies that each vertex function has condition number O(\kappa) and furthermore the rest of the argument still goes through. Interestingly in this case one also has \Delta \approx 1/\sqrt{\gamma} and thus this proves that for the line graph one has T_{\epsilon} = \Omega((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon) for gossip algorithms. We will now show a matching upper bound (which holds for arbitrary graphs).

Dual algorithm

For X \in \mathbb{R}^{d \times n} (which we think of as a set of column vectors, one for each vertex i \in V = [n]), denote X_i for the i^{th} column and let F(X) := \sum_{i =1}^n f_i(X_i). We are interested in minimizing F under the constraint that all columns are equal, which can be written as X \sqrt{W} = 0. By definition of the Fenchel dual F^* and a simple change of variable one has:

    \[\min_{X \in \mathbb{R}^{d\times n} : X \sqrt{W} = 0} F(X) = \max_{\lambda \in \mathbb{R}^{d \times n}} - F^*(\lambda \sqrt{W}) .\]

Next observe that gradient ascent on \lambda can be written as

    \[\lambda^+ = \lambda + \eta \nabla F^*(\lambda \sqrt{W}) \sqrt{W}\]

and with the notation y = \lambda \sqrt{W} this is simply y^+ = y - \eta \nabla F^*(y) W. Crucially \nabla F^*(y) W exactly corresponds to gossiping the local conjugate gradients (which are also the local models) \nabla f_i^*(y_{i,t}) =: X_{i,t}. In other words we only have to understand the condition number of the function \lambda \mapsto F^*(\lambda \sqrt{W}). The beauty of all of this is that this condition number is precisely \kappa / \gamma (i.e. it naturally combines the condition number of the vertex functions with the “condition number” of the graph). Thus by accelerating gradient ascent we arrive at a time complexity of O((1+\tau) \sqrt{\kappa / \gamma} \log(1/\epsilon)) (recall that a gossip step takes time \tau). We call the corresponding algorithm SSDA (Single-Step Dual Accelerated). One can improve it slightly in the case of low communication cost by doing multiple rounds of communication between two gradient computations (essentially replacing W by W^k). We call the corresponding algorithm MSDA (Multi-Step Dual Accelerated) and its attains the optimal (in the worst case over graphs) complexity of O((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon)).

Posted in Optimization | Leave a comment

COLT 2017 accepted papers

The list of accepted papers at COLT 2017 has been published and it looks particularly good (see below with links to arxiv version)! The growth trend of previous years continues with 228 submissions (14% increase from 2016) and 73 accepted papers. Note also that this year’s edition will be in Amsterdam beginning of July which should be fun. The deadline for early registration is in two days, so hurry up!

COLT 2017 accepted papers
– Constantinos Daskalakis, Manolis Zampetakis and Christos Tzamos. Ten Steps of EM Suffice for Mixtures of Two Gaussians
– Shachar Lovett and Jiapeng Zhang. Noisy Population Recovery from Unknown Noise
– Jonathan Scarlett, Ilija Bogunovic and Volkan Cevher. Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization
– Avrim Blum and Yishay Mansour. Efficient Co-Training of Linear Separators under Weak Dependence
– Michal Moshkovitz and Dana Moshkovitz. Mixing Implies Lower Bounds for Space Bounded Learning
– Mitali Bafna and Jonathan Ullman. The Price of Selection in Differential Privacy
– Nader Bshouty, Dana Drachsler Cohen, Martin Vechev and Eran Yahav. Learning Disjunctions of Predicates
– Avinatan Hassidim and Yaron Singer. Submodular Optimization under Noise
– Debarghya Ghoshdastidar, Ulrike von Luxburg, Maurilio Gutzeit and Alexandra Carpentier. Two-Sample Tests for Large Random Graphs using Network Statistics
– Andreas Maurer. A second-order look at stability and generalization
– Eric Balkanski and Yaron Singer. The Sample Complexity of Optimizing a Convex Function
– Daniel Vainsencher, Shie Mannor and Huan Xu. Ignoring Is a Bliss: Learning with Large Noise Through Reweighting-Minimization
– Alekh Agarwal, Haipeng Luo, Behnam Neyshabur and Robert Schapire. Corralling a Band of Bandit Algorithms
– Nikita Zhivotovskiy. Optimal learning via local entropies and sample compression
– Max Simchowitz, Kevin Jamieson and Benjamin Recht. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
– Lunjia Hu, Ruihan Wu, Tianhong Li and Liwei Wang. Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes
– Nicolas Brosse, Alain Durmus, Eric Moulines and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo
– Maxim Raginsky, Alexander Rakhlin and Matus Telgarsky. Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis
– Alon Cohen, Tamir Hazan and Tomer Koren. Tight Bounds for Bandit Combinatorial Optimization
– Moran Feldman, Christopher Harshaw and Amin Karbasi. Greed Is Good: Near-Optimal Submodular Maximization via Greedy Optimization
– Bin Hu, Peter Seiler and Anders Rantzer. A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints
– Jerry Li. Robust Sparse Estimation Tasks in High Dimensions (*to be merge)
– Yeshwanth Cherapanamjeri, Prateek Jain and Praneeth Netrapalli. Thresholding based Efficient Outlier Robust PCA
– Amir Globerson, Roi Livni and Shai Shalev-Shwartz. Effective Semisupervised Learning on Manifolds
– Yuchen Zhang, Percy Liang and Moses Charikar. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics
– Ashok Cutkosky and Kwabena Boahen. Online Learning Without Prior Information
Joon Kwon, Vianney Perchet and Claire Vernade. Sparse Stochastic Bandits
– Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian and Nathan Srebro. Learning Non-Discriminatory Predictors
– Arpit Agarwal, Shivani Agarwal, Sepehr Assadi and Sanjeev Khanna. Learning with Limited Rounds of Adaptivity: Coin Tossing, Multi-Armed Bandits, and Ranking from Pairwise Comparisons
– Jerry Li and Ludwig Schmidt. Robust Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities
Alexandr Andoni, Daniel Hsu, Kevin Shi and Xiaorui Sun. Correspondence retrieval
– Andrea Locatelli, Alexandra Carpentier and Samory Kpotufe. Adaptivity to Noise Parameters in Nonparametric Active Learning
Salil Vadhan. On Learning versus Refutation
– Sebastian Casalaina-Martin, Rafael Frongillo, Tom Morgan and Bo Waggoner. Multi-Observation Elicitation
Vitaly Feldman and Thomas Steinke. Generalization for Adaptively-chosen Estimators via Stable Median
Shipra Agrawal, Vashist Avadhanula, Vineet Goyal and Assaf Zeevi. Thompson Sampling for the MNL-Bandit
Rafael Frongillo and Andrew Nobel. Memoryless Sequences for Differentiable Losses
– Pranjal Awasthi, Avrim Blum, Nika Haghtalab and Yishay Mansour. Efficient PAC Learning from the Crowd
– Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning
– Yury Polyanskiy, Ananda Theertha Suresh and Yihong Wu. Sample complexity of population recovery
– Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski and Sanjeev Arora. On the Ability of Neural Nets to Express Distributions
Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao and Ruosong Wang. Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration
Dylan Foster, Alexander Rakhlin and Karthik Sridharan. ZIGZAG: A new approach to adaptive online learning
Posted in Conference/workshop | 1 Comment

Discrepancy algorithm inspired by gradient descent and multiplicative weights; after Levy, Ramadas and Rothvoss

A week or so ago at our Theory Lunch we had the pleasure to listen to Harishchandra Ramadas (student of Thomas Rothvoss) who told us about their latest discrepancy algorithm. I think the algorithm is quite interesting as it combines ideas from gradient descent and multiplicative weights in a non-trivial (yet very simple) way. Below I reprove Spencer’s 6 deviations theorem with their machinery (in the actual paper Levy, Ramadas and Rothvoss do more than this).

First let me remind you the setting (see also this previous blog post for some motivation on discrepancy and a bit more context; by the way it is funny to read the comments in that post after this): given v_1, \hdots, v_n \in \mathbb{S}^{n-1} one wants to find x \in \{-1,1\}^n (think of it as a “coloring” of the coordinates) such that \max_{i \in [n]} |x \cdot v_i| \leq C for some numerical constant C>0 (when v_i is a normalized vectors of 1‘s and 0‘s the quantity |x \cdot v_i| represents the unbalancedness of the coloring in the set corresponding to v_i). Clearly it suffices to give a method to find x \in [-1,1]^n with at least half of its coordinates equal to -1 and 1 and such that \max_{i \in [n]} |x \cdot v_i| \leq C' for some numerical constant C'>0 (indeed one can then simply recurse on the coordinates not yet set to -1 or 1; this is the so-called “partial coloring” argument). Note also that one can drop the absolute value by taking v_i and -v_i (the number of constraints then becomes 2n but this is easy to deal with and we ignore it here for sake of simplicity).

The algorithm

Let x_0 = 0, w_0 = 1 \in \mathbb{R}^n. We run an iterative algorithm which keeps at every time step t \in \mathbb{N} a subspace U_t of valid update directions and then proceeds as follows. First find (using for instance a basis for U_t) z_t \in \mathbb{S}^{n-1} \bigcap U_t such that

(1)   \begin{equation*}  \sum_{i=1}^n \frac{w_t(i)}{\|w\|_1} (v_i \cdot z_t)^2 \leq \frac{1}{\mathrm{dim}(U_t)} . \end{equation*}

Then update x_{t+1}= x_t + \lambda_t z_t where \lambda_t \in [0,1] is maximal so that x_{t+1} remains in [-1,1]^n. Finally update the exponential weights by w_{t+1}(i) = w_t(i) \exp( v_i \cdot (x_{t+1} - x_t) ).


It remains to describe the subspace U_t. For this we introduce the set I_t \subset [n] containing the n/16^{th} largest coordinates of w_t (the “inactive” coordinates) and the set F_t \subset [n] containing the coordinates of x_t equal to -1 or 1 (the “frozen” coordinates). The subspace U_t is now described as the set of points orthogonal to (i) x_t, (ii) e_j, j \in F_t, (iii) v_i, i \in I_t, (iv) \sum_{i=1}^n w_t(i) v_i. The intuition for (i) and (ii) is rather clear: for (i) one simply wants to ensure that the method keeps making progress towards the boundary of the cube (i.e., |x_{t+1}| > |x_t|) while for (ii) one wants to make sure that coordinates which are already “colored” (i.e., set to -1 or 1) are not updated. In particular (i) and (ii) together ensures that at each step either the norm squared of x_t augments by 1 (in particular \lambda_t=1) or that one fixes forever one of the coordinates to -1 or 1. In particular this means that after at most 3 n /2 iterations one will have a partial coloring (i.e., half of the coordinates set to -1 or 1, which was our objective). Property (iii) is about ensuring that we stop walking in the directions where we are not making good progress (there are many ways to ensure this and this precise form will make sense towards the end of the analysis). Property (iv) is closely related, and while it might be only a technical condition it can also be understood as ensuring that locally one is not increasing the softmax of the constraints, indeed (iv) exactly says that one should move orthogonally to the gradient of \log(\sum_{i=1}^n \exp(x \cdot v_i)).

The analysis

Let Z_t = \sum_{i=1}^n w_t(i). Note that since z_t is on the sphere and \lambda_t \in [0,1] one has that |v_i \cdot (x_{t+1} - x_t)| \leq 1. Thus using \exp(x) \leq 1 + x + x^2 for x \in [-1,1], as well as property (iv) (i.e., \sum_{i=1}^n w_t(i) v_i \cdot z_t = 0) and \lambda_t \in [0,1] one obtains:

    \[Z_{t+1} = \sum_{i=1}^n w_t(i) \exp(v_i \cdot (x_{t+1} - x_t)) \leq \sum_{i=1}^n w_t(i) (1 + (v_i \cdot z_t)^2) .\]

Observe now that the subspace U_t has dimension at least n/4 (say for n \geq 16) and thus by (1) and the above inequalities one gets:

    \[Z_{t+1} \leq (1+ 4/n) Z_t .\]

In particular for any t \leq 2n, Z_{t} \leq C n for some numerical constant C >0. It only remains to observe that this ensures w_{2n}(i) = O(1) for any i \in [n] (this concludes the proof since we already observed that at time 2 n at least half of the coordinates are colored). For this last implication we simply use property (iii). Indeed assume that some coordinate i satisfies at some time t \leq 2n, w_t(i) > c e for some c>0. Since each update increases the weights (multiplicatively) by at most e it means that there is a previous time (say s) where this weight was larger than c and yet it got updated, meaning that it was not in the top n/16 weights, and in particular one had Z_s \geq c n / 16 which contradicts Z_{s} \leq C n for c large enough (namely c > 16 C).

Posted in Theoretical Computer Science | 2 Comments