I thought this would be a good opportunity to revisit the proof of Nesterov’s momentum, especially since as it turns out I really don’t like the way I described it back in 2013 (and to this day the latter post also remains my most visible post ever…). So here we go, for what is hopefully a short and intuitive proof of the convergence rate for Nesterov’s momentum (disclaimer: this proof is merely a rearranging of well-known calculations, nothing new is going on here).

We assume that is -smooth convex function, and we take in the gradient step. The momentum term will be set to a very particular value, which comes out naturally in the proof.

**The two basic inequalities**

Let us denote and (note that ). Now let us write our favorite inequalities (using and ):

and

**On the way to a telescopic sum**

Recall now that , so it would be nice to somehow combine the two above inequalities to obtain a telescopic sum thanks to this simple formula. Let us try to take a convex combination of the two inequalities. In fact it will be slightly more elegant if we use the coefficient on the second inequality, so let us do times the first inequality plus times the second inequality. We obtain an inequality whose right hand side is given by times

Recall that our objective is to obtain a telescopic sum, and at this point we still have flexibility both to choose and . What we would like to have is:

Observe that (since ) the right hand side can be written as , and thus we see that we simply need to have:

**Setting the parameters and concluding the proof**

Writing we now obtain as a result of the combination of our two starting inequalities:

It only remains to select such that (i.e., roughly is of order ) so that by summing the previous inequality one obtains which is exactly the rate we were looking for.

]]>This summer at MSR Michael was still very present in our discussions, whether it was about some ideas that we discussed that last 2017 summer (acceleration, metrical task systems lower bounds, etc…), or just some random fun story.

I highly recommend to take look at the YouTube videos from the November 2017 symposium in memory of Michael. You can also take a look at his (still growing) list of publications on arxiv. In fact I know of an upcoming major paper so stay tuned (the premises are in Yin Tat’s talk at the symposium).

As always when remembering this tragic loss my thoughts go to Michael’s family.

]]>**Syllabus**

Lecture 1: Introduction to the statistical learning theory framework, its basic question (sample complexity) and its canonical settings (linear classification, linear regression, logistic regression, SVM, neural networks). Two basic methods for learning: (i) Empirical Risk Minimization, (ii) Nearest neighbor classification.

Lecture 2: Uniform law of large numbers approach to control the sample complexity of ERM (includes a brief reminder of concentration inequalities). Application: analysis of bounded regression (includes the non-standard topic of type/cotype and how it relates to different regularizations such as in LASSO).

Lecture 3: Reminder of the first two lectures and relation with the famous VC dimension. How to generalize beyond uniform law of large numbers: stability and robustness approaches (see below).

Lecture 4: How to generalize beyond uniform law of large numbers: information theoretic perspective (see below), PAC-Bayes, and online learning. Brief discussion of margin theory, and an introduction to modern questions in robust machine learning.

**Some notes on algorithmic generalization**

Let be input/output spaces. Let be a loss function, a probability measure supported on , and a learning rule (in words takes as input a dataset of examples, and output a mapping from -inputs to -outputs). With a slight abuse of notation, for and , we write . We define the generalization of on by:

In words, if then we expect the empirical performance of the learned classifier to be representative of its performance on a fresh out-of-sample data point, up to an additive . The whole difficulty of course is that the empirical evaluation is done with the *same *dataset that is used for training, leading to non-trivial dependencies. We should also note that in many situations one might be interested in the two-sided version of the generalization, as well as high probability bounds instead of bounds in expectation. For simplicity we focus on here.

The most classical approach to controlling generalization, which we covered in details in previous notes, is via uniform law of large numbers. More precisely assuming that the range of the learning rule is some hypothesis class one trivially has

However this approach might be too coarse when the learning rule is searching through a potentially huge space of hypothesis (such as in the case of neural networks). Certainly such uniform bound has no chance of explaining why neural networks with billions of parameters would generalize with a data set of merely millions of examples. For this one has to use *algorithm-based* arguments.

**Stability**

The classical example of algorithmic generalization is due to Bousquet and Elisseeff 2002. It is a simple rewriting of the generalization as a *stability* notion:

where . This viewpoint can be quite enlightening. For example in the uniform law of large numbers view, regularization enforces small capacity, while in the stability view we see that regularization ensures that the output hypothesis is not too brittle (this was covered in some details in the previous notes).

**Robustness**

The next approach I would like to discuss is related to deep questions about current machine learning methods. One of the outstanding problem in machine learning is that current algorithms are not robust to even mild shift of distribution at test time. Intuitively this lack of robustness seem to indicate a lack of generalization. Can we formalize this intuition? I will now give one such formal link between robustness and generalization due to Xu and Mannor 2010, which shows the reverse direction (robustness implies generalization). At some level robustness can be viewed as a “stability at test time” (while in Bousquet and Elisseeff we care about “stability at training time”).

Xu and Mannor define -robustness as follows: assume that can be partitioned into sets such that if and are in the same set then

A good example to have in mind would be a binary classifier with large margin, in which case corresponds to the covering number of at the scale given by the margin. Another (related) example would be regression with a Lipschitz function. In both cases would be typically exponential in the dimension of . The key result of Xu and Mannor that we prove next is a generalization bound of order . In any situation of interest this seems to me to be a pretty weak bound, yet on the other hand I find the framework to be very pretty and it is of topical interest. I would be surprised if this was the end of the road in the space of “generalization and robustness”.

Theorem (Xu and Mannor 2010):

A -robust learning rule satisfies

**Proof:** Let and note that . Now one has for a robust :

It only remains to observe that

**Information theoretic perspective**

Why do we think that a lack of robustness indicate a lack of generalization? Well it seems to me that a basic issue could simply be that the dataset was *memorized* by the neural network (which be a *very* non-robust way to learn). If true then one could basically find all the information about the data in the weights of the neural network. Again, can we prove at least the opposite direction, that is if the output hypothesis does not retain much information from the dataset then it must generalize. This is exactly what Russo and Zou 2016, where they use the mutual information as a measure of the “information” retained by the trained hypothesis about the dataset. More precisely they show the following result:

Theorem (Russo and Zou 2016):

Note that here we have assumed that the codomain of the learning rule consists of deterministic maps from inputs to outputs, in which case the mutual information is simply the entropy . However the proof below also applies to the case where the codomain of the learning rule consists of probability measures, see e.g., Xu and Raginsky 2017. Let us now conclude this (long) post with the proof of the above theorem.

The key point is very simple: one can view generalization as a decoupling property by writing:

where .

Now the theorem follows straightforwardly (if one knows Hoeffding’s lemma) from an application of the following beautiful lemma:

Lemma:Let . Let be random variables in and , be mutually independent copies of and . Assume that is -subgaussian (i.e., ) then

**Proof:** The mutual information is equal to the relative entropy between the distribution of and the distribution of . Recall also the variational representation of the relative entropy which is that the map is the convex conjugate of the log-partition function . In particular one has a lower bound on the mutual information for any such which means:

Now it only remains to use the definition of subgaussianity, that is take , and optimize over .

]]>- The main culprit is definitely the COLT 2018 chairing. This year we received a surprise 50% increase in number of submissions. This is great news for the community, and it led to a truly fantastic program. We are looking forward to see many of you in Stockholm! Note that the early registration deadline is tomorrow (May 30th).
- The first volume of our new Mathematical Statistics and Learning journal just got published. If you care about being evaluated by top experts, and/or get the visibility of being in a highly selective mathematical journal, you should consider submitting your highest quality work there!
- The first set of video lectures on the youtube channel is now complete. I realize that the quality is not perfect and that it is hard to read the blackboard. I will try to improve this for the future videos (which should be posted some time during the summer).
- Finally I am quite excited to have a first preprint (joint work with Eric Price and Ilya Razenshteyn) at least loosely related to deep learning. It is titled “Adversarial examples from computational constraints”, which sums up the paper pretty well. In a nutshell, we prove that avoiding adversarial examples is computationally hard, even in situations where there exists a very robust classifier. We are looking forward to get the deep learning community’s feedback on this work!

**State space for weighted -paging**

Let be the weight of the edge from leaf to the root. Recall from the previous post that we want to find a norm and a convex body that can express the Wasserstein- distance between two fractional -server configurations.

Consider a fractional move from to . Then clearly, the amount of mass has to transfer through the edge , so that the Wasserstein- distance is at least . Furthermore there is trivially a transport plan achieving that much total mass transfer. In other words we just proved that in this case the appropriate norm is a weighted norm (namely ) and one can simply use the basic state space (recall from the previous post that we have to work with anticonfiguration, and that the mapping to a configuration is simply given by ).

**Applying the general mirror descent framework**

Given a request at location and a current anticonfiguration , our proposed (fractional) algorithm is to run the mirror descent dynamics with a continuous time linear cost from (i.e., for some ) and until the first time at which (i.e., one has a server at location in ). By the lemma at the end of the previous post one has (denote for the request being serviced at time )

One can think of as a “virtual service cost”. In -server this quantity has no real meaning, but the above inequality shows that this quantity, which only depends on the algorithm, is tightly related to the value of OPT (provided that has a small Lipschitz norm). Thus we see that we have two key desiderata for the choice of the mirror map : (i) it should have a small Lipschitz norm, (ii) one should be able to relate the movement cost to this “virtual service cost” , say up to a multiplicative factor . One would then obtain a -competitive algorithm.

**Entropy regularization for weighted -paging**

Let us look at (ii), and we shall see that the entropy regularization comes out very naturally. Ignore for a moment the Lagrange multiplier and let us search for a separable regularizer of the form . We want to relate to . Making those two quantities equal gives and thus the regularizer is a *weighted negentropy*: .

We now need to verify that this relation between the virtual service cost and the movement cost remains true even when the Lagrange mutilplier is taken into account. Note that because of the form of the state space the multiplier contains a term of the form (which corresponds to the constraint ) and for each location a term forcing the derivative to be if the value of the missing mass has reached . In other words we obtain the following dynamics for mirror descent with the weighted negentropy regularization:

Notice that up to a factor one can focus on controlling (that is only movement *into* a location is charged). In that view the Lagrange multipliers simply have no effect, since one has (indeed recall that ). Thus we see that the movement cost is exactly bounded by the virtual service cost in this case.

**Making Lipschitz**

It remains to deal with a non-trivial issue, namely the entropy is not Lipschitz on the simplex! A similar issue is faced in online learning when one tries to prove *tracking expert regret bounds*, i.e., bounds with respect to a shifting expert. The standard solution (perhaps first used by Herbster and Warmuth in 98, see also Blum and Burch 00) is to shift the variables so that they never get below some , in which case the Lipschitz constant would be . In the -server scenario one can stop the dynamics when (instead of ) provided that the mapping from to the actual fractional configuration is now given by . This raises a final issue: the total amount of server mass in such a is . Next we show that if is small enough then can be “rounded” online to a fractional -server configuration at the expense of a multiplicative movement. Precisely we show that is sufficient, which in turns gives a final competitive ratio of for weighted -paging.

**From servers to servers**

Consider a fractional server configuration with total mass (i.e., ), which we want to transform into a server configuration with total mass . A natural way to “round down” is at each location to put if the mass was . Furthermore a mass of should stay . This suggests the mapping , which is -Lipschitz. Thus the movement of is controlled by the movement of up to a multiplicative factor . Moreover one clearly has (in fact the inequality can be strict, in which case one should think of the “lost mass” as being stored at the root, this incurs no additional movement cost).

]]>

**State representation for MTS**

Recall the MTS problem: one maintains a random state (where is equipped with a distance ), and given a new cost function , one can update the state to with the corresponding price to pay: . Equivalently one can maintain the probability distribution of : indeed given one can obtain by optimal coupling, that is the couple of random variables minimizes the expected distance over all couples such that has marginal and has marginal (the latter quantity is called the Wasserstein- distance between and ). In this view the (expected) service cost is now a linear function, that is where , and the movement cost between and is the Wasserstein- distance.

We will further assume the existence of a convex body and a norm in () such that the Wasserstein- distance between two distributions is equal to where are “expanded states” with and . For a weighted star metric it suffices to take , but we will see in the fourth post that for trees one indeed need to expand the state space.

**Fractional solution for -server**

Recall the -server problem: one maintains a random configuration , and given a new request one must update the configuration to such that . One could equivalently maintain a distribution as before. In the *fractional* problem one in fact only maintains the moment of this distribution, , defined by (in particular servicing a request a location means that one must have ). Again the metric here on the variables is the Wasserstein distance induced by (a.k.a., the optimal transport distance). Importantly we note that this view is *not* equivalent to maintaining a full distribution (indeed a lot of information is lost by recording only the first moment). This raises the subtle issue of whether one can *round online* a fractional solution to a proper integral solution whose total movement is of the same order of magnitude. We will not touch upon this question here, and we focus on the fractional -server problem, see for example Section 5.2 here for more. We note however that the existence of a polynomial time algorithm for this rounding task is an open problem.

To think of the request as a linear cost function (like in MTS) it is more appropriate to work with the *anticonfiguration* . In this view a request is equivalent to a cost vector . Finally like in MTS we will assume the existence of an expanded state space and a norm that measures movement in this expanded view.

**Continuous time decision making**

We will now move to a continuous time setting, where the (discrete time) sequence of cost vectors is revealed continuously as a path , with (and ). The decision maker’s reponse is a path that lives in . In this setting the service cost of the algorithm is and the movement cost is equal to where is the time derivative of the path . We note that there is small subtelty here to translate the continuous time service cost into a meaningful discrete time service cost, but we will not worry about this here since it does not affect the argument for -server (where there is only a movement cost). If you are curious see the appendix here.

For -server we will use where is the currently requested location, and we move to the next request at the first time such that (which means that satisfies , i.e., there is a server at the requested location.

**Mirror descent**

If you have never seen the mirror descent framework now is a good time to take a quick look here.

Very succintly mirror descent with mirror map can be written as follows, with a step-size :

where we recall that is the Legendre-Fenchel transform of (i.e., the map whose gradient is the inverse map of the gradient of ) and is the Bregman divergence associated to .

We now want to take to in the above definition to find the continuous time mirror descent update. For that let us recall the first order optimality condition for constrained optimization. Denote for the normal cone of at , i.e., the set of directions which are negatively correlated with any direction going into the body. One then has for any convex function ,

In particular we see that (note that )

and thus taking to we morally get

This type of equation is known as a differential inclusion, and with the added constraint that the path must live in the constraint set we get a *viability problem*. In our paper we show that a solution indeed exist (and is unique) under mild assumptions on .

**Potential based analysis**

The mirror descent algorithm is fully described by:

Denoting we see that for any fixed ,

The above calculation is the key to understand mirror descent: it says that if the algorithm is currently paying more than , i.e., , then it is actually getting closer to in the sense that is decreasing. Put differently: when the algorithm pays, it also learns. This key insight is sufficient for online learning, where one competes against a fixed point . However in MTS and -server we compete against a path , and thus we also need to evaluate by how much the Bregman divergence can go up when is moving. This is captured by the following calculation:

Putting together the two above calculations we obtain the following control on the service cost of mirror descent in terms of the service cost and movement cost of the optimal path:

Lemma:The mirror descent path satisfies for any comparator path ,

At this point the big question is: how do we control the movement of mirror descent? In the next post we will see how this plays out on a weighted star.

]]>

In the coming series of 5 blog posts I will explain the main ideas behind our -server preprint with Michael Cohen, James Lee, Yin Tat Lee, and Aleksander Madry. In this first post I will briefly put the problem in the broader context of online learning and online algorithms, which will be helpful as it suggests an approach based on the mirror descent algorithm. In the next post I will explain the general framework of mirror descent and how it can be applied to a problem such as k-server. The third post will show how to use this general framework to recover effortlessly the state of the art for weighted k-paging (i.e., when the underlying metric space is a weighted star). The fourth post will show how to extend the analysis to arbitrary trees. Finally in the fifth post we will discuss both classical embedding techniques to reduce the problem to the tree case, as well as our new dynamic embedding based on multiscale restarts on the classical technique. The content of the first three posts was essentially covered in my TCS+ seminar talk.

**Online decision making: two probability free models**

I will start by introducing a very general framework for online algorithms due to Borodin, Linial and Saks 92, called *metrical task systems* (MTS). At the same time I will recall the online learning framework, so that one can see the similarities/differences between the two settings. In fact this connection was already made at the end of the Nineties, see Blum and Burch 00. At the time it was natural to explore the power of multiplicative weights. With today’s perspective it is natural to explore the much more general *mirror descent* algorithm (we note that the closely related concept of regularization was already brought to bear for these problems in Abernethy, Bartlett, Buchbinder and Stanton 10, and Buchbinder, Chen and Naor 14).

Let be a set which we think of as either a state space (for online algorithms) or an action space (for online learning). Let be a metric on . Finally let be a set of possible cost functions (e.g., arbitrary bounded functions, or linear functions if , or convex functions, etc). Online decision making is about making decisions in an uncertain environment (*the future is not known*), and here we model this uncertain environment as an unknown sequence of cost functions . The interaction protocol can then be described as follows: For each , the player chooses (possibly randomly) based on past observations (to be made precise soon) and pays the cost , plus possibly a movement penalty: .

There are now two key differences between online algorithms and online learning: (i) the observation model, and (ii) the performance metric. In online learning one assumes that the cost is unknown at the decision time, and is only revealed after the decision is made (in *bandits* an even weaker signal is revealed, namely only the paid cost ). The type of applications one has in mind is say online recommendations, where a user’s preference is only revealed once some item has been recommended to her. On the other hand in online algorithms the cost is known at decision time. In this context the cost corresponds to a *request*, and the player has to find a way to satisfy this request (in the language of MTS the cost represents a *task*, and one gets the option to update its state so as to complete this task more efficiently, i.e., at a lower cost). Now let us discuss the performance metric. In online learning one assumes that there *is* some “hidden” good action (it is hidden in the noise, i.e., a single cost function does not say much about which actions are good, but if one takes the whole sequence into account then there is some good action that hopefully emerges). Thus it is natural to consider the following *regret* notion:

This regret notion does not make sense for online algorithms where one may have to keep changing states to satisfy the request sequence. There one must compare to the best offline strategy, in which case additive guarantees are not attainable and one resorts to a multiplicative guarantee, the so-called competitive ratio:

(Note that in MTS one always assumes nonnegative cost functions so that the multicative guarantee makes sense.) The -server problem, introduced in Manasse, McGeoch and Sleator 90, corresponds to a metrical task system on the product state space equipped with the Earthmover distance inherited by , and with cost functions

**Known results**

The online learning setting is by now fairly well-understood. We know that the optimal regret for a finite action set and bounded cost functions is (see e.g., Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth 97). In fact we even know the optimal constant (for large and ), and we have both algorithmic and information theoretic proofs of this result. Moreover we know how to leverage combinatorial information, e.g., when and is a set of linear functions. A unifying principle that was put forth in the last decade or so is the mirror descent strategy.

On the other hand the situation for MTS is much worse. The optimal guarantees for finite sets are not known: the worst case (over all metric spaces of size ) competitive ratio is known to be (trivial coupon-collector style lower bound on the uniform metric) and (the latter bound is due to Fiat and Mendel 03). No information theoretic analysis is known, even for the uniform metric. With combinatorial structure the situation becomes even more disastrous. That’s where the -server problem comes in, as a combinatorial MTS striking the best balance between simplicity and importance. For -server it is conjectured that one could get a competitive ratio of (the coupon-collector lower bound on uniform metric gives here ), while the best result prior to our work was (due to Bansal, Buchbinder, Naor and Madry 11), and if one insists for a bound independent of (due to Koutsoupias and Papadimitriou 95).

]]>**Call for applications for researcher and postdoc positions in Machine Learning and Optimization**

Application deadline: December 1st, 2017

The Machine Learning and Optimization Group at Microsoft Research Redmond invites applications for researcher positions at all levels (postdoc, full-time junior, full-time senior). The current full-time research members are Zeyuan Allen-Zhu, Sebastien Bubeck, Ofer Dekel, Yuval Peres, Ilya Razenshteyn, and Lin Xiao.

All applicants working on machine learning and optimization, including their algorithmic and mathematical foundations, will be considered. We expect candidates to have demonstrated excellence in research and have developed their own original research agenda. Our current areas of focus include statistical and online learning, convex and non-convex optimization, high dimensional data analysis, combinatorial optimization and its applications in AI, statistics, and probability. We are also looking to expand our domain of expertise to other areas of modern machine learning, including more applied research areas.

Microsoft Research offers wonderful resources to develop your research, opportunities for collaborations across all MSR labs and top academic institutions, and the potential to realize your ideas in products and services used worldwide. Our group is part of Microsoft Research AI, a new organization that brings together the breadth of talent across Microsoft Research to pursue game-changing advances in artificial intelligence.

Please apply directly on the Microsoft Research Careers website and include Ofer Dekel as a Microsoft Research contact. In addition, send a copy of your application to mloapp@microsoft.com. To be assured full consideration, all your materials, including at least two reference letters, should arrive by December 1st, 2017. We recommend including a brief academic research statement and links to publications.

Microsoft is an equal opportunity employer. We welcome applications from all qualified candidates, and in particular from women and underrepresented groups.

]]>**Program Committee:**

Jacob Abernethy (Georgia Tech)

Shivani Agarwal (University of Pennsylvania)

Shipra Agrawal (Columbia University)

Alexandr Andoni (Columbia University)

Pranjal Awasthi (Rutgers University)

Francis Bach (INRIA)

Nina Balcan (Carnegie Mellon University)

Afonso Bandeira (New-York University)

Mihail Belkin (Ohio State University)

Shai Ben-David (University of Waterloo)

Quentin Berthet (University of Cambridge)

Alina Beygelzimer (Yahoo! Research)

Avrim Blum (TTI Chicago)

Guy Bresler (MIT)

Olivier Cappé (Université Paris-Saclay)

Constantine Caramanis (UT Austin)

Alexandra Carpentier (OvGU Magdeburg)

Nicolo Cesa-Bianchi (University of Milan)

Arnak Dalalyan (ENSAE)

Amit Daniely (Hebrew University)

Ronen Eldan (Weizman Institute)

Tim van Erven (Leiden University)

Shivani Agarwal (University of Pennsylvania)

Shipra Agrawal (Columbia University)

Alexandr Andoni (Columbia University)

Pranjal Awasthi (Rutgers University)

Francis Bach (INRIA)

Nina Balcan (Carnegie Mellon University)

Afonso Bandeira (New-York University)

Mihail Belkin (Ohio State University)

Shai Ben-David (University of Waterloo)

Quentin Berthet (University of Cambridge)

Alina Beygelzimer (Yahoo! Research)

Avrim Blum (TTI Chicago)

Guy Bresler (MIT)

Olivier Cappé (Université Paris-Saclay)

Constantine Caramanis (UT Austin)

Alexandra Carpentier (OvGU Magdeburg)

Nicolo Cesa-Bianchi (University of Milan)

Arnak Dalalyan (ENSAE)

Amit Daniely (Hebrew University)

Ronen Eldan (Weizman Institute)

Tim van Erven (Leiden University)

Vitaly Feldman (IBM Research)

Aurelien Garivier (University Paul-Sabatier)

Rong Ge (Duke University)

Claudio Gentile (Universita degli Studi dell’Insubria)

Steve Hanneke

Elad Hazan (Princeton University)

Daniel Hsu (Columbia University)

Prateek Jain (Microsoft Research)

Satyen Kale (Google)

Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)

Wouter Koolen (CWI, Amsterdam)

Tomer Koren (Google)

John Lafferty (Yale University)

Po-Ling Loh (UW Madison)

Gabor Lugosi (ICREA and Pompeu Fabra University)

Shie Mannor (Technion)

Yishay Mansour (Tel Aviv University and Google)

Ankur Moitra (MIT)

Robert Nowak (UW Madison)

Vianney Perchet (ENS Paris-Saclay and Criteo)

Aurelien Garivier (University Paul-Sabatier)

Rong Ge (Duke University)

Claudio Gentile (Universita degli Studi dell’Insubria)

Steve Hanneke

Elad Hazan (Princeton University)

Daniel Hsu (Columbia University)

Prateek Jain (Microsoft Research)

Satyen Kale (Google)

Varun Kanade (Oxford University) Vladimir Koltchinskii (Georgia Tech)

Wouter Koolen (CWI, Amsterdam)

Tomer Koren (Google)

John Lafferty (Yale University)

Po-Ling Loh (UW Madison)

Gabor Lugosi (ICREA and Pompeu Fabra University)

Shie Mannor (Technion)

Yishay Mansour (Tel Aviv University and Google)

Ankur Moitra (MIT)

Robert Nowak (UW Madison)

Vianney Perchet (ENS Paris-Saclay and Criteo)

Alexandre Proutiere (KTH)

Luis Rademacher (UC Davis)

Maxim Raginsky (University of Illinois)

Sasha Rakhlin (University of Pennsylvania)

Lorenzo Rosasco (MIT and Universita’ di Genova)

Robert Schapire (Microsoft Research)

Ohad Shamir (Weizman Institute)

David Steurer (ETH Zurich)

Suvrit Sra (MIT)

Nathan Srebro (TTI Chicago)

Karthik Sridharan (Cornell University)

Csaba Szepesvari (University of Alberta)

Matus Telgarsky (University of Illinois)

Ambuj Tewari (University of Michigan)

Alexandre Tsybakov (ENSAE-CREST)

Ruth Urner (Max Planck Institute)

Santosh Vempala (Georgia Tech)

Roman Vershynin (UC Irvine)

Manfred Warmuth (UC Santa Cruz)

Yihong Wu (Yale University)

Luis Rademacher (UC Davis)

Maxim Raginsky (University of Illinois)

Sasha Rakhlin (University of Pennsylvania)

Lorenzo Rosasco (MIT and Universita’ di Genova)

Robert Schapire (Microsoft Research)

Ohad Shamir (Weizman Institute)

David Steurer (ETH Zurich)

Suvrit Sra (MIT)

Nathan Srebro (TTI Chicago)

Karthik Sridharan (Cornell University)

Csaba Szepesvari (University of Alberta)

Matus Telgarsky (University of Illinois)

Ambuj Tewari (University of Michigan)

Alexandre Tsybakov (ENSAE-CREST)

Ruth Urner (Max Planck Institute)

Santosh Vempala (Georgia Tech)

Roman Vershynin (UC Irvine)

Manfred Warmuth (UC Santa Cruz)

Yihong Wu (Yale University)

**Call for papers:**

The 31st Annual Conference on Learning Theory (COLT 2018) will take place in Stockholm, Sweden, on July 5-9, 2018 (with a welcome reception on the 4th), immediately before ICML 2018, which takes place in the same city. We invite submissions of papers addressing theoretical aspects of machine learning and related topics. We strongly support a broad definition of learning theory, including, but not limited to:

-Design and analysis of learning algorithms

-Statistical and computational complexity of learning

-Optimization methods for learning

-Unsupervised, semi-supervised, online and active learning

-Interactions with other mathematical fields

-Interactions with statistical physics

-Artificial neural networks, including deep learning

-High-dimensional and non-parametric statistics

-Learning with algebraic or combinatorial structure

-Geometric and topological data analysis

-Bayesian methods in learning

-Planning and control, including reinforcement learning

-Learning with system constraints: e.g. privacy, memory or communication budget

-Learning from complex data: e.g., networks, time series, etc.

-Learning in other settings: e.g. social, economic, and game-theoretic

Submissions by authors who are new to COLT are encouraged. While the primary focus of the conference is theoretical, the authors may support their analysis by including relevant experimental results.

All accepted papers will be presented in a single track at the conference. At least one of each paper’s authors should be present at the conference to present the work. Accepted papers will be published electronically in the Proceedings of Machine Learning Research (PMLR). The authors of accepted papers will have the option of opting-out of the proceedings in favor of a 1-page extended abstract. The full paper reviewed for COLT will then be placed on the arXiv repository.

COLT will award both best paper and best student paper awards. To be eligible for the best student paper award, the primary contributor(s) must be full-time students at the time of submission. For eligible papers, authors must indicate at submission time if they wish their paper to be considered for a student paper award. The program committee may decline to make these awards, or may split them among several papers.

Submissions that are substantially similar to papers that have been previously published, accepted for publication, or submitted in parallel to other peer-reviewed conferences with proceedings may not be submitted to COLT. The same policy applies to journals, unless the submission is a short version of a paper submitted to a journal, and not yet published. Authors must declare such dual submissions either through the Easychair submission form, or via email to the program chairs.

Submissions are limited to 12 PMLR-formatted pages, plus unlimited additional pages for references and appendices. All details, proofs and derivations required to substantiate the results must be included in the submission, possibly in the appendices. However, the contribution, novelty and significance of submissions will be judged primarily based on the main text (without appendices), and so enough details, including proof details, must be provided in the main text to convince the reviewers of the submissions’ merits. Formatting and submission instructions can be found here.

As in previous years, there will be a rebuttal phase during the review process. Initial reviews will be sent to authors before final decisions have been made. Authors will have the opportunity to provide a short response on the PC’s initial evaluation.

We also invite submission of open problems. A separate call for open problems will be made available on the conference website.

-Paper submission deadline: February 16, 2018, 11:00 PM EST

-Author feedback: April 18-21, 2018

-Author notification: May 2, 2018

-Conference: July 5-9, 2018 (welcome reception on the 4th)

Papers should be submitted through EasyChair at https://easychair.org/conferences/?conf=colt2018

]]>