COLT (with bombs, deep learning and other exciting stuff)

The 2013 edition of COLT at Princeton just finished a few days ago and I’m happy to report that everything went smoothly! We had a sort of stressful start (at least for the organizers..) since on the day before the start of the conference a bomb threat was declared at Princeton and the whole campus had to be evacuated. Thankfully the campus was entirely checked within a few hours and it reopened on the same day, which allowed the conference to go as planned.

Next year COLT will be held in Barcelona (most likely on June 13-14-15). As I will be again involved in the organization (but this time just to help out a bit the main organizer -Gabor Lugosi-) I already have some tidbits on the conference and based on this my advice is that you should already start preparing your best theorems for COLT 2014!!! Barcelona in June 2014 will be the place to be for learning theory ;).

Back to the 2013 edition: I wanted to do a summary of all the interesting things that I learned during these 3 days but I quickly realized that this is a way too ambitious goal. Instead I’m going to do a much shorter summary that include only the things that I ‘strictly’ learned during COLT in the sense that I had never read about these contributions in the past (which automatically dismiss papers on Online Learning that I read as they appear on arxiv, or the COLT best paper by Quentin and Philippe since I have seen the making of it).

Deep learning and related stuff

Deep learning was not in any contributed paper, but it made three apparitions during the conference:

– Yann LeCun gave an invited talk on this topic and he discussed the success of his deep architecture in recent competitions. Most of you have probably already read about this in mass media (such as the New York Times or Wired) so I’m not going to bore you with that. From a theoretical point of view what I sort of understood is that for example with sparse auto-encoders they are trying to optimize a function of the form $f(D,W) = \|X - D W \|_2^2 + \|W\|_1$ which is convex separately in $D$ and $W$ but not jointly in $D$ and $W$ (just think of $(x, y) \in \mathbb{R}^2 \mapsto xy$ to see the issue). As far as I understand they alternate the optimization over $D$ and $W$ and it works just fine. Understanding theoretically what is going on remains a challenge.

Yann also said something which is in my opinion a striking observation: he says that all the ‘interesting’ learning must be non-convex since for non-trivial tasks the order in which one sees the examples matter a lot (just think of watching a movie in reverse..) while this is of course not at all the case for convex learning.

– Sanjeev Arora gave an invited talk on ‘provable bounds for machine learning’. What Sanjeev means by this title is that several objective functions in machine learning are NP-hard to optimize and yet people seems to be able to optimize them in practice. One example which is sort of related to the sparse auto-encoders discussed above is the problem of non-negative matrix factorization, where one is given $X \in \mathbb{R}^{n \times m}_+$ and one tries to find $D \in \mathbb{R}^{n \times r}_+$ and $W \in \mathbb{R}^{r \times m}_+$ such that $X = D W$ with $r$ as small as possible. Of course without the positivness constraint this is an easy problem but with this added constraint the task is NP-hard (see this paper by Vavasis). However Sanjeev and his co-authors (Ge, Kannan and Moitra) found a natural (in terms of applications) assumption for $X$ such that under this assumption one can design a polynomial-time algorithm for this problem, see this paper for the details. At the end of the talk Sanjeev also mentioned deep learning in the list of exciting open problems in machine learning.

– Ohad Shamir briefly talked in the impromtu session about his recent paper on deep learning (with Roi Livni and Shai Shalev-Shwartz). The general idea of the paper is to show that one can very efficiently (both in terms of space and time) find a basis for the set of values attainable by polynomials of a fixed degree on a given data set. This gives a potentially interesting representation of the data which can then be used together with a linear classifier for supervised learning tasks.

A new estimator for Good-Turing

The Good-Turing problem is very simple: given an i.i.d. sequence $X_1, \hdots, X_n$ drawn from some unknown probability distribution $\nu$ over a countable set $\mathcal{X}$ , one wants to estimate the numbers $M_i = \nu(\{ x : \sum_{j=1}^n \mathbf{1}\{X_j = x\} = i \})$ . That is one wants to estimate the probability mass of the symbols that appear exactly $i$ times in the observed sequence. The term $M_0$ is of particular interest (see for example this paper by myself, Ernst and Garivier for a cool application) but sometimes one really needs to estimate the entire vector $M = (M_0, \hdots, M_n)$ . Good and Turing worked on this problem during the Second World War while they were trying to break the Enigma code and they came up with an estimator $\hat{M}$ which is ‘almost’ unbiased. Fifty years later McAllester and Schapire were able to prove concentration bounds for the Good-Turing estimator. For instance they showed that $\hat{M}_0 - M_0$ is of order $\sqrt{\frac{\log 1/\delta}{n}}$ with probability at least $1-\delta$ which is in my opinion an amazing result (and the proof is not too difficult, $\hat{M}_0$ concentrates around its expectation thanks to McDiarmid’s inequality, and $M_0$ concentrates around its expectation thanks to a version of Bernstein’s inequality for negatively correlated random variables). However if one is looking for concentration of the vector $\hat{M}$ around $M$ these results are not useful. One of the COLT paper by Acharya, Jafarpour, Orlitsky and Suresh design a new estimator $M'$ for this task for which they show that $\mathrm{KL}(M, M') = O(1/\sqrt{n})$ (and thus by Pinsker’s inequality $\|M - M'\|_1 = O(1/n^{1/4})$ ) and furthermore they also show that these two rates (for $\mathrm{KL}$ and $\ell_1$ ) are optimal. This is a really neat result, and the estimator is a non-trivial modification of the basic Good-Turing estimator.

Other papers

I wanted to do a summary of two other papers, Algorithms and Hardness for Robust Subspace Recovery by Hardt and Moitra, and Polynomial Time Optimal Query Algorithms for Finding Graphs with Arbitrary Real Weights by Choi, but I’m running out of stamina so you’ll have to check the papers yourself!

4 Responses to "COLT (with bombs, deep learning and other exciting stuff)"

By Deep Learning and Shallow Learning 初探 - 好好学习天天向上 November 3, 2013 - 4:51 am

[…] Learning 在 NLP 里也取得了相当的成功。甚至连纯机器学习理论的会 COLT 也开始凑这趟热闹了。deeplearning.net 上有一个 reading list，里面列举了一些各个领域关于 deep […]
By Igor June 28, 2013 - 5:01 am

This was half way joking really, without going too far in the example, there is a genre in movies where the ending is known from the very start and what is discovered and appreciated is how the “tree of events” built itself into that endng. One could even argue that most blockbusters follow that route as well ( most of the time heroes get out well) and then I agree with you there is the Games of Thrones “red wedding”, but then again,one could argue that it is so traumatizing that it is an exception to the rule.
By Igor June 26, 2013 - 11:09 pm

so watching a movie backwards doesn’t lead you to the same understanding of the plot ?

Igor
- By Sebastien Bubeck June 27, 2013 - 7:32 am
  
  Yeah apparently humans have this weird time-dependent behavior ;). More seriously this observation is perhaps more a criticism of the usual i.i.d. assumption, but I think that it also have some relevance for convexity assumptions.

COLT (with bombs, deep learning and other exciting stuff)

4 Responses to "COLT (with bombs, deep learning and other exciting stuff)"

By Deep Learning and Shallow Learning 初探 - 好好学习天天向上 November 3, 2013 - 4:51 am

By Igor June 28, 2013 - 5:01 am

By Igor June 26, 2013 - 11:09 pm

By Sebastien Bubeck June 27, 2013 - 7:32 am

Categories

Recent Posts

Archives

Blogroll