First week of activity at the Simons Institute

This first week at the Simons Institute was a lot of fun! I attended the first workshop in the Real Analysis program which was about Testing, Learning and Inapproximability. There was plenty of good talks and I learned a lot of new things.


First I learned about parallel repetition. Let \Sigma be a finite alphabet. Consider a bipartite graph G with vertex sets U and V and a labeling of the edges with mappings from \Sigma to \Sigma, that is edge uv is labelled with some mapping \pi_{uv} : \Sigma \rightarrow \Sigma \pi_{uv} : \Sigma \rightarrow \Sigma. Now consider the following two players game (Alice and Bob are the players). First an edge uv is drawn at random from some known probability distribution \nu over the set of edges in G (the graph G and the ‘labels’ are also known to the players). The vertex u is revealed to Alice and the vertex v is revealed to Bob. Without communication each one of them has to output an alphabet symbol, say \alpha for Alice and \beta for Bob, and they win if \beta = \pi_{uv} ( \alpha ) . In other words they win if the pair of symbols is consistent with the labeling of the drawn edge. The value of the game v( G ) is the probability that Alice and Bob wins, which can be rewritten as:

    \[v(G) = \max_{f : U \rightarrow \Sigma, g : V \rightarrow \Sigma} \mathbb{P}_{uv \sim \nu}\big(g(v) = \pi_{uv} (f(u)) \big) .\]

Now we define the value of the repeated game as follows: the players simply play the above game k times (with no communication at all except before the game starts when they agree on their strategy, and with the sequence of edges being i.i.d.) and the value of the repeated game, denoted v( G^{\otimes k} ) is the probability that they win all the k games. This can be rewritten as:

    \begin{eqnarray*} v(G^{\otimes k}) = \max_{f : U^k \rightarrow \Sigma^k, g : V^k \rightarrow \Sigma^k} & & \mathbb{P}_{(u_1v_1, \hdots, u_kv_k) \sim \nu^{\otimes k}}\big(\forall i \in [k], \\ & & [g(v_1, \hdots, v_k)]_i = \pi_{u_iv_i}([f(v_1, \hdots, v_k)]_i) \big) . \end{eqnarray*}

One is now interested in understanding the relation between v( G^{\otimes k} ) and v(G).

At this point it’s probably a good idea to pause. If this is the first time that you read about parallel repetition you are probably like ‘wtf, it’s obvious that v( G^{\otimes k} ) = v(G)^k !!!’ (at least this was my initial reaction). It turns out that this is very far from being true. Clearly it’s true that v( G^{\otimes k} ) \geq v(G)^k since you can simply play the k games independently. But in some cases v( G^{\otimes k} ) might be much larger than v(G)^k, which is kind of amazing! Here is a beautiful example (which is due to Lance Fortnow I think) where v(G) = 1/2 and v( G^{\otimes 2} ) = 1/2. The graph G is the complete graph on the set of vertices U=V=\{0,1\} and the distribution \nu is uniform on the set of edges (in other words Alice receives a random bit u \in \{0,1\} and Bob receives a random bit v \in \{0,1\}). The alphabet is \Sigma = (\{A,B\} \times \{0,1\}) \cup \{\mathrm{err}\} where the symbol \mathrm{err} is introduced for sake of simplicity and we assume that the players cannot respond with \mathrm{err}. Now the mappings \pi_{uv} are given by:

    \[\pi_{uv}((A, u) )=(A, u), \ \pi_{uv}((B, v) )=(B, v), \ \pi_{uv}((A, \bar{u}) )= \pi_{uv}((B, \bar{v}) )=\mathrm{err}.\]

In words the players must choose either Alice or Bob and then they both have to output the bit of the chosen player. For instance if they choose Alice then Bob has to guess the bit of Alice (and Alice can just report her bit). Clearly there is not much one can do for the one-shot game: one has v(G) = 1/2. Now the situation becomes more interesting for the two rounds game: the players can agree that in the first round Bob will try to guess Alice’s first bit, and in the second round Alice will try to guess Bob’s second bit. The trick is that when Bob tries to guess Alice’s first bit he can use his second bit as a guess, and respectively Alice’s guess for Bob’s second bit can be her own first bit. That way they effectively reduced the two rounds game to a one-round game, since both predictions are correct if Bob’s second bit is equal to Alice’s first bit which happens with probability 1/2. In other words we proved v( G^{\otimes 2})  = 1/2 while naively one would have guessed that v( G^{\otimes 2} ) = 1/4.

Now that we have a better understanding of why the original question on the relation between v( G^{\otimes k} ) and v(G) is a non-trivial one, it is clear that the interesting thing to study is to upper bound the value v( G^{\otimes k} ) as a function of v(G) (recall the obvious lower bound v( G^{\otimes k} ) \geq v(G)^k). In essence we are asking to show a limit to the power of tricks such as the one described above. Such a result was first proved by Ran Raz in 1995. He showed that, essentially, if v(G) \leq 1- \epsilon then v( G^{\otimes k} ) \leq (1-\epsilon^2 /2)^k (the original proof gave \epsilon^{32} instead of \epsilon^2). At the Simons workshop David Steurer talked about a new proof of this result based on viewing the value of the game as the norm of some matrix, see this paper. This new proof method also lead to new upper bounds, including v( G^{\otimes k} ) \leq v(G)^{\sqrt{k}} (which matches some known lower bounds from Ran Raz). As far as I understand there are still plenty of open questions on this fundamental problem of parallel repetition.


Joe Neeman talked about a new proof of Borell’s theorem which states that if X and Y are two standard Gaussian vectors in \mathbb{R}^n with \mathrm{cov}(X, Y) = \rho I_n, then half-spaces are maximizers of the function A \subset \mathbb{R}^n \mapsto \mathbb{P}(X \in A, Y \in A). The proof is really slick, it’s basically one page of fairly simple calculations, see Section 2 hereElchanan Mossel then talked about the ‘natural’ extension of this result to the case of uniform variables on the hypercube \{-1,1\}^n, see this paper.


- James Lee talked about a very interesting problem: we know how to prove lower bounds for the size of extended formulations (see this blog post for instance), but such bounds are interesting only for exact optimization. A natural question is whether we can prove lower bounds on the size of LPs even for approximate optimization. Apparently there is a rich literature on this for specific type of LP relaxations such the Sherali-Adams hierarchy. Now in their new paper (which is not yet online) James Lee and his co-authors are able to prove general bounds instead of the one that were previously proved specifically for the Sherali-Adams hierarchy. Of course LPs are nice but SDPs are even nicer (see here) and a natural question is how to generalize these results to SDPs instead of LPs. This was touched upon in Raghu Meka‘s talk where he discussed lower bounds for the number of rounds to solve the hidden clique problem with the Lasserre hierarchy. The paper, joint work with Avi Wigderson, is available here.


- Nati Linial gave an absolutely wonderful talk, probably one of the best that I have seen in months. My advice is to look at the video as soon as it is released on the Simons website (this should be a link to it hopefully). Just as a teaser here is one object that Linial is looking at. First a definition: for a graph G=(V,E) let p_i(G) be the fraction of the set of 3 vertices that contains exactly i-1 edges. The vector p(G) = (p_1(G), \hdots, p_4(G)) is called the 3-profile of G. Now the object that we look at is the set S \subset \mathbb{R}^4 of possible ‘asymptotic profile’, more precisely:

    \[S = \{x \in \mathbb{R}^4 : \exists (G_n = (V_n, E_n))_{n \in \mathbb{N}}, p(G_n) \xrightarrow[n \to \infty]{} x\} .\]

There is not much that we know about S. It is not convex: (1,0,0,0) and (0,0,0,1) are clearly in S but (1/2, 0,0,1/2) is of course not in S. We also know since the 60′s that for x \in S one has x_1 + x_4 \geq 1/4 and Linial recently proved that \min(x_2, x_4) \leq 0.26. A better understanding of the properties of S could lead to breakthroughs in the emerging theory of statistics for networks (see Nati’s talk for more details!).


Tomorrow the Big Data program will start with its Boot Camp. Apparently there has been a lot of hype around this boot camp and the seats have been ‘sold out’. If you can’t enter the auditorium, or if you are not in Berkeley, the event will be streamed live here.

Posted in Theoretical Computer Science | 4 Comments

Random-Approx 2013

Last week I attended the Random-Approx conference at Berkeley. I missed quite a few talks as I was also settling in my new office for the semester at the Simons Institute so I will just report on the three invited talks:

Luca Trevisan gave a talk on spectral graph theory. He first went through the basics: the eigenvalues of the normalized Laplacian L= I_n - D^{-1/2} A D^{-1/2} (where D is the diagonal matrix of degrees and A the adjacency matrix of the graph) tells you something about connectivity in the graph, precisely the eigenvalue 0 has multiplicity k if and only if the graph has k connected components, and the largest eigenvalue is equal to 2 if and only if the graph has a bipartite connected component (both statement are really easy to prove). This is a very nice result but a much more powerful statement would be a ‘soft’ version where one says something useful about the graph when say some eigenvalue is close to 0 (instead of being exactly 0). Luckily this has been extensively studied. A famous example is the Cheeger’s inequality which states that a small second eigenvalue \lambda_2 implies a small expansion (the expansion \Phi(S) of a set S of vertices is the probability that a random neighbor of a random vertex in S is outside of S), precisely there exists S such that

    \[\frac{\lambda_2}{2} \leq \Phi(S) \leq \sqrt{2 \lambda_2}.\]

Another example is the result of Arora, Barak and Steurer which says that a small value for the k^{th} eigenvalue \lambda_k implies that there is a set of roughly size n/k with small expansion.

One thing that I find fascinating is that as far as I understand we still have pretty much no idea of what ‘type’ of properties are encoded in the spectrum of a graph. Of course there is a good reason for this: computing the spectrum is computationally easy so this question is directly related to asking what properties of a graph are easy to compute, and this latter question is far far beyond our current knowledge. One example of a highly non-trivial property which is ‘hidden’ in the spectrum (and the corresponding eigenbasis) is a Szemeredi decomposition. Recall that the Szemeredi regularity lemma states roughly the following: fix a precision \epsilon >0, then there exists k such that for any graph on n vertices, with n large enough, one can partition the vertices into k clusters of equal sizes, such that for most pairs of clusters (i,j) (except a fraction \epsilon of the pairs), the graph of edges between cluster i and cluster j looks like a random graph (in a certain sense and up to an error of \epsilon) with a certain density of edges d_{i,j}. Another way to put it is that any large graph of size n can be viewed as a small graph of size k (with k independent of n!!!!) with weighted edges d_{i,j}. You can find the precise statement of this result here together with a proof based on the spectrum of the graph.


Santosh Vempala gave a talk on high-dimensional sampling algorithms. I already blogged about this since he gave the same talk at ICML, see here.


Persi Diaconis gave a very nice board talk about how to use algorithms to prove theorems in probability theory. Of course the standard approach is rather to go the other way around and use probability to derive new algorithms! I discussed a somewhat similar topic in the early days of this blog when I suggested that one could use results in statistics to prove new theorems in probability theory (again one usually do the opposite), see here. In his talk Diaconis illustrated the power of algorithms as a proof method for probability statements with the following problem. Let \Pi(n) be the set of partitions of [n]. The cardinality of \Pi(n) is known as the Bell number B(n). For various reasons one can be interested in computing quantities of the form

(1)   \begin{equation*}  \sum_{\lambda \in \Pi(n)} f(\lambda) , \end{equation*}

for some function f : \Pi(n) \rightarrow \mathbb{R}. Apparently there is a huge litterature that shows that such statistics can be expressed as polynomial functions of the Bell numbers. For each f one has to work hard to prove the expression. What Diaconis did in this paper is to prove an existence theorem for such an expression for ‘reasonable’ functions f. Of course one can view (1) as the expectation of f(\lambda) when \lambda is a set partition drawn uniformly at random. The key for the proof is then to have a ‘convenient algorithm’ to generate such a set partition at random, which in turns will make the analysis easy. The algorithm that they used is the following beautiful strategy of Stam. First one has to know the Dobinski’s formula:

    \[B(n) = \frac{1}{e} \sum_{j=1}^{+\infty} \frac{j^n}{j!} .\]

(Yup that’s right, it’s an amazing formula!) Now pick an integer j at random with probability \frac{1}{e B(n)} \frac{j^n}{j!} (this defines a probability distribution thanks to the Dobinski’s formula). Next drop n labeled balls into j boxes. This gives a partition of [n], and the distribution is uniform. The point is that there are many possible algorithms to generate a partition at random, but this particular clever algorithm leads to a new and interesting point of view to prove statements about uniformly drawn set partitions.


P.S: In the last post I gave links to the videos for COLT 2013 and ICML 2013, in case you are interested here are the direct links to my talks in these events:

- Bounded regret in stochastic multi-armed bandits (COLT), paper is here.

Multiple Identifications in Multi-Armed Bandits (ICML), unfortunately for this one the sound is not correctly synchronized so it’s probably better to just look at the paper directly.

- Here is another talk I did during the summer on an older paper, it’s about 0^{th} order Lipschitz optimization.


Posted in Random graphs, Theoretical Computer Science | 2 Comments

COLT 2013/ICML 2013 videos

The videos for COLT 2013 were just published on The videos for ICML 2013 are also available on

Posted in Uncategorized | 2 Comments

Two summer readings on Big Data and Deep Learning

This is the first (short) post dedicated to the Big Data program of the Simons Institute. We received from the program organizer Mike Jordan our first reading assignment which is a report published by the National Academy of Sciences on the “Frontiers in Massive Data Analysis“. This paper was written by a committee of academics and industrials chaired by Mike Jordan. I find the report quite interesting because it clearly identifies the new mathematical and algorithmic challenges that the Big Data point of view brings. This includes in particular many issues related to data representation, but also distributed algorithms, crowdsourcing, or tradeoffs between statistical accuracy and computational cost.


Talking about data representation I would like to link another paper, this one from Stéphane Mallat, Group Invariant Scattering. Stephane’s idea can be described roughly as follows: a useful data representation, say for sound signals, should be of course invariant to translations but also robust to small diffeomorphisms which are close to translations. In other words one is looking for mappings \Phi : L_2(\mathbb{R}^d) \rightarrow \mathcal{H} which are invariant to translations and Lipschitz continuous to C^2 diffeomorphisms of the form x \mapsto x - \tau(x) (with the weak topology on C^2 diffeomorphisms). As an example consider the modulus of the Fourier transform: this is a mapping invariant to translations but it is not Lipschitz continuous with respect to diffeomorphisms as one can ‘expand’ arbitrary high frequency by a simple transformation of the form x \mapsto (1-s) x. Mallat’s construction turns out to be much more complicated than simple Fourier or Wavelet transforms. Interestingly it builds on ideas from the Deep Learning literature. It also generalizes to other groups of transforms than translations, such as rotations (which can be useful for images).

Posted in Uncategorized | 2 Comments

ICML and isotropic position of a convex body

ICML 2013 just finished a few days ago. The presentation that inspired me the most was the invited talk by Santosh Vempala. He talked about the relations between sampling, optimization, integration, learning, and rounding. I strongly recommend Vempala’s short survey on these topics, it is an excellent read. I also would like to point out two papers by Vempala on optimization that are very interesting and exceptionally well written: the first paper (with Bertsimas) presents an optimal (in terms of oracle complexity) algorithm for convex optimization with a much better computational complexity than the Center of Gravity method (but still slightly worse computational complexity than the Ellipsoid Method); the second paper (with Kalai) gives a more efficient algorithm than the Ellipsoid Method for linear optimization with a membership oracle (instead of a separation oracle).

Vempala’s talk gave me the motivation to write on a related topic which has many applications in optimization, probability and statistics: this is the notion of an isotropic convex body. A convex body K \subset \mathbb{R}^n is centered and isotropic if it satisfies:

    \[\int_{K} x \ dx = 0, \;\; \frac{1}{\mathrm{vol}(K)}\int_{K} x x^{\top} \ dx = I_n .\]

First observe that the hypercube [-c,c]^n is isotropic (with c of order 1/2), and it is a one line calculation to show that c \sqrt{n} B_n (with B_n being the Euclidean ball of unit radius and c some numerical constant) is also isotropic. It is also immediate that any convex body can be put in isotropic position up to an affine transformation. Interestingly it was shown in this paper that one can find an approximate isotropic transformation with a linear (in n) number of samples from the uniform distribution on K.


If you read papers about isotropic convex set they often use a definition where \mathrm{vol}(K) = 1, in which case the second moment condition writes

    \[\int_{K} x x^{\top} \ dx = L_K \ I_n .\]

The term L_K is called the isotropic constant. However this definition of isotropy is not easy to work with because there is a deep conjecture related to the order of magnitude of the isotropic constant. It is easy to see that L_K has to be at least some numerical constant, but the statement that L_K is also upper bounded by a numerical constant is equivalent to the Hyperplane Conjecture (which I describe below). For this reason we stick to the definition with the second moment normalization.


The question we want to explore now is the following: ‘what looks like an isotropic convex body K‘? We are especially interested in the answer for the high-dimensional regime where n is very large. The following three results give interesting and useful results in that direction. I use c to denote some numerical constant.

KLS lemma

    \[\sqrt{\frac{n+2}{n}} B_n \subset K \subset \sqrt{n (n+2)} B_n .\]

It can be shown that the above inequalities are both tight in the simplex case.

We describe the next results in terms of a random variable X uniformly distributed on K. Since K is isotropic one has \mathbb{E} \|X\|_2^2 = n. The next result shows that in fact X has norm of order \sqrt{n} with high probability.

Theorem (Paouris’ inequality)

For any t \geq 1,

    \[\mathbb{P}(\|X\|_2 \geq c t \sqrt{n} ) \leq \exp(- t \sqrt{n}) .\]

The next statement is still a conjecture, it concerns a much more refined concentration inequality for the norm of an isotropic random variable.

Thin Shell Conjecture

For any t > 0,

    \[\mathbb{P}\bigg(\big|\|X\|_2 - \sqrt{n}\big| \geq t \bigg) \leq \exp(- c t) .\]

Since we are now talking about conjectures I cannot resist giving you the most important conjecture in convex geometry:

Hyperplane Conjecture

For any centered isotropic K there exists an hyperplane H such that

    \[\mathrm{vol}(K \cap H) \geq c \ \mathrm{vol}(K) .\]

As of today the best result we know replaces c by n^{-1/4} in the above inequality (this is a theorem of Bourgain up to a log factor, and the log factor was removed by Klartag). To the unexperienced eye as mine it seems unbelievable that this is still a conjecture: it looks clearly true! The following theorem reminds us that the set of statements that look true is not equal to the set of true statements…

Theorem (Busemann-Petty problem)

If n> 4 there exists centered convex sets K_1 and K_2 such that for all hyperplanes H one has

    \[\mathrm{vol}(K_1 \cap H) \leq \mathrm{vol}(K_2 \cap H) ,\]


    \[\mathrm{vol}(K_1) > \mathrm{vol}(K_2) .\]

Posted in Uncategorized | 5 Comments

COLT (with bombs, deep learning and other exciting stuff)

The 2013 edition of COLT at Princeton just finished a few days ago and I’m happy to report that everything went smoothly! We had a sort of stressful start (at least for the organizers..) since on the day before the start of the conference a bomb threat was declared at Princeton and the whole campus had to be evacuated. Thankfully the campus was entirely checked within a few hours and it reopened on the same day, which allowed the conference to go as planned.

Next year COLT will be held in Barcelona (most likely on June 13-14-15). As I will be again involved in the organization (but this time just to help out a bit the main organizer -Gabor Lugosi-) I already have some tidbits on the conference and based on this my advice is that you should already start preparing your best theorems for COLT 2014!!! Barcelona in June 2014 will be the place to be for learning theory ;).

Back to the 2013 edition: I wanted to do a summary of all the interesting things that I learned during these 3 days but I quickly realized that this is a way too ambitious goal. Instead I’m going to do a much shorter summary that include only the things that I ‘strictly’ learned during COLT in the sense that I had never read about these contributions in the past (which automatically dismiss papers on Online Learning that I read as they appear on arxiv, or the COLT best paper by Quentin and Philippe since I have seen the making of it).


Deep learning and related stuff

Deep learning was not in any contributed paper, but it made three apparitions during the conference:

- Yann LeCun gave an invited talk on this topic and he discussed the success of his deep architecture in recent competitions. Most of you have probably already read about this in mass media (such as the New York Times or Wired) so I’m not going to bore you with that. From a theoretical point of view what I sort of understood is that for example with sparse auto-encoders they are trying to optimize a function of the form f(D,W) = \|X - D W \|_2^2 + \|W\|_1 which is convex separately in D and W but not jointly in D and W (just think of (x, y) \in \mathbb{R}^2 \mapsto xy to see the issue). As far as I understand they alternate the optimization over D and W and it works just fine. Understanding theoretically what is going on remains a challenge.

Yann also said something which is in my opinion a striking observation: he says that all the ‘interesting’ learning must be non-convex since for non-trivial tasks the order in which one sees the examples matter a lot (just think of watching a movie in reverse..) while this is of course not at all the case for convex learning.

- Sanjeev Arora gave an invited talk on ‘provable bounds for machine learning’. What Sanjeev means by this title is that several objective functions in machine learning are NP-hard to optimize and yet people seems to be able to optimize them in practice. One example which is sort of related to the sparse auto-encoders discussed above is the problem of non-negative matrix factorization, where one is given X \in \mathbb{R}^{n \times m}_+ and one tries to find D \in \mathbb{R}^{n \times r}_+ and W \in \mathbb{R}^{r \times m}_+ such that X = D W with r as small as possible. Of course without the positivness constraint this is an easy problem but with this added constraint the task is NP-hard (see this paper by Vavasis). However Sanjeev and his co-authors (Ge, Kannan and Moitra) found a natural (in terms of applications) assumption for X such that under this assumption one can design a polynomial-time algorithm for this problem, see this paper for the details. At the end of the talk Sanjeev also mentioned deep learning in the list of exciting open problems in machine learning.

- Ohad Shamir briefly talked in the impromtu session about his recent paper on deep learning (with Roi Livni and Shai Shalev-Shwartz). The general idea of the paper is to show that one can very efficiently (both in terms of space and time) find a basis for the set of values attainable by polynomials of a fixed degree on a given data set. This gives a potentially interesting representation of the data which can then be used together with a linear classifier for supervised learning tasks.


A new estimator for Good-Turing

The Good-Turing problem is very simple: given an i.i.d. sequence X_1, \hdots, X_n drawn from some unknown probability distribution \nu over a countable set \mathcal{X}, one wants to estimate the numbers M_i = \nu(\{ x : \sum_{j=1}^n \mathbf{1}\{X_j = x\} = i \}). That is one wants to estimate the probability mass of the symbols that appear exactly i times in the observed sequence. The term M_0 is of particular interest (see for example this paper by myself, Ernst and Garivier for a cool application) but sometimes one really needs to estimate the entire vector M = (M_0, \hdots, M_n). Good and Turing worked on this problem during the Second World War while they were trying to break the Enigma code and they came up with an estimator \hat{M} which is ‘almost’ unbiased. Fifty years later McAllester and Schapire were able to prove concentration bounds for the Good-Turing estimator. For instance they showed that \hat{M}_0 - M_0 is of order \sqrt{\frac{\log 1/\delta}{n}} with probability at least 1-\delta which is in my opinion an amazing result (and the proof is not too difficult, \hat{M}_0 concentrates around its expectation thanks to McDiarmid’s inequality, and M_0 concentrates around its expectation thanks to a version of Bernstein’s inequality for negatively correlated random variables). However if one is looking for concentration of the vector \hat{M} around M these results are not useful. One of the COLT paper by Acharya, Jafarpour, Orlitsky and Suresh design a new estimator M' for this task for which they show that \mathrm{KL}(M, M') = O(1/\sqrt{n}) (and thus by Pinsker’s inequality \|M - M'\|_1 = O(1/n^{1/4})) and furthermore they also show that these two rates (for \mathrm{KL} and \ell_1) are optimal. This is a really neat result, and the estimator is a non-trivial modification of the basic Good-Turing estimator.


Other papers

I wanted to do a summary of two other papers, Algorithms and Hardness for Robust Subspace Recovery by Hardt and Moitra, and Polynomial Time Optimal Query Algorithms for Finding Graphs with Arbitrary Real Weights by Choi, but I’m running out of stamina so you’ll have to check the papers yourself!

Posted in Uncategorized | 4 Comments

Embeddings of finite metric spaces in Hilbert space

In this post we discuss the following notion: Let (X,d_X) and (Y,d_Y) be two metric spaces, one says that X embeds into Y with distortion D>0 if there exists \lambda >0 and f : X \rightarrow Y such that for any x,y \in X,

    \[1 \leq \frac{d_Y(f(x), f(y))}{\lambda \ d_X(x,y)} \leq D .\]

We write this as X \overset{D}\hookrightarrow Y.

Note that if Y is a normed vector space (over the reals) then the scaling parameter \lambda in the above definition is unnecessary since one can always perform this scaling with the embedding f (indeed in this case d_Y(\lambda f(x), \lambda f(y)) = \lambda d_Y(f(x), f(y))).

Informally if X embeds into Y with distortion D then up to an ‘error’ of order D one can view X (once mapped with the embedding f) as a subset of Y. This is particularly useful when Y has a specific known structure that we may want to exploit, such as Y = \ell_p.

The case p=2 is of particular interest since then one embeds into a Hilbert space which comes with a lot of ‘structure’. Thus we mainly focus on embeddings in \ell_2, and we also focus on (X,d) being a finite metric space with n = |X|.

First observe that the case p=\infty (embeddings in \ell_{\infty}) is trivial:

Proposition X \overset{1}\hookrightarrow \ell_{\infty}^n

Proof: Consider the embedding f(x) = (d(x,z))_{z \in X} \in \mathbb{R}^n. It is obvious that

    \[\|(d(x,z) - d(y,z))_{z \in X}\|_{\infty} \geq d(x,y) ,\]

and the other side of the inequality follows from the triangle inequality for d. Thus one has \|f(x) - f(y)\|_{\infty} = d(x,y) which concludes the proof.


Could it be that the situation is that simple for p=2? It seems very unlikely because \ell_2 is a much more structured space than an arbitrary metric space. Consider for example X=\{r,x,y,z\} with d(r,x) = d(r,y) = d(r,z) = 1 and d(x,y) = d(x,z) = d(z,y) = 2 (that is X is a tree with root r and three leaves x,y,z). It is easy to see that it is impossible to embed X into \ell_2 with distortion 1. Indeed f(x), f(y) and f(z) would be on a Euclidean ball of radius 1 and they would be at distance 2 from each other, which is impossible for 3 points.

So what can we say in general about embeddings into \ell_2? The following beautiful result gives the answer:

Theorem (Bourgain 1985) X \overset{O(\log n)}\hookrightarrow \ell_{2}

It can be shown that this result is unimprovable in general. However for specific classes of finite metric spaces one can hope for some improvement and we refer to the survey by Linial for more details on this.


Embeddings in \ell_2 with small distortion

While Bourgain’s theorem is a very strong statement in the sense that it applies to any finite metric space, it is also quite weak in the sense that the distortion is large, of order \log n. For some applications it is desirable to have a bounded distortion that does not grow with the size of X, or even a distortion of order 1+\epsilon with small \epsilon. As we said previously the only way to obtain such a statement is to make further assumptions on X. We will not take this route in this post but let me just mention the Johnson-Lindenstrauss lemma which should be well-known to Machine Learners and alike:

Lemma (Johnson-Lindenstrauss 1984) Let X \subset \ell_2 then X\overset{1+\epsilon}\hookrightarrow \ell_{2}^{O(\log(n) / \epsilon^2)}.

Now what can we say in general about embeddings with small distortion into \ell_2? There is a well-known result in convex geometry with the same flavor than this question, this is Dvoretzky’s theorem: any symmetric convex body in \mathbb{R}^n has a section of dimension O(c(\epsilon) \log n) which is Euclidean up to an error of order \epsilon. The analogue for the theory of metric embeddings is called the non-linear Dvoretzky’s theorem and it goes as follows:

Theorem (Bourgain–Figiel–Milman 1986) Let \epsilon > 0. There exists Y \subset X with |Y| \geq c(\epsilon) \log n and Y\overset{1+\epsilon}\hookrightarrow \ell_{2}.

Again this theorem has a weakness: while the distortion is small it also only applies to a small subset of the original metric space (of size O(\log n)). In some sense we have now two extreme results, Bourgain’s theorem which applies to the entire metric space but with a large distortion, and Bourgain–Figiel–Milman which applies to a small small subset but with a small distortion. A natural question at this point is to try to ‘interpolate’ between these two results. Such a statement was first obtained by Bartal, Linial, Mendel and Naor (2002) and later improved in Mendel and Naor (2006):

Theorem (Mendel and Naor 2006) Let \epsilon > 0. There exists Y \subset X with |Y| \geq n^{1-\epsilon} and Y\overset{O(1/\epsilon)}\hookrightarrow \ell_{2}.


In future posts I will try to prove some of these theorems. This might take a while as in the coming weeks I will be traveling to various conferences (including COLT 2013 and ICML 2013) and I will try to blog about some of the stuff that I learn there.

Posted in Theoretical Computer Science | Leave a comment

The aftermath of ORF523 and the final

It has been two weeks since my last post on the blog, and I have to admit that it felt really good to take a break from my 2-posts-per-week regime of the previous months. Now that ORF523 is over the blog will probably be much more quiet in the near future, though I have a few cool topics that I want to discuss in the coming weeks and I’m also hoping to get interesting guest posts. In any case the next period of intense activity will probably be next Fall, as I will be visiting the newly created Simons Institute for the Theory of Computing at UC Berkeley and I plan to blog about the new tricks that I will learn from the program on the Theoretical Foundations of Big Data Analysis and the one on Real Analysis in Computer Science. More details on this to come soon!

For those of you who followed assiduously my lectures I thought that you might want to take a look the final that I gave to my students this morning. The first part is based on this paper by Alon and Naor, and the second part is based on this paper by Nesterov (how could I not finish with a result from Nesterov??). Enjoy!


An SDP relaxation based on Grothendieck’s inequality

In this problem we consider the space of rectangular matrices \mathbb{R}^{m \times n} for m, n \geq 1. We denote by \langle \cdot, \cdot \rangle the Frobenius inner product on \mathbb{R}^{m \times n}, that is

    \[\langle A, B \rangle = \sum_{i=1}^m \sum_{j=1}^n A_{i,j} B_{i,j} .\]

Let p, p' \in \mathbb{R}_{++} \cup \{\infty\}. We consider the following norm on \mathbb{R}^{m \times n}:

    \[\|A\|_{p \rightarrow p'} = \max_{x \in \mathbb{R}^n : \|x\|_p \leq 1} \|A x\|_{p'} .\]

In other words this is the operator norm of A when we view it as a mapping from \ell_p^n to \ell_{p'}^m. In this problem we are interested in computing this operator norm for a given matrix.

1. Is the problem of computing the operator norm of A a convex problem?

2. For what values of p and p' can you easily propose a polynomial-time algorithm?

3. Let q' be such that 1/p' + 1/q' =1. Recall why one has

    \[\|A\|_{p \rightarrow p'} = \max_{x \in \mathbb{R}^n : \|x\|_p \leq 1, y \in \mathbb{R}^m : \|y\|_{q'} \leq 1} \langle A , y x^{\top} \rangle .\]

We will now focus on the case p=\infty and p'=1, that is

    \[\|A\|_{\infty \rightarrow 1} = \max_{x \in \mathbb{R}^n : \|x\|_{\infty} \leq 1, y \in \mathbb{R}^m : \|y\|_{\infty} \leq 1} \langle A , y x^{\top} \rangle.\]

We also consider the following quantity:

    \begin{eqnarray*} \text{SDP}(A) = & \max & \langle A , V U^{\top} \rangle \\ & \text{subject to} & U \in \mathbb{R}^{n \times (n+m)}, V \in \mathbb{R}^{m \times (n+m)} \\ & & \text{with unit norm rows (in} \; \ell_2). \end{eqnarray*}

4. Show that

    \[\|A\|_{\infty \rightarrow 1} \leq \text{SDP}(A) .\]

5. Show that the problem of computing \text{SDP}(A) for a given matrix A is an SDP.

Grothendieck’s inequality says that there exists a constant K_G \in [1.67, 1.79] such that

    \[\text{SDP}(A) \leq K_G \|A\|_{\infty \rightarrow 1} .\]

6. What is the implication of Grothendieck’s inequality for the problem of computing the norm \|A\|_{\infty \rightarrow 1}?


Coordinate Descent

Let f be a differentiable convex function on \mathbb{R}^n that admits a minimizer x^*. We denote \partial_i f(x) for the partial derivative in direction e_i of f at x (it is the i^{th} coordinate of \nabla f(x)). In this problem we consider the Coordinate Descent algorithm to solve the problem \min_{x \in \mathbb{R}^n} f(x): Let x_1 \in \mathbb{R}^n and for t>1 let

    \begin{eqnarray*} i_t & = & \mathrm{argmax}_{1 \leq i \leq n} | \partial_i f(x_t) | , \\ x_{t+1} & = & x_t - \eta \ \partial_{i_t} f (x_t) e_{i_t} . \end{eqnarray*}

We say that f is coordinate-wise \beta-smooth if for any x \in \mathbb{R}^n, h \in \mathbb{R}, i \in [n],

    \[|\partial_i f(x+h e_i) - \partial_i f(x) | \leq \beta |h| .\]

1. Is coordinate-wise smoothness a stronger or weaker assumption than standard smoothness?

2. Assume that f is coordinate-wise \beta-smooth and let \eta = \frac{1}{\beta}. Show first that

    \[f(x_{t+1}) - f(x_t) \leq - \frac{1}{2 \beta n} \| \nabla f(x_t) \|^2_2 .\]

Let R = \max \{\|x - x^*\|_2 : x \in \mathbb{R}^n \; \text{and} \; f(x) \leq f(x_1) \}, show next that

    \[f(x_{t+1}) - f(x_t) \leq - \frac{1}{2 \beta R^2 n} (f(x_t) - f(x^*))^2 .\]

Conclude by proving that

    \[f(x_t) - f(x^*) \leq \frac{2 \beta R^2 n}{t+3} .\]

Hint: to deal with the term f(x_1) - f(x^*) you can first show that f has to be (\beta n)-smooth.

3. Can you use this algorithm and the convergence result described above to show existence of sparse minimizers in the spirit of what we did with the Conditional Gradient Descent algorithm?

4. In general how does the computational complexity of Coordinate Descent compares to the computational complexity of Gradient Descent to attain an \epsilon-optimal point for a \beta-smooth function? What happens if you assume access to an oracle to compute i_t?

5. Same question but in the special case where f(x) = \sum_{i=1}^n f_i(x(i)) + \|A x - b \|_2^2 with A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m.

6. Can you propose a variant of Coordinate Descent where the computation of i_t would be much faster and yet this new i_t would still allow us to prove the rate of convergence of question 2.? Hint: use randomization!

Posted in Optimization | Leave a comment

ORF523: Optimization with bandit feedback

In this last lecture we consider the case where one can only access the function with a noisy 0^{th}-order oracle (see this lecture for a definition). This assumption models the ‘physical reality’ of many practical problems (on the contrary to the case of a 1^{st}-order oracle which was essentially used as an ‘approximation’). Indeed there are many situations where the function f to be optimized is not known, but one can simulate the function. Imagine for example that we are trying to find the correct dosage of a few chemical elements in order to produce the most effective drug. We can imagine that given specific dosage one can produce the corresponding drug, and then test it to estimate its effectiveness. This corresponds precisely to assuming that we have a noisy 0^{th}-order oracle for the function \mathrm{dosage} \mapsto \mathrm{effectiveness}.

As one can see with the above example, the type of applications for 0^{th}-order optimization are fundamentally different from the type of applications for 1^{st}-order optimization. In particular in 1^{st}-order optimization the function f is known (we need to compute gradients), and thus it must have been directly engineered by humans (who wrote down the function!). This observation shows that in these applications one has some flexibility in the choice of f, and since we know that (roughly speaking) convex functions can be optimized efficiently, one has a strong incentive into engineering convex functions. On the other hand in 0^{th}-order optimization the function is produced by ‘Nature’. As it turns out, Nature does not care that we can only solve convex problems, and in most applications of 0^{th}-order optimization I would argue that it does not make sense to make a convexity assumption.

As you can imagine 0^{th}-order optimization appear in various contexts, from experiments in physics or biology, to the design of Artificial Intelligence for games, and the list goes on. In particular different communities have been looking at this problem, with different terminology and different assumptions. Being completely biased by my work on the multi-armed bandit problem, I believe that the bandit terminology is the nicest one to describe 0^{th}-order optimization and I will now refer to it as optimization with bandit feedback (for more explanation on this you can check my survey with Cesa-Bianchi on bandit problems). In order to simplify the discussion we will focus in this lecture on optimization over a finite set, that is |\mathcal{X}| < + \infty (we give some references on the continuous case at the end of the post).


Discrete optimization with bandit feedback

We want to solve the problem:

    \[\max_{i = 1, \hdots, K} \mu_i ,\]

where the \mu_i‘s are unknown quantities on which we can obtain information as follows: if one makes a query to action i (we will also say that one pulls arm i), then one receives an independent random variable Y such that \mathbb{E} Y = \mu_i. We will also assume that the distribution of Y is subgaussian (for example this include Gaussian distributions with variance less than 1 and distributions supported on an interval of length less than 2). The precise definition is not important, we will only use the fact that if one receives Y_1, \hdots, Y_s from pullings s times arm i then \hat{\mu}_{i,s} = \frac{1}{s} \sum_{t=1}^s Y_s satisfies:

(1)   \begin{equation*}  \mathbb{P}(\hat{\mu}_{i,s} - \mu_i > u) \leq \exp \left(- \frac{s u^2}{2} \right) . \end{equation*}

For sake of simplicity we assume that \mathrm{argmax}_{1 \leq i \leq K} \mu_i = \{i^*\}. An important parameter in our analysis is the suboptimality gap for arm i: \Delta_i = \mu_{i^*} - \mu_i (we also denote \Delta = \min_{i \neq i^*} \Delta_i).


We will be interested in two very different models:

  • In the PAC (Probably Approximately Correct) setting the algorithm can make as many queries as necessary, so that when it stops querying it can output an arm J \in \{1, \hdots, K\} such that \mathbb{P}(\mu_J < \mu_{i^*} - \epsilon) \leq \delta where \epsilon and \delta are prespecified accuracies. We denote by \mathcal{N} the number of queries that the algorithm made before stopping, and the objective is to have (\epsilon,\delta)-PAC algorithms with \mathcal{N} as small as possible. This formulation goes back to Bechhofer (1954).
  • In the fixed budget setting the algorithm can make at most n queries, and then it has to output an arm J \in \{1, \hdots, K\}. The goal here is to minimize the optimization error \mu_{i^*} - \mu_J. Strangely this formulation is much more recent: it was proposed in this paper that I wrote with Munos and Stoltz.

While at first sight the two models might seem similar, we will see that in fact there is a key fundamental difference.


Trivial algorithms

A trivial (\epsilon,\delta)-PAC algorithm would be to query each arm of order of \frac{1}{\epsilon^2} \log \frac{K}{\delta} times and then output the empirical best arm. Using (1) it is obvious that this algorithm is indeed (\epsilon,\delta)-PAC, and furthermore it satisfies \mathcal{N} \leq c \frac{K}{\epsilon^2} \log \frac{K}{\delta}.


In the fixed budget setting a trivial algorithm is to divide the budget evenly among the K arms. Using (1) it is immediate that this strategy satisfies \mathbb{P}(\mu_{i^*} - \mu_J \geq \epsilon) \leq K \exp( - c \frac{n}{K} \epsilon^2), which equivalently states that to have \mathbb{P}(\mu_{i^*} - \mu_J \geq \epsilon) \leq \delta one needs the budget n to be at least of order of \frac{K}{\epsilon^2} \log \frac{K}{\delta}.


We will now see that these trivial algorithms can be dramatically improved by taking into account the potential heterogeneity in the \mu_i‘s. For sake of simplicty we focus now on finding the best arm i^*, that is in the PAC setting we take \epsilon = 0, and in the fixed budget setting we consider \mathbb{P}(J \neq i^*). The critical quantity will be the hardness measure:

    \[H = \sum_{i \neq i^*} \frac{1}{\Delta_i^2} ,\]

that we introduced with Audibert and Munos in this paper (though it appeared previously in many other papers). We will show that, roughly speaking, while the trivial algorithms need of order of \frac{K}{\Delta^2} queries to find the optimal arm, our improved algorithms can find it with only H queries. Note that for information-theoretic reasons H is a lower bound on the number of queries that one has to make to find the optimal arm (see this paper by Mannor and Tsitsiklis for a lower bound in the PAC setting, and my paper with Audibert and Munos for a lower bound in the fixed budget setting).


Successive Elimination (SE) for the PAC setting

The Successive Elimination algorithm was proposed in this paper by Even-Dar, Mannor and Mansour. The idea is very simple: start with a set of active arms A_1 = \{1, \hdots, K\}. At each round t, pull once each arm in A_t. Now construct confidence intervals of size c \sqrt{\frac{\log (K t / \delta)}{t}} for each arm, and build A_{t+1} from A_t by eliminating arms in A_t for which the confidence interval does not overlap with the confidence interval of the currently best empirical arm in A_t. The algorithm stops when |A_t| = 1, and it outputs the single element of A_t. Using an union bound it is an easy exercise to verify that this algorithm is \delta-PAC, and furthermore with probability at least 1- \delta one has \mathcal{N} \leq c H \log \frac{K}{\delta \Delta^2}.


Successive Rejects (SR) for the fixed budget setting

The Successive Elimination algorithm cannot be easily adapted in the fixed budget setting. The reason is that in the fixed budget framework we do not know a priori what is a reasonable value for the confidence parameter \delta. Indeed in the end an optimal algorithm should have a probability of error of order \exp(- c n / H), which depends on the unknown hardness parameter H. As a result one cannot use strategy based on confidence intervals in the fixed budget setting unless one knows H (note that estimating H from data is basically as hard as finding the best arm). With Audibert and Munos we proposed an alternative to SE for the fixed budget that we called Successive Rejects.

The idea is to divide the budget n into K-1 chunks n_1, \hdots, n_{K-1} such that n_1 + \hdots + n_{K-1} = n. The algorithm then runs in K-1 phases. Let A_k be the set of active arms in phase k (with A_1 = \{1, \hdots, K\}). In phase k each arm in A_k is sampled n_k - n_{k-1} times, and the end of the phase the arm i with the worst empirical performance in A_k is rejected, that is A_{k+1} = A_k \setminus \{i\}. The output of the algorithm is the unique element of A_K.

Let \Delta_{(1)} \leq \Delta_{(2)} \leq \hdots \leq \Delta_{(K)} be the ordered \Delta_i‘s. Remark that in phase k, one of the k worst arms must still be in the active set, and thus using a trivial union bound one obtains that:

    \[\mathbb{P}(J \neq i^*) \leq \sum_{k=1}^{K-1} 2 k \exp\left( - n_k \frac{\Delta_{(K-k+1)}^2}{8} \right) .\]

Now the key observation is that by taking n_k proportional to \frac{1}{K+1-k} one obtains a bound of the form \exp( - \tilde{O} (n / H)). Precisely let n_k = \alpha \frac{n}{K+1 - k} where \alpha is such that the n_k‘s sum to n (morally \alpha is of order 1 / \log K). Then we have

    \[n_k \Delta_{(K-k+1)}^2 \geq \frac{\alpha n}{\max_{2 \leq i \leq K} i \Delta_{(i)}^{-2}} \geq \frac{\alpha n}{\sum_{i=2}^K \frac{1}{\Delta_{(i)}^2}} \geq c \frac{n}{\log(K) H} ,\]

for some numerical constant c>0. Thus we proved that SR satisfies

\mathbb{P}(J \neq i^*) \leq K^2 \exp \left(- c \frac{n}{\log(K) H} \right), which equivalently states that SR finds the best arm with probability at least 1-\delta provided that the budget is at least of order H \log(K) \log(K / \delta).


The continuous case

Many questions are still open in the continuous case. As I explained at the beginning of the post, convexity might not be the best assumption from a practical point of view, but it is nonetheless a very natural mathematical problem to try to understand the best rate of convergence in this case. This is still an open problem, and you can see the state of the art for upper bounds in this paper by Agarwal, Foster, Hsu, Kakade and Rakhlin, and for lower bounds in this paper by Shamir. In my opinion a more ‘practical’ assumption on the function is simply to assume that it is Lipschitz in some metric. The HOO (Hierarchical Optimistic Optimization) algorithm attains interesting performances when the metric is known, see this paper by myself, Munos, Stoltz and Szepesvari (similar results were obtained independently by Kleinberg, Slivkins and Upfal in this paper). Recently progress has been made for the case where the metric is unknown, see this paper by Slivkins, this paper by Munos, and this one by Bull. Finally let me remark that this Lipschitzian assumption is very weak, and thus one cannot hope to solve high-dimensional problem with the above algorithms. In fact, these algorithms are designed for small dimensional problems (say dimension less than 5 or so). In standard optimization one can solve problems in very large dimension because of the convexity assumption. For 0^{th}-order optimization I think that we still don’t have a natural assumption that would allow us to scale-up the algorithms to higher dimensional problems.

Posted in Optimization | Leave a comment

ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

In this lecture we are interested in the following problem:

    \[\min_{x \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(x) ,\]

where each f_i is \beta-smooth and \alpha strongly-convex. We denote by Q=\frac{\beta}{\alpha} the condition number of these functions (which is also the condition number of the average).

Using a gradient descent directly on the sum with step-size \frac{2}{\alpha + \beta} one obtains a rate of convergence of order \exp( - c t / Q) for some numerical constant c>0 (see this lecture). In particular one can reach an \epsilon-optimal point in time O(m Q N \log(1/\epsilon)) with this method, where O(N) is the time to compute the gradient of a single function f_i(x). Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to O(m \sqrt{Q} N \log(1/\epsilon)).

Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an \epsilon-optimal point in time O( \log(m) N / (\alpha \epsilon)) which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in \epsilon would be of the form \log(1/\epsilon)).


Stochastic Average Gradient (SAG)

We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let y_0 = 0, x_1 \in \mathbb{R}^n and for t \geq 1 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[y_t(i) = \left\{ \begin{array}{ccc} \nabla f_i(x_t) & \text{if} & i=i_t \\ y_{t-1}(i) & \text{if} & i\neq i_t \end{array} \right. ; \qquad x_{t+1} = x_t - \frac{\eta}{m} \sum_{i=1}^m y_t(i) .\]

They show that in the regime where m \geq 8 Q, SAG (with step-size \eta = \frac{1}{2 m \alpha} and with x_1 obtained as the average of m iterates of SGD) has a rate of convergence of order \exp( - c t / m) (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of O(N \log(m)) this results in an overall time-complexity of O(m N \log(m) \log(1/\epsilon)), which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.


Short detour through Fenchel’s duality

Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function f: \mathbb{R}^n \rightarrow \mathbb{R} is defined by:

    \[f^*(u) = \sup_{x \in \mathbb{R}^n} u^{\top} x - f(x) .\]

Note that f^* is always a convex function (possibly taking value +\infty). If f is convex, then for any u \in \partial f(x) one clearly has f^*(u) + f(x) = u^{\top} x. Finally one can show that if f is \beta-smooth then f^* is \frac{1}{\beta}-strongly convex (one can use the second Lemma from this lecture).


Stochastic Dual Coordinate Ascent (SDCA)

We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):

    \[\min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2 ,\]


  • f_i : \mathbb{R} \rightarrow \mathbb{R} is convex, \beta-smooth.
  • f_i(s) \geq 0, \forall s \in \mathbb{R} and f_i(0) \leq 1.
  • \|x_i\| \leq 1.


    \[P(w) = \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2\]

the primal objective. For the dual objective we introduce a dual variable \nu(i) for each data point x_i (as well as a temporary primal variable z(i)):

    \begin{eqnarray*} \min_{w \in \mathbb{R}^n} P(w) & = & \min_{z \in \mathbb{R}^m, w} \max_{\nu \in \mathbb{R}^m} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{z \in \mathbb{R}^m, w} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \bigg( - f_i^*(- \nu(i)) - \nu(i) w^{\top} x_i \bigg) + \frac{\alpha}{2} \|w\|^2 . \end{eqnarray*}

Next observe that by the KKT conditions one has at the optimum (w^*, \nu^*):

    \[w^* = \frac{1}{\alpha m} \sum_{i=1}^m \nu^*(i) x_i ,\]

and thus denoting

    \[D(\nu) = \frac{1}{m} \sum_{i=1}^m - f_i^*(- \nu(i)) - \frac{\alpha}{2} \left\|\frac{1}{\alpha m} \sum_{i=1}^m \nu(i) x_i \right\|^2 ,\]

we just proved

    \[\min_{w \in \mathbb{R}^n} P(w) = \max_{\nu \in \mathbb{R}^m} D(\nu) .\]


We can now describe SDCA: let \nu_0 = 0 and for t \geq 0 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[\nu_{t+1}(i) = \left\{ \begin{array}{ccc} (1-\eta) \nu_t(i) - \eta f_i'(w_t^{\top} x_i) & \text{if} & i=i_t , \\ \nu_t(i) & \text{if} & i \neq i_t . \end{array} \right.\]

Let w_t = \frac{1}{\alpha m} \sum_{i=1}^m \nu_t(i) x_i be the corresponding primal variable. The following result justifies the procedure.

Theorem (Shalev-Shwartz and Zhang 2013) SDCA with \eta = \frac{m}{Q + m} satisfies:

    \[\mathbb{E} P(w_t) - \min_{w \in \mathbb{R}^n} P(w) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

The above convergence rate yields an overall time-complexity of O((Q+m) N \log(m) \log((Q+m)/\epsilon)) for SDCA (where O(N) is the complexity of computing a gradient of w \mapsto f_i(w^{\top} x_i)), which can be much better than the O(m Q N \log(1/\epsilon)) complexity of the plain gradient descent.

Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:

(1)   \begin{equation*} \mathbb{E} \left( P(w_t) - D(\nu_t) \right) \leq (Q+m) \mathbb{E} \left( D(\nu_{t+1}) - D(\nu_t) \right) . \end{equation*}

Let us see how the above inequality yields the result. First we clearly have D(\nu) \leq P(w), and thus the above inequality immediately yields:

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right) \mathbb{E} \left( D(\nu^*) - D(\nu_{t}) \right) ,\]

which by induction gives (using D(0) \geq 0 and D(\nu^*) \leq 1):

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right)^{t+1} \leq \exp\left( - \frac{t+1}{Q+m} \right) ,\]

and thus using once more (1):

    \[\mathbb{E} P(w_t) - P(w^*) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:

    \[P(w_t) - D(\nu_t) = \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) \bigg) + \alpha \|w_t\|^2 .\]

Let us now evaluate the improvement in the dual objective assuming that coordinate i is updated in the (t+1)^{th} step:

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & = - f_i^*(-\nu_{t+1}(i)) - \frac{\alpha m}{2} \|w_{t+1}\|^2 - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) \\ & = - f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) - \frac{\alpha m}{2} \left\|w_t - \frac{\eta}{\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i)) x_i \right\|^2 \\ & \qquad - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) . \end{align*}

Now using that f_i^* is \frac{1}{\beta}-strongly convex one has

    \begin{eqnarray*} f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) & \leq& (1 -\eta) f_i^*(- \nu_t(i)) + \eta f_i^*(f_i'(w_t^{\top} x_i)) \\ & & \;- \frac{1}{2 \beta} \eta (1-\eta) (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 , \end{eqnarray*}

which yields

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \frac{\eta (1-\eta)}{2 \beta} (f_i'(w_t^{\top} x_i) - \nu_t(i))^2 \\ & + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i - \frac{\eta^2}{2\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 \|x_i\|^2 \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i , \end{align*}

where in the last line we used that \eta = \frac{m}{Q+m} and \|x_i\| \leq 1. Observe now that

    \[f_i^*(f_i'(w_t^{\top} x_i)) = (w_t^{\top} x_i) f_i'(w_t^{\top} x_i) - f_i(w_t^{\top} x_i) ,\]

which finally gives

    \[m \left( D(\nu_{t+1}) - D(\nu_t) \right) \geq \eta \bigg(f_i^*(- \nu_t(i)) + f_i(w_t^{\top} x_i) + \nu_t(i) w_t^{\top} x_i \bigg),\]

and thus taking expectation one obtains

    \begin{align*} & (Q+m) \mathbb{E}_{i_t} \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) + \nu_t(i) w_t^{\top} x_i \bigg)\\ & = P(w_t) - D(\nu_t), \end{align*}

which concludes the proof of (1) and the proof of the theorem.


Posted in Optimization | 2 Comments