## COLT (with bombs, deep learning and other exciting stuff)

The 2013 edition of COLT at Princeton just finished a few days ago and I’m happy to report that everything went smoothly! We had a sort of stressful start (at least for the organizers..) since on the day before the start of the conference a bomb threat was declared at Princeton and the whole campus had to be evacuated. Thankfully the campus was entirely checked within a few hours and it reopened on the same day, which allowed the conference to go as planned.

Next year COLT will be held in Barcelona (most likely on June 13-14-15). As I will be again involved in the organization (but this time just to help out a bit the main organizer -Gabor Lugosi-) I already have some tidbits on the conference and based on this my advice is that you should already start preparing your best theorems for COLT 2014!!! Barcelona in June 2014 will be the place to be for learning theory ;).

Back to the 2013 edition: I wanted to do a summary of all the interesting things that I learned during these 3 days but I quickly realized that this is a way too ambitious goal. Instead I’m going to do a much shorter summary that include only the things that I ‘strictly’ learned during COLT in the sense that I had never read about these contributions in the past (which automatically dismiss papers on Online Learning that I read as they appear on arxiv, or the COLT best paper by Quentin and Philippe since I have seen the making of it).

Deep learning and related stuff

Deep learning was not in any contributed paper, but it made three apparitions during the conference:

- Yann LeCun gave an invited talk on this topic and he discussed the success of his deep architecture in recent competitions. Most of you have probably already read about this in mass media (such as the New York Times or Wired) so I’m not going to bore you with that. From a theoretical point of view what I sort of understood is that for example with sparse auto-encoders they are trying to optimize a function of the form which is convex separately in and but not jointly in and (just think of to see the issue). As far as I understand they alternate the optimization over and and it works just fine. Understanding theoretically what is going on remains a challenge.

Yann also said something which is in my opinion a striking observation: he says that all the ‘interesting’ learning must be non-convex since for non-trivial tasks the order in which one sees the examples matter a lot (just think of watching a movie in reverse..) while this is of course not at all the case for convex learning.

- Sanjeev Arora gave an invited talk on ‘provable bounds for machine learning’. What Sanjeev means by this title is that several objective functions in machine learning are NP-hard to optimize and yet people seems to be able to optimize them in practice. One example which is sort of related to the sparse auto-encoders discussed above is the problem of non-negative matrix factorization, where one is given and one tries to find and such that with as small as possible. Of course without the positivness constraint this is an easy problem but with this added constraint the task is NP-hard (see this paper by Vavasis). However Sanjeev and his co-authors (Ge, Kannan and Moitra) found a natural (in terms of applications) assumption for such that under this assumption one can design a polynomial-time algorithm for this problem, see this paper for the details. At the end of the talk Sanjeev also mentioned deep learning in the list of exciting open problems in machine learning.

- Ohad Shamir briefly talked in the impromtu session about his recent paper on deep learning (with Roi Livni and Shai Shalev-Shwartz). The general idea of the paper is to show that one can very efficiently (both in terms of space and time) find a basis for the set of values attainable by polynomials of a fixed degree on a given data set. This gives a potentially interesting representation of the data which can then be used together with a linear classifier for supervised learning tasks.

A new estimator for Good-Turing

The Good-Turing problem is very simple: given an i.i.d. sequence drawn from some unknown probability distribution over a countable set , one wants to estimate the numbers . That is one wants to estimate the probability mass of the symbols that appear exactly times in the observed sequence. The term is of particular interest (see for example this paper by myself, Ernst and Garivier for a cool application) but sometimes one really needs to estimate the entire vector . Good and Turing worked on this problem during the Second World War while they were trying to break the Enigma code and they came up with an estimator which is ‘almost’ unbiased. Fifty years later McAllester and Schapire were able to prove concentration bounds for the Good-Turing estimator. For instance they showed that is of order with probability at least which is in my opinion an amazing result (and the proof is not too difficult, concentrates around its expectation thanks to McDiarmid’s inequality, and concentrates around its expectation thanks to a version of Bernstein’s inequality for negatively correlated random variables). However if one is looking for concentration of the vector around these results are not useful. One of the COLT paper by Acharya, Jafarpour, Orlitsky and Suresh design a new estimator for this task for which they show that (and thus by Pinsker’s inequality ) and furthermore they also show that these two rates (for and ) are optimal. This is a really neat result, and the estimator is a non-trivial modification of the basic Good-Turing estimator.

Other papers

I wanted to do a summary of two other papers, Algorithms and Hardness for Robust Subspace Recovery by Hardt and Moitra, and Polynomial Time Optimal Query Algorithms for Finding Graphs with Arbitrary Real Weights by Choi, but I’m running out of stamina so you’ll have to check the papers yourself!

Posted in Uncategorized | 4 Comments

## Embeddings of finite metric spaces in Hilbert space

In this post we discuss the following notion: Let and be two metric spaces, one says that embeds into with distortion if there exists and such that for any ,

We write this as .

Note that if is a normed vector space (over the reals) then the scaling parameter in the above definition is unnecessary since one can always perform this scaling with the embedding (indeed in this case ).

Informally if embeds into with distortion then up to an ‘error’ of order one can view (once mapped with the embedding ) as a subset of . This is particularly useful when has a specific known structure that we may want to exploit, such as .

The case is of particular interest since then one embeds into a Hilbert space which comes with a lot of ‘structure’. Thus we mainly focus on embeddings in , and we also focus on being a finite metric space with .

First observe that the case (embeddings in ) is trivial:

Proposition

Proof: Consider the embedding . It is obvious that

and the other side of the inequality follows from the triangle inequality for . Thus one has which concludes the proof.

Could it be that the situation is that simple for ? It seems very unlikely because is a much more structured space than an arbitrary metric space. Consider for example with and (that is is a tree with root and three leaves ). It is easy to see that it is impossible to embed into with distortion . Indeed and would be on a Euclidean ball of radius and they would be at distance 2 from each other, which is impossible for 3 points.

So what can we say in general about embeddings into ? The following beautiful result gives the answer:

Theorem (Bourgain 1985)

It can be shown that this result is unimprovable in general. However for specific classes of finite metric spaces one can hope for some improvement and we refer to the survey by Linial for more details on this.

Embeddings in with small distortion

While Bourgain’s theorem is a very strong statement in the sense that it applies to any finite metric space, it is also quite weak in the sense that the distortion is large, of order . For some applications it is desirable to have a bounded distortion that does not grow with the size of , or even a distortion of order with small . As we said previously the only way to obtain such a statement is to make further assumptions on . We will not take this route in this post but let me just mention the Johnson-Lindenstrauss lemma which should be well-known to Machine Learners and alike:

Lemma (Johnson-Lindenstrauss 1984) Let then .

Now what can we say in general about embeddings with small distortion into ? There is a well-known result in convex geometry with the same flavor than this question, this is Dvoretzky’s theorem: any symmetric convex body in has a section of dimension which is Euclidean up to an error of order . The analogue for the theory of metric embeddings is called the non-linear Dvoretzky’s theorem and it goes as follows:

Theorem (Bourgain–Figiel–Milman 1986) Let . There exists with and .

Again this theorem has a weakness: while the distortion is small it also only applies to a small subset of the original metric space (of size ). In some sense we have now two extreme results, Bourgain’s theorem which applies to the entire metric space but with a large distortion, and Bourgain–Figiel–Milman which applies to a small small subset but with a small distortion. A natural question at this point is to try to ‘interpolate’ between these two results. Such a statement was first obtained by Bartal, Linial, Mendel and Naor (2002) and later improved in Mendel and Naor (2006):

Theorem (Mendel and Naor 2006) Let . There exists with and .

In future posts I will try to prove some of these theorems. This might take a while as in the coming weeks I will be traveling to various conferences (including COLT 2013 and ICML 2013) and I will try to blog about some of the stuff that I learn there.

## The aftermath of ORF523 and the final

It has been two weeks since my last post on the blog, and I have to admit that it felt really good to take a break from my 2-posts-per-week regime of the previous months. Now that ORF523 is over the blog will probably be much more quiet in the near future, though I have a few cool topics that I want to discuss in the coming weeks and I’m also hoping to get interesting guest posts. In any case the next period of intense activity will probably be next Fall, as I will be visiting the newly created Simons Institute for the Theory of Computing at UC Berkeley and I plan to blog about the new tricks that I will learn from the program on the Theoretical Foundations of Big Data Analysis and the one on Real Analysis in Computer Science. More details on this to come soon!

For those of you who followed assiduously my lectures I thought that you might want to take a look the final that I gave to my students this morning. The first part is based on this paper by Alon and Naor, and the second part is based on this paper by Nesterov (how could I not finish with a result from Nesterov??). Enjoy!

An SDP relaxation based on Grothendieck’s inequality

In this problem we consider the space of rectangular matrices for . We denote by the Frobenius inner product on , that is

Let . We consider the following norm on :

In other words this is the operator norm of when we view it as a mapping from to . In this problem we are interested in computing this operator norm for a given matrix.

1. Is the problem of computing the operator norm of a convex problem?

2. For what values of and can you easily propose a polynomial-time algorithm?

3. Let be such that . Recall why one has

We will now focus on the case and , that is

We also consider the following quantity:

4. Show that

5. Show that the problem of computing for a given matrix is an SDP.

Grothendieck’s inequality says that there exists a constant such that

6. What is the implication of Grothendieck’s inequality for the problem of computing the norm ?

Coordinate Descent

Let be a differentiable convex function on that admits a minimizer . We denote for the partial derivative in direction of at (it is the coordinate of ). In this problem we consider the Coordinate Descent algorithm to solve the problem : Let and for let

We say that is coordinate-wise -smooth if for any ,

1. Is coordinate-wise smoothness a stronger or weaker assumption than standard smoothness?

2. Assume that is coordinate-wise -smooth and let . Show first that

Let , show next that

Conclude by proving that

Hint: to deal with the term you can first show that has to be -smooth.

3. Can you use this algorithm and the convergence result described above to show existence of sparse minimizers in the spirit of what we did with the Conditional Gradient Descent algorithm?

4. In general how does the computational complexity of Coordinate Descent compares to the computational complexity of Gradient Descent to attain an -optimal point for a -smooth function? What happens if you assume access to an oracle to compute ?

5. Same question but in the special case where with and .

6. Can you propose a variant of Coordinate Descent where the computation of would be much faster and yet this new would still allow us to prove the rate of convergence of question 2.? Hint: use randomization!

## ORF523: Optimization with bandit feedback

In this last lecture we consider the case where one can only access the function with a noisy -order oracle (see this lecture for a definition). This assumption models the ‘physical reality’ of many practical problems (on the contrary to the case of a -order oracle which was essentially used as an ‘approximation’). Indeed there are many situations where the function to be optimized is not known, but one can simulate the function. Imagine for example that we are trying to find the correct dosage of a few chemical elements in order to produce the most effective drug. We can imagine that given specific dosage one can produce the corresponding drug, and then test it to estimate its effectiveness. This corresponds precisely to assuming that we have a noisy -order oracle for the function .

As one can see with the above example, the type of applications for -order optimization are fundamentally different from the type of applications for -order optimization. In particular in -order optimization the function is known (we need to compute gradients), and thus it must have been directly engineered by humans (who wrote down the function!). This observation shows that in these applications one has some flexibility in the choice of , and since we know that (roughly speaking) convex functions can be optimized efficiently, one has a strong incentive into engineering convex functions. On the other hand in -order optimization the function is produced by ‘Nature’. As it turns out, Nature does not care that we can only solve convex problems, and in most applications of -order optimization I would argue that it does not make sense to make a convexity assumption.

As you can imagine -order optimization appear in various contexts, from experiments in physics or biology, to the design of Artificial Intelligence for games, and the list goes on. In particular different communities have been looking at this problem, with different terminology and different assumptions. Being completely biased by my work on the multi-armed bandit problem, I believe that the bandit terminology is the nicest one to describe -order optimization and I will now refer to it as optimization with bandit feedback (for more explanation on this you can check my survey with Cesa-Bianchi on bandit problems). In order to simplify the discussion we will focus in this lecture on optimization over a finite set, that is (we give some references on the continuous case at the end of the post).

Discrete optimization with bandit feedback

We want to solve the problem:

where the ‘s are unknown quantities on which we can obtain information as follows: if one makes a query to action (we will also say that one pulls arm ), then one receives an independent random variable such that . We will also assume that the distribution of is subgaussian (for example this include Gaussian distributions with variance less than and distributions supported on an interval of length less than ). The precise definition is not important, we will only use the fact that if one receives from pullings times arm then satisfies:

(1)

For sake of simplicity we assume that . An important parameter in our analysis is the suboptimality gap for arm : (we also denote ).

We will be interested in two very different models:

• In the PAC (Probably Approximately Correct) setting the algorithm can make as many queries as necessary, so that when it stops querying it can output an arm such that where and are prespecified accuracies. We denote by the number of queries that the algorithm made before stopping, and the objective is to have -PAC algorithms with as small as possible. This formulation goes back to Bechhofer (1954).
• In the fixed budget setting the algorithm can make at most queries, and then it has to output an arm . The goal here is to minimize the optimization error . Strangely this formulation is much more recent: it was proposed in this paper that I wrote with Munos and Stoltz.

While at first sight the two models might seem similar, we will see that in fact there is a key fundamental difference.

Trivial algorithms

A trivial -PAC algorithm would be to query each arm of order of times and then output the empirical best arm. Using (1) it is obvious that this algorithm is indeed -PAC, and furthermore it satisfies .

In the fixed budget setting a trivial algorithm is to divide the budget evenly among the arms. Using (1) it is immediate that this strategy satisfies , which equivalently states that to have  one needs the budget to be at least of order of .

We will now see that these trivial algorithms can be dramatically improved by taking into account the potential heterogeneity in the ‘s. For sake of simplicty we focus now on finding the best arm , that is in the PAC setting we take , and in the fixed budget setting we consider . The critical quantity will be the hardness measure:

that we introduced with Audibert and Munos in this paper (though it appeared previously in many other papers). We will show that, roughly speaking, while the trivial algorithms need of order of queries to find the optimal arm, our improved algorithms can find it with only queries. Note that for information-theoretic reasons is a lower bound on the number of queries that one has to make to find the optimal arm (see this paper by Mannor and Tsitsiklis for a lower bound in the PAC setting, and my paper with Audibert and Munos for a lower bound in the fixed budget setting).

Successive Elimination (SE) for the PAC setting

The Successive Elimination algorithm was proposed in this paper by Even-Dar, Mannor and Mansour. The idea is very simple: start with a set of active arms . At each round , pull once each arm in . Now construct confidence intervals of size for each arm, and build from by eliminating arms in for which the confidence interval does not overlap with the confidence interval of the currently best empirical arm in . The algorithm stops when , and it outputs the single element of . Using an union bound it is an easy exercise to verify that this algorithm is -PAC, and furthermore with probability at least one has .

Successive Rejects (SR) for the fixed budget setting

The Successive Elimination algorithm cannot be easily adapted in the fixed budget setting. The reason is that in the fixed budget framework we do not know a priori what is a reasonable value for the confidence parameter . Indeed in the end an optimal algorithm should have a probability of error of order , which depends on the unknown hardness parameter . As a result one cannot use strategy based on confidence intervals in the fixed budget setting unless one knows (note that estimating from data is basically as hard as finding the best arm). With Audibert and Munos we proposed an alternative to SE for the fixed budget that we called Successive Rejects.

The idea is to divide the budget into chunks such that . The algorithm then runs in phases. Let be the set of active arms in phase (with ). In phase each arm in is sampled times, and the end of the phase the arm with the worst empirical performance in is rejected, that is . The output of the algorithm is the unique element of .

Let be the ordered ‘s. Remark that in phase , one of the worst arms must still be in the active set, and thus using a trivial union bound one obtains that:

Now the key observation is that by taking proportional to one obtains a bound of the form . Precisely let where is such that the ‘s sum to (morally is of order ). Then we have

for some numerical constant . Thus we proved that SR satisfies

, which equivalently states that SR finds the best arm with probability at least provided that the budget is at least of order .

The continuous case

Many questions are still open in the continuous case. As I explained at the beginning of the post, convexity might not be the best assumption from a practical point of view, but it is nonetheless a very natural mathematical problem to try to understand the best rate of convergence in this case. This is still an open problem, and you can see the state of the art for upper bounds in this paper by Agarwal, Foster, Hsu, Kakade and Rakhlin, and for lower bounds in this paper by Shamir. In my opinion a more ‘practical’ assumption on the function is simply to assume that it is Lipschitz in some metric. The HOO (Hierarchical Optimistic Optimization) algorithm attains interesting performances when the metric is known, see this paper by myself, Munos, Stoltz and Szepesvari (similar results were obtained independently by Kleinberg, Slivkins and Upfal in this paper). Recently progress has been made for the case where the metric is unknown, see this paper by Slivkins, this paper by Munos, and this one by Bull. Finally let me remark that this Lipschitzian assumption is very weak, and thus one cannot hope to solve high-dimensional problem with the above algorithms. In fact, these algorithms are designed for small dimensional problems (say dimension less than or so). In standard optimization one can solve problems in very large dimension because of the convexity assumption. For -order optimization I think that we still don’t have a natural assumption that would allow us to scale-up the algorithms to higher dimensional problems.

## ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

In this lecture we are interested in the following problem:

where each is -smooth and strongly-convex. We denote by the condition number of these functions (which is also the condition number of the average).

Using a gradient descent directly on the sum with step-size one obtains a rate of convergence of order for some numerical constant (see this lecture). In particular one can reach an -optimal point in time with this method, where is the time to compute the gradient of a single function . Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to .

Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an -optimal point in time which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in would be of the form ).

We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let , and for let be drawn uniformly at random in and

They show that in the regime where , SAG (with step-size and with obtained as the average of iterates of SGD) has a rate of convergence of order (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of this results in an overall time-complexity of , which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.

Short detour through Fenchel’s duality

Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function is defined by:

Note that is always a convex function (possibly taking value ). If is convex, then for any one clearly has . Finally one can show that if is -smooth then is -strongly convex (one can use the second Lemma from this lecture).

Stochastic Dual Coordinate Ascent (SDCA)

We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):

where

• is convex, -smooth.
• and .
• .

Denote

the primal objective. For the dual objective we introduce a dual variable for each data point (as well as a temporary primal variable ):

Next observe that by the KKT conditions one has at the optimum :

and thus denoting

we just proved

We can now describe SDCA: let and for let be drawn uniformly at random in and

Let be the corresponding primal variable. The following result justifies the procedure.

Theorem (Shalev-Shwartz and Zhang 2013) SDCA with satisfies:

The above convergence rate yields an overall time-complexity of for SDCA (where is the complexity of computing a gradient of ), which can be much better than the complexity of the plain gradient descent.

Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:

(1)

Let us see how the above inequality yields the result. First we clearly have , and thus the above inequality immediately yields:

which by induction gives (using and ):

and thus using once more (1):

Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:

Let us now evaluate the improvement in the dual objective assuming that coordinate is updated in the step:

Now using that is -strongly convex one has

which yields

where in the last line we used that and . Observe now that

which finally gives

and thus taking expectation one obtains

which concludes the proof of (1) and the proof of the theorem.

Posted in Optimization | 2 Comments

## ORF523: Noisy oracles

Today we start the third (and last) part of our class titled: ‘Optimization and Randomness’. An important setting that we will explore is the one of optimization with noisy oracles:

• Noisy -order oracle: given it outputs such that .
• Noisy -order oracle: given it outputs such that .

Somewhat surprisingly several of the first-order algorithms we studied so far are robust to noise and attain the same performance if one feeds them with the answers of a noisy order oracle. This was first observed by Robbins and Monro in the early Fifties, and the resulting set of techniques is often referred to as stochastic approximation methods. For instance let us consider the Stochastic Mirror Descent:

The analysis of this algorithm follows the exact same lines than the standard version. The key observation is that:

and now one can repeat the standard proof since was used in the algorithm. This yields the following result.

Theorem. Let be a mirror map -strongly convex on with respect to , and let . Assume that is convex and that the noisy oracle is such that . Then Stochastic Mirror Descent with satisfies

Of course the exact same reasoning applies to the Mirror Descent for Saddle Points. One can also get the rate for strongly convex functions using the idea described in this lecture. High-probability results can be obtained using Markov’s inequality, and stronger concentration results can be proved under a subgaussian assumption on the tails of the estimates , see Proposition 2.2. here.

Let us now see some application of this noisy framework.

Statistical Learning

We are interested in the solving the following problem:

where is a known loss function such that is a convex function , and is a random variable whose distribution is unknown. While the distribution of is unknown, we assume that we have access to a sequence of i.i.d. examples . A typical example within this framework is the case where represents a pair of (input, output) and the function is one of the loss function we saw in this lecture, for example the logistic loss with , .

Let be a subgradient of at , and let . We assume that . Note that we cannot perform a gradient descent on as we would need to know the distribution of to compute a subgradient of . On the other hand one has access to unbiased estimates of these subgradients with the i.i.d. examples . Thus one can use the Stochastic Mirror Descent, where at each iteration a new example is used to compute the noisy subgradient, precisely we take . In particular with examples one can only run the algorithm for steps, and one gets to a precision of order . If in addition is strongly convex then one can in fact obtain a precision of order with the step sizes/averaging described in this lecture.

Acceleration by randomization

In the previous example there was no alternative to the use of noisy subgradients as the distribution of was unknown. However in some cases, even if the distribution of is known, it might be beneficial to artificially sample examples from the distribution in order to avoid computing difficult integrals. Consider for example the problem of minimizing an average of convex functions:

We have seen several important examples of this form where might be very large. Note that computing a subgradient of takes time where is the time to compute the subgradient of a single function . Thus a standard gradient descent algorithm will take time to reach an -optimal point. On the other hand one can simply sample (with replacement) an index at every iteration (which takes time ), and perform a step of gradient descent using only the subgradient of . This method will reach an -optimal point (in expectation) in time , which is much cheaper than the requirement for GD!

Note that if is also strongly convex, then GD needs time to reach an -optimal point, and SGD needs only time (with appropriate step sizes/averaging as discussed above). The situation becomes more favorable to GD if in addition is smooth (recall that for strongly convex and smooth functions one has an exponential rate of convergence for GD), and we will come back to this in the next lecture.

Let us consider another example which we studied at the end of the previous lecture:

where (the symmetry assumption is just for sake of notation). It is clear that the rate of convergence of MD-SP for this problem is of order . Furthermore each step requires to compute a subgradient in and , which are of the form and . Thus the time complexity of one step is of order (which is the time complexity of a matrix-vector multiplication). Overall to reach an -optimal points one needs elementary operations. Now the key observation is that the matrix-vector multiplication can easily be randomized, especially since . Let be the column vectors of , then , and if is a random variable such that one has . Using this randomization in a Stochastic MD-SP one obtains the exact same rate of convergence (in expectation) but now each step has time complexity of order , resulting in an overall complexity to reach an -optimal point (in expectation) of . This result is especially amazing since we obtain a sublinear time algorithm (the complexity is while the ‘size’ of the problem is ). One can propose an even better algorithm based on Mirror Prox with a complexity of and we refer the reader to this book chapter by Juditsky and Nemirovski for more details.

Posted in Optimization | 1 Comment

## ORF523: Mirror Prox

Today I’m going to talk about a variant of Mirror Descent invented by Nemirovski in 2004 (see this paper). This new algorithm is called Mirror Prox and it is designed to minimize a smooth convex function over a compact convex set . We assume that the smoothness is measured in some arbitrary norm , precisely we assume that .

Let be a mirror map on (see this post for a definition) and let . Mirror Prox is described by the following equations:

In words the algorithm first makes a step of Mirror Descent to go from to , and then it makes a similar step to obtain , starting again from but this time using the gradient of evaluated at (instead of ). The following result justifies the procedure.

Theorem. Let be a mirror map -strongly convex on with respect to . Let and be convex and -smooth w.r.t. . Then Mirror Prox with satisfies (with )

Proof: Let . We write

We will now bound separately these three terms. For the first one we have (by definition of the method, first-order optimality condition of the projection, and the definition of the Bregman divergence)

For the second term using the same properties than above and the strong-convexity of the mirror map one has

Finally for the last term, using H\”older’s inequality, -smoothness, and the fact one has

Thus summing up these three terms and using that one gets

and this inequality is also true for (see the bound for the first term). The proof is then concluded as usual.

Following the generalization of Mirror Descent for Saddle Points (MD-SP, see this post) one can propose a version of Mirror Prox for Saddle Points (MP-SP). In fact the importance of Mirror Prox lies primarily in its application to compute the saddle point of a smooth convex-convave function. Indeed, the key observation is that many non-smooth convex functions admit a smooth saddle-point representation (see examples below). With such a representation one can then use MP-SP to obtain a rate of convergence of order , despite the fact that the original function is non-smooth! Recall that we proved that for non-smooth convex functions the optimal black-box rate is of order . However MP-SP is not a black-box method in the sense that it makes use of the structure of the function (by representing it as a saddle-point). Nonetheless MP-SP still enjoys the nice time-complexity property of black-box methods since it is a first-order method. It is a win-win situation!

Let us now describe more precisely MP-SP. Let be a compact and convex set endowed with some norm and a mirror map (defined on ), and let be a compact and convex set endowed with some norm and a mirror map (defined on ). Let be a continuously differentiable function such that is convex and is concave. We assume that is -smooth in the sense that:

We consider the mirror map on (defined on ). Let . Then for , let and be defined by the following equations (with ):

Theorem. Assume that is -strongly convex on with respect to and let (with similar definition for ). Let . Then SP Mirror Descent with satisfies

Proof: First we endow with the norm defined by

It is immediate that is -strongly convex with respect to on . Furthermore one can easily check that

and thus the vector fields used in MP-SP satisfy:

It suffices now to use the exact same proof than for the standard Mirror Prox starting with:

Smooth saddle-point representation of non-smooth functions

Assume that the function to be optimized is of the form:

where the functions are convex, -smooth and -Lipschitz in some norm . Minimizing over is equivalent to finding a saddle point over since one has:

Thus we are looking for the saddle point of which is convex in , and concave (in fact linear!) in . Furthermore is smooth and it is an easy exercise to verify that in this case one has , , , and . Thus a rate of convergence (to the minimizer of ) of order can be obtained by applying MP-SP with appropriate mirror maps (in particular the update in the -space has to be done with the negative entropy). Note that is rate is particularly striking since the oracle lower bound of was proved precisely for this type of function! Here we are able to get around this lower bound by making use of the structure of the function. Remark also that if you think about what this algorithm is doing to optimize the function you get a very very elegant procedure for which it is highly non-trivial a priori to prove something about its rate of convergence.

Finally note that if and are of different order of magnitudes one can obtain a better result by considering the mirror map and optimizing over the values of and .

Another interesting example arises when with , and . With the -norm on both spaces it is easy to see that (the entrywise max norm), and of course . Thus we obtain here a convergence rate of order . Again if and are of different order of magnitudes one can improve the result to obtain a rate of order by considering the mirror map and optimizing over the values of and .

Many more interesting examples can be derived and we refer the reader to the original paper of Nemirovski as well as this book chapter by Juditsky and Nemirovski.

Posted in Optimization | 2 Comments

## COLT 2013 accepted papers

The accepted papers for COLT 2013 have just been revealed. You can also find the list below with a link to the arxiv version when I could find one (if you want me to add a link just send me an email).

I also take the opportunity of this post to remind my student readers that you can apply for a ‘student travel award’ to come participate in COLT 2013. The instructions can be found here. So far I received very few applications so I highly encourage you to give it a try! You can check out this blog post that gives a few reasons to make the effort to travel and participate in conferences.

Update with a message from Philippe Rigollet: Sebastien Bubeck and I are local organizers of this year’s edition of the Conference On Learning Theory (COLT 2013) on June 12-14.
This means that it will take place at Princeton (in Friend 101), and of course, that it’ll be awesome.
The list of accepted papers is very exciting and has been posted on the official COLT 2013 website:

Thanks to generous sponsors, we have been able to lower the registration costs to a historic minimum (regular: $300, students:$150).
These costs include:
-a cocktail party followed by a banquet on Thursday night
-coffee breaks (with poster presentations)
-lunches (with poster presentation)
-47 amazing presentations
-3 Keynote lectures by Sanjeev Arora (Princeton), Ralf Herbrich (amazon.com) and Yann LeCun (NYU) respectively

The deadline for early registration is May 13 (even if you’re local, help us support the conference by registering and not just crashing the party).
To register, go to:
http://orfe.princeton.edu/conferences/colt2013/?q=for-participants/registration

COLT 2013 accepted papers

## ORF523: Mirror Descent, part II/II

We start with some of the standard setups for the Mirror Descent algorithm we described in the previous post.

Standard setups for Mirror Descent

‘Ball’ setup. The simplest version of Mirror Descent is obtained by taking on The function is a mirror map strongly convex w.r.t. , and furthermore the associated Bregman Divergence is given by . Thus in that case Mirror Descent is exactly equivalent to Projected Subgradient Descent, and the rate of convergence obtained in the previous post recovers our earlier result on Projected Subgradient Descent.

‘Simplex’ setup. A more interesting choice of a mirror map is given by the negative entropy

on . In that case the gradient update can be written equivalently as

The Bregman divergence of this mirror map is given by (also known as the generalized Kullback-Leibler divergence). It is easy to verify that the projection with respect to this Bregman divergence on the simplex amounts to a simple renormalization . Furthermore it is also easy to verify that is -strongly convex w.r.t. on (this result is known as Pinsker’s inequality). Note also that one has and when

The above observations imply that when minimizing on the simplex a function with subgradients bounded in -norm, Mirror Descent with the negative entropy achieves a rate of convergence of order . On the other the regular Subgradient Descent achieves only a rate of order in this case!

‘Spectrahedron’ setup. We consider here functions defined on matrices, and we are interested in minimizing a function on the spectrahedron defined as:

In this setting we consider the mirror map on given by the negative von Neumann entropy:

where are the eigenvalues of . It can be shown that the gradient update can be written equivalently as

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on is a simple trace renormalization.

With highly non-trivial computation one can show that is -strongly convex with respect to the Schatten -norm defined as

It is easy to see that one has and for . In other words the rate of convergence for optimization on the spectrahedron is the same than on the simplex!

We consider now the following saddle-point problem: Let be compact convex sets, and be a continuous function such that is convex and is concave. We are interested in finding such that

(Note that the above equality uses Sion’s minimax theorem from this lecture.) We measure the quality of a candidate solution through the primal/dual gap

The key observation now is that for any , one has with and ,

and

In particular, denoting we just proved

for some This inequality suggests that one can solve the saddle-point problem with a Mirror Descent in the space using the ‘vector field’ given by . Precisely this gives the following SP Mirror Descent:

Let be a mirror map on (defined on ) and let be a mirror map on (defined on ). We consider now the mirror map on (defined on ).

Let . Then for , let such that

where with and . Finally let

The following result describes the performances of SP Mirror Descent.

Theorem. Assume that is -strongly convex on with respect to , and let . Assume that is -Lipschitz w.r.t. . Define similarly . Then SP Mirror Descent with satisfies

Proof: First we endow with the norm defined by

It is immediate that is -strongly convex with respect to on . Furthermore one can easily check that

and thus the vector field used in the SP mirror descent satisfies:

It suffices now to use the exact same proof than for the standard Mirror Descent starting with:

Posted in Optimization | 3 Comments

## ORF523: Mirror Descent, part I/II

In this post we are interested in minimizing a convex function over a compact convex set under the assumption that is -Lipschitz on with respect to some arbitrary norm (precisely we assume that ). As we have seen in this post, using the correct norm to measure the variation of a function can make an enormous difference. We describe now a method invented by Nemirovski and Yudin to deal with this situation.

Let us for a moment abstract the situation a little bit and forget that we are trying to do optimization in finite dimension. We already observed that Projected Subgradient Descent works in an arbitrary Hilbert space (think of ). However we are now interested in the more general situation where we are optimizing the function in some Banach space (for example ). In that case the Gradient Descent strategy does not even make sense: indeed the gradients (more formally the Fréchet derivative) are elements of the dual space and thus one cannot perform the computation (it simply does not make sense). We did not have this problem for optimization in an Hilbert space since by Riesz representation theorem is isometric to . The great insight of Nemirovski and Yudin is that one can still do a gradient descent by first mapping the point into the dual space , then performing the gradient update in the dual space, and finally mapping back the resulting point to the primal space . Of course the new point in the primal space might lie outside of the constraint set and thus we need a way to project back the point on the constraint set . The rest of this lecture will make this idea more precise.

Preliminaries

We consider a non-empty convex open set such that the constraint set is included in its closure, that is , and such that .

Now we consider a continuously differentiable and strictly convex function defined on . This function will play the role of the mirror map between the primal and dual space. Precisely a point in the primal is mapped to in the ‘dual’ (note that here all points lie in and the notions of ‘primal’ and ‘dual’ space only have an intuitive meaning). In other words the gradient update of Mirror Descent with the mirror map will take the form . Next this dual point needs to be mapped back to the primal space, which is always possible under the condition that (weaker assumptions are possible but we will not need them), that is one can find such that

Now this point might lie outside of the constraint set . To project back on the set of constraint we use the Bregman divergence associated to defined by

The projection of onto then takes the following form:

To ensure the existence of this projection we assume now that satisfies

With this assumption is locally increasing on the boundary of , which together with the fact that is compact, and is stricly convex implies the existence and uniqueness of the projection defined above.

We say that is a mirror map if it satisfies all the above assumptions.

Mirror Descent

We can now describe the Mirror Descent strategy based on a mirror map . Let . Then for , let such that

and

Theorem Let be a mirror map. Assume also that is strongly convex on with respect to , that is

Let and be convex and -Lipschitz w.r.t. , then Mirror Descent with satisfies

Proof: Let . The claimed bounded will be obtained by taking a limit . Now by convexity of , the definition of the Mirror Descent, and the definition of the Bregman divergence one has

Now the key observation is that by writing down the first order optimality condition for one obtains

which can be written equivalently as

Thus the term will lead to a telescopic sum when summing over to , and it remains to bound the other term as follows:

We proved

which concludes the proof up to trivial computation.

Mirror Descent as a proximal method

One can rewrite Mirror Descent as follows:

Nowadays this last expression is usually taken as the definition of Mirror Descent (see this paper by Beck and Teboulle). It gives a proximal point of view on Mirror Descent: the algorithm is trying to minimize the local linearization of the function while not moving too far away from the previous point (with distances measured via the Bregman divergence of the mirror map).

Next time we will see some of the ‘standard setups’ for Mirror Descent, that is pairs of (constraint set, mirror map) that ‘go well together’.

Posted in Optimization | 2 Comments