Theory of Convex Optimization for Machine Learning

I am extremely happy to release the first draft of my monograph based on the lecture notes published last year on this blog. (Comments on the draft are welcome!) The abstract reads as follows:

This monograph presents the main mathematical ideas in convex optimization. Starting from the fundamental theory of black-box optimization, the material progresses towards recent advances in structural optimization and stochastic optimization. Our presentation of black-box optimization, strongly influenced by the seminal book of Nesterov, includes the analysis of the Ellipsoid Method, as well as (accelerated) gradient descent schemes. We also pay special attention to non-Euclidean settings (relevant algorithms include Frank Wolfe, Mirror Descent, and Dual Averaging) and discuss their relevance in machine learning. We provide a gentle introduction to structural optimization with FISTA (to optimize a sum of a smooth and a simple non-smooth term), Saddle-Point Mirror Prox (Nemirovski’s alternative to Nesterov’s smoothing), and a concise description of Interior Point Methods. In stochastic optimization we discuss Stochastic Gradient Descent, mini-batches, Random Coordinate Descent, and sublinear algorithms. We also briefly touch upon convex relaxation of combinatorial problems and the use of randomness to round solutions, as well as random walks based methods.

This entry was posted in Optimization. Bookmark the permalink.

13 Responses to "Theory of Convex Optimization for Machine Learning"

By Lee Zamparo August 7, 2014 - 12:56 pm

Thanks for releasing this, it’s really useful. Do you have a preferred way for people to cite this? I notice it’s on arxiv, will citing that submission do?

By Sebastien Bubeck August 7, 2014 - 2:57 pm

The monograph will be eventually published, but for the moment it is best to cite the arxiv version.

By Erik August 4, 2014 - 11:38 pm

Hi Sebastian,

Excellent work! I’ve been reading this for the past few days and have found it to be a quite helpful resource. One comment:

On page 43, in your discussion on Nesterov’s Accelerated Gradient Descent, you say that, intuitively, \Phi_s become a finer and finer approximation “from below” to f in the sense of inequality (3.17).

I found the wording here a bit confusing (“from below”, in particular), since for some values of x we have \Phi_s(x) > f(x). For those x where \Phi_s(x) > f(x), the inequality (3.17) is putting a bound on how good of an approximation \Phi_s is to f “from above”.

Erik

By Sebastien Bubeck August 7, 2014 - 3:14 pm

You are definitely right that \Phi_s(x) is not a lower bound on f, however I still would like to argue that it is “essentially” a lower bound. I’m not sure how to make this more precise in words, but I think that equations (3.17), (3.18), and the calculations at the top of page 44, speak for themselves :).

By GMC June 13, 2014 - 2:43 am

Hi, thank you very much for your work on this fascinating subject.
Giving free access to the monograph is a very powerful vector for the communication of knowledge to everyone.

Beside, I understand your definition of subgradient (1.2) as “the variation of the function f between x and y is upper bounded by a linear function of the distance between x and y, which seems to be counter-intuitive with the word sub.
Is my interpretation correct ?

Thank you

By Sebastien Bubeck June 21, 2014 - 9:42 am

Think of x as being fixed and y as being the variable, then you get a linear lower bound on f(y).
By GMC June 26, 2014 - 2:46 am

Thanks!

By Lukasz Lew June 2, 2014 - 8:21 am

You might want to look into the table on page 11. It is over a margin and maybe outside of page.

By Abhinav Maurya May 31, 2014 - 1:11 am

Nice! I recently discovered this via ArXiv. I have just started reading it, and will follow up with any feedback that might be helpful.

By petrux May 19, 2014 - 5:11 am

Simply amazing! Hope I have time to read and give you some feedback. Thanks a lot!

By Lukasz Lew May 16, 2014 - 1:09 pm

Can you provide more printer-friendly version ?

By Sebastien Bubeck May 18, 2014 - 12:36 pm

What do you mean Lukasz? What is not printer-friendly in the current version?
By Lukasz Lew May 18, 2014 - 3:40 pm

Ah sorry I did not explain. I’ve meant mostly wide margins.
Also double column format would make it easier to read with narrower margins but that is minor.

By Lee Zamparo August 7, 2014 - 12:56 pm

Thanks for releasing this, it’s really useful. Do you have a preferred way for people to cite this? I notice it’s on arxiv, will citing that submission do?
- By Sebastien Bubeck August 7, 2014 - 2:57 pm
  
  The monograph will be eventually published, but for the moment it is best to cite the arxiv version.
By Erik August 4, 2014 - 11:38 pm

Hi Sebastian,

Excellent work! I’ve been reading this for the past few days and have found it to be a quite helpful resource. One comment:

On page 43, in your discussion on Nesterov’s Accelerated Gradient Descent, you say that, intuitively, \Phi_s become a finer and finer approximation “from below” to f in the sense of inequality (3.17).

I found the wording here a bit confusing (“from below”, in particular), since for some values of x we have \Phi_s(x) > f(x). For those x where \Phi_s(x) > f(x), the inequality (3.17) is putting a bound on how good of an approximation \Phi_s is to f “from above”.

Erik
- By Sebastien Bubeck August 7, 2014 - 3:14 pm
  
  You are definitely right that \Phi_s(x) is not a lower bound on f, however I still would like to argue that it is “essentially” a lower bound. I’m not sure how to make this more precise in words, but I think that equations (3.17), (3.18), and the calculations at the top of page 44, speak for themselves :).
By GMC June 13, 2014 - 2:43 am

Hi, thank you very much for your work on this fascinating subject.
Giving free access to the monograph is a very powerful vector for the communication of knowledge to everyone.

Beside, I understand your definition of subgradient (1.2) as “the variation of the function f between x and y is upper bounded by a linear function of the distance between x and y, which seems to be counter-intuitive with the word sub.
Is my interpretation correct ?

Thank you
- By Sebastien Bubeck June 21, 2014 - 9:42 am
  
  Think of x as being fixed and y as being the variable, then you get a linear lower bound on f(y).
- By GMC June 26, 2014 - 2:46 am
  
  Thanks!
By Lukasz Lew June 2, 2014 - 8:21 am

You might want to look into the table on page 11. It is over a margin and maybe outside of page.
By Abhinav Maurya May 31, 2014 - 1:11 am

Nice! I recently discovered this via ArXiv. I have just started reading it, and will follow up with any feedback that might be helpful.
By petrux May 19, 2014 - 5:11 am

Simply amazing! Hope I have time to read and give you some feedback. Thanks a lot!
By Lukasz Lew May 16, 2014 - 1:09 pm

Can you provide more printer-friendly version ?
- By Sebastien Bubeck May 18, 2014 - 12:36 pm
  
  What do you mean Lukasz? What is not printer-friendly in the current version?
- By Lukasz Lew May 18, 2014 - 3:40 pm
  
  Ah sorry I did not explain. I’ve meant mostly wide margins.
  Also double column format would make it easier to read with narrower margins but that is minor.

Theory of Convex Optimization for Machine Learning

13 Responses to "Theory of Convex Optimization for Machine Learning"

By Lee Zamparo August 7, 2014 - 12:56 pm

By Sebastien Bubeck August 7, 2014 - 2:57 pm

By Erik August 4, 2014 - 11:38 pm

By Sebastien Bubeck August 7, 2014 - 3:14 pm

By GMC June 13, 2014 - 2:43 am

By Sebastien Bubeck June 21, 2014 - 9:42 am

By GMC June 26, 2014 - 2:46 am

By Lukasz Lew June 2, 2014 - 8:21 am

By Abhinav Maurya May 31, 2014 - 1:11 am

By petrux May 19, 2014 - 5:11 am

By Lukasz Lew May 16, 2014 - 1:09 pm

By Sebastien Bubeck May 18, 2014 - 12:36 pm

By Lukasz Lew May 18, 2014 - 3:40 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll