Today we start the third (and last) part of our class titled: ‘Optimization and Randomness’. An important setting that we will explore is the one of optimization with noisy oracles:
- Noisy -order oracle: given it outputs such that .
- Noisy -order oracle: given it outputs such that .
Somewhat surprisingly several of the first-order algorithms we studied so far are robust to noise and attain the same performance if one feeds them with the answers of a noisy order oracle. This was first observed by Robbins and Monro in the early Fifties, and the resulting set of techniques is often referred to as stochastic approximation methods. For instance let us consider the Stochastic Mirror Descent:
The analysis of this algorithm follows the exact same lines than the standard version. The key observation is that:
and now one can repeat the standard proof since was used in the algorithm. This yields the following result.
Theorem. Let be a mirror map -strongly convex on with respect to , and let . Assume that is convex and that the noisy oracle is such that . Then Stochastic Mirror Descent with satisfies
Of course the exact same reasoning applies to the Mirror Descent for Saddle Points. One can also get the rate for strongly convex functions using the idea described in this lecture. High-probability results can be obtained using Markov’s inequality, and stronger concentration results can be proved under a subgaussian assumption on the tails of the estimates , see Proposition 2.2. here.
Let us now see some application of this noisy framework.
We are interested in the solving the following problem:
where is a known loss function such that is a convex function , and is a random variable whose distribution is unknown. While the distribution of is unknown, we assume that we have access to a sequence of i.i.d. examples . A typical example within this framework is the case where represents a pair of (input, output) and the function is one of the loss function we saw in this lecture, for example the logistic loss with , .
Let be a subgradient of at , and let . We assume that . Note that we cannot perform a gradient descent on as we would need to know the distribution of to compute a subgradient of . On the other hand one has access to unbiased estimates of these subgradients with the i.i.d. examples . Thus one can use the Stochastic Mirror Descent, where at each iteration a new example is used to compute the noisy subgradient, precisely we take . In particular with examples one can only run the algorithm for steps, and one gets to a precision of order . If in addition is strongly convex then one can in fact obtain a precision of order with the step sizes/averaging described in this lecture.
Acceleration by randomization
In the previous example there was no alternative to the use of noisy subgradients as the distribution of was unknown. However in some cases, even if the distribution of is known, it might be beneficial to artificially sample examples from the distribution in order to avoid computing difficult integrals. Consider for example the problem of minimizing an average of convex functions:
We have seen several important examples of this form where might be very large. Note that computing a subgradient of takes time where is the time to compute the subgradient of a single function . Thus a standard gradient descent algorithm will take time to reach an -optimal point. On the other hand one can simply sample (with replacement) an index at every iteration (which takes time ), and perform a step of gradient descent using only the subgradient of . This method will reach an -optimal point (in expectation) in time , which is much cheaper than the requirement for GD!
Note that if is also strongly convex, then GD needs time to reach an -optimal point, and SGD needs only time (with appropriate step sizes/averaging as discussed above). The situation becomes more favorable to GD if in addition is smooth (recall that for strongly convex and smooth functions one has an exponential rate of convergence for GD), and we will come back to this in the next lecture.
Let us consider another example which we studied at the end of the previous lecture:
where (the symmetry assumption is just for sake of notation). It is clear that the rate of convergence of MD-SP for this problem is of order . Furthermore each step requires to compute a subgradient in and , which are of the form and . Thus the time complexity of one step is of order (which is the time complexity of a matrix-vector multiplication). Overall to reach an -optimal points one needs elementary operations. Now the key observation is that the matrix-vector multiplication can easily be randomized, especially since . Let be the column vectors of , then , and if is a random variable such that one has . Using this randomization in a Stochastic MD-SP one obtains the exact same rate of convergence (in expectation) but now each step has time complexity of order , resulting in an overall complexity to reach an -optimal point (in expectation) of . This result is especially amazing since we obtain a sublinear time algorithm (the complexity is while the ‘size’ of the problem is ). One can propose an even better algorithm based on Mirror Prox with a complexity of and we refer the reader to this book chapter by Juditsky and Nemirovski for more details.