In this lecture we are interested in the following problem:
where each is -smooth and strongly-convex. We denote by the condition number of these functions (which is also the condition number of the average).
Using a gradient descent directly on the sum with step-size one obtains a rate of convergence of order for some numerical constant (see this lecture). In particular one can reach an -optimal point in time with this method, where is the time to compute the gradient of a single function . Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to .
Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an -optimal point in time which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in would be of the form ).
Stochastic Average Gradient (SAG)
We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let , and for let be drawn uniformly at random in and
They show that in the regime where , SAG (with step-size and with obtained as the average of iterates of SGD) has a rate of convergence of order (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of this results in an overall time-complexity of , which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.
Short detour through Fenchel’s duality
Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function is defined by:
Note that is always a convex function (possibly taking value ). If is convex, then for any one clearly has . Finally one can show that if is -smooth then is -strongly convex (one can use the second Lemma from this lecture).
Stochastic Dual Coordinate Ascent (SDCA)
We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):
- is convex, -smooth.
- and .
the primal objective. For the dual objective we introduce a dual variable for each data point (as well as a temporary primal variable ):
Next observe that by the KKT conditions one has at the optimum :
and thus denoting
we just proved
We can now describe SDCA: let and for let be drawn uniformly at random in and
Let be the corresponding primal variable. The following result justifies the procedure.
Theorem (Shalev-Shwartz and Zhang 2013) SDCA with satisfies:
The above convergence rate yields an overall time-complexity of for SDCA (where is the complexity of computing a gradient of ), which can be much better than the complexity of the plain gradient descent.
Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:
Let us see how the above inequality yields the result. First we clearly have , and thus the above inequality immediately yields:
which by induction gives (using and ):
and thus using once more (1):
Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:
Let us now evaluate the improvement in the dual objective assuming that coordinate is updated in the step:
Now using that is -strongly convex one has
where in the last line we used that and . Observe now that
which finally gives
and thus taking expectation one obtains
which concludes the proof of (1) and the proof of the theorem.