In this lecture we are interested in the following problem:
where each is
-smooth and
strongly-convex. We denote by
the condition number of these functions (which is also the condition number of the average).
Using a gradient descent directly on the sum with step-size one obtains a rate of convergence of order
for some numerical constant
(see this lecture). In particular one can reach an
-optimal point in time
with this method, where
is the time to compute the gradient of a single function
. Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to
.
Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an -optimal point in time
which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in
would be of the form
).
Stochastic Average Gradient (SAG)
We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let ,
and for
let
be drawn uniformly at random in
and
They show that in the regime where , SAG (with step-size
and with
obtained as the average of
iterates of SGD) has a rate of convergence of order
(see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of
this results in an overall time-complexity of
, which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.
Short detour through Fenchel’s duality
Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function is defined by:
Note that is always a convex function (possibly taking value
). If
is convex, then for any
one clearly has
. Finally one can show that if
is
-smooth then
is
-strongly convex (one can use the second Lemma from this lecture).
Stochastic Dual Coordinate Ascent (SDCA)
We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):
where
is convex,
-smooth.
and
.
.
Denote
the primal objective. For the dual objective we introduce a dual variable for each data point
(as well as a temporary primal variable
):
Next observe that by the KKT conditions one has at the optimum :
and thus denoting
we just proved
We can now describe SDCA: let and for
let
be drawn uniformly at random in
and
Let be the corresponding primal variable. The following result justifies the procedure.
Theorem (Shalev-Shwartz and Zhang 2013) SDCA with
satisfies:
The above convergence rate yields an overall time-complexity of for SDCA (where
is the complexity of computing a gradient of
), which can be much better than the
complexity of the plain gradient descent.
Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:
(1)
Let us see how the above inequality yields the result. First we clearly have , and thus the above inequality immediately yields:
which by induction gives (using and
):
and thus using once more (1):
Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:
Let us now evaluate the improvement in the dual objective assuming that coordinate is updated in the
step:
Now using that is
-strongly convex one has
which yields
where in the last line we used that and
. Observe now that
which finally gives
and thus taking expectation one obtains
which concludes the proof of (1) and the proof of the theorem.
By Alexander Gasnikov (MIPT) January 9, 2014 - 2:12 pm
I guess you will be interested in recent work
Konecny J., Richtarik P. Semi-stochastic gradient descent methods // e-print, 2013.
http://www.maths.ed.ac.uk/%7Eprichtar/papers/s2gd.pdf
In work one can find a generalization of mentioned above results
Another interesting work is
Rakhlin A., Shamir O., Sridharan K. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization // e-print, 2012. http://icml.cc/2012/papers/261.pdf
By sm August 9, 2013 - 11:46 am
is this SDCA identical to the one in the paper? The original one seems having strict maximization w.r.t. a chosen coordinate at each update.