ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

In this lecture we are interested in the following problem:

$\min_{x \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(x) ,$

where each $f_i$ is $\beta$ -smooth and $\alpha$ strongly-convex. We denote by $Q=\frac{\beta}{\alpha}$ the condition number of these functions (which is also the condition number of the average).

Using a gradient descent directly on the sum with step-size $\frac{2}{\alpha + \beta}$ one obtains a rate of convergence of order $\exp( - c t / Q)$ for some numerical constant $c>0$ (see this lecture). In particular one can reach an $\epsilon$ -optimal point in time $O(m Q N \log(1/\epsilon))$ with this method, where $O(N)$ is the time to compute the gradient of a single function $f_i(x)$ . Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to $O(m \sqrt{Q} N \log(1/\epsilon))$ .

Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an $\epsilon$ -optimal point in time $O( \log(m) N / (\alpha \epsilon))$ which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in $\epsilon$ would be of the form $\log(1/\epsilon)$ ).

Stochastic Average Gradient (SAG)

We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let $y_0 = 0$ , $x_1 \in \mathbb{R}^n$ and for $t \geq 1$ let $i_t$ be drawn uniformly at random in $\{1, \hdots, m\}$ and

$y_t(i) = \left\{ \begin{array}{ccc} \nabla f_i(x_t) & \text{if} & i=i_t \\ y_{t-1}(i) & \text{if} & i\neq i_t \end{array} \right. ; \qquad x_{t+1} = x_t - \frac{\eta}{m} \sum_{i=1}^m y_t(i) .$

They show that in the regime where $m \geq 8 Q$ , SAG (with step-size $\eta = \frac{1}{2 m \alpha}$ and with $x_1$ obtained as the average of $m$ iterates of SGD) has a rate of convergence of order $\exp( - c t / m)$ (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of $O(N \log(m))$ this results in an overall time-complexity of $O(m N \log(m) \log(1/\epsilon))$ , which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.

Short detour through Fenchel’s duality

Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is defined by:

$f^*(u) = \sup_{x \in \mathbb{R}^n} u^{\top} x - f(x) .$

Note that $f^*$ is always a convex function (possibly taking value $+\infty$ ). If $f$ is convex, then for any $u \in \partial f(x)$ one clearly has $f^*(u) + f(x) = u^{\top} x$ . Finally one can show that if $f$ is $\beta$ -smooth then $f^*$ is $\frac{1}{\beta}$ -strongly convex (one can use the second Lemma from this lecture).

Stochastic Dual Coordinate Ascent (SDCA)

We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):

$\min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2 ,$

where

$f_i : \mathbb{R} \rightarrow \mathbb{R}$ is convex, $\beta$ -smooth.
$f_i(s) \geq 0, \forall s \in \mathbb{R}$ and $f_i(0) \leq 1$ .
$\|x_i\| \leq 1$ .

Denote

$P(w) = \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2$

the primal objective. For the dual objective we introduce a dual variable $\nu(i)$ for each data point $x_i$ (as well as a temporary primal variable $z(i)$ ):

$\begin{eqnarray*} \min_{w \in \mathbb{R}^n} P(w) & = & \min_{z \in \mathbb{R}^m, w} \max_{\nu \in \mathbb{R}^m} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{z \in \mathbb{R}^m, w} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \bigg( - f_i^*(- \nu(i)) - \nu(i) w^{\top} x_i \bigg) + \frac{\alpha}{2} \|w\|^2 . \end{eqnarray*}$

Next observe that by the KKT conditions one has at the optimum $(w^*, \nu^*)$ :

$w^* = \frac{1}{\alpha m} \sum_{i=1}^m \nu^*(i) x_i ,$

and thus denoting

$D(\nu) = \frac{1}{m} \sum_{i=1}^m - f_i^*(- \nu(i)) - \frac{\alpha}{2} \left\|\frac{1}{\alpha m} \sum_{i=1}^m \nu(i) x_i \right\|^2 ,$

we just proved

$\min_{w \in \mathbb{R}^n} P(w) = \max_{\nu \in \mathbb{R}^m} D(\nu) .$

We can now describe SDCA: let $\nu_0 = 0$ and for $t \geq 0$ let $i_t$ be drawn uniformly at random in $\{1, \hdots, m\}$ and

$\nu_{t+1}(i) = \left\{ \begin{array}{ccc} (1-\eta) \nu_t(i) - \eta f_i'(w_t^{\top} x_i) & \text{if} & i=i_t , \\ \nu_t(i) & \text{if} & i \neq i_t . \end{array} \right.$

Let $w_t = \frac{1}{\alpha m} \sum_{i=1}^m \nu_t(i) x_i$ be the corresponding primal variable. The following result justifies the procedure.

Theorem (Shalev-Shwartz and Zhang 2013) SDCA with $\eta = \frac{m}{Q + m}$ satisfies:

$\mathbb{E} P(w_t) - \min_{w \in \mathbb{R}^n} P(w) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .$

The above convergence rate yields an overall time-complexity of $O((Q+m) N \log(m) \log((Q+m)/\epsilon))$ for SDCA (where $O(N)$ is the complexity of computing a gradient of $w \mapsto f_i(w^{\top} x_i)$ ), which can be much better than the $O(m Q N \log(1/\epsilon))$ complexity of the plain gradient descent.

Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:

(1) $\begin{equation*} \mathbb{E} \left( P(w_t) - D(\nu_t) \right) \leq (Q+m) \mathbb{E} \left( D(\nu_{t+1}) - D(\nu_t) \right) . \end{equation*}$

Let us see how the above inequality yields the result. First we clearly have $D(\nu) \leq P(w)$ , and thus the above inequality immediately yields:

$\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right) \mathbb{E} \left( D(\nu^*) - D(\nu_{t}) \right) ,$

which by induction gives (using $D(0) \geq 0$ and $D(\nu^*) \leq 1$ ):

$\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right)^{t+1} \leq \exp\left( - \frac{t+1}{Q+m} \right) ,$

and thus using once more (1):

$\mathbb{E} P(w_t) - P(w^*) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .$

Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:

$P(w_t) - D(\nu_t) = \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) \bigg) + \alpha \|w_t\|^2 .$

Let us now evaluate the improvement in the dual objective assuming that coordinate $i$ is updated in the $(t+1)^{th}$ step:

$\begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & = - f_i^*(-\nu_{t+1}(i)) - \frac{\alpha m}{2} \|w_{t+1}\|^2 - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) \\ & = - f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) - \frac{\alpha m}{2} \left\|w_t - \frac{\eta}{\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i)) x_i \right\|^2 \\ & \qquad - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) . \end{align*}$

Now using that $f_i^*$ is $\frac{1}{\beta}$ -strongly convex one has

$\begin{eqnarray*} f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) & \leq& (1 -\eta) f_i^*(- \nu_t(i)) + \eta f_i^*(f_i'(w_t^{\top} x_i)) \\ & & \;- \frac{1}{2 \beta} \eta (1-\eta) (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 , \end{eqnarray*}$

which yields

$\begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \frac{\eta (1-\eta)}{2 \beta} (f_i'(w_t^{\top} x_i) - \nu_t(i))^2 \\ & + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i - \frac{\eta^2}{2\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 \|x_i\|^2 \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i , \end{align*}$

where in the last line we used that $\eta = \frac{m}{Q+m}$ and $\|x_i\| \leq 1$ . Observe now that

$f_i^*(f_i'(w_t^{\top} x_i)) = (w_t^{\top} x_i) f_i'(w_t^{\top} x_i) - f_i(w_t^{\top} x_i) ,$

which finally gives

$m \left( D(\nu_{t+1}) - D(\nu_t) \right) \geq \eta \bigg(f_i^*(- \nu_t(i)) + f_i(w_t^{\top} x_i) + \nu_t(i) w_t^{\top} x_i \bigg),$

and thus taking expectation one obtains

$\begin{align*} & (Q+m) \mathbb{E}_{i_t} \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) + \nu_t(i) w_t^{\top} x_i \bigg)\\ & = P(w_t) - D(\nu_t), \end{align*}$

which concludes the proof of (1) and the proof of the theorem.

$\Box$

2 Responses to "ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions"

By Alexander Gasnikov (MIPT) January 9, 2014 - 2:12 pm

I guess you will be interested in recent work
Konecny J., Richtarik P. Semi-stochastic gradient descent methods // e-print, 2013.
http://www.maths.ed.ac.uk/%7Eprichtar/papers/s2gd.pdf
In work one can find a generalization of mentioned above results
Another interesting work is
Rakhlin A., Shamir O., Sridharan K. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization // e-print, 2012. http://icml.cc/2012/papers/261.pdf
By sm August 9, 2013 - 11:46 am

is this SDCA identical to the one in the paper? The original one seems having strict maximization w.r.t. a chosen coordinate at each update.

Aug	APR	Nov
	15
2014	2021	2022

ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

2 Responses to "ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions"

By Alexander Gasnikov (MIPT) January 9, 2014 - 2:12 pm

By sm August 9, 2013 - 11:46 am

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll