The Wayback Machine - https://web.archive.org/web/20210415073609/https://blogs.princeton.edu/imabandit/2013/04/30/orf523-acceleration-by-randomization-for-a-sum-of-smooth-and-strongly-functions/

ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions

In this lecture we are interested in the following problem:

    \[\min_{x \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(x) ,\]

where each f_i is \beta-smooth and \alpha strongly-convex. We denote by Q=\frac{\beta}{\alpha} the condition number of these functions (which is also the condition number of the average).

Using a gradient descent directly on the sum with step-size \frac{2}{\alpha + \beta} one obtains a rate of convergence of order \exp( - c t / Q) for some numerical constant c>0 (see this lecture). In particular one can reach an \epsilon-optimal point in time O(m Q N \log(1/\epsilon)) with this method, where O(N) is the time to compute the gradient of a single function f_i(x). Using Nesterov’s Accelerated Gradient Descent (see this lecture) one can reduce the time complexity to O(m \sqrt{Q} N \log(1/\epsilon)).

Using the SGD method described in the previous lecture (with the step-size/averaging described here) one can reach an \epsilon-optimal point in time O( \log(m) N / (\alpha \epsilon)) which is much worse than the basic gradient descent if one is looking for a fairly accurate solution. We are interested in designing a randomized algorithm in the spirit of SGD that would leverage the smoothness of the function to attain a linear rate of convergence (i.e., the dependency in \epsilon would be of the form \log(1/\epsilon)).

 

Stochastic Average Gradient (SAG)

We first describe the SAG method proposed in this paper by Le Roux, Schmidt and Bach (2012). The method is simple and elegant: let y_0 = 0, x_1 \in \mathbb{R}^n and for t \geq 1 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[y_t(i) = \left\{ \begin{array}{ccc} \nabla f_i(x_t) & \text{if} & i=i_t \\ y_{t-1}(i) & \text{if} & i\neq i_t \end{array} \right. ; \qquad x_{t+1} = x_t - \frac{\eta}{m} \sum_{i=1}^m y_t(i) .\]

They show that in the regime where m \geq 8 Q, SAG (with step-size \eta = \frac{1}{2 m \alpha} and with x_1 obtained as the average of m iterates of SGD) has a rate of convergence of order \exp( - c t / m) (see Proposition 2 in their paper for a more precise statement). Since each step has a time-complexity of O(N \log(m)) this results in an overall time-complexity of O(m N \log(m) \log(1/\epsilon)), which is much better than the complexity of the plain gradient descent (or even of Nesterov’s Accelerated Gradient Descent). Unfortunately the proof of this result is quite difficult. We will now describe another strategy with similar performance but with a much simpler analysis.

 

Short detour through Fenchel’s duality

Before describing the next strategy we first recall a few basic facts about Fenchel duality. First the Fenchel conjugate of a function f: \mathbb{R}^n \rightarrow \mathbb{R} is defined by:

    \[f^*(u) = \sup_{x \in \mathbb{R}^n} u^{\top} x - f(x) .\]

Note that f^* is always a convex function (possibly taking value +\infty). If f is convex, then for any u \in \partial f(x) one clearly has f^*(u) + f(x) = u^{\top} x. Finally one can show that if f is \beta-smooth then f^* is \frac{1}{\beta}-strongly convex (one can use the second Lemma from this lecture).

 

Stochastic Dual Coordinate Ascent (SDCA)

We describe now the SDCA method proposed in this paper by Shalev-Shwartz and Zhang. We need to make a few more assumptions on the functions in the sum, and precisely we will now assume that we want to solve the following problem (note that all norms here are Euclidean):

    \[\min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2 ,\]

where

  • f_i : \mathbb{R} \rightarrow \mathbb{R} is convex, \beta-smooth.
  • f_i(s) \geq 0, \forall s \in \mathbb{R} and f_i(0) \leq 1.
  • \|x_i\| \leq 1.

Denote

    \[P(w) = \frac{1}{m} \sum_{i=1}^m f_i(w^{\top} x_i) + \frac{\alpha}{2} \|w\|^2\]

the primal objective. For the dual objective we introduce a dual variable \nu(i) for each data point x_i (as well as a temporary primal variable z(i)):

    \begin{eqnarray*} \min_{w \in \mathbb{R}^n} P(w) & = & \min_{z \in \mathbb{R}^m, w} \max_{\nu \in \mathbb{R}^m} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{z \in \mathbb{R}^m, w} \frac{1}{m} \sum_{i=1}^m \bigg( f_i(z(i)) + \nu(i) (z(i) - w^{\top} x_i) \bigg) + \frac{\alpha}{2} \|w\|^2 \\ & = & \max_{\nu \in \mathbb{R}^m} \min_{w \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \bigg( - f_i^*(- \nu(i)) - \nu(i) w^{\top} x_i \bigg) + \frac{\alpha}{2} \|w\|^2 . \end{eqnarray*}

Next observe that by the KKT conditions one has at the optimum (w^*, \nu^*):

    \[w^* = \frac{1}{\alpha m} \sum_{i=1}^m \nu^*(i) x_i ,\]

and thus denoting

    \[D(\nu) = \frac{1}{m} \sum_{i=1}^m - f_i^*(- \nu(i)) - \frac{\alpha}{2} \left\|\frac{1}{\alpha m} \sum_{i=1}^m \nu(i) x_i \right\|^2 ,\]

we just proved

    \[\min_{w \in \mathbb{R}^n} P(w) = \max_{\nu \in \mathbb{R}^m} D(\nu) .\]

 

We can now describe SDCA: let \nu_0 = 0 and for t \geq 0 let i_t be drawn uniformly at random in \{1, \hdots, m\} and

    \[\nu_{t+1}(i) = \left\{ \begin{array}{ccc} (1-\eta) \nu_t(i) - \eta f_i'(w_t^{\top} x_i) & \text{if} & i=i_t , \\ \nu_t(i) & \text{if} & i \neq i_t . \end{array} \right.\]

Let w_t = \frac{1}{\alpha m} \sum_{i=1}^m \nu_t(i) x_i be the corresponding primal variable. The following result justifies the procedure.

Theorem (Shalev-Shwartz and Zhang 2013) SDCA with \eta = \frac{m}{Q + m} satisfies:

    \[\mathbb{E} P(w_t) - \min_{w \in \mathbb{R}^n} P(w) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

The above convergence rate yields an overall time-complexity of O((Q+m) N \log(m) \log((Q+m)/\epsilon)) for SDCA (where O(N) is the complexity of computing a gradient of w \mapsto f_i(w^{\top} x_i)), which can be much better than the O(m Q N \log(1/\epsilon)) complexity of the plain gradient descent.

Proof: The key property that we will prove is that the primal/dual gap is controlled by the improvement in the dual objective, precisely:

(1)   \begin{equation*} \mathbb{E} \left( P(w_t) - D(\nu_t) \right) \leq (Q+m) \mathbb{E} \left( D(\nu_{t+1}) - D(\nu_t) \right) . \end{equation*}

Let us see how the above inequality yields the result. First we clearly have D(\nu) \leq P(w), and thus the above inequality immediately yields:

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right) \mathbb{E} \left( D(\nu^*) - D(\nu_{t}) \right) ,\]

which by induction gives (using D(0) \geq 0 and D(\nu^*) \leq 1):

    \[\mathbb{E} \left( D(\nu^*) - D(\nu_{t+1}) \right) \leq \left(1 - \frac{1}{Q+m}\right)^{t+1} \leq \exp\left( - \frac{t+1}{Q+m} \right) ,\]

and thus using once more (1):

    \[\mathbb{E} P(w_t) - P(w^*) \leq (Q+m) \exp\left( - \frac{t}{Q+m} \right) .\]

Thus it only remains to prove (1). First observe that the primal/dual gap can be written as:

    \[P(w_t) - D(\nu_t) = \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) \bigg) + \alpha \|w_t\|^2 .\]

Let us now evaluate the improvement in the dual objective assuming that coordinate i is updated in the (t+1)^{th} step:

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & = - f_i^*(-\nu_{t+1}(i)) - \frac{\alpha m}{2} \|w_{t+1}\|^2 - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) \\ & = - f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) - \frac{\alpha m}{2} \left\|w_t - \frac{\eta}{\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i)) x_i \right\|^2 \\ & \qquad - \left(- f_i^*(-\nu_t(i)) - \frac{\alpha m}{2} \|w_t\|^2\right) . \end{align*}

Now using that f_i^* is \frac{1}{\beta}-strongly convex one has

    \begin{eqnarray*} f_i^*(-(1-\eta) \nu_t(i) + \eta f_i'(w_t^{\top} x_i)) & \leq& (1 -\eta) f_i^*(- \nu_t(i)) + \eta f_i^*(f_i'(w_t^{\top} x_i)) \\ & & \;- \frac{1}{2 \beta} \eta (1-\eta) (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 , \end{eqnarray*}

which yields

    \begin{align*} & m \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \frac{\eta (1-\eta)}{2 \beta} (f_i'(w_t^{\top} x_i) - \nu_t(i))^2 \\ & + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i - \frac{\eta^2}{2\alpha m} (f_i'(w_t^{\top} x_i) + \nu_t(i))^2 \|x_i\|^2 \\ & \geq \eta f_i^*(- \nu_t(i)) - \eta f_i^*(f_i'(w_t^{\top} x_i)) + \eta (f_i'(w_t^{\top} x_i) + \nu_t(i)) w_t^{\top} x_i , \end{align*}

where in the last line we used that \eta = \frac{m}{Q+m} and \|x_i\| \leq 1. Observe now that

    \[f_i^*(f_i'(w_t^{\top} x_i)) = (w_t^{\top} x_i) f_i'(w_t^{\top} x_i) - f_i(w_t^{\top} x_i) ,\]

which finally gives

    \[m \left( D(\nu_{t+1}) - D(\nu_t) \right) \geq \eta \bigg(f_i^*(- \nu_t(i)) + f_i(w_t^{\top} x_i) + \nu_t(i) w_t^{\top} x_i \bigg),\]

and thus taking expectation one obtains

    \begin{align*} & (Q+m) \mathbb{E}_{i_t} \left( D(\nu_{t+1}) - D(\nu_t) \right) \\ & \geq \frac{1}{m} \sum_{i=1}^m \bigg( f_i(w_t^{\top} x_i) + f_i^*(- \nu_t(i)) + \nu_t(i) w_t^{\top} x_i \bigg)\\ & = P(w_t) - D(\nu_t), \end{align*}

which concludes the proof of (1) and the proof of the theorem.

\Box

This entry was posted in Optimization. Bookmark the permalink.

2 Responses to "ORF523: Acceleration by randomization for a sum of smooth and strongly convex functions"

Leave a reply