ORF523: ISTA and FISTA

As we said in the previous lecture it seems stupid to consider that we are in a black-box situation when in fact we know entirely the function to be optimized. Consider for instance the LASSO objective \| X w - Y \|_2^2 + \lambda \|w\|_1 where one wants to minimize over w \in \mathbb{R}^n. By resorting to black-box procedures one would solve this problem with Subgradient Descent and the rate of convergence would be of order 1/\sqrt{t} as this function is non-smooth (and potentially non-strongly convex). However we will see now that one can take advantage of the form of the LASSO objective which is a sum of a smooth part and a simple non-smooth part to obtain rates as fast as 1/t^2.

In this lecture (which follows this paper by Beck and Teboulle) we consider the unconstrained minimization of a sum of two functions f and g that satisfy the following requirements:

(i) f+g admits a minimizer x^* on \mathbb{R}^n.

(ii) f and g are convex, and f is \beta-smooth.

(iii) g is known and f is accessible with a 1^{st}-order oracle.

As we will see next, for the proposed algorithm to be computationally efficient g also needs to be ‘simple’. For instance a separable function (i.e., g(x) = \sum_{i=1}^n g_i(x_i)) will be considered as a simple function. Our prime example will be g(x) = \|x\|_1.

 

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall that the Gradient Descent algorithm to optimize the smooth function f is simply given by

    \[x_{t+1} = x_t - \eta \nabla f(x_t) ,\]

which can be written in the proximal form as

    \[x_{t+1} = \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 .\]

Now here one wants to minimize f+g, and g is assumed to be known and ‘simple’. It seems very natural to consider the following iterative procedure known as ISTA:

    \begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 + g(x) \\ & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \end{eqnarray*}

In terms of convergence rate it is not too hard to show that ISTA has the same convergence rate on f+g than Gradient Descent on f, more precisely with \eta=\frac{1}{\beta} one has

    \[f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} .\]

This improved convergence rate over Subgradient Descent comes at a price: computing x_{t+1} may be a difficult optimization problem by itself in general, and this is why one needs to assume that g is ‘simple’. For instance if g can be written as g(x) = \sum_{i=1}^n g_i(x_i) then one can compute x_{t+1} by solving n convex problems in dimension 1. In the case where g(x) = \lambda \|x\|_1 this one-dimensional problem is given by:

    \[\min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} .\]

Elementary computations show that this problem has an analytical solution given by \tau_{\lambda \eta}(x_0), where \tau is the shrinkage operator defined by

    \[\tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) .\]

 

FISTA (Fast ISTA)

As we have seen in this lecture, the optimal rate of convergence for smooth functions can be obtained with Nesterov’s Accelerated Gradient Descent. Combining this idea with ISTA one gets FISTA which is described as follows. Let

    \[\lambda_0 = 0, \ \lambda_{s} = \frac{1 + \sqrt{1+ 4 \lambda_{s-1}^2}}{2}, \ \text{and} \ \gamma_s = \frac{1 - \lambda_s}{\lambda_{s+1}}.\]

Let x_1 = y_1 an arbitrary initial point, and

    \begin{eqnarray*} y_{s+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_s - \eta \nabla f(x_s)) \|_2^2 , \\ x_{s+1} & = & (1 - \gamma_s) y_{s+1} + \gamma_s y_s . \end{eqnarray*}

Again it is not hard to show that the rate of FISTA is similar to the one of Nesterov’s Accelerated Gradient Descent, more precisely:

    \[f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} .\]

This entry was posted in Optimization. Bookmark the permalink.

6 Responses to "ORF523: ISTA and FISTA"

Leave a reply