ORF523: ISTA and FISTA

As we said in the previous lecture it seems stupid to consider that we are in a black-box situation when in fact we know entirely the function to be optimized. Consider for instance the LASSO objective $\| X w - Y \|_2^2 + \lambda \|w\|_1$ where one wants to minimize over $w \in \mathbb{R}^n$ . By resorting to black-box procedures one would solve this problem with Subgradient Descent and the rate of convergence would be of order $1/\sqrt{t}$ as this function is non-smooth (and potentially non-strongly convex). However we will see now that one can take advantage of the form of the LASSO objective which is a sum of a smooth part and a simple non-smooth part to obtain rates as fast as $1/t^2$ .

In this lecture (which follows this paper by Beck and Teboulle) we consider the unconstrained minimization of a sum of two functions $f$ and $g$ that satisfy the following requirements:

(i) $f+g$ admits a minimizer $x^*$ on $\mathbb{R}^n$ .

(ii) $f$ and $g$ are convex, and $f$ is $\beta$ -smooth.

(iii) $g$ is known and $f$ is accessible with a $1^{st}$ -order oracle.

As we will see next, for the proposed algorithm to be computationally efficient $g$ also needs to be ‘simple’. For instance a separable function (i.e., $g(x) = \sum_{i=1}^n g_i(x_i)$ ) will be considered as a simple function. Our prime example will be $g(x) = \|x\|_1$ .

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall that the Gradient Descent algorithm to optimize the smooth function $f$ is simply given by

$x_{t+1} = x_t - \eta \nabla f(x_t) ,$

which can be written in the proximal form as

$x_{t+1} = \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 .$

Now here one wants to minimize $f+g$ , and $g$ is assumed to be known and ‘simple’. It seems very natural to consider the following iterative procedure known as ISTA:

$\begin{eqnarray*} x_{t+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ f(x_t) + \nabla f(x_t)^{\top} (x-x_t) + \frac{1}{2\eta} \|x - x_t\|^2_2 + g(x) \\ & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{1}{2\eta} \|x - (x_t - \eta \nabla f(x_t)) \|_2^2 . \end{eqnarray*}$

In terms of convergence rate it is not too hard to show that ISTA has the same convergence rate on $f+g$ than Gradient Descent on $f$ , more precisely with $\eta=\frac{1}{\beta}$ one has

$f(x_t) + g(x_t) - (f(x^*) + g(x^*)) \leq \frac{\beta \|x_1 - x^*\|^2_2}{2 t} .$

This improved convergence rate over Subgradient Descent comes at a price: computing $x_{t+1}$ may be a difficult optimization problem by itself in general, and this is why one needs to assume that $g$ is ‘simple’. For instance if $g$ can be written as $g(x) = \sum_{i=1}^n g_i(x_i)$ then one can compute $x_{t+1}$ by solving $n$ convex problems in dimension $1$ . In the case where $g(x) = \lambda \|x\|_1$ this one-dimensional problem is given by:

$\min_{x \in \mathbb{R}} \ \lambda |x| + \frac{1}{2 \eta}(x - x_0)^2, \ \text{where} \ x_0 \in \mathbb{R} .$

Elementary computations show that this problem has an analytical solution given by $\tau_{\lambda \eta}(x_0)$ , where $\tau$ is the shrinkage operator defined by

$\tau_{\alpha}(x) = (|x|-\alpha)_+ \mathrm{sign}(x) .$

FISTA (Fast ISTA)

As we have seen in this lecture, the optimal rate of convergence for smooth functions can be obtained with Nesterov’s Accelerated Gradient Descent. Combining this idea with ISTA one gets FISTA which is described as follows. Let

$\lambda_0 = 0, \ \lambda_{s} = \frac{1 + \sqrt{1+ 4 \lambda_{s-1}^2}}{2}, \ \text{and} \ \gamma_s = \frac{1 - \lambda_s}{\lambda_{s+1}}.$

Let $x_1 = y_1$ an arbitrary initial point, and

$\begin{eqnarray*} y_{s+1} & = & \mathrm{argmin}_{x \in \mathbb{R}^n} \ g(x) + \frac{\beta}{2} \|x - (x_s - \eta \nabla f(x_s)) \|_2^2 , \\ x_{s+1} & = & (1 - \gamma_s) y_{s+1} + \gamma_s y_s . \end{eqnarray*}$

Again it is not hard to show that the rate of FISTA is similar to the one of Nesterov’s Accelerated Gradient Descent, more precisely:

$f(y_t) + g(y_t) - (f(x^*) + g(x^*)) \leq \frac{2 \beta \|x_1 - x^*\|^2}{t^2} .$

6 Responses to "ORF523: ISTA and FISTA"

By Muhammad Kasim January 9, 2019 - 10:32 am

Hi,

I wonder if there’s an analytical solution if g(x) = || Bx ||_1, with B is a matrix with an arbitrary shape (not necessarily invertible) in the ISTA step equation above.
Thank you.
By Royi June 1, 2016 - 2:54 am

Hi,

Could you show the use of FISTA in the case of the f(x) and g(x) you chose above (Namely g(x) is the l1 norm)?

Thank You.
By First-order methods for regularization | Sketches, polytopes July 21, 2014 - 12:27 am

[…] NESTA: A Fast and Accurate First-Order Method for Sparse Recovery. Technical Report, Caltech, 2009 [3] http://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/ [4] M. Teboulle. […]
By NIPS 2013 and ICML 2014 | spider's space January 28, 2014 - 8:47 am

[…] Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal […]
By Maziar November 17, 2013 - 6:43 pm

FISTA method is a special case of Nestrov’s method.
- By Khue July 10, 2014 - 6:56 pm
  
  Do you mean this one of Nesterov: http://www.ecore.be/DPs/dp_1191313936.pdf?

ORF523: ISTA and FISTA

6 Responses to "ORF523: ISTA and FISTA"

By Muhammad Kasim January 9, 2019 - 10:32 am

By Royi June 1, 2016 - 2:54 am

By First-order methods for regularization | Sketches, polytopes July 21, 2014 - 12:27 am

By NIPS 2013 and ICML 2014 | spider's space January 28, 2014 - 8:47 am

By Maziar November 17, 2013 - 6:43 pm

By Khue July 10, 2014 - 6:56 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll