As we said in the previous lecture it seems stupid to consider that we are in a black-box situation when in fact we know entirely the function to be optimized. Consider for instance the LASSO objective where one wants to minimize over
. By resorting to black-box procedures one would solve this problem with Subgradient Descent and the rate of convergence would be of order
as this function is non-smooth (and potentially non-strongly convex). However we will see now that one can take advantage of the form of the LASSO objective which is a sum of a smooth part and a simple non-smooth part to obtain rates as fast as
.
In this lecture (which follows this paper by Beck and Teboulle) we consider the unconstrained minimization of a sum of two functions and
that satisfy the following requirements:
(i) admits a minimizer
on
.
(ii) and
are convex, and
is
-smooth.
(iii) is known and
is accessible with a
-order oracle.
As we will see next, for the proposed algorithm to be computationally efficient also needs to be ‘simple’. For instance a separable function (i.e.,
) will be considered as a simple function. Our prime example will be
.
ISTA (Iterative Shrinkage-Thresholding Algorithm)
Recall that the Gradient Descent algorithm to optimize the smooth function is simply given by
which can be written in the proximal form as
Now here one wants to minimize , and
is assumed to be known and ‘simple’. It seems very natural to consider the following iterative procedure known as ISTA:
In terms of convergence rate it is not too hard to show that ISTA has the same convergence rate on than Gradient Descent on
, more precisely with
one has
This improved convergence rate over Subgradient Descent comes at a price: computing may be a difficult optimization problem by itself in general, and this is why one needs to assume that
is ‘simple’. For instance if
can be written as
then one can compute
by solving
convex problems in dimension
. In the case where
this one-dimensional problem is given by:
Elementary computations show that this problem has an analytical solution given by , where
is the shrinkage operator defined by
FISTA (Fast ISTA)
As we have seen in this lecture, the optimal rate of convergence for smooth functions can be obtained with Nesterov’s Accelerated Gradient Descent. Combining this idea with ISTA one gets FISTA which is described as follows. Let
Let an arbitrary initial point, and
Again it is not hard to show that the rate of FISTA is similar to the one of Nesterov’s Accelerated Gradient Descent, more precisely:
By Muhammad Kasim January 9, 2019 - 10:32 am
Hi,
I wonder if there’s an analytical solution if g(x) = || Bx ||_1, with B is a matrix with an arbitrary shape (not necessarily invertible) in the ISTA step equation above.
Thank you.
By Royi June 1, 2016 - 2:54 am
Hi,
Could you show the use of FISTA in the case of the f(x) and g(x) you chose above (Namely g(x) is the l1 norm)?
Thank You.
By First-order methods for regularization | Sketches, polytopes July 21, 2014 - 12:27 am
[…] NESTA: A Fast and Accurate First-Order Method for Sparse Recovery. Technical Report, Caltech, 2009 [3] http://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/ [4] M. Teboulle. […]
By NIPS 2013 and ICML 2014 | spider's space January 28, 2014 - 8:47 am
[…] Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal […]
By Maziar November 17, 2013 - 6:43 pm
FISTA method is a special case of Nestrov’s method.
By Khue July 10, 2014 - 6:56 pm
Do you mean this one of Nesterov: http://www.ecore.be/DPs/dp_1191313936.pdf?