As we said in the previous lecture it seems stupid to consider that we are in a black-box situation when in fact we know entirely the function to be optimized. Consider for instance the LASSO objective where one wants to minimize over . By resorting to black-box procedures one would solve this problem with Subgradient Descent and the rate of convergence would be of order as this function is non-smooth (and potentially non-strongly convex). However we will see now that one can take advantage of the form of the LASSO objective which is a sum of a smooth part and a simple non-smooth part to obtain rates as fast as .
In this lecture (which follows this paper by Beck and Teboulle) we consider the unconstrained minimization of a sum of two functions and that satisfy the following requirements:
(i) admits a minimizer on .
(ii) and are convex, and is -smooth.
(iii) is known and is accessible with a -order oracle.
As we will see next, for the proposed algorithm to be computationally efficient also needs to be ‘simple’. For instance a separable function (i.e., ) will be considered as a simple function. Our prime example will be .
ISTA (Iterative Shrinkage-Thresholding Algorithm)
Recall that the Gradient Descent algorithm to optimize the smooth function is simply given by
which can be written in the proximal form as
Now here one wants to minimize , and is assumed to be known and ‘simple’. It seems very natural to consider the following iterative procedure known as ISTA:
In terms of convergence rate it is not too hard to show that ISTA has the same convergence rate on than Gradient Descent on , more precisely with one has
This improved convergence rate over Subgradient Descent comes at a price: computing may be a difficult optimization problem by itself in general, and this is why one needs to assume that is ‘simple’. For instance if can be written as then one can compute by solving convex problems in dimension . In the case where this one-dimensional problem is given by:
Elementary computations show that this problem has an analytical solution given by , where is the shrinkage operator defined by
FISTA (Fast ISTA)
As we have seen in this lecture, the optimal rate of convergence for smooth functions can be obtained with Nesterov’s Accelerated Gradient Descent. Combining this idea with ISTA one gets FISTA which is described as follows. Let
Let an arbitrary initial point, and
Again it is not hard to show that the rate of FISTA is similar to the one of Nesterov’s Accelerated Gradient Descent, more precisely: