Today we will talk about another property of convex functions that can significantly speed-up the convergence of first-order methods: strong convexity. We say that is –strongly convex if it satisfies
(1)
Of course this definition does not require differentiability of the function , and one can replace in the inequality above by . It is immediate to verify that a function is -strongly convex if and only if is convex.
Note that (1) can be interpreted as follows: at any point one can find a (convex) quadratic lower bound to the function , i.e. (and ). Thus in some sense strong convexity is a dual assumption to the smoothness assumption from previous lectures. Indeed recall that smoothness can be defined via the inequality:
which implies that at any point one can find a (convex) quadratic upper bound to the function , i.e. (and ). In fact we will see later a precise sense in which smoothness and strong convexity are dual notions (via Fenchel duality). Remark also that clearly one always has .
Projected Subgradient Descent for strongly convex and Lipschitz functions
In this section we investigate the setting where is strongly convex but potentially non-smooth. As we have already seen in a previous lecture, in the case of non-smooth functions we have to project back on the set where we control the norm of the gradients. Precisely let us assume that is a compact and convex set such that . We consider the Projected Subgradient Descent algorithm with time-varying step size, that is
The following result is extracted from a recent paper of Simon Lacoste-Julien, Mark Schmidt, and Francis Bach.
Theorem Let , then Projected Subgradient Descent satisfies for
\noindent ,
Note that one can immediately see from the analysis in this lecture that the rate is optimal. Indeed, one can always find a function and set (an ball) that satisfies the above assumptions and such that no black-box procedure can go at a rate faster than for (in fact the constant can be improved to ).
Proof: Let . Coming back to our original analysis of Projected Subgradient Descent and using the strong convexity assumption one immediately obtains
Multiplying this inequality by yields
Now sum the resulting inequality over to , and apply Jensen’s inequality to obtain the claimed statement.
Gradient Descent for strongly convex and smooth functions
As will see now, having both strong convexity and smoothness allows for a drastic improvement in the convergence rate. The key observation is the following lemma.
Lemma Let be -smooth and -strongly convex. Then for all , one has
Proof: Using the definitions it is easy to prove that is convex and -smooth, and thus using a result from the previous lecture one has
which gives the claimed result with straightforward computations. (Note that if then one just has to apply directly the above inequality to .)
Theorem Let be -smooth and -strongly convex, and let be the condition number of . Then Gradient Descent with satisfies
Proof: First note that by -smoothness one has
Now using the previous lemma one obtains
which concludes the proof.
Oracle lower bound for strongly convex and smooth functions
To simplify the proof of the next theorem we will consider the case of where . More precisely we assume now that we are working in rather than in . Note that in fact everything that we have done so far for first-order methods could have been done in an arbitrary Hilbert space rather than in , and we chose to work in only for the clarity of exposition.
Theorem For any black-box optimization procedure satisfying
there exists a -smooth and -strongly convex function such that for any one has
A consequence of this result is that the bound we derived in the previous section for Gradient Descent is not optimal, as the rate had a similar form than the above lower bound but with instead (which can be an important difference for large values of ). One can obtain a matching upper bound by resorting to Nesterov’s Accelerated Gradient Descent instead of the plain Gradient Descent, and we refer the interested reader to the book by Nesterov for more details on this.
Proof: The overall argument is similar to what we did in the lower bound for smooth functions, but the details are different. As usual we assume without loss of generality that . Let be the linear operator that corresponds to the infinite tridiagonal matrix with on the diagonal and on the upper and lower diagonals. Let . We consider now the following function:
We already proved that which easily implies that is -strongly convex and -smooth. Now as always the key observation is that for this function, thanks to our assumption on the black-box procedure, one necessarily has . This implies in particular:
Furthermore since is -strongly convex, one has
Thus it only remains to compute . This can be done by differentiating and setting the gradient to , which gives the following infinite set of equations
It is easy to verify that defined by satisfy this infinite set of equations, and the conclusion of the theorem then follows by straightforward computations.
By Bo February 7, 2018 - 3:48 pm
Prof. Bubeck
Can you list some methods to estimate the Lipschitz constant? Thanks a lot!
By joe September 14, 2016 - 9:24 am
Hi for me it is unclear what the first symbol after the first inequality means as you only define f and nothing else. It would be nice if you could clarify 🙂
By Sara August 13, 2013 - 4:19 pm
Could you give a reference for the Lemma in the Gradient Descent for strongly convex and smooth functions section?
By zcc August 12, 2013 - 8:46 pm
Does the example function f in oracle lower bound contain a linear term, i.e. has an additional \frac{\alpha(Q-1)}{4} x^Te_1?