In this lecture we consider the unconstrained minimization of a function that satisfy the following requirements:
- (i)
admits a minimizer
on
.
- (ii)
is continuously differentiable and convex on
.
- (iii)
is smooth in the sense that the gradient mapping
is
-Lipschitz: for any
, one has
.
Oracle lower bound in high dimension
We first prove a lower bound on the best possible dimension-uniform rate of convergence one can obtain for smooth convex functions.
Theorem For any black-box optimization procedure satisfying
and for any
, there exists a function
that satisfies (i)-(ii)-(iii) and such that
Proof: Without loss of generality we assume that . We denote here
for the
coordinate of
. Let
be the canonical basis of
, and for
let
be the symmetric and tridiagonal matrix defined by
It is easy to verify that since
We consider now the following -smooth convex function:
Similarly to what happened in the proof of the oracle lower bound of the previous lecture, one can see here too that must lie in the linear span of
(because of our assumption on the black-box procedure). In particular for
we necessarily have
for
, which implies
. In other words, if we denote
then we just proved that
Thus it simply remains to compute the minimizer of
, its norm, and the corresponding function value
.
The point is the unique solution in the span of
of
. It is easy to verify that it is defined by
for
. Thus we immediately have:
Furthermore note that
Thus one obtains:
which concludes the proof.
Analysis of Gradient Descent for smooth functions
We analyze now the plain Gradient Descent algorithm that iterates the equation . The first key observation is the following lemma which shows that a smooth function (not necessarily convex) is ‘sandwiched’ between two quadratic functions.
Lemma 1 Let
be a
-smooth function on
. Then for any
, one has
Proof: If the function was convex one could use the definition of subgradient, Cauchy-Schwarz inequality and
-smoothness to obtain
A slightly more clever proof applies Cauchy-Schwarz ‘continuously’, which improves the constant and also get rid of the convexity condition. The key idea is to represent the quantity as an integral, this gives:
Note that in fact it can be showed that is convex and
-smooth if and only if for any
, one has
and this inequality is often used as a definition. (The fact that the above inequality implies -smoothness follows from Lemma 2 below.)
The next lemma shows that for smooth convex functions the basic inequality for subgradients can be improved.
Lemma 2 Let
be a convex and
-smooth function on
. Then for any
, one has
Proof: The proof is quite clever. Let . Note that
is convex and
-smooth. Remark also that
a the minimizer of
. Thus, using Lemma 1, one has:
which concludes the proof.
We can now prove the following result.
Theorem Assume thatsatisfies (i)-(ii)-(iii). Then Gradient Descent with
satisfies
Note that here the step-size is much larger than for the Projected Subgradient Descent. This makes sense since for smooth functions one can ‘trust’ the size of the gradient to be a good indicator of how big of a step one can make (this is definitely not the case for non-smooth functions!). Also remark that the decision point is simply the last point visited, again in contrast with what happened for Projected Subgradient Descent.
Proof: Using Lemma 1 and the definition of the method one has
In particular, denoting , this shows:
Of course one also has by convexity
We will prove that is decreasing with
, which with the two above displays will imply
Let us see how to use this last inequality to conclude the proof. Let , then
The last inequality concludes the proof since by -smoothness (and Lemma 1) one has
Thus it only remains to show that is decreasing with
. Note that for
-smooth function, using Lemma 2 one immediately gets
We use this as follows
which concludes the proof.
On the contrary to what happened in the non-smooth case, here the gradient descent idea does not attain a rate that matches the lower bound ( v.s.
). In the next lecture we will see how to fix this situation and we will derive a highly non-trivial algorithm with a
convergence rate.
By Grigory Yaroslavtsev October 28, 2015 - 5:39 pm
Great post! I am going to use this in my big data class: http://grigory.us/big-data-class.html