In this lecture we consider the unconstrained minimization of a function that satisfy the following requirements:

- (i) admits a minimizer on .
- (ii) is continuously differentiable and convex on .
- (iii) is
*smooth*in the sense that the gradient mapping is -Lipschitz: for any , one has .

**Oracle lower bound in high dimension**

We first prove a lower bound on the best possible dimension-uniform rate of convergence one can obtain for smooth convex functions.

TheoremFor any black-box optimization procedure satisfying

and for any , there exists a function that satisfies (i)-(ii)-(iii) and such that

*Proof: *Without loss of generality we assume that . We denote here for the coordinate of . Let be the canonical basis of , and for let be the symmetric and tridiagonal matrix defined by

It is easy to verify that since

We consider now the following -smooth convex function:

Similarly to what happened in the proof of the oracle lower bound of the previous lecture, one can see here too that must lie in the linear span of (because of our assumption on the black-box procedure). In particular for we necessarily have for , which implies . In other words, if we denote

then we just proved that

Thus it simply remains to compute the minimizer of , its norm, and the corresponding function value .

The point is the unique solution in the span of of . It is easy to verify that it is defined by for . Thus we immediately have:

Furthermore note that

Thus one obtains:

which concludes the proof.

**Analysis of Gradient Descent for smooth functions**

We analyze now the plain Gradient Descent algorithm that iterates the equation . The first key observation is the following lemma which shows that a smooth function (not necessarily convex) is ‘sandwiched’ between two quadratic functions.

Lemma 1Let be a -smooth function on . Then for any , one has

*Proof: *If the function was convex one could use the definition of subgradient, Cauchy-Schwarz inequality and -smoothness to obtain

A slightly more clever proof applies Cauchy-Schwarz ‘continuously’, which improves the constant and also get rid of the convexity condition. The key idea is to represent the quantity as an integral, this gives:

Note that in fact it can be showed that is convex and -smooth if and only if for any , one has

and this inequality is often used as a definition. (The fact that the above inequality implies -smoothness follows from Lemma 2 below.)

The next lemma shows that for smooth convex functions the basic inequality for subgradients can be improved.

Lemma 2Let be a convex and -smooth function on . Then for any , one has

*Proof: *The proof is quite clever. Let . Note that is convex and -smooth. Remark also that a the minimizer of . Thus, using Lemma 1, one has:

which concludes the proof.

We can now prove the following result.

Assume that satisfies (i)-(ii)-(iii). Then Gradient Descent with satisfies

Theorem

Note that here the step-size is much larger than for the Projected Subgradient Descent. This makes sense since for smooth functions one can ‘trust’ the size of the gradient to be a good indicator of how big of a step one can make (this is definitely not the case for non-smooth functions!). Also remark that the decision point is simply the last point visited, again in contrast with what happened for Projected Subgradient Descent.

*Proof: *Using Lemma 1 and the definition of the method one has

In particular, denoting , this shows:

Of course one also has by convexity

We will prove that is decreasing with , which with the two above displays will imply

Let us see how to use this last inequality to conclude the proof. Let , then

The last inequality concludes the proof since by -smoothness (and Lemma 1) one has

Thus it only remains to show that is decreasing with . Note that for -smooth function, using Lemma 2 one immediately gets

We use this as follows

which concludes the proof.

On the contrary to what happened in the non-smooth case, here the gradient descent idea does not attain a rate that matches the lower bound ( v.s. ). In the next lecture we will see how to fix this situation and we will derive a highly non-trivial algorithm with a convergence rate.

## By Grigory Yaroslavtsev October 28, 2015 - 5:39 pm

Great post! I am going to use this in my big data class: http://grigory.us/big-data-class.html