ORF523: Oracle complexity of smooth convex functions

In this lecture we consider the unconstrained minimization of a function f that satisfy the following requirements:

  •  (i) f admits a minimizer x^* on \mathbb{R}^n.
  • (ii) f is continuously differentiable and convex on \mathbb{R}^n.
  • (iii) f is smooth in the sense that the gradient mapping \nabla f is \beta-Lipschitz: for any x, y \in \mathbb{R}^n, one has \|\nabla f(x) - \nabla f(y) \| \leq \beta \|x - y\|.

 

Oracle lower bound in high dimension

We first prove a lower bound on the best possible dimension-uniform rate of convergence one can obtain for smooth convex functions.

Theorem For any black-box optimization procedure satisfying

    \[x_{s+1} \in x_1 + \mathrm{Span}(\nabla f(x_1), \hdots, \nabla f(x_s))\]

and for any t \leq (n-1)/2, there exists a function f that satisfies (i)-(ii)-(iii) and such that

    \[\min_{1 \leq s \leq t} f(x_s) - f^* \geq \frac{3 \beta}{32} \frac{\|x_1 - x^*\|^2}{(t+1)^2} .\]

Proof: Without loss of generality we assume that x_1 = 0. We denote here x(i) for the i^{th} coordinate of x \in \mathbb{R}^n. Let e_1, \hdots, e_n be the canonical basis of \mathbb{R}^n, and for k \leq n let A_k \in \mathbb{R}^{n \times n} be the symmetric and tridiagonal matrix defined by

    \[(A_k)_{i,j} = \left\{\begin{array}{ll} 2, & i = j, i \leq k \\ -1, & j \in \{i-1, i+1\}, i \leq k, j \neq k+1\\ 0, & \text{otherwise}. \end{array}\right.\]

It is easy to verify that 0 \preceq A_k \preceq 4 I_n since

    \[x^{\top} A_k x = 2 \sum_{i=1}^k x(i)^2 - 2 \sum_{i=1}^{k-1} x(i) x(i+1) = x(1)^2 + x(k)^2 + \sum_{i=1}^{k-1} (x(i) - x(i+1))^2 .\]

We consider now the following \beta-smooth convex function:

    \[f(x) = \frac{\beta}{8} x^{\top} A_{2 t + 1} x - \frac{\beta}{4} x^{\top} e_1 .\]

Similarly to what happened in the proof of the oracle lower bound of the previous lecture, one can see here too that x_s must lie in the linear span of e_1, \hdots, e_{s-1} (because of our assumption on the black-box procedure). In particular for s \leq t we necessarily have x_s(i) = 0 for i=s, \hdots, n, which implies x_s^{\top} A_{2 t+1} x_s = x_s^{\top} A_{s} x_s. In other words, if we denote

    \[f_k(x) = \frac{\beta}{8} x^{\top} A_{k} x - \frac{\beta}{4} x^{\top} e_1 ,\]

then we just proved that

    \[f(x_s) - f^* = f_s(x_s) - f_{2t+1}^* \geq f_{s}^* - f_{2 t + 1}^* \geq f_{t}^* - f_{2 t + 1}^* .\]

Thus it simply remains to compute the minimizer x^*_k of f_k, its norm, and the corresponding function value f_k^*.

The point x^*_k is the unique solution in the span of e_1, \hdots, e_k of A_k x = e_1. It is easy to verify that it is defined by x^*_k(i) = 1 - \frac{i}{k+1} for i=1, \hdots, k. Thus we immediately have:

    \[f^*_k = \frac{\beta}{8} (x^*_k)^{\top} A_{k} x^*_k - \frac{\beta}{4} (x^*_k)^{\top} e_1 = - \frac{\beta}{8} (x^*_k)^{\top} e_1 = - \frac{\beta}{8} \left(1 - \frac{1}{k+1}\right) .\]

Furthermore note that

    \[\|x^*_k\|^2 = \sum_{i=1}^k \left(1 - \frac{i}{k+1}\right)^2 = \sum_{i=1}^k \left( \frac{i}{k+1}\right)^2 \leq \frac{k+1}{3} .\]

Thus one obtains:

    \[f_{t}^* - f_{2 t+1}^* = \frac{\beta}{8} \left(\frac{1}{t+1} - \frac{1}{2 t + 2} \right) = \frac{3 \beta}{32} \frac{\|x^*_{2 t + 1}\|^2}{(t+1)^2},\]

which concludes the proof.

\Box

Analysis of Gradient Descent for smooth functions

We analyze now the plain Gradient Descent algorithm that iterates the equation x_{t+1} = x_t - \eta \nabla f(x_t). The first key observation is the following lemma which shows that a smooth function (not necessarily convex) is ‘sandwiched’ between two quadratic functions.

Lemma 1 Let f be a \beta-smooth function on \mathbb{R}^n. Then for any x, y \in \mathbb{R}^n, one has

    \[|f(x) - f(y) - \nabla f(y)^{\top} (x - y)| \leq \frac{\beta}{2} \|x - y\|^2 .\]

Proof: If the function f was convex one could use the definition of subgradient, Cauchy-Schwarz inequality and \beta-smoothness to obtain

    \[0 \leq f(x) - f(y) - \nabla f(y)^{\top} (x - y) \leq (\nabla f(x) - \nabla f(y))^{\top} (x - y) \leq \beta \|x-y\|^2.\]

A slightly more clever proof applies Cauchy-Schwarz ‘continuously’, which improves the constant and also get rid of the convexity condition. The key idea is to represent the quantity f(x) - f(y) as an integral, this gives:

    \begin{align*} & |f(x) - f(y) - \nabla f(y)^{\top} (x - y)| \\ & = \left|\int_0^1 \nabla f(y + t(x-y))^{\top} (x-y) dt - \nabla f(y)^{\top} (x - y)\right| \\ & \leq \int_0^1 \beta t \|x-y\|^2 dt = \frac{\beta}{2} \|x-y\|^2 . \end{align*}

\Box

Note that in fact it can be showed that f is convex and \beta-smooth if and only if for any x, y \in \mathbb{R}^n, one has

    \[0 \leq f(x) - f(y) - \nabla f(y)^{\top} (x - y) \leq \frac{\beta}{2} \|x - y\|^2 ,\]

and this inequality is often used as a definition. (The fact that the above inequality implies \beta-smoothness follows from Lemma 2 below.)

The next lemma shows that for smooth convex functions the basic inequality for subgradients can be improved.

Lemma 2 Let f be a convex and \beta-smooth function on \mathbb{R}^n. Then for any x, y \in \mathbb{R}^n, one has

    \[f(x) - f(y) \leq \nabla f(x)^{\top} (x - y) - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y)\|^2 .\]

Proof: The proof is quite clever. Let \phi(y) = f(y) - \nabla f(x)^{\top} y. Note that \phi is convex and \beta-smooth. Remark also that x a the minimizer of \phi. Thus, using Lemma 1, one has:

    \begin{align*} & f(x) - f(y) - \nabla f(x)^{\top} (x - y) \\ & = \phi(x) - \phi(y) \\ & \leq \phi\left(y - \frac{1}{\beta} \nabla \phi(y)\right) - \phi(y) \\ & \leq \nabla \phi(y)^{\top} \left(y - \frac{1}{\beta} \nabla \phi(y) - y\right) + \frac{\beta}{2} \left\|y - \frac{1}{\beta} \nabla \phi(y) - y\right\|^2 \\ & = - \frac{1}{2 \beta} \| \nabla \phi(y) \|^2 \\ & = - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y)\|^2 , \end{align*}

which concludes the proof.

\Box

 We can now prove the following result.


Theorem 
Assume that f satisfies (i)-(ii)-(iii). Then Gradient Descent with \eta = \frac{1}{\beta} satisfies

    \[f(x_t) - f^* \leq \frac{2 \beta \|x_1 - x^*\|^2}{t+3} .\]

Note that here the step-size is much larger than for the Projected Subgradient Descent. This makes sense since for smooth functions one can ‘trust’ the size of the gradient to be a good indicator of how big of a step one can make (this is definitely not the case for non-smooth functions!). Also remark that the decision point is simply the last point visited, again in contrast with what happened for Projected Subgradient Descent.

Proof: Using Lemma 1 and the definition of the method one has

    \begin{eqnarray*} f(x_{s+1}) - f(x_s) & \leq & \nabla f(x_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & = & \left(- \eta + \frac{\beta}{2} \eta^2 \right) \|\nabla f(x_s)\|^2 = - \frac{1}{2 \beta} \|\nabla f(x_s)\|^2. \end{eqnarray*}

In particular, denoting \delta_s = f(x_s) - f^*, this shows:

    \[\delta_{s+1} \leq \delta_s - \frac{1}{2 \beta} \|\nabla f(x_s)\|^2.\]

Of course one also has by convexity

    \[\delta_s \leq \nabla f(x_s)^{\top} (x_s - x^*) \leq \|x_s - x^*\| \|\nabla f(x_s)\| .\]

We will prove that \|x_s - x^*\| is decreasing with s, which with the two above displays will imply

    \[\delta_{s+1} \leq \delta_s - \frac{1}{2 \beta \|x_1 - x^*\|^2} \delta_s^2.\]

Let us see how to use this last inequality to conclude the proof. Let \omega = \frac{1}{2 \beta \|x_1 - x^*\|^2}, then

    \begin{eqnarray*} \omega \delta_s^2 + \delta_{s+1} \leq \delta_s \Leftrightarrow \omega \frac{\delta_s}{\delta_{s+1}} + \frac{1}{\delta_{s}} \leq \frac{1}{\delta_{s+1}} & \Rightarrow & \frac{1}{\delta_{s+1}} - \frac{1}{\delta_{s}} \geq \omega \\ & \Rightarrow & \frac{1}{\delta_t} \geq \omega (t-1) + \frac{1}{f(x_1) - f^*} . \end{eqnarray*}

The last inequality concludes the proof since by \beta-smoothness (and Lemma 1) one has

    \[f(x_1) - f(x^*) \leq \nabla f(x^*)^{\top} (x_1 - x^*) + \frac{\beta}{2} \|x_1 - x^*\|^2 = \frac{1}{4 \omega} .\]

Thus it only remains to show that \|x_s - x^*\| is decreasing with s. Note that for \beta-smooth function, using Lemma 2 one immediately gets

    \[(\nabla f(x) - \nabla f(y))^{\top} (x - y) \geq \frac{1}{\beta} \|\nabla f(x) - \nabla f(y)\|^2 .\]

We use this as follows

    \begin{eqnarray*} \|x_{s+1} - x^*\|^2& = & \|x_{s} - \eta \nabla f(x_s) - x^*\|^2 \\ & = & \|x_{s} - x^*\|^2 - 2 \eta \nabla f(x_s)^{\top} (x_s - x^*) + \eta^2 \|\nabla f(x_s)\|^2 \\ & \leq & \|x_{s} - x^*\|^2 + \left(\eta^2 - 2 \frac{\eta}{\beta}\right) \|\nabla f(x_s)\|^2 \\ & \leq & \|x_{s} - x^*\|^2 , \end{eqnarray*}

which concludes the proof.

\Box

On the contrary to what happened in the non-smooth case, here the gradient descent idea does not attain a rate that matches the lower bound (1/t v.s. 1/t^2). In the next lecture we will see how to fix this situation and we will derive a highly non-trivial algorithm with a 1/t^2 convergence rate.

This entry was posted in Optimization. Bookmark the permalink.

Leave a reply