ORF523: Oracle complexity of smooth convex functions

In this lecture we consider the unconstrained minimization of a function $f$ that satisfy the following requirements:

(i) $f$ admits a minimizer $x^*$ on $\mathbb{R}^n$ .
(ii) $f$ is continuously differentiable and convex on $\mathbb{R}^n$ .
(iii) $f$ is smooth in the sense that the gradient mapping $\nabla f$ is $\beta$ -Lipschitz: for any $x, y \in \mathbb{R}^n$ , one has $\|\nabla f(x) - \nabla f(y) \| \leq \beta \|x - y\|$ .

Oracle lower bound in high dimension

We first prove a lower bound on the best possible dimension-uniform rate of convergence one can obtain for smooth convex functions.

Theorem For any black-box optimization procedure satisfying

$x_{s+1} \in x_1 + \mathrm{Span}(\nabla f(x_1), \hdots, \nabla f(x_s))$

and for any $t \leq (n-1)/2$ , there exists a function $f$ that satisfies (i)-(ii)-(iii) and such that

$\min_{1 \leq s \leq t} f(x_s) - f^* \geq \frac{3 \beta}{32} \frac{\|x_1 - x^*\|^2}{(t+1)^2} .$

Proof: Without loss of generality we assume that $x_1 = 0$ . We denote here $x(i)$ for the $i^{th}$ coordinate of $x \in \mathbb{R}^n$ . Let $e_1, \hdots, e_n$ be the canonical basis of $\mathbb{R}^n$ , and for $k \leq n$ let $A_k \in \mathbb{R}^{n \times n}$ be the symmetric and tridiagonal matrix defined by

$(A_k)_{i,j} = \left\{\begin{array}{ll} 2, & i = j, i \leq k \\ -1, & j \in \{i-1, i+1\}, i \leq k, j \neq k+1\\ 0, & \text{otherwise}. \end{array}\right.$

It is easy to verify that $0 \preceq A_k \preceq 4 I_n$ since

$x^{\top} A_k x = 2 \sum_{i=1}^k x(i)^2 - 2 \sum_{i=1}^{k-1} x(i) x(i+1) = x(1)^2 + x(k)^2 + \sum_{i=1}^{k-1} (x(i) - x(i+1))^2 .$

We consider now the following $\beta$ -smooth convex function:

$f(x) = \frac{\beta}{8} x^{\top} A_{2 t + 1} x - \frac{\beta}{4} x^{\top} e_1 .$

Similarly to what happened in the proof of the oracle lower bound of the previous lecture, one can see here too that $x_s$ must lie in the linear span of $e_1, \hdots, e_{s-1}$ (because of our assumption on the black-box procedure). In particular for $s \leq t$ we necessarily have $x_s(i) = 0$ for $i=s, \hdots, n$ , which implies $x_s^{\top} A_{2 t+1} x_s = x_s^{\top} A_{s} x_s$ . In other words, if we denote

$f_k(x) = \frac{\beta}{8} x^{\top} A_{k} x - \frac{\beta}{4} x^{\top} e_1 ,$

then we just proved that

$f(x_s) - f^* = f_s(x_s) - f_{2t+1}^* \geq f_{s}^* - f_{2 t + 1}^* \geq f_{t}^* - f_{2 t + 1}^* .$

Thus it simply remains to compute the minimizer $x^*_k$ of $f_k$ , its norm, and the corresponding function value $f_k^*$ .

The point $x^*_k$ is the unique solution in the span of $e_1, \hdots, e_k$ of $A_k x = e_1$ . It is easy to verify that it is defined by $x^*_k(i) = 1 - \frac{i}{k+1}$ for $i=1, \hdots, k$ . Thus we immediately have:

$f^*_k = \frac{\beta}{8} (x^*_k)^{\top} A_{k} x^*_k - \frac{\beta}{4} (x^*_k)^{\top} e_1 = - \frac{\beta}{8} (x^*_k)^{\top} e_1 = - \frac{\beta}{8} \left(1 - \frac{1}{k+1}\right) .$

Furthermore note that

$\|x^*_k\|^2 = \sum_{i=1}^k \left(1 - \frac{i}{k+1}\right)^2 = \sum_{i=1}^k \left( \frac{i}{k+1}\right)^2 \leq \frac{k+1}{3} .$

Thus one obtains:

$f_{t}^* - f_{2 t+1}^* = \frac{\beta}{8} \left(\frac{1}{t+1} - \frac{1}{2 t + 2} \right) = \frac{3 \beta}{32} \frac{\|x^*_{2 t + 1}\|^2}{(t+1)^2},$

which concludes the proof.

$\Box$

Analysis of Gradient Descent for smooth functions

We analyze now the plain Gradient Descent algorithm that iterates the equation $x_{t+1} = x_t - \eta \nabla f(x_t)$ . The first key observation is the following lemma which shows that a smooth function (not necessarily convex) is ‘sandwiched’ between two quadratic functions.

Lemma 1 Let $f$ be a $\beta$ -smooth function on $\mathbb{R}^n$ . Then for any $x, y \in \mathbb{R}^n$ , one has

$|f(x) - f(y) - \nabla f(y)^{\top} (x - y)| \leq \frac{\beta}{2} \|x - y\|^2 .$

Proof: If the function $f$ was convex one could use the definition of subgradient, Cauchy-Schwarz inequality and $\beta$ -smoothness to obtain

$0 \leq f(x) - f(y) - \nabla f(y)^{\top} (x - y) \leq (\nabla f(x) - \nabla f(y))^{\top} (x - y) \leq \beta \|x-y\|^2.$

A slightly more clever proof applies Cauchy-Schwarz ‘continuously’, which improves the constant and also get rid of the convexity condition. The key idea is to represent the quantity $f(x) - f(y)$ as an integral, this gives:

$\begin{align*} & |f(x) - f(y) - \nabla f(y)^{\top} (x - y)| \\ & = \left|\int_0^1 \nabla f(y + t(x-y))^{\top} (x-y) dt - \nabla f(y)^{\top} (x - y)\right| \\ & \leq \int_0^1 \beta t \|x-y\|^2 dt = \frac{\beta}{2} \|x-y\|^2 . \end{align*}$

$\Box$

Note that in fact it can be showed that $f$ is convex and $\beta$ -smooth if and only if for any $x, y \in \mathbb{R}^n$ , one has

$0 \leq f(x) - f(y) - \nabla f(y)^{\top} (x - y) \leq \frac{\beta}{2} \|x - y\|^2 ,$

and this inequality is often used as a definition. (The fact that the above inequality implies $\beta$ -smoothness follows from Lemma 2 below.)

The next lemma shows that for smooth convex functions the basic inequality for subgradients can be improved.

Lemma 2 Let $f$ be a convex and $\beta$ -smooth function on $\mathbb{R}^n$ . Then for any $x, y \in \mathbb{R}^n$ , one has

$f(x) - f(y) \leq \nabla f(x)^{\top} (x - y) - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y)\|^2 .$

Proof: The proof is quite clever. Let $\phi(y) = f(y) - \nabla f(x)^{\top} y$ . Note that $\phi$ is convex and $\beta$ -smooth. Remark also that $x$ a the minimizer of $\phi$ . Thus, using Lemma 1, one has:

$\begin{align*} & f(x) - f(y) - \nabla f(x)^{\top} (x - y) \\ & = \phi(x) - \phi(y) \\ & \leq \phi\left(y - \frac{1}{\beta} \nabla \phi(y)\right) - \phi(y) \\ & \leq \nabla \phi(y)^{\top} \left(y - \frac{1}{\beta} \nabla \phi(y) - y\right) + \frac{\beta}{2} \left\|y - \frac{1}{\beta} \nabla \phi(y) - y\right\|^2 \\ & = - \frac{1}{2 \beta} \| \nabla \phi(y) \|^2 \\ & = - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y)\|^2 , \end{align*}$

which concludes the proof.

$\Box$

We can now prove the following result.

Theorem Assume that $f$ satisfies (i)-(ii)-(iii). Then Gradient Descent with $\eta = \frac{1}{\beta}$ satisfies

$f(x_t) - f^* \leq \frac{2 \beta \|x_1 - x^*\|^2}{t+3} .$

Note that here the step-size is much larger than for the Projected Subgradient Descent. This makes sense since for smooth functions one can ‘trust’ the size of the gradient to be a good indicator of how big of a step one can make (this is definitely not the case for non-smooth functions!). Also remark that the decision point is simply the last point visited, again in contrast with what happened for Projected Subgradient Descent.

Proof: Using Lemma 1 and the definition of the method one has

$\begin{eqnarray*} f(x_{s+1}) - f(x_s) & \leq & \nabla f(x_s)^{\top} (x_{s+1} - x_s) + \frac{\beta}{2} \|x_{s+1} - x_s\|^2 \\ & = & \left(- \eta + \frac{\beta}{2} \eta^2 \right) \|\nabla f(x_s)\|^2 = - \frac{1}{2 \beta} \|\nabla f(x_s)\|^2. \end{eqnarray*}$

In particular, denoting $\delta_s = f(x_s) - f^*$ , this shows:

$\delta_{s+1} \leq \delta_s - \frac{1}{2 \beta} \|\nabla f(x_s)\|^2.$

Of course one also has by convexity

$\delta_s \leq \nabla f(x_s)^{\top} (x_s - x^*) \leq \|x_s - x^*\| \|\nabla f(x_s)\| .$

We will prove that $\|x_s - x^*\|$ is decreasing with $s$ , which with the two above displays will imply

$\delta_{s+1} \leq \delta_s - \frac{1}{2 \beta \|x_1 - x^*\|^2} \delta_s^2.$

Let us see how to use this last inequality to conclude the proof. Let $\omega = \frac{1}{2 \beta \|x_1 - x^*\|^2}$ , then

$\begin{eqnarray*} \omega \delta_s^2 + \delta_{s+1} \leq \delta_s \Leftrightarrow \omega \frac{\delta_s}{\delta_{s+1}} + \frac{1}{\delta_{s}} \leq \frac{1}{\delta_{s+1}} & \Rightarrow & \frac{1}{\delta_{s+1}} - \frac{1}{\delta_{s}} \geq \omega \\ & \Rightarrow & \frac{1}{\delta_t} \geq \omega (t-1) + \frac{1}{f(x_1) - f^*} . \end{eqnarray*}$

The last inequality concludes the proof since by $\beta$ -smoothness (and Lemma 1) one has

$f(x_1) - f(x^*) \leq \nabla f(x^*)^{\top} (x_1 - x^*) + \frac{\beta}{2} \|x_1 - x^*\|^2 = \frac{1}{4 \omega} .$

Thus it only remains to show that $\|x_s - x^*\|$ is decreasing with $s$ . Note that for $\beta$ -smooth function, using Lemma 2 one immediately gets

$(\nabla f(x) - \nabla f(y))^{\top} (x - y) \geq \frac{1}{\beta} \|\nabla f(x) - \nabla f(y)\|^2 .$

We use this as follows

$\begin{eqnarray*} \|x_{s+1} - x^*\|^2& = & \|x_{s} - \eta \nabla f(x_s) - x^*\|^2 \\ & = & \|x_{s} - x^*\|^2 - 2 \eta \nabla f(x_s)^{\top} (x_s - x^*) + \eta^2 \|\nabla f(x_s)\|^2 \\ & \leq & \|x_{s} - x^*\|^2 + \left(\eta^2 - 2 \frac{\eta}{\beta}\right) \|\nabla f(x_s)\|^2 \\ & \leq & \|x_{s} - x^*\|^2 , \end{eqnarray*}$

which concludes the proof.

$\Box$

On the contrary to what happened in the non-smooth case, here the gradient descent idea does not attain a rate that matches the lower bound ( $1/t$ v.s. $1/t^2$ ). In the next lecture we will see how to fix this situation and we will derive a highly non-trivial algorithm with a $1/t^2$ convergence rate.

ORF523: Oracle complexity of smooth convex functions

One Response to "ORF523: Oracle complexity of smooth convex functions"

By Grigory Yaroslavtsev October 28, 2015 - 5:39 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll