ORF523: Strong convexity

Today we will talk about another property of convex functions that can significantly speed-up the convergence of first-order methods: strong convexity. We say that $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is $\alpha$ –strongly convex if it satisfies

(1) $\begin{equation*} f(x) - f(y) \leq \nabla f(x)^{\top} (x - y) - \frac{\alpha}{2} \|x - y \|^2 . \end{equation*}$

Of course this definition does not require differentiability of the function $f$ , and one can replace $\nabla f(x)$ in the inequality above by $g \in \partial f(x)$ . It is immediate to verify that a function $f$ is $\alpha$ -strongly convex if and only if $x \mapsto f(x) - \frac{\alpha}{2} \|x\|^2$ is convex.

Note that (1) can be interpreted as follows: at any point $x$ one can find a (convex) quadratic lower bound $q_x^-(y) = f(x) + \nabla f(x)^{\top} (y - x) + \frac{\alpha}{2} \|x - y \|^2$ to the function $f$ , i.e. $q_x^-(y) \leq f(y), \forall y \in \mathbb{R}^n$ (and $q_x^-(x) = f(x)$ ). Thus in some sense strong convexity is a dual assumption to the smoothness assumption from previous lectures. Indeed recall that smoothness can be defined via the inequality:

$f(x) - f(y) \leq \nabla f(y)^{\top} (x - y) + \frac{\beta}{2} \|x - y \|^2 ,$

which implies that at any point $y$ one can find a (convex) quadratic upper bound $q_y^+(x) = f(y) + \nabla f(y)^{\top} (x - y) + \frac{\beta}{2} \|x - y \|^2$ to the function $f$ , i.e. $q_y^+(x) \geq f(x), \forall x \in \mathbb{R}^n$ (and $q_y^+(y) = f(y)$ ). In fact we will see later a precise sense in which smoothness and strong convexity are dual notions (via Fenchel duality). Remark also that clearly one always has $\beta \geq \alpha$ .

Projected Subgradient Descent for strongly convex and Lipschitz functions

In this section we investigate the setting where $f$ is strongly convex but potentially non-smooth. As we have already seen in a previous lecture, in the case of non-smooth functions we have to project back on the set where we control the norm of the gradients. Precisely let us assume that $\mathcal{X}$ is a compact and convex set such that $\forall x \in \mathcal{X}, \forall g \in \partial f(x), \|g\| \leq L$ . We consider the Projected Subgradient Descent algorithm with time-varying step size, that is

$\begin{align*} & y_{t+1} = x_t - \eta_t g_t , \ \text{where} \ g_t \in \partial f(x_t) \\ & x_{t+1} = \mathrm{argmin}_{x \in \mathcal{X}} \|x - y_{t+1}\|. \end{align*}$

The following result is extracted from a recent paper of Simon Lacoste-Julien, Mark Schmidt, and Francis Bach.

Theorem Let $\eta_s = \frac{2}{\alpha (s+1)}$ , then Projected Subgradient Descent satisfies for

\noindent $\bar{x}_t \in \left\{\mathrm{argmin}_{1 \leq s \leq t} f(x_s) ; \sum_{s=1}^t \frac{2 s}{t(t-1)} x_s \right\}$ ,

$f (\bar{x}_t ) - \min_{x \in \mathcal{X}} f(x) \leq \frac{2 L^2}{\alpha (t+1)} .$

Note that one can immediately see from the analysis in this lecture that the rate is optimal. Indeed, one can always find a function $f$ and set $\mathcal{X}$ (an $\ell_2$ ball) that satisfies the above assumptions and such that no black-box procedure can go at a rate faster than $\frac{L^2}{8 \alpha t}$ for $t \leq n$ (in fact the constant $1/8$ can be improved to $1/2$ ).

Proof: Let $x^* \in \mathrm{argmin}_{x \in \mathcal{X}} f(x)$ . Coming back to our original analysis of Projected Subgradient Descent and using the strong convexity assumption one immediately obtains

$f(x_s) - f(x^*) \leq \frac{\eta_s}{2} L^2 + \left( \frac{1}{2 \eta_s} - \frac{\alpha}{2} \right) \|x_s - x^*\|^2 - \frac{1}{2 \eta_s} \|x_{s+1} - x^*\|^2 .$

Multiplying this inequality by $s$ yields

$s( f(x_s) - f(x^*) ) \leq \frac{L^2}{\alpha} + \frac{\alpha}{4} \bigg( s(s-1) \|x_s - x^*\|^2 - s (s+1) \|x_{s+1} - x^*\|^2 \bigg),$

Now sum the resulting inequality over $s=1$ to $s=t$ , and apply Jensen’s inequality to obtain the claimed statement.

$\Box$

Gradient Descent for strongly convex and smooth functions

As will see now, having both strong convexity and smoothness allows for a drastic improvement in the convergence rate. The key observation is the following lemma.

Lemma Let $f$ be $\beta$ -smooth and $\alpha$ -strongly convex. Then for all $x, y \in \mathbb{R}^n$ , one has

$(\nabla f(x) - \nabla f(y))^{\top} (x - y) \geq \frac{\alpha \beta}{\beta + \alpha} \|x-y\|^2 + \frac{1}{\beta + \alpha} \|\nabla f(x) - \nabla f(y)\|^2 .$

Proof: Using the definitions it is easy to prove that $\phi(x) = f(x) - \frac{\alpha}{2} \|x\|^2$ is convex and $(\beta-\alpha)$ -smooth, and thus using a result from the previous lecture one has

$(\nabla \phi(x) - \nabla \phi(y))^{\top} (x - y) \geq \frac{1}{\beta - \alpha} \|\nabla \phi(x) - \nabla \phi(y)\|^2 ,$

which gives the claimed result with straightforward computations. (Note that if $\alpha = \beta$ then one just has to apply directly the above inequality to $f$ .)

$\Box$

Theorem Let $f$ be $\beta$ -smooth and $\alpha$ -strongly convex, and let $Q = \frac{\beta}{\alpha}$ be the condition number of $f$ . Then Gradient Descent with $\eta = \frac{2}{\alpha + \beta}$ satisfies

$f(x_t) - f(x^*) \leq \frac{\beta}{2} \left(\frac{Q - 1}{Q+1}\right)^{2 (t-1)} \|x_1 - x^*\|^2 .$

Proof: First note that by $\beta$ -smoothness one has

$f(x_t) - f(x^*) \leq \frac{\beta}{2} \|x_t - x^*\|^2 .$

Now using the previous lemma one obtains

$\begin{eqnarray*} \|x_{t} - x^*\|^2& = & \|x_{t-1} - \eta \nabla f(x_{t-1}) - x^*\|^2 \\ & = & \|x_{t-1} - x^*\|^2 - 2 \eta \nabla f(x_{t-1})^{\top} (x_{t-1} - x^*) + \eta^2 \|\nabla f(x_{t-1})\|^2 \\ & \leq & \left(1 - 2 \frac{\eta \alpha \beta}{\beta + \alpha}\right)\|x_{t-1} - x^*\|^2 + \left(\eta^2 - 2 \frac{\eta}{\beta + \alpha}\right) \|\nabla f(x_{t-1})\|^2 \\ & = & \left(\frac{Q - 1}{Q+1}\right)^2 \|x_{t-1} - x^*\|^2 , \end{eqnarray*}$

which concludes the proof.

$\Box$

Oracle lower bound for strongly convex and smooth functions

To simplify the proof of the next theorem we will consider the case of $\mathbb{R}^n$ where $n \to +\infty$ . More precisely we assume now that we are working in $\ell_2 = \{ x = (x(n))_{n \in \mathbb{N}} : \sum_{i=1}^{+\infty} x(i) < + \infty\}$ rather than in $\mathbb{R}^n$ . Note that in fact everything that we have done so far for first-order methods could have been done in an arbitrary Hilbert space $\mathcal{H}$ rather than in $\mathbb{R}^n$ , and we chose to work in $\mathbb{R}^n$ only for the clarity of exposition.

Theorem For any black-box optimization procedure satisfying

$x_{s+1} \in x_1 + \mathrm{Span}(\nabla f(x_1), \hdots, \nabla f(x_s)) ,$

there exists a $\beta$ -smooth and $\alpha$ -strongly convex function $f: \ell_2 \rightarrow \mathbb{R}$ such that for any $t \geq 1$ one has

$f(x_t) - f(x^*) \geq \frac{\alpha}{2} \left(\frac{\sqrt{Q} - 1}{\sqrt{Q}+1}\right)^{2 (t-1)} \|x_1 - x^*\|^2 .$

A consequence of this result is that the bound we derived in the previous section for Gradient Descent is not optimal, as the rate had a similar form than the above lower bound but with $Q$ instead $\sqrt{Q}$ (which can be an important difference for large values of $Q$ ). One can obtain a matching upper bound by resorting to Nesterov’s Accelerated Gradient Descent instead of the plain Gradient Descent, and we refer the interested reader to the book by Nesterov for more details on this.

Proof: The overall argument is similar to what we did in the lower bound for smooth functions, but the details are different. As usual we assume without loss of generality that $x_1 = 0$ . Let $A : \ell_2 \rightarrow \ell_2$ be the linear operator that corresponds to the infinite tridiagonal matrix with $2$ on the diagonal and $-1$ on the upper and lower diagonals. Let $Q = \beta / \alpha$ . We consider now the following function:

$f(x) = \frac{\alpha (Q-1)}{8} \langle Ax, x\rangle + \frac{\alpha}{2} \|x\|^2 .$

We already proved that $0 \preceq A \preceq 4 I$ which easily implies that $f$ is $\alpha$ -strongly convex and $\beta$ -smooth. Now as always the key observation is that for this function, thanks to our assumption on the black-box procedure, one necessarily has $x_t(i) = 0, \forall i \geq t$ . This implies in particular:

$\|x_t - x^*\|^2 \geq \sum_{i=t}^{+\infty} x^*(i)^2 .$

Furthermore since $f$ is $\alpha$ -strongly convex, one has

$f(x_t) - f(x^*) \geq \frac{\alpha}{2} \|x_t - x^*\|^2 .$

Thus it only remains to compute $x^*$ . This can be done by differentiating $f$ and setting the gradient to $0$ , which gives the following infinite set of equations

$\begin{align*} & 1 - 2 \frac{Q+1}{Q-1} x^*(1) + x^*(2) = 0 , \\ & x^*(k-1) - 2 \frac{Q+1}{Q-1} x^*(k) + x^*(k+1) = 0, \forall k \geq 2 . \end{align*}$

It is easy to verify that $x^*$ defined by $x^*(i) = \left(\frac{\sqrt{Q} - 1}{\sqrt{Q} + 1}\right)^i$ satisfy this infinite set of equations, and the conclusion of the theorem then follows by straightforward computations.

$\Box$

4 Responses to "ORF523: Strong convexity"

By Bo February 7, 2018 - 3:48 pm

Prof. Bubeck

Can you list some methods to estimate the Lipschitz constant? Thanks a lot!
By joe September 14, 2016 - 9:24 am

Hi for me it is unclear what the first symbol after the first inequality means as you only define f and nothing else. It would be nice if you could clarify 🙂
By Sara August 13, 2013 - 4:19 pm

Could you give a reference for the Lemma in the Gradient Descent for strongly convex and smooth functions section?
By zcc August 12, 2013 - 8:46 pm

Does the example function f in oracle lower bound contain a linear term, i.e. has an additional \frac{\alpha(Q-1)}{4} x^Te_1?

ORF523: Strong convexity

4 Responses to "ORF523: Strong convexity"

By Bo February 7, 2018 - 3:48 pm

By joe September 14, 2016 - 9:24 am

By Sara August 13, 2013 - 4:19 pm

By zcc August 12, 2013 - 8:46 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll