ORF523: Strong convexity

Today we will talk about another property of convex functions that can significantly speed-up the convergence of first-order methods: strong convexity. We say that f: \mathbb{R}^n \rightarrow \mathbb{R} is \alphastrongly convex if it satisfies

(1)   \begin{equation*} f(x) - f(y) \leq \nabla f(x)^{\top} (x - y) - \frac{\alpha}{2} \|x - y \|^2 . \end{equation*}

Of course this definition does not require differentiability of the function f, and one can replace \nabla f(x) in the inequality above by g \in \partial f(x). It is immediate to verify that a function f is \alpha-strongly convex if and only if x \mapsto f(x) - \frac{\alpha}{2} \|x\|^2 is convex.

Note that (1) can be interpreted as follows: at any point x one can find a (convex) quadratic lower bound q_x^-(y) = f(x) + \nabla f(x)^{\top} (y - x) + \frac{\alpha}{2} \|x - y \|^2 to the function f, i.e. q_x^-(y) \leq f(y), \forall y \in \mathbb{R}^n (and q_x^-(x) = f(x)). Thus in some sense strong convexity is a dual assumption to the smoothness assumption from previous lectures. Indeed recall that smoothness can be defined via the inequality:

    \[f(x) - f(y) \leq \nabla f(y)^{\top} (x - y) + \frac{\beta}{2} \|x - y \|^2 ,\]

which implies that at any point y one can find a (convex) quadratic upper bound q_y^+(x) = f(y) + \nabla f(y)^{\top} (x - y) + \frac{\beta}{2} \|x - y \|^2 to the function f, i.e. q_y^+(x) \geq f(x), \forall x \in \mathbb{R}^n (and q_y^+(y) = f(y)). In fact we will see later a precise sense in which smoothness and strong convexity are dual notions (via Fenchel duality). Remark also that clearly one always has \beta \geq \alpha.


Projected Subgradient Descent for strongly convex and Lipschitz functions

In this section we investigate the setting where f is strongly convex but potentially non-smooth. As we have already seen in a previous lecture, in the case of non-smooth functions we have to project back on the set where we control the norm of the gradients. Precisely let us assume that \mathcal{X} is a compact and convex set such that \forall x \in \mathcal{X}, \forall g \in \partial f(x), \|g\| \leq L. We consider the Projected Subgradient Descent algorithm with time-varying step size, that is

    \begin{align*} & y_{t+1} = x_t - \eta_t g_t , \ \text{where} \ g_t \in \partial f(x_t) \\ & x_{t+1} = \mathrm{argmin}_{x \in \mathcal{X}} \|x - y_{t+1}\|. \end{align*}

The following result is extracted from a recent paper of Simon Lacoste-Julien, Mark Schmidt, and Francis Bach.

Theorem Let \eta_s = \frac{2}{\alpha (s+1)}, then Projected Subgradient Descent satisfies for

\noindent \bar{x}_t \in \left\{\mathrm{argmin}_{1 \leq s \leq t} f(x_s) ; \sum_{s=1}^t \frac{2 s}{t(t-1)} x_s \right\},

    \[f (\bar{x}_t ) - \min_{x \in \mathcal{X}} f(x) \leq \frac{2 L^2}{\alpha (t+1)} .\]

Note that one can immediately see from the analysis in this lecture that the rate is optimal. Indeed, one can always find a function f and set \mathcal{X} (an \ell_2 ball) that satisfies the above assumptions and such that no black-box procedure can go at a rate faster than \frac{L^2}{8 \alpha t} for t \leq n (in fact the constant 1/8 can be improved to 1/2).

Proof: Let x^* \in \mathrm{argmin}_{x \in \mathcal{X}} f(x). Coming back to our original analysis of Projected Subgradient Descent and using the strong convexity assumption one immediately obtains

    \[f(x_s) - f(x^*) \leq \frac{\eta_s}{2} L^2 + \left( \frac{1}{2 \eta_s} - \frac{\alpha}{2} \right) \|x_s - x^*\|^2 - \frac{1}{2 \eta_s} \|x_{s+1} - x^*\|^2 .\]

Multiplying this inequality by s yields

    \[s( f(x_s) - f(x^*) ) \leq \frac{L^2}{\alpha} + \frac{\alpha}{4} \bigg( s(s-1) \|x_s - x^*\|^2 - s (s+1) \|x_{s+1} - x^*\|^2 \bigg),\]

Now sum the resulting inequality over s=1 to s=t, and apply Jensen’s inequality to obtain the claimed statement.



Gradient Descent for strongly convex and smooth functions

As will see now, having both strong convexity and smoothness allows for a drastic improvement in the convergence rate. The key observation is the following lemma.

Lemma Let f be \beta-smooth and \alpha-strongly convex. Then for all x, y \in \mathbb{R}^n, one has

    \[(\nabla f(x) - \nabla f(y))^{\top} (x - y) \geq \frac{\alpha \beta}{\beta + \alpha} \|x-y\|^2 + \frac{1}{\beta + \alpha} \|\nabla f(x) - \nabla f(y)\|^2 .\]

Proof: Using the definitions it is easy to prove that \phi(x) = f(x) - \frac{\alpha}{2} \|x\|^2 is convex and (\beta-\alpha)-smooth, and thus using a result from the previous lecture one has

    \[(\nabla \phi(x) - \nabla \phi(y))^{\top} (x - y) \geq \frac{1}{\beta - \alpha} \|\nabla \phi(x) - \nabla \phi(y)\|^2 ,\]

which gives the claimed result with straightforward computations. (Note that if \alpha = \beta then one just has to apply directly the above inequality to f.)



Theorem Let f be \beta-smooth and \alpha-strongly convex, and let Q = \frac{\beta}{\alpha} be the condition number of f. Then Gradient Descent with \eta = \frac{2}{\alpha + \beta} satisfies

    \[f(x_t) - f(x^*) \leq \frac{\beta}{2} \left(\frac{Q - 1}{Q+1}\right)^{2 (t-1)} \|x_1 - x^*\|^2 .\]

Proof: First note that by \beta-smoothness one has

    \[f(x_t) - f(x^*) \leq \frac{\beta}{2} \|x_t - x^*\|^2 .\]

Now using the previous lemma one obtains

    \begin{eqnarray*} \|x_{t} - x^*\|^2& = & \|x_{t-1} - \eta \nabla f(x_{t-1}) - x^*\|^2 \\ & = & \|x_{t-1} - x^*\|^2 - 2 \eta \nabla f(x_{t-1})^{\top} (x_{t-1} - x^*) + \eta^2 \|\nabla f(x_{t-1})\|^2 \\ & \leq & \left(1 - 2 \frac{\eta \alpha \beta}{\beta + \alpha}\right)\|x_{t-1} - x^*\|^2 + \left(\eta^2 - 2 \frac{\eta}{\beta + \alpha}\right) \|\nabla f(x_{t-1})\|^2 \\ & = & \left(\frac{Q - 1}{Q+1}\right)^2 \|x_{t-1} - x^*\|^2 , \end{eqnarray*}

which concludes the proof.



Oracle lower bound for strongly convex and smooth functions

To simplify the proof of the next theorem we will consider the case of \mathbb{R}^n where n \to +\infty. More precisely we assume now that we are working in \ell_2 = \{ x = (x(n))_{n \in \mathbb{N}} : \sum_{i=1}^{+\infty} x(i) < + \infty\} rather than in \mathbb{R}^n. Note that in fact everything that we have done so far for first-order methods could have been done in an arbitrary Hilbert space \mathcal{H} rather than in \mathbb{R}^n, and we chose to work in \mathbb{R}^n only for the clarity of exposition.

 Theorem For any black-box optimization procedure satisfying

    \[x_{s+1} \in x_1 + \mathrm{Span}(\nabla f(x_1), \hdots, \nabla f(x_s)) ,\]

there exists a \beta-smooth and \alpha-strongly convex function f: \ell_2 \rightarrow \mathbb{R} such that for any t \geq 1 one has

    \[f(x_t) - f(x^*) \geq \frac{\alpha}{2} \left(\frac{\sqrt{Q} - 1}{\sqrt{Q}+1}\right)^{2 (t-1)} \|x_1 - x^*\|^2 .\]

A consequence of this result is that the bound we derived in the previous section for Gradient Descent is not optimal, as the rate had a similar form than the above lower bound but with Q instead \sqrt{Q} (which can be an important difference for large values of Q). One can obtain a matching upper bound by resorting to Nesterov’s Accelerated Gradient Descent instead of the plain Gradient Descent, and we refer the interested reader to the book by Nesterov for more details on this.

Proof: The overall argument is similar to what we did in the lower bound for smooth functions, but the details are different. As usual we assume without loss of generality that x_1 = 0. Let A : \ell_2 \rightarrow \ell_2 be the linear operator that corresponds to the infinite tridiagonal matrix with 2 on the diagonal and -1 on the upper and lower diagonals. Let Q = \beta / \alpha. We consider now the following function:

    \[f(x) = \frac{\alpha (Q-1)}{8} \langle Ax, x\rangle + \frac{\alpha}{2} \|x\|^2 .\]

We already proved that 0 \preceq A \preceq 4 I which easily implies that f is \alpha-strongly convex and \beta-smooth. Now as always the key observation is that for this function, thanks to our assumption on the black-box procedure, one necessarily has x_t(i) = 0, \forall i \geq t. This implies in particular:

    \[\|x_t - x^*\|^2 \geq \sum_{i=t}^{+\infty} x^*(i)^2 .\]

Furthermore since f is \alpha-strongly convex, one has

    \[f(x_t) - f(x^*) \geq \frac{\alpha}{2} \|x_t - x^*\|^2 .\]

Thus it only remains to compute x^*. This can be done by differentiating f and setting the gradient to 0, which gives the following infinite set of equations

    \begin{align*} & 1 - 2 \frac{Q+1}{Q-1} x^*(1) + x^*(2) = 0 , \\ & x^*(k-1) - 2 \frac{Q+1}{Q-1} x^*(k) + x^*(k+1) = 0, \forall k \geq 2 . \end{align*}

It is easy to verify that x^* defined by x^*(i) = \left(\frac{\sqrt{Q} - 1}{\sqrt{Q} + 1}\right)^i satisfy this infinite set of equations, and the conclusion of the theorem then follows by straightforward computations.


This entry was posted in Optimization. Bookmark the permalink.

4 Responses to "ORF523: Strong convexity"