ORF523: Noisy oracles

Today we start the third (and last) part of our class titled: ‘Optimization and Randomness’. An important setting that we will explore is the one of optimization with noisy oracles:

Noisy $0^{th}$ -order oracle: given $x \in \mathcal{X}$ it outputs $Y$ such that $\mathbb{E}(Y | x) = f(x)$ .
Noisy $1^{st}$ -order oracle: given $x \in \mathcal{X}$ it outputs $\tilde{g}$ such that $\mathbb{E}(\tilde{g} | x) \in \partial f(x)$ .

Somewhat surprisingly several of the first-order algorithms we studied so far are robust to noise and attain the same performance if one feeds them with the answers of a noisy $1^{st}$ order oracle. This was first observed by Robbins and Monro in the early Fifties, and the resulting set of techniques is often referred to as stochastic approximation methods. For instance let us consider the Stochastic Mirror Descent:

$\begin{align*} & \nabla \Phi(y_{t+1}) = \nabla \Phi(x_{t}) - \eta \tilde{g}_t, \ \text{where} \ \mathbb{E}(\tilde{g_t} | x_t) \in \partial f(x_t) , \\ & x_{t+1} \in \mathrm{argmin}_{x \in \mathcal{X} \cap \mathcal{D}} D_{\Phi}(x,y_{t+1}) . \end{align*}$

The analysis of this algorithm follows the exact same lines than the standard version. The key observation is that:

$\begin{eqnarray*} \mathbb{E} \ f(x_t) - f(x) & \leq & \mathbb{E} \ g_t^{\top} (x_t - x), \ \text{where} \ g_t \in \partial f(x_t) \\ & = & \mathbb{E} \left( \mathbb{E} (\tilde{g}_t|x_t)^{\top} (x_t - x) \right), \\ & = & \mathbb{E} \ \tilde{g}_t^{\top} (x_t - x) , \end{eqnarray*}$

and now one can repeat the standard proof since $\tilde{g}_t$ was used in the algorithm. This yields the following result.

Theorem. Let $\Phi$ be a mirror map $\kappa$ -strongly convex on $\mathcal{X} \cap \mathcal{D}$ with respect to $\|\cdot\|$ , and let $R^2 = \sup_{x \in \mathcal{X} \cap \mathcal{D}} \Phi(x) - \Phi(x_1)$ . Assume that $f$ is convex and that the noisy oracle is such that $\mathbb{E} \|\tilde{g}\|_*^2 \leq \sigma^2$ . Then Stochastic Mirror Descent with $\eta = \frac{R}{\sigma} \sqrt{\frac{2 \kappa}{t}}$ satisfies

$\mathbb{E} f\bigg(\frac{1}{t} \sum_{s=1}^t x_s \bigg) - \min_{x \in \mathcal{X}} f(x) \leq R \sigma \sqrt{\frac{2}{\kappa t}} .$

Of course the exact same reasoning applies to the Mirror Descent for Saddle Points. One can also get the rate $1/t$ for strongly convex functions using the idea described in this lecture. High-probability results can be obtained using Markov’s inequality, and stronger concentration results can be proved under a subgaussian assumption on the tails of the estimates $\tilde{g}$ , see Proposition 2.2. here.

Let us now see some application of this noisy framework.

Statistical Learning

We are interested in the solving the following problem:

$\min_{w \in \mathcal{W}} \mathbb{E} \ \ell(w, Z) ,$

where $\ell : \mathcal{W} \times \mathcal{Z}$ is a known loss function such that $\ell(\cdot, z)$ is a convex function $\forall z \in \mathcal{Z}$ , and $Z$ is a random variable whose distribution is unknown. While the distribution of $Z$ is unknown, we assume that we have access to a sequence of $m$ i.i.d. examples $Z_1, \hdots, Z_m$ . A typical example within this framework is the case where $z=(x,y)$ represents a pair of (input, output) and the function $\ell$ is one of the loss function we saw in this lecture, for example the logistic loss $\ell(w, (x,y)) = \log(1 + \exp(- y w^{\top} x))$ with $\mathcal{W} \subset \mathbb{R}^n$ , $\mathcal{Z} = \mathbb{R}^n \times \{-1,1\}$ .

Let $g(w,z)$ be a subgradient of $\ell(\cdot, z)$ at $w$ , and let $f(w) = \mathbb{E} \ \ell(w, Z)$ . We assume that $\mathbb{E} g(w, Z) \in \partial f(w)$ . Note that we cannot perform a gradient descent on $f(w)$ as we would need to know the distribution of $Z$ to compute a subgradient of $f$ . On the other hand one has access to unbiased estimates of these subgradients with the i.i.d. examples $Z_1, \hdots, Z_m$ . Thus one can use the Stochastic Mirror Descent, where at each iteration a new example is used to compute the noisy subgradient, precisely we take $\tilde{g}_t = g(w_t, Z_t)$ . In particular with $m$ examples one can only run the algorithm for $m$ steps, and one gets to a precision of order $1/\sqrt{m}$ . If in addition $f$ is strongly convex then one can in fact obtain a precision of order $1/m$ with the step sizes/averaging described in this lecture.

Acceleration by randomization

In the previous example there was no alternative to the use of noisy subgradients as the distribution of $Z$ was unknown. However in some cases, even if the distribution of $Z$ is known, it might be beneficial to artificially sample examples from the distribution in order to avoid computing difficult integrals. Consider for example the problem of minimizing an average of convex functions:

$\min_{x \in \mathcal{X}} \frac{1}{m} \sum_{i=1}^m f_i(x) .$

We have seen several important examples of this form where $m$ might be very large. Note that computing a subgradient of $f(x) = \frac{1}{m} \sum_{i=1}^m f_i(x)$ takes $O(m N)$ time where $O(N)$ is the time to compute the subgradient of a single function $f_i(x)$ . Thus a standard gradient descent algorithm will take $O(m N / \epsilon^2)$ time to reach an $\epsilon$ -optimal point. On the other hand one can simply sample (with replacement) an index $i \in [m]$ at every iteration (which takes time $O(\log m)$ ), and perform a step of gradient descent using only the subgradient of $f_i$ . This method will reach an $\epsilon$ -optimal point (in expectation) in time $O(\log(m) N / \epsilon^2)$ , which is much cheaper than the requirement for GD!

Note that if $f$ is also strongly convex, then GD needs $O(m N / \epsilon)$ time to reach an $\epsilon$ -optimal point, and SGD needs only $O(\log(m) N / \epsilon)$ time (with appropriate step sizes/averaging as discussed above). The situation becomes more favorable to GD if in addition $f$ is smooth (recall that for strongly convex and smooth functions one has an exponential rate of convergence for GD), and we will come back to this in the next lecture.

Let us consider another example which we studied at the end of the previous lecture:

$\min_{x \in \Delta_n} \max_{y \in \Delta_n} x^{\top} A y ,$

where $A \in \mathbb{S}^n$ (the symmetry assumption is just for sake of notation). It is clear that the rate of convergence of MD-SP for this problem is of order $\|A\|_{\max} \sqrt{\frac{\log n}{t}}$ . Furthermore each step requires to compute a subgradient in $x$ and $y$ , which are of the form $A y$ and $Ax$ . Thus the time complexity of one step is of order $O(n^2)$ (which is the time complexity of a matrix-vector multiplication). Overall to reach an $\epsilon$ -optimal points one needs $O(\|A\|_{\max}^2 \frac{n^2 \log n}{\epsilon^2})$ elementary operations. Now the key observation is that the matrix-vector multiplication can easily be randomized, especially since $x, y \in \Delta_n$ . Let $A_1, \hdots, A_n$ be the column vectors of $A$ , then $A x = \sum_{i=1}^n x_i A_i$ , and if $I$ is a random variable such that $\mathbb{P}(I=i) = x_i$ one has $\mathbb{E} A_I = A x$ . Using this randomization in a Stochastic MD-SP one obtains the exact same rate of convergence (in expectation) but now each step has time complexity of order $O(n)$ , resulting in an overall complexity to reach an $\epsilon$ -optimal point (in expectation) of $O(\|A\|_{\max}^2 \frac{n \log n}{\epsilon^2})$ . This result is especially amazing since we obtain a sublinear time algorithm (the complexity is $\tilde{O}(n)$ while the ‘size’ of the problem is $O(n^2)$ ). One can propose an even better algorithm based on Mirror Prox with a complexity of $\tilde{O}(n / \epsilon)$ and we refer the reader to this book chapter by Juditsky and Nemirovski for more details.