Convex Optimization: Algorithms and Complexity

I am thrilled to announce that my short introduction to convex optimization has just came out in the Foundations and Trends in Machine Learning series (free version on arxiv). This project started on this blog in 2013 with the lecture notes “The complexities of optimization”, it then morphed a year later into a first draft titled “Theory of Convex Optimization for Machine Learning”, and finally after one more iteration it has found the more appropriate name: “Convex Optimization: Algorithms and Complexity”. Notable additions since the last version include: a short treatment of conjugate gradient, an almost self-contained analysis of Vaidya’s 1989 cutting plane method (which attains the best of both center of gravity and ellipsoid method in terms of oracle complexity and computational complexity), and finally an algorithm with a simple geometric intuition which attains the rate of convergence of Nesterov’s accelerated gradient descent.

Posted in Optimization | Leave a comment

Crash course on learning theory, part 2

It might be useful to refresh your memory on the concepts we saw in part 1 (particularly the notions of VC dimension and Rademacher complexity). In this second and last part we will discuss two of the most successful algorithm paradigms in learning: boosting and SVM. Note that just like last time each topic we cover have its own books dedicated to them (see for example the boosting book by Schapire and Freund, and the SVM book by Scholkopf and Smola). Finally we conclude our short tour of the learning’s basics with a simple observation: stable algorithms generalize well.



Say that given a distribution D supported on m points (x_i, y_i)_{i \in [m]} one can find (efficiently) a classifier h \in \mathcal{B} such that L_{D}(h) \leq 1/2 - \gamma (here we are in the context of classification with the zero-one loss). Can we “boost” this weak learning algorithm into a strong learning algorithm with L_{\mu}(A) arbitrarily small for m large enough? It turns out that this is possible, and even simple. The idea is to build a linear combination of hypotheses in \mathcal{B} with a greedy procedure. That is at time step t-1 our hypothesis is (the sign of) f_{t-1} = \sum_{s=1}^{t-1} w_s h_s, and we are now looking to add h_{t} \in \mathcal{B} with an approriate weight w_{t} \in \mathbb{R}. A natural guess is to optimize over (w_t, h_t) to minimize the training error of f_t on our sample (x_i, y_i)_{i \in [m]}. This might be a difficult computational problem (how do you optimize over \mathcal{B}?), and furthermore we would like to make use of our efficient weak learning algorithm. The key trick is that \mathds1\{x < 0\} \leq \exp(-x). More precisely:

    \begin{eqnarray*} L_{\hat{\mu}_m}(f_t) = \frac{1}{m} \sum_{i=1}^m \mathds1\{\mathrm{sign}(f_t(x_i)) \neq y_i\} & \leq & \frac{1}{m} \sum_{i=1}^m \exp( - y_i f_t(x_i) ) \; (=: Z_t / m) \\ & = & \frac{Z_{t-1}}{m} \sum_{i=1}^m D_t(i) \exp( - y_i w_t h_t(x_i) ) , \end{eqnarray*}

where D_t(i) = \frac{1}{Z_{t-1}} \exp(- y_i f_{t-1}(i)). From this we see that we would like h_t to be a good predictor for the distribution D_t. Thus we can pass D_t to the weak learning algorithm, which in turns gives us h_t with L_{D_t}(h_t) = \epsilon_t \leq 1/2 - \gamma. Thus we now have:

    \[Z_ t = Z_{t-1} (\epsilon_t \exp(w_t) + (1-\epsilon_t) \exp(-w_t) ).\]

Optimizing the above expression one finds that w_t = \frac{1}{2} \log\left(\frac{1-\epsilon_t}{\epsilon_t}\right) leads to (using \epsilon_t \leq 1/2-\gamma)

    \[Z_t \leq Z_{t-1} \exp(-2 \gamma^2) \leq m \exp(-2 \gamma^2 t) .\]

The procedure we just described is called AdaBoost (introduce by Schapire and Freund) and we proved that it satisfies

(1)   \begin{equation*}  L_{\hat{\mu}_m}(f_t) \leq \exp(- 2 \gamma^2 t) . \end{equation*}

In particular we see that our weak learner assumption implies that \hat{\mu}_m is realizable (and in fact realizable with margin \gamma, see next section for the definition of margin) with the hypothesis class:

    \[\mathcal{H} = \left\{ \mathrm{sign} \left( \sum_{t \in \mathbb{N}} w_t h_t(\cdot) \right) : \forall t \in \mathbb{N}, w_t \in \mathbb{R}, h_t \in \mathcal{B} \right\} .\]

This class can be thought of as a neural network with one (infinite size) hidden layer. To realize how expressive \mathcal{H} is compared to \mathcal{B} it’s a useful exercise to think about the very basic case of decision stumps (for which the empirical risk minimization can be implemented very efficiently):

    \[\mathcal{B} = \{x \mapsto \mathrm{sign}(\theta - x_i), i \in [n], \theta \in \mathbb{R}\} .\]

To derive a bound on the true risk L_{\mu}(f_t) of AdaBoost it remains to calculate the VC dimension of the class \mathcal{H}_t where the size of the hidden layer is t. This follows from more general results on the VC dimension of neural networks, and up to logarithmic factors one obtains that \mathrm{VC}(\mathcal{H}_t) is of order t \times \mathrm{VC}(\mathcal{B}). Putting this together with \eqref{eq:empada} we see that when \gamma is a constant, one should run AdaBoost for t=\Theta(\log(m)) rounds, and then one gets L_{\mu}(f_t) \lesssim \sqrt{\frac{\mathrm{VC}(\mathcal{B})}{m}}.



We consider \mathcal{D} to be the set of distributions \mu such that there exists w^* with \mathbb{P}_{\mu}(Y \langle w^{*} , X \rangle > 0) =1 (again we are in the context of classification with the zero-one loss, and this assumption means that the data is almost surely realizable). The SVM idea is to search for w with minimal Euclidean norm and such that Y_i \langle w , X_i \rangle \geq 1, \forall i \in [m]. Effectively this is doing empirical risk minimization over the following data dependent hypothesis class (which we write in terms of the set of admissible weight vectors):

    \[\mathcal{W}_S = \{w \in \mathbb{R}^n : Y_i \langle w , X_i \rangle \geq 1, \forall i \in [m] \; \text{and} \; |w| \leq |w^*|\} .\]

The key point is that we can now use the contraction lemma to bound the Rademacher complexity of this class. Indeed replacing the zero-one loss \mathds1\{\mathrm{sign}(\langle w, x\rangle) \neq y\} by the (Lipschitz!) “ramp loss” \min(1,\max(0,1-y\langle w , x \rangle)) makes no difference for the optimum w^*, and our estimated weight still has 0 training error while its true loss is only surestimated. Using the argument from previous sections we see that the Rademacher complexity of our hypothesis class (with respect to the ramp loss) is bounded by 2 |w^*| / \sqrt{m} (assuming the examples X_i are normalized to be in the Euclidean ball). Now it is easy to see that the existence of w^* with |w^*| \leq 1/\gamma (and \mathbb{P}_{\mu}(Y \langle w^{*} , X \rangle \geq 1) =1) exactly corresponds to a geometric margin of \gamma between positive and negative examples (indeed the margin is exactly \min_i y_i \langle \frac{w^*}{|w^*|} , x \rangle). To summarize we just saw that under the \gamma-margin condition the SVM algorithm has a sample complexity of order 1/ (\gamma \epsilon)^2. This suggests that from an estimation perspective one should map the points into a high-dimensional space so that one could hope to have the separability condition (with margin). However this raises computational challenges, as the QP given by the SVM can suddenly look daunting. This is where kernels come into the picture.



So let’s go overboard and map the points to an infinite dimensional Hilbert space \mathcal{H} (as we will see in the next subsection this notation will be consistent with \mathcal{H} being the hypothesis class). Denote \psi for this map, and let K(x,x') = \langle \psi(x), \psi(x') \rangle_{\mathcal{H}} be the kernel associated with it. The key point is that we are not using all the dimensions of our Hilbert space in the SVM optimization problem, but rather we are effectively working in the subspace spanned by \psi(x_1), \hdots, \psi(x_m) (this is because we are only working with inner products with those vectors, and we are trying to minimize the norm of the resulting vector). Thus we can restrict our search to w=\sum_{i=1}^m \alpha_i \psi(x_i) (this fact is called Mercer representer theorem and the previous sentence is the proof…). The beauty is that now we only need to compute the Gram matrix (K(x_i, x_j))_{i, j \in [m]} as we only need to consider \langle w, \psi(x_i)\rangle = \sum_{j=1}^m \alpha_j K(x_j, x_i) and |w|^2 = \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i, x_j). In particular we never need to compute the points \psi(x_i) (which anyway could be infinite dimensional, so we couldn’t really write them down…). Note that the same trick would work with soft SVM (i.e., regularized hinge loss). To drive the point home let’s see an example: \psi(x) = (\prod_{i \in J} x_i)_{J \subset [n], |J| =k} leads to K(x,x') = (\langle x, x'\rangle)^k. I guess it doesn’t get much better than this :). Despite all this beauty, one should note that we now have to manipulate an object of size m^2 (the kernel matrix (K(x_i,x_j))) and in our big data days this can be a disaster. We really want to focus on methods with computational complexity linear in m, and thus one is led to the problem of kernel approximation, which we will explore below. But first let us explore a bit further what kernel SVM is really doing.


RKHS and the inductive bias of kernels

As we just saw in kernel methods the true central element is the kernel K rather than the embedding \psi. In particular since the only thing that matter are inner products \langle \psi(x), \psi(x') \rangle_{\mathcal{H}} we might as well assume that \psi(x) := K(x,\cdot), and \mathcal{H} is the completion of \mathrm{Span}(K(x, \cdot), x \in \mathcal{X}) where the inner product is defined by \langle K(x, \cdot), K(x', \cdot) \rangle_{\mathcal{H}} = K(x,x') (and the definition is extended to \mathcal{H} by linearity). Assuming that K is positive definite (that is Gram matrices built from K are positive definite) one obtains a well-defined Hilbert space \mathcal{H}. Furthermore this Hilbert space has a special property: for any f \in \mathcal{H}, x \in \mathcal{X}, f(x) = \langle f, K(x, \cdot) \rangle_{\mathcal{H}}. In other words \mathcal{H} is a reproducing kernel Hilbert space (RKHS), and in fact any RKHS can be obtained with the above construction (this is a simple consequence of Riesz representation theorem). Now observe that we can rewrite the kernel SVM problem

    \begin{align*} & \mathrm{min}. \; \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i,x_j) \\ & \text{s.t.} \; \alpha \in \mathbb{R}^m \; \text{and} \; \forall j \in [m], y_j \sum_{i=1}^m \alpha_i K(x_i, x_j) \geq 1 , \end{align*}


    \begin{align*} & \mathrm{min}. \; \|f\|_{\mathcal{H}} \\ & \text{s.t.} \; f \in \mathcal{H} \; \text{and} \; \forall j \in [m], y_j f(x_j) \geq 1 . \end{align*}

While the first formulation is computationally more effective, the second sheds light on what we are really doing: simply searching the consistent (with margin) hypothesis in \mathcal{H} with smallest norm. In other words, thinking in terms of inductive bias, one should choose a kernel for which the norm \|\cdot\|_{\mathcal{H}} represents the kind of smoothness one expects in the mapping from input to output (more on that next).

It should also be clear now that one can “kernelize” any regularized empirical risk minimization, that is instead of the boring (note that here the loss \ell is defined on \mathcal{Y} \times \mathbb{R} instead of \mathcal{Y} \times \mathcal{Y})

    \[\min_{w \in \mathbb{R}^n}. \; \frac1{m} \sum_{i=1}^m \ell(y_i, \langle w, x_i \rangle) + \lambda |w|^2 ,\]

one can consider the much more exciting

    \[\min_{f \in \mathcal{H}}. \; \frac1{m} \sum_{i=1}^m \ell(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2 ,\]

since this can be equivalently written as

    \[\min_{\alpha \in \mathbb{R}^m}. \; \frac1{m} \sum_{i=1}^m \ell\left(y_i, \sum_{j=1}^m \alpha_j K(x_j, x_i)\right) + \lambda \sum_{i,j=1}^m \alpha_i \alpha_j K(x_i,x_j) .\]

This gives the kernel ridge regression, kernel logistic regression, etc…


Translation invariant kernels and low-pass filtering

We will now investigate a bit further the RKHS that one obtains with translation invariant kernels, that is K(x,x')=k(x-x'). A beautiful theorem of Bochner characterizes the continuous maps k: \mathbb{R}^n \rightarrow \mathbb{R} (with k(0)=1) for which such a K is a positive definite kernel: it is necessary and sufficient that k is the characteristic function of a probability measure \mu, that is

    \[k(x) = \int_{\mathbb{R}^n} \exp(i \langle \omega, x \rangle ) d\mu(\omega) .\]

An important example in practice is the Gaussian kernel: K(x,x')=\exp\left( - \frac{|x-x'|^2}{2 \sigma^2}\right) (this corresponds to mapping x to the function \exp\left( - \frac{|x- \cdot |^2}{2 \sigma^2}\right)). One can check that in this case \mu is itself a Gaussian (centered and with covariance \sigma^{-2} \mathrm{I}_n).

Now let us restrict our attention to the case where \mu has a density h with respect to the Lebesgue measure, that is \frac{d\mu}{d\omega}=h. A standard calculation then shows that

    \[K(x,x')= \frac{1}{(2\pi)^{n}} \int_{\mathbb{R}^n} \frac{\widehat{K(x, \cdot)}(\omega)\overline{\widehat{K(x, \cdot)}(\omega)}}{h(\omega)} d\omega ,\]

which implies in particular

    \[\| f \|_{\mathcal{H}} = \frac{1}{(2\pi)^{\frac{n}{2}}} \int_{\mathbb{R}^n} \frac{|\widehat{f}(\omega)|^2}{h(\omega)} d\omega .\]

Note that for the Gaussian kernel one has h(\omega)^{-1} \propto \exp\left(\frac{\sigma^2 |\omega|^2}{2}\right), that is the high frequency in f are severely penalized in the RKHS norm. Also note that smaller values of \sigma correspond to less regularization, which is what one would have expected from the feature map representation (indeed the features \exp\left( - \frac{|x- \cdot |^2}{2 \sigma^2}\right) are more localized around the data point x for larger values of \sigma).

To summarize, SVM with translation invariant kernels correspond to some kind of soft low-pass filtering, where the exact form of the penalization for higher frequency depends on the specific kernel being used (smoother kernels lead to more penalization).


Random features

Let us now come back to computational issues. As we pointed out before, the vanilla kernel method has at least a quadratic cost in the number of data points. A common approach to reduce this cost is to use a low rank approximation of the Gram matrix (indeed thanks to the i.i.d. assumption there is presumably a lot of redundancy in the Gram matrix), or to resort to online algorithms (see for example the forgetron of Dekel, Shalev-Shwartz and Singer). Another idea is to approximate the feature map itself (a priori this doesn’t sound like a good idea, since as we explained above the beauty of kernels is that we avoid computing this feature map). We now describe an elegant and powerful approximation of the feature map (for translation invariant kernels) proposed by Rahimi and Recht which is based on random features.

Let K be a translation invariant kernel and \mu its corresponding probability measure. Let’s rewrite K in a convenient form, using \cos(\theta-\theta') = \frac{1}{\pi} \int_{0}^{\pi} 2 \cos(\theta + \phi) \cos(\theta'+\phi) d\phi and Bochner’s theorem,

    \[K(x,x') = \mathbb{E}_{\omega \sim \mu} \cos(\langle \omega, x-x' \rangle) = 2 \mathbb{E}_{\omega \sim \mu, \phi \sim \mathrm{unif}([0,\pi])} \cos(\langle \omega, x \rangle + \phi) \cos(\langle \omega, x' \rangle + \phi).\]

A simple idea is thus to build the following random feature map: given \omega_1, \hdots, \omega_d and \phi_1, \hdots, \phi_d, i.i.d. draws from respectively \mu and \mathrm{unif}([0, \pi]), let \Psi: \mathbb{R}^n \rightarrow \mathbb{R}^d be defined by \Psi(x)=(\Psi_1(x), \hdots, \Psi_d(x)) where

    \[\Psi_i(x) = \sqrt{\frac{2}{d}} \cos(\langle \omega_i, x \rangle + \phi_i) .\]

For d \gtrsim n / \epsilon^2 it is an easy exercise to verify that with probability at least 1-\epsilon (provided \mu has a second moment at most polynomial in n) one will have for any x, x' in some compact set (with diameter at most polynomial in n),

    \[|K(x,x') - \langle \Psi(x), \Psi(x') \rangle | \leq \epsilon .\]

The SVM problem can now be approximated by:

    \begin{align*} & \mathrm{min}. \; |w| \\ & \text{s.t.} \; w \in \mathbb{R}^d \; \text{and} \; y_j \langle w, \Psi(x_j) \geq 1 , \forall j \in [m]. \end{align*}

This optimization problem is potentially much simpler than the vanilla kernel SVM when m is much bigger than n (essentially n replaces m for most computational aspects of the problem, including space/time complexity of prediction after training).



We conclude our tour of the basic topics in statistical learning with a different point of view on generalization that was put forward by Bousquet and Elisseeff.

Let’s start with a simple observation. Let S'=(Z_1', \hdots, Z_m') be an independent copy of S, and denote S^i = (Z_1, \hdots, Z_{i-1}, Z_i', Z_{i+1}, \hdots, Z_m). Then one can write, using the slight abuse of notation \ell(h, (x,y)) :=\ell(h(x), y),

    \begin{eqnarray*} \mathbb{E}_S (L_{\mu}(A(S)) - L_{\hat{\mu}_m}(A(S))) & = & \mathbb{E}_{S, S', i \sim \mathrm{unif}([m])} (\ell(A(S), Z_i') - \ell(A(S), Z_i)) \\ & = & \mathbb{E}_{S, S', i \sim \mathrm{unif}([m])} (\ell(A(S^i), Z_i) - \ell(A(S), Z_i)) . \end{eqnarray*}

This last quantity can be interpreted as a stability notion, and we see that controlling it would in turn control how different is the true risk compared to the empirical risk. Thus stable methods generalize.

We will now show that regularization can be interpreted as a stabilizer. Precisely we show that

    \[w(S) \in \argmin_{w \in \mathbb{R}} L_{\hat{\mu}_m}(h_w) + \lambda |w|^2 ,\]

is 2 / (\lambda m)-stable for a convex and 1-Lipschitz loss. Denote f_S(w) for the above objective function, then one has

    \[f_S(w) - f_S(w') = f_{S^i}(w) - f_{S^i}(w') + \frac{\ell(h_w, Z_i) - \ell(h_{w'}, Z_i)}{m} + \frac{\ell(h_{w'}, Z_i') - \ell(h_{w}, Z_i')}{m} ,\]

and thus by Lipschitzness

    \begin{eqnarray*} f_S(w(S^i))-f_S(w(S)) & \leq & \frac{\ell(h_{w(S^i)}, Z_i) - \ell(h_{w(S)}, Z_i)}{m} + \frac{\ell(h_{w(S)}, Z_i') - \ell(h_{w(S^i)}, Z_i')}{m} \\ & \leq & \frac{2}{m} |w(S) - w(S^i)| . \end{eqnarray*}

On the other hand by strong convexity one has

    \[f_S(w) - f_S(w(S)) \geq \lambda |w-w(S)|^2 ,\]

and thus with the above we get |w(S) - w(S^i)| \leq 2 / (\lambda m) which implies (by Lipschitzness)

    \[\ell(h_{w(S^i)}, Z_i) - \ell(h_{w(S)}, Z_i) \leq \frac{2}{\lambda m} ,\]

or in other words regularized empirical risk minimization (\mathrm{R-ERM}) is stable. In particular denoting w^* for the minimizer of w \mapsto L_{\mu}(h_w) we have:

    \begin{eqnarray*} L_{\mu}(\mathrm{R-ERM}) & \leq & \mathbb{E}_S L_{\hat{\mu}_m}(\mathrm{R-ERM}) + \frac{2}{\lambda m} \\ & \leq & \mathbb{E}_S L_{\hat{\mu}_m}(w^*) + \lambda |w^*| + \frac{2}{\lambda m} \\ & = & L_{\mu}(w^*) + \lambda |w^*| + \frac{2}{\lambda m} . \end{eqnarray*}

Assuming |w^*| \leq 1 and optimizing over \lambda we recover the bound we obtained previously via Rademacher complexity.

Posted in Optimization, Probability theory, Theoretical Computer Science | Leave a comment

Crash course on learning theory, part 1

This week and next week I’m giving 90 minutes lectures at MSR on the fundamentals of learning theory. Below you will find my notes for the first course, where we covered the basic setting of statistical learning theory, Glivenko-Cantelli classes, Rademacher complexity, VC dimension, linear regression, logistic regression, soft-SVM, sample complexity of classification with halfspaces, and convex learning with SGD. All of this (and much more) can be found in the excellent book by Shai Shalev-Shwartz and Shai Ben-David (freely available online) which I heavily used to prepare these lectures. Next week we will cover boosting, margin and kernels, stability, and an introduction to neural networks [note: part 2 can be found here].


What statistical learning is about

The basic ingredients in statistical learning theory are as follows:

  • Two measurable spaces \mathcal{X} and \mathcal{Y} (the input and output sets).
  • A loss function \ell : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}.
  • A set \mathcal{H} \subset \mathcal{Y}^{\mathcal{X}} of maps from \mathcal{X} to \mathcal{Y} (the hypothesis class)
  • A set \mathcal{D} of probability measures on \mathcal{X} \times \mathcal{Y}.

The mathematically most elementary goal in learning can now be described as follows: given an i.i.d. sequence S := ((X_i, Y_i))_{i \in [m]} drawn from some unknown measure \mu \in \mathcal{D}, one wants to find a map from \mathcal{X} to \mathcal{Y} that is doing almost as well as the best map in \mathcal{H} in terms of predicting Y given X when (X,Y) is drawn from the unknown \mu and the prediction mistakes are evaluated according to the loss function \ell. In other words, denoting for h \in \mathcal{Y}^{\mathcal{X}}, L_{\mu}(h) = \mathbb{E}_{(X,Y) \sim \mu} \ell(h(X), Y), one is interested in finding a map A : (\mathcal{X} \times \mathcal{Y})^m \rightarrow \mathcal{Y}^{\mathcal{X}} that minimizes:

    \[R_m(A) := \sup_{\mu \in \mathcal{D}} R_{m}(A; \mu) ,\]


    \[R_{m}(A; \mu) := \mathbb{E}_{S \sim \mu^{\otimes m}} \; L_{\mu}(A(S)) - \inf_{h \in \mathcal{H}} L_{\mu}(h) .\]

Let us make a few comments/observations about this problem and some of its variants. In all our discussion we restrict to \mathcal{X} \subset \mathbb{R}^n and \mathcal{H} = \{h_w : w \in \mathcal{W} \subset \mathbb{R}^d\}.

  • The two canonical examples of the above framework are the problem of classification with halfspaces (\mathcal{Y} = \{-1,1\}, h_w(x) = \mathrm{sign}(\langle w, x\rangle), and \ell(y,y') = \mathds1\{y \neq y'\}), and the problem of linear regression (\mathcal{Y} = \mathbb{R}, h_w(x) = \langle w, x \rangle, and \ell(y,y') = (y-y')^2). The loss functions just defined are respectively called the zero-one loss and the least-squares loss.
  • The prior knowledge about the relation between inputs and outputs (also called the inductive bias) is expressed by either restricting the comparison set \mathcal{H} (this is called the discriminative approach) or restricting the set of possible distributions \mathcal{D} (the generative approach).
  • One says that (\mathcal{D}, \mathcal{H}, \ell) is learnable if there exists A such that \lim_{m\rightarrow+\infty}R_m(A) \leq 0. The sample complexity of the problem (at scale \epsilon >0) is the smallest m such that there exists A with R_m(A) \leq \epsilon.
  • Obviously some problems are not learnable, say in classification with the zero-one loss one cannot learn if \mathcal{D} is the set of all distributions on [0,1]^n \times \{-1,1\} and \mathcal{H} = \{-1,1\}^{[0,1]^n}. This is sometimes called the no-free lunch theorem (note however that this observation really doesn’t deserve the name “theorem”). Interestingly it turns out that it doesn’t take much (in some sense) to make the problem learnable: if one restricts \mathcal{D} to be the set of distributions such that the conditional probability \eta(x) = \mathbb{P}(Y=1 | X=x) is 1-Lipschitz, then the nearest neighbor rule (i.e., A(S)(x) outputs the label Y_i of the closest point X_i to x) shows that the problem is (essentially) learnable with sample complexity 1/\epsilon^{n+1}. However this is not so good news, as this sample complexity is exponential in the dimension (this is the so-called “curse of dimensionality”). To get anything practical we will have to strengthen our inductive bias.
  • The stronger notion of PAC learnability replaces the convergence in expectation of L_{\mu}(A(S)) to \inf_{h \in \mathcal{H}} L_{\mu}(h) by a convergence in probability (and this convergence should be uniform in \mu \in \mathcal{D}). Furthermore in basic PAC learning one assumes that \mathcal{D} is restricted to measures such that \inf_{h \in \mathcal{H}} L_{\mu}(h) = 0, also called the realizable case in the context of classification. On the other hand in agnostic PAC learning \mathcal{D} is the set of all measures on \mathcal{X} \times \mathcal{Y}.
  • A weaker notion of learning is to simply require consistency, that is for any \mu \in \mathcal{D} one has R_m(A; \mu) going to zero. One can show that k-Nearest Neighbor is “universally consistent” provided that k grows to infinity at a sublinear rate. This fact is not too surprising if you recall Lusin’s theorem.
  • Proper learning refers to the situation where A is restricted to map into \mathcal{H}.
  • Of course we are interested in the computational complexity of learning too, and the “real” objective of machine learning is to find large and expressive hypothesis classes that are learnable in polynomial sample complexity and polynomial time.


Glivenko-Cantelli and Rademacher complexity

Recall that a set of functions \mathcal{F} \subset \mathbb{R}^{\mathcal{Z}} is a Glivenko-Cantelli class if for any measure \mu on \mathcal{Z}, denoting \hat{\mu}_m = \frac{1}{m} \sum_{i=1}^m \delta_{z_i} for the empirical measure of z \sim \mu^{\otimes m}, one has that \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| goes to 0 in probability as m goes to infinity. A useful technique to get a handle on Glivenko-Cantelli is the following symmetrization argument, where \epsilon \sim \mathrm{unif}(\{-1,1\}^m),

    \begin{eqnarray*} \mathbb{E}_z \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| & = & \mathbb{E}_z \sup_{f \in \mathcal{F}} \left| \mathbb{E}_{z'} \frac{1}{m} \sum_{i=1}^m (f(z_i) - f(z_i')) \right| \\ & \leq & \mathbb{E}_{z, z', \epsilon} \sup_{f \in \mathcal{F}} \left| \frac{1}{m} \sum_{i=1}^m \epsilon_i (f(z_i) - f(z_i')) \right| \\ & \leq & 2 \mathbb{E}_{z, \epsilon} \sup_{f \in \mathcal{F}} \left|\frac{1}{m} \sum_{i=1}^m \epsilon_i f(z_i)\right| . \end{eqnarray*}

This last quantity is known as the Rademacher complexity of \mathcal{F} on \mu, and its supremum over \mu is simply called the Rademacher complexity of \mathcal{F}, denoted \mathrm{Rad}_m(\mathcal{F}). In particular McDiarmid’s inequality together with boundedness of f directly implies that with probability at least 1-\delta,

(1)   \begin{equation*} \sup_{f \in \mathcal{F}} |\mathbb{E}_{\hat{\mu}_m} f - \mathbb{E}_{\mu} f| \lesssim \mathrm{Rad}_m(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{m}}. \end{equation*}

In other words if the Rademacher complexity of \mathcal{F} goes to 0, then \mathcal{F} is a Glivenko-Cantelli class.

In learning theory the class of interest is \mathcal{F} = \{(x,y) \mapsto \ell(h(x), y), h \in \mathcal{H}\} (in particular \mathcal{Z} = \mathcal{X} \times \mathcal{Y}). If \mathcal{F} is a Glivenko-Cantelli class (that is the true loss and the empirical loss are close for any h \in \mathcal{H}) then one obtains a consistent method with the \mathrm{ERM} rule (empirical risk minimization):

    \[\mathrm{ERM}(S) = \mathrm{argmin}_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \ell(h(X_i), Y_i) .\]

Furthermore if the Glivenko-Cantelli convergence is uniform in \mu \in \mathcal{D} (which is the case if say the Rademacher complexity \mathrm{Rad}_m(\mathcal{F}) goes to 0) then the problem is agnostic PAC learnable (thanks to (1))

Bounding the Rademacher complexity (and variants such as when the Rademacher random variables are replaced by Gaussians) is one of the main objective in the rich theory of empirical processes. Many tools have been developed for this task and we rapidly cover some of them below.


Lipschitz loss

The contraction lemma says that if \mathcal{Y} \subset \mathbb{R} and x \mapsto \ell(h(x), y) is 1-Lipschitz for any h and y, then \mathrm{Rad}_m(\mathcal{F}) \leq \mathrm{Rad}_m(\mathcal{H}). Thus, if one wants to show that linear regression is PAC learnable (say with \mathcal{Y} = [-1,1]) it suffices to study the Rademacher complexity of linear functions. For instance if \mathcal{X} and \mathcal{W} are both the unit Euclidean ball \mathrm{B}_2^n then the Rademacher complexity of linear functions is bounded by 2/\sqrt{m} (in particular it is independent of the dimension, more on that below). Thus summarizing this discussion we see that in this case the map

(2)   \begin{equation*} \mathrm{ERM} : (X_i, Y_i)_{i \in [m]} \mapsto \mathrm{argmin}_{w \in \mathrm{B}_2^n} \frac{1}{m} \sum_{i=1}^m (\langle w, X_i\rangle - Y_i)^2 , \end{equation*}

satisfies for any \mu, with probability at least 1- \delta,

    \[L_{\mu}(\mathrm{ERM}(S)) - \inf_{h \in \mathcal{H}} L_{\mu}(h) \lesssim \sqrt{\frac{\log(1/\delta)}{m}} .\]

Also observe that one can rewrite the optimization problem (2) by Lagrangian duality as follows, for some \lambda>0 and with obvious notations for \mathbb{X} and \mathbb{Y},

    \[\mathrm{argmin}_{w \in \mathbb{R}^n} \frac1{m} |\mathbb{X} w - \mathbb{Y}|^2 + \lambda |w|^2 ,\]

which is sometimes called the ridge regression problem. The added penalty \lambda | \cdot | is thought of as a regularizer, and from the perspective of this section it enforces a small Rademacher complexity. The tradeoff is that larger values of \lambda leads to smaller Rademacher complexity but also to a smaller comparison set \mathcal{H}. This two types of error should be thought of as the estimation and approximation errors.

Another interesting example is the LASSO problem

    \[\mathrm{argmin}_{w \in \mathbb{R}^n} \frac1{m} |\mathbb{X} w - \mathbb{Y}|^2 + \lambda \|w\|_1 ,\]

which corresponds to \mathcal{W} being the \ell_1-unit ball (up to scaling). In this case if \mathcal{X} is the \ell_{\infty}-unit ball one can show that the Rademacher complexity of linear functions is of order \sqrt{\log(n) / m}. This corresponds in some sense to the inductive bias that the optimal weight is sparse.

Finally let us make an important remark about scaling issues. In practice features are often normalized individually, i.e. \mathcal{X} = [-1,1]^d, in which case the Rademacher complexity (with \mathcal{W} = \mathrm{B}_2^n) scales with the Euclidean radius which is \sqrt{d}. However if the examples are all sparse then in fact the dimension is replaced by the sparsity, which can make a huge difference in practice (think of the case where d is the number of words in the english dictionary, and our data points x_i are articles, which are typically much shorter than the dictionary). We see that ridge regression and LASSO are after different kind of sparsity, the former is about sparsity in the features while the latter is about sparsity of the features that matter for prediction.


Zero-one loss

For the zero-one loss we cannot use the contraction lemma as this loss is not Lipschitz. Instead we use Massart’s lemma which states that if A \subset \mathrm{B}_2^m then

    \[\mathbb{E}_{\epsilon}\sup_{a \in A} \left|\frac{1}{m} \sum_{i=1}^m \epsilon_i a_i\right|\leq \frac{\sqrt{2 \log |A|}}{m} .\]

Recall that in learning with the zero-one loss one has \mathcal{F} = \{(x,y) \mapsto \mathds1\{h(x) \neq y\}, h \in \mathcal{H}\}.

Thus we need to count the set of vectors \{(\mathds1\{h(x_1) \neq y_1\}, \hdots, \mathds1\{h(x_m) \neq y_m\}), h \in \mathcal{H}\}. The Vapnik-Chervonenkis dimension is introduced exactly for this task: for an hypothesis class \mathcal{H} \subset \{-1,1\}^{\mathcal{X}}, \mathrm{VC}(\mathcal{H}) is the largest integer k such that one can find a set \{x_1, \hdots, x_k\} \subset \mathcal{X} shattered by \mathcal{H}, that is |\{(h(x_1), \hdots, h(x_k)), h \in \mathcal{H}\}| = 2^k. The Sauer-Shelah lemma then gives

    \[\left|\left\{(\mathds1\{h(x_1) \neq y_1\}, \hdots, \mathds1\{h(x_m) \neq y_m\}), h \in \mathcal{H}\right\}\right| \leq \left(\frac{e m}{\mathrm{VC}(\mathcal{H})}\right)^{\mathrm{VC}(\mathcal{H})}.\]

In particular we obtain:

    \[\mathrm{Rad}(\mathcal{F}) \leq \sqrt{\frac{2 \mathrm{VC}(\mathcal{H}) \log\left( \frac{e m}{\mathrm{VC}(\mathcal{H})} \right)}{m}} .\]

As a rule of thumb the VC dimension is of order of the number of parameters defining the hypothesis class (obviously this is not always true…). For instance one always has \mathrm{VC}(\mathcal{H}) \leq \log_2(|\mathcal{H}|), and the VC dimension of halfspaces in \mathbb{R}^d is d.

Unfortunately the ERM rule for the zero-one loss on halfspaces is computationally hard to implement (on the other hand in the realizable case it can be written as an LP, we will come back to this a bit later). In practice what we do is to relax the zero-one loss by some convex surrogate \ell so that we can then write L_{\mu}^{0-1} \leq L_{\mu}^{\ell}. Thus if somehow we manage to have a small error with the surrogate loss then we also have a small error with the zero-one loss. Two important examples of surrogate losses for classification with halfspaces are the hinge loss (which gives the soft-SVM algorithm, more on that later):

    \[\ell(h_w, (x,y)) = \max(0, 1 - y \langle w, x\rangle) ,\]

and the logistic loss (which gives regularized logistic regression):

    \[\ell(h_w, (x,y)) = \log_2(1+\exp(- y \langle w, x\rangle)).\]

Using the contraction lemma from the previous section one can easily control the Rademacher complexity with those losses.


Convex learning

We say that the loss is convex if for any (x, y) \in \mathcal{X} \times \mathcal{Y} the map w \in \mathcal{W} \mapsto \ell(h_w(x), y) is convex (in particular \mathcal{W} is a convex set).

Theorem: The following situations are learnable: convex loss with either [bounded gradients, bounded \mathcal{W}], or [bounded second derivative, bounded \mathcal{W}], or [strongly convex, bounded gradients].

Just as an example assume that for any w \in \mathcal{W}, |w| \leq R, and that for any x, y, w, g \in \partial_w \ell(h_w(x), y) (the set of subgradients for \ell) one has |g| \leq L.

Then running one-pass SGD with step-size given by \frac{R}{L} \sqrt{\frac{2}{t}} gives R_m(SGD) \leq R L \sqrt{\frac{2}{m}} (again this is dimension independent!). Observe that if one runs multiple passes of SGD then this bound does not apply, and one is in fact implementing the ERM rule from the previous section (for which bounds can then be derived via Rademacher complexity as we showed before).

Posted in Optimization, Probability theory, Theoretical Computer Science | 7 Comments

Some stuff I learned this week

This week I had the pleasure to spend 3 days at a wonderful workshop (disclaimer: I was part of the organization, with Ofer Dekel, Yuval Peres and James Lee, so I might be a bit biased). Below you will find three results that I particularly enjoyed (one from each day). Obviously many more topics were covered at the workshop, from finding correlations in subquadratic time, to the basic problem of learning halfspaces (see for instances these two COLT’15 papers, by Daniely and by Awasthi, Balcan, Haghtalab and Urner), or how stability can explain why deep nets trained with SGD are not overfitting (upcoming work by Hardt-Recht-Singer; also on a related note I learned from Yuval about the Kolmogrov-Arnold representation theorem, which seems very much related to the famous Barron’1990 result on the universal approximation properties of neural networks with 1 hidden layer). Finally two more things: (i) we also have a blog where we collected all the open problems discussed during the workshop, and (ii) many of the talks have been recorded (including the ones I mention below, as well as the nice tutorial by Sergiu Hart on dynamics in games). I will post the link to the videos as soon as we upload them.


Negative association

Robin Pemantle gave an excellent tutorial on negative association, see the slides here. The punchline can be summarized pretty succinctly. Consider a measure \mu on \{0,1\}^n. We say that \mu is positively correlated if for any f,g monotone increasing one has \int f g d\mu \geq \int f d\mu \int g d\mu. On the other hand we say that \mu is negatively correlated if for any f,g monotone increasing and such that f and g depends on disjoint subset of variables, one has \int f g d\mu \leq \int f d\mu \int g d\mu. Positive correlation is particularly easy to check thanks to the famous FKG theorem which states that if \mu is log-supermodular (i.e. - \log(\mu) is a submodular function) then \mu is positively correlated (this is nice because submodularity is a local condition). One may hope that log-submodular distributions are negatively correlated, but unfortunately this is not true in general. The route to an easily checkable condition for negative correlation is a bit more tortuous and it goes as follows. First observe that negative association clearly implies pairwise negative correlation, that is for any i,j and X = (X_1, \hdots, X_n) \sim \mu,

    \[\mathbb{E} X_i X_j \leq \mathbb{E} X_i \mathbb{E} X_j .\]

Writing this inequality in terms of the generating function F(x) = \mathbb{E} \prod_{k=1}^n x_i^{X_i} one has (with \mathds{1}=(1,\hdots,1)):

    \[F(\mathds{1}) \frac{\partial^2 F}{\partial x_i \partial x_j}(\mathds{1}) \leq \frac{\partial F}{\partial x_i}(\mathds{1}) \frac{\partial F}{\partial x_j}(\mathds{1}) .\]

One says that \mu is strongly Rayleigh if the above equation holds true for any x \in \mathbb{R}^n instead of just at x= \mathds{1}. It turns out that this condition is more restrictive than negative correlation, but it is also much more tractable, in particular because it is equivalent to the fact that F is a stable polynomial (that is F does not have any roots in \mathbb{H}^n where \mathbb{H} is the complex upper half plane). Once you know that latter equivalence it becomes trivial to check that determinantal measures are negatively correlated since they are strongly Rayleigh (just use the fact that the characteristic polynomial of a symmetric matrix only has roots on the real line). A particularly useful property of strongly Rayleigh measures is that they have the usual Gaussian tails for any Lipschitz functional, just like product measures (this was conjectured by Elchanan Mossel, and recently proved by Robin Pemantle and Yuval Peres). Recently this set of measures has also been useful in theoretical computer science, see the recent work of Shayan Oveis Gharan and his co-authors (where they use negative correlation properties of random spanning trees). Finally I briefly mention that the sampling problem for negatively associated measures is a quite rich question, see the slides by Robin and also this COLT’15 paper by Rebeschini and Karbasi.


Learning in networks

Elchanan Mossel talked about a beautiful set of open questions. The basic problem goes as follows: let G be a graph, and assume that each vertex is a player that receives an independent random variable distributed according to some measure. The measure used to produce the random variables is the same for all vertices, and it is either \nu_0 or \nu_1 (say the players also know that the choice I \in \{0,1\} of the measure was a coin flip). Question: assuming that at the end of each day the players reveal their conditional expectation for I (given all the information they currently have) to their neighbors in G, how many days do we need to reach an agreement? If convergence happens, is the conditional expectation that is reached the optimal one (that is the conditional expectation given all the random variables)? What if instead of revealing your belief you only reveal what is the most likely to you between I=0 and I=1 (revealed action game)? Mossel-Sly-Tamuz show that both in the revealed belief and the revealed action game there will eventually be an agreement. Furthermore in the revealed beliefs this will be close to the optimal posterior provided that G is large enough. Both of these theorems currently lack a convergence rate.


The complexity of stochastic block models

The stochastic block model is a canonical model of a random graph with communities. The general version of this model is denoted \mathrm{SBM}(n,p,W) where n \in \mathbb{N} is the number of vertices, p \in [0,1]^k represents the proportion of each community, and W \in [0,1]^{k \times k} represents the matrix of connection probabilities. More precisely the random graph \mathrm{SBM}(n,p,W) is defined as follows: First each vertex i \in [n] is assigned a latent label \sigma_i \in [k], where the latent label vector \sigma \in [k]^n is drawn uniformly at random conditionally on the fact that the number of vertices with label \sigma \in [k] is exactly p_{\sigma} n. The edges are then drawn independently (conditionally on \sigma), and each edge \{i,j\} is present with probability W_{\sigma_i, \sigma_j}. This model has received a LOT of attention in the last thirty years. The basic objective is to recover the latent label vector \sigma from the observation of the unlabelled graph \mathrm{SBM}(n,p,W). The following very recent result by Emmanuel Abbe and Colin Sandon answers the most basic question one can ask: when is this objective achievable? First we introduce the Abbe-Sandon divergence for vectors x, y \in \mathbb{R}^k:

    \[D(x,y) = \max_{t \in [0,1]} \sum_{i=1}^k (t x_i + (1-t) y_i - x_i^t y_i^{1-t}) .\]

Theorem (Abbe-Sandon 2015): Asymptotically almost sure exact recovery of the latent feature vector in \mathrm{SBM}(n,p, Q \frac{\log(n)}{n}) is possible if and only if for any i, j \in [k], i \neq j,

    \[D \big( (\mathrm{diag}(p) Q)_i, (\mathrm{diag}(p) Q)_j \big) \geq 1 .\]

In their paper Abbe and Sandon also give an efficient algorithm which achieves the information theoretic limit given above. It remains an outstanding open problem to obtain such a precise picture for the problem of weak recovery where one asks to recover correctly only a fraction 1-\epsilon of the latent label vector. This question is particularly interesting in the sparse regime where the average degree is constant (note that in the above theorem the average degree diverges at the rate \log(n)). Recently a lot of progress has been made in this mathematically challenging regime too, see in particular the papers by Mossel-Neeman-Sly, Massoulie and Bordenave-Lelarge-Massoulie.

Posted in Conference/workshop, Probability theory | Leave a comment

A solution to bandit convex optimization

Ronen Eldan and I have just uploaded to the arXiv our newest paper which finally proves that for online learning with bandit feedback, convex functions are not much harder than linear functions. The quest for this result started in 2004 with a paper by Robert Kleinberg, and one by Flaxman, Kalai, McMahan, and over the years it has become the flagship open problem in bandit theory (see for instance Hazan and Levy 2015 where they refer to the attainable regret for convex bandit as the most important open question in bandit learning).

Our result with Ronen however is far from the end of the story for this problem: currently our dependency on the dimension n is n^{11}, which can certainly be improved, and even more importantly we do not yet know an algorithm which attains our regret bound! Indeed our proof follows the trail of my COLT’15 paper with Ofer Dekel, Tomer Koren and Yuval Peres where we suggest to attack the dual Bayesian problem (in the latter paper we solved this problem in dimension one). Thus finding an algorithm with \mathrm{poly}(n) \sqrt{T} for bandit convex optimization remains as open as before, but now at least we know that such an algorithm exist!!

Along the way of our new proof we solve an interesting property testing problem on convex functions, which we feel could have applications beyond convex bandits. The problem can be described as follows. Assume that you are given a convex and non-negative convex function f. Now suppose that there is some unknown convex function g defined on the same domain as f and for which you would like to know if g is equal to f. I also guarantee you that if g is not equal to f then g takes a negative value somewhere on its domain (say even less than - \epsilon). You can obtain information on g through noisy queries, that is you can ask what is the value of g at x and you obtain g(x) + \xi where \xi is a standard Gaussian. Can you design a non-adaptive query strategy such that after at most \mathrm{poly}(n) / \epsilon^2 (with possibly extra logarithmic factors) you will be able to correctly decide if g equals f or not with probability at least 1- \epsilon? In the paper we give a positive answer by constructing a map f \mapsto \mu where \mu is a distribution supported on the domain of f which “explores f at every scale simultaneously”. At this point this map is still somewhat mysterious, and there are many open questions around it (can you compute the map efficiently? can this be useful beyond Bayesian convex bandits? can you find a better construction than ours? what is the optimal number of queries?).  See the paper for all the details!

Posted in Optimization | 2 Comments

Revisiting Nesterov’s Acceleration

Nesterov’s accelerated gradient descent (AGD) is hard to understand. Since Nesterov’s 1983 paper people have tried to explain “why” acceleration is possible, with the hope that the answer would go beyond the mysterious (but beautiful) algebraic manipulations of the original proof. This search for a good “why” has accelerated in the recent years under the impulsion of the machine learning community, as one would like to generalize Nesterov’s acceleration in various ways to deal with machine learning applications. Obviously generalizing is difficult when even the basic phenomenon is not well understood. Recent works in the “why” direction are re-interpreting the AGD algorithm from different point of views:

Allen-Zhu and Orrechia view AGD as a linear coupling of Gradient Descent and Mirror Descent. This gives a nice new proof where the acceleration results from a primal-dual point of view on the progress of the algorithm.

Su, Boyd and Candes (building on earlier work of O’Donoghue and Candes) view AGD as the discretization of a certain second-order ODE. Other papers in a somewhat similar direction include the work of Lessard, Recht and Packard; as well as Flammarion and Bach. Note that these papers do not give alternative proofs of acceleration for general smooth functions.

– Other relevant references include a recent paper of Arjevani, Shalev-Shwartz and Shamir, and a blog post by Hardt.

In joint work with Yin-Tat Lee and Mohit Singh we discovered a simple geometric reason for the possiblity of acceleration, which can be turned into a new algorithm with the same theoretical guarantees as AGD, and improved performances in practice. The proof of the convergence rate is 3 lines, assuming a completely elementary geometric fact (in dimension 2) which itself takes less than half a page to prove. The general principle is loosely inspired from the ellipsoid method (except that we work with balls, which makes the algorithm much more efficient). Let me guide you through this.


Ok so we are trying to minimize on \mathbb{R}^n a function f which is \alpha-strongly convex and \beta smooth (see here for definitions). Denote x^* the minimizer of f, \kappa = \beta / \alpha the condition number of f, \mathrm{B}(x, r^2) for a Euclidean ball centered at x and of radius squared r^2,

    \[x^+ = x - \frac{1}{\beta} \nabla f(x), \;\; x^{++} = x - \frac{1}{\alpha} \nabla f(x) ,\]

which are respectively a standard and long step of gradient descent. The crucial observation is that the local information at x and the assumptions on f allow to enclose the minimizer in some ball, precisely:

    \[x^* \in \mathrm{B}\left(x^{++}, \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) - \frac{2}{\alpha} (f(x^+) - f(x^*)) \right) .\]

Finally we need some fact on the intersection of balls. Take a ball with radius squared R^2, and intersect it with some other ball which has the center of the previous ball on its boundary. This intersection is obviously contained in a ball of radius squared R^2. An easy calculation shows that if the radius squared of one ball shrinks by 1-\epsilon then the intersection is included in a ball of radius squared (1-\epsilon) R^2. Furthermore if both balls shrink (additively) by the same amount (which is an \epsilon fraction of the radius squared for one of the balls), then the intersection is included in a ball of radius squared (1-\sqrt{\epsilon}) R^2. As we will see this fact is the reason why one can accelerate.

A suboptimal algorithm

Assume that we are given x \in \mathbb{R}^n such that x^* \in \mathrm{B}(x,R^2). As we saw before one also has

    \[x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .\]

Taking the intersection of these two balls and using the fact from the previous section we see that there exists x' such that

    \[x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\kappa}\right) R^2\right) .\]

Thus by iterating the map from x to x' (defined implicitely above) we obtain a new algorithm with the same rate of convergence as gradient descent.


Now assume that the point x from the previous section and its guarantee R^2 on the distance to the optimum came from the local information at some other point x_0, that is x = x_0^{++} and R^2 = R_0^2 - \frac{2}{\alpha} (f(x_0^+) - f(x^*)) where R_0^2 = \frac{|\nabla f(x_0)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right). Assume furthermore that x is at least as good as a step of gradient descent at x_0, that is f(x) \leq f(x_0^+). Then we can use the local information at x to reduce the ball around x, indeed one has

    \[f(x^*) \leq f(x^+) \leq f(x) - \frac{1}{2 \beta} \nabla f(x) \leq f(x_0^+) - \frac{1}{2 \beta} \nabla f(x) ,\]

and thus

    \[x^* \in \mathrm{B}\left(x, R_0^2 - \frac{|\nabla f(x)|^2}{\alpha^2 \kappa} \right) .\]

Furthermore, as in the last section, one has

    \[x^* \in \mathrm{B}\left(x - \frac{1}{\alpha} \nabla f(x), \frac{|\nabla f(x)|^2}{\alpha^2} \left(1 - \frac{1}{\kappa}\right) \right) .\]

Taking the intersection of the above balls and enclosing it in another ball one obtains (see preliminaries) that there exists x' such that

    \[x^* \in \mathrm{B}\left(x', \left(1 - \frac{1}{\sqrt{\kappa}}\right) R_0^2 \right) .\]

The only issue at this point is that to iterate the map from x to x' one would need that x' is at least as good as a step of gradient descent on x, which is not necessarily true (just as the assumption f(x_0^{++}) \leq f(x_0^+) is not necessarily true). We circumvent this difficulty by doing a line search on the line going through x' and x^+. The details (which are straightforward given the above discussion) are in the paper, specifically in Section 3. We also report encouraging preliminary experimental results.

Posted in Optimization | 5 Comments

COLT 2015 accepted papers and some cool videos

Like last year I compiled a list of the COLT 2015 accepted papers together with links to the arxiv version whenever I could find one. These papers were selected from 180 submissions, a number that keep rising in the recent years (of course this is true for all the major learning conferences, for instance ICML had over 1000 submissions this year). This strong program, together with a pretty good location (Paris, 5th), should make COLT 2015 quite attractive! Also, following the trend of COLT 2013 and COLT 2014, we will have some “pre-COLT” activity, with an optimization workshop co-organized by Vianney Perchet (also the main organizer for COLT itself) and Patrick Louis Combettes.

On a completely different topic, I wanted to share some videos which many readers of this blog will enjoy. These are the videos of the 2015 Breakthrough Prize in Mathematics Symposium, with speakers (and prize winners) Jacob Lurie, Terence Tao, Maxim Kontsevich, Richard Taylor, and Simon Donaldson. They were asked to give talks to a general audience, and they succeeded at different levels. Both Taylor and Lurie took this request very seriously, perhaps a bit too much even, and their talks (here and here) are very elementary (yet still entertaining!). Tao talks about the Polymath projects, and the video can be safely skipped unless you have never heard of Polymath. I understood nothing of Kontsevich’s talk (it’s pretty funny to think that his talk was prepared with the guideline of aiming at a general audience). My favorite talk by far was the one by Donaldson. Thanks to him I finally understand what the extra 7 unobserved dimensions of our universe could look like! There is also a panel discussion led by Yuri Milner with the 5 mathematicians. Unfortunately the questions are a bit dull, so there is not much that the panelists can do to make this interesting. Yet there are a few gems in the answers, such as Tao claiming that *universality* (such as in the Central Limit Theorem) is behind the unreasonable effectiveness of mathematics in physics, and Kontsevich who replies to Tao that this is a valid point at the macroscopic level, but the fact that mathematics work so well at a microscopic level (e.g., quantum mechanics) makes him question whether we live in a simulation. Kontsevich also says that there is no fundamental obstacle to building an A.I., and he even claims that he gave some thoughts to this problem, though I could not find any paper written by him on this matter.

COLT 2015 accepted papers

– Arpit Agarwal and Shivani Agarwal. On Consistent Surrogate Risk Minimization and Property Elicitation
Noga Alon, Nicolò Cesa-Bianchi, Ofer Dekel and Tomer Koren. Online Learning with Feedback Graphs: Beyond Bandits
– Sanjeev Arora, Rong Ge, Tengyu Ma and Ankur Moitra. Simple, Efficient, and Neural Algorithms for Sparse Coding
– Pranjal Awasthi, Maria Florina Balcan, Nika Haghtalab and Ruth Urner. Efficient Learning of Linear Separators under Bounded Noise
– Pranjal Awasthi, Moses Charikar, Kevin Lai and Andrej Risteski. Label optimal regret bounds for online local learning
– Maria-Florina Balcan, Avrim Blum and Santosh Vempala. Efficient Representations for Lifelong Learning and Autoencoding
– Akshay Balsubramani and Yoav Freund. Optimally Combining Classifiers Using Unlabeled Data
– Peter Bartlett, Wouter Koolen, Alan Malek, Eiji Takimoto and Manfred Warmuth. Minimax Fixed-Design Linear Regression
– Alexandre Belloni, Tengyuan Liang, Hari Narayanan and Alexander Rakhlin. Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions
Sébastien Bubeck, Ofer Dekel, Tomer Koren and Yuval Peres. Bandit Convex Optimization: sqrt{T} Regret in One Dimension
– Yang Cai, Constantinos Daskalakis and Christos Papadimitriou. Optimum Statistical Estimation with Strategic Data Sources
Nicolo Cesa-Bianchi, Yishay Mansour and Ohad Shamir. On the Complexity of Learning with Kernels
– Yuxin Chen, Hamed Hassani, Amin Karbasi and Andreas Krause. Sequential Information Maximization: When is Greedy Near-optimal?
Hubie Chen and Matt Valeriote. Learnability of Solutions to Conjunctive Queries: The Full Dichotomy
– Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng and Shang-Hua Teng. Efficient Sampling for Gaussian Graphical Models via Spectral Sparsification
– Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri and Manfred Warmuth. On-Line Learning Algorithms for Path Experts with Non-Additive Losses
– Rachel Cummings, Stratis Ioannidis and Katrina Ligett. Truthful Linear Regression
– Gautam Dasarathy, Robert Nowak and Xiaojin Zhu. $S^2$: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification
Miroslav Dudik, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins and Masrour Zoghi. Contextual Dueling Bandits
– Miroslav Dudík, Robert Schapire and Matus Telgarsky. Convex Risk Minimization and Conditional Probability Estimation
– Justin Eldridge, Mikhail Belkin and Yusu Wang. Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
– Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati and Ananda Theertha Suresh. Faster Algorithms for Testing under Conditional Sampling
Uriel Feige, Yishay Mansour and Robert Schapire. Learning and inference in the presence of corrupted inputs
Nicolas Flammarion and Francis Bach. From Averaging to Acceleration, There is Only a Step-size
Dean Foster, Howard Karloff and Justin Thaler. Variable Selection is Hard
– Rafael Frongillo and Ian Kash. Vector-Valued Property Elicitation
Roy Frostig, Rong Ge, Sham Kakade and Aaron Sidford. Competing with the Empirical Risk Minimizer in a Single Pass
Pierre Gaillard and Sébastien Gerchinovitz. A Chaining Algorithm for Online Nonparametric Regression
– Nicolas Goix, Anne Sabourin and Stéphan Clémençon. Learning the dependence structure of rare events: a non-asymptotic study
– Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski and Dmitry Ostrovsky. Adaptive recovery of signals by convex optimization
– Sam Hopkins, Jonathan Shi and David Steurer. Tensor principal component analysis
Prateek Jain and Praneeth Netrapalli. Fast Exact Matrix Completion with Finite Samples
– Majid Janzamin, Animashree Anandkumar and Rong Ge. Learning Overcomplete Latent Variable Models through Tensor Methods
– Parameswaran Kamalaruban, Robert Williamson and Xinhua Zhang. Exp-Concavity of Proper Composite Losses
– Sudeep Kamath, Alon Orlitsky, Venkatadheeraj Pichapati and Ananda Theertha Suresh. On Learning Distributions from their Samples
Varun Kanade and Elchanan Mossel. MCMC Learning
Zohar Karnin and Edo Liberty. Online PCA with Spectral Bounds
Junpei Komiyama, Junya Honda, Hisashi Kashima and Hiroshi Nakagawa. Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem
– Samory Kpotufe, Ruth Urner and Shai Ben-David. Hierarchical label queries with data-dependent partitions
– Rasmus Kyng, Anup Rao, Sushant Sachdeva and Daniel A. Spielman. Algorithms for Lipschitz Learning on Graphs
Jan Leike and Marcus Hutter. Bad Universal Priors and Notions of Optimality
– Tengyuan Liang, Alexander Rakhlin and Karthik Sridharan. Learning with Square Loss: Localization through Offset Rademacher Complexity
– Haipeng Luo and Robert Schapire. Achieving All with No Parameters: Adaptive NormalHedge
– Mehrdad Mahdavi, Lijun Zhang and Rong Jin. Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization
– Konstantin Makarychev, Yury Makarychev and Aravindan Vijayaraghavan. Correlation Clustering with Noisy Partial Information
Issei Matsumoto, Kohei Hatano and Eiji Takimoto. Online Density Estimation of Bradley-Terry Models
– Behnam Neyshabur, Ryota Tomioka and Nathan Srebro. Norm-Based Capacity Control in Neural Networks
– Christos Papadimitriou and Santosh Vempala. Cortical Learning via Prediction
Richard Peng, He Sun and Luca Zanetti. Partitioning Well-Clustered Graphs: Spectral Clustering Works!
– Vianney Perchet, Philippe Rigollet, Sylvain Chassang and Erik Snowberg. Batched Bandit Problems
Patrick Rebeschini and Amin Karbasi. Fast Mixing for Discrete Point Processes
– Mark Reid, Rafael Frongillo, Robert Williamson and Nishant Mehta. Generalized Mixability via Entropic Duality
Hans Simon. An Almost Optimal PAC Algorithm
– Jacob Steinhardt and John Duchi. Minimax rates for memory-bounded sparse linear regression
– Christos Thrampoulidis, Samet Oymak and Babak Hassibi. Regularized Linear Regression: A Precise Analysis of the Estimation Error
– Santosh Vempala and Ying Xiao. Max vs Min: Tensor Decomposition and ICA with nearly Linear Sample Complexity
– Huizhen Yu and Richard Sutton. On Convergence of Emphatic Temporal-Difference Learning
Posted in Conference/workshop | 3 Comments

Deep stuff about deep learning?

Unless you live a secluded life without internet (in which case you’re not reading those lines), odds are that you have heard and read about deep learning (such as in this 2012 article in the New York Times, or the Wired’s article about the new Facebook AI lab).

So what is deep learning? Well, from my current limited understanding, it seems to be a pretty simple supervised learning technique which can be explained in a few lines. Of course, simple problems can be very deep (ever heard of the Collatz conjecture?), and it looks like this may be the case with deep learning. After explaining what is deep learning (or more accurately what is my understanding of deep learning), I will point out some recent theoretical progress towards understanding the basic phenomenona at play. Let me be clear right away: it seems to me that no one has a real clue about what’s going on.


Deep learning: it’s all about the features

Ok, so, you have a data set (X_i, Y_i)_{1 \leq i \leq n} with X_i \in \mathbb{R}^d and Y_i \in \{1, \hdots, M\}. Typically X_i is an image, and Y_i is say 9 if there is a pinguin in this image. A very efficient algorithm to learn a mapping from X to Y is to use some kind of linear classification, such as multi-class logistic regression (see also this post for some background on linear classification). However applying a linear classification directly on X typically does not work very well, and this why we usually first map the point X to \Phi(X), where \Phi is some feature map. Designing \Phi has been part of the art of machine learning, and the idea behind deep learning is to incorporate the problem of finding a good map as part of the learning process (again in the previous post on linear classification I also discussed this possibility in the context of SVM, though we ended up with a semi-definite program which does not scale well with large data sets).

Ok so what kind of \Phi do we look at in deep learning? Well they are of the following form, for some sequence of L matrices W_1, \hdots, W_L, with W_k \in \mathbb{R}^{n_k \times n_{k+1}'}:

    \[\Phi(X) = X^{(L+1)}, \ \text{where} \ X^{(k+1)} = \sigma_k(W_k^{\top} X^{(k)}) \ \text{with} \ X^{(1)} = X .\]

The (non-linear) mappings \sigma_k : \mathbb{R}^{n_{k+1}'} \rightarrow \mathbb{R}^{n_{k+1}} are fixed and part of the design choice (i.e., they are not learned). The parameters L, n_1, n_1', \hdots, n_{L+1}, n_{L+1}' are also specified beforehand. Furthermore one usually impose extra constraints on the matrices W_1, \hdots, W_L, such as having all columns bounded in \ell_2-norm (say by 1).

The game now is simply to incorporate the matrices W_1, \hdots, W_L as free variables when one performs the optimization of the multiclass logistic loss. This optimization is always performed with Stochastic Gradient Descent (or a simple variant of it, such as mini-batch SGD, see this post and Section 6.2 in my survey): take an example in your data set (X_i, Y_i); compute the gradient of the logistic loss at this example (with respect to the matrices W_1, \hdots, W_L and the logistic regression weights), and do a small step in the opposite direction of the gradient.


Design choices

Part of the success of deep learning is in the choices left open above, such as the choice of the non-linear maps, and the set of constraints on the matrices. Let me partly describe some of these choices that Krizhevsky, Sutskever and Hinton used for their computer vision breakthrough where they basically halfed the error rate of the competition on the ImageNet dataset (the details can be found here).

First they take L=8. Then \sigma_6, \sigma_7 and \sigma_8 are the so-called ReLU function (Rectified Linear Unit), which simply take the positive part coordinate-wise. On the other hand \sigma_1, \hdots, \sigma_5 are composition of ReLU with max-pooling. The max-pooling is a dimension reduction operation, which simply takes 4 adjacent coordinates (in the case of an image, 4 adjacent pixels) and returns the maximal element. Regarding the constraints on the matrices, W_6, W_7 and W_8 are left unconstrained (besides perhaps the energy control on the columns), while W_1, \hdots, W_5 are highly structure convolutional networks, or ConvNets as they are usually called. Essentially this means that in these matrices, the columns k d + 1 up to (k+1)d are translations of some very sparse vector. Precisely, for W_1, they take n_2'= 75*d and each sparse vector corresponds to a 5x5 image. In other words the vector W_1 X corresponds to a collection of 75 images, each one of them being the convolution of X with a small 5x5 patch (whose weights are being learned).

While this post is not about the history of deep learning (this has been discussed at length pretty much everywhere on the internet), I still would like to point out that the apparently arbitrary choice behind max-pooling and the ConvNets are due to Yann LeCun, who has experimented with these for many many years.


Some mysterious steps

So far I would say that the method described above kind of make sense (apart from the max-pooling operation which I really do not understand). But things are now going to turn out a bit more wacky than you might have expected. First note that there is one key element missing from my description of the learning algorithm: what does “small” mean in the sentence “take a small step in the opposite direction of the gradient”. If you know about convex optimization and machine learning, then you know that a lot of work usually goes into understanding precisely what kind of learning rate one should take, and in particular at which rate one should decrease it over time. In deep learning all of this theory goes out of the window since the objective function is not convex (note that there is some convexity in the sense that the objective is convex in W_L – thanks to Nicolas Le Roux for fixing a mistake in the first version of the previous sentence). Here it seems that the current state of the art is to start with some guess, say 0.01, then look at how the training error decays as more examples are being processed, and if the training error plateau then just divide the learning rate by 10.

Another point which seems key to the success of such an architecture is to control overfitting. Again for a convex problem we know that one-pass through the data with SGD and the appropriate learning rate would work just fine in terms of generalization error. In deep learning the algorithm makes multiple passes on the data, and furthermore the problem is non-convex. This strongly suggests that overfitting will be a real problem, especially given the number of free parameters. Thankfully Hinton and his colleagues invented a quite smart technique, called “dropout”. The idea is simply to take the gradient with respect to only half of the columns of the matrices W_1, \hdots, W_L, where the half is chosen at random at each iteration. In the network view of the architecture this amounts to “dropping” each unit with probability 1/2. Intuitively this forces the network to be more robust, and to “repeat” itself, thus avoiding overfitting.


Hacking through life

Of course, if you want to beat everyone else in the world at something, you better have every parts of your system completely optimized. In that respect the deep learning community took the implementation of the above ideas to the next level by using GPUs, thus making the computation of the gradients much faster (by the way, the ReLU non-linearity also helps here, compared to more traditional non-linearities such as sigmoids). Various other tricks are being used, such as artificially expanding the data set by altering the X_i (e.g., by translating the images) or by running the network backward.


Towards an understanding of deep learning?

I think it’s fair to say that we do not understand why deep nets work at all, and we understand even less why they are doing so much better than anything else. The three main questions are: (i) why is SGD meaningful for this non-convex problem, (ii) why is dropout enough to avoid overfitting, (iii) can we explain why ConvNets are a sensible way to learn representations?

Very limited progress has been made on these questions, though I will cite some papers that at least try to attack these problems. I try to summarize each work with one sentence, see the papers for more details (and in particular the details will reveal that these papers do not make direct progress on the above questions).

– Sanjeev Arora and his co-authors proved that a “random and sparse” deep net can be learned in polynomial time.

– Roi Livni, Shai Shalev-Shwartz and Ohad Shamir obtain some results on the approximation power of a neural network with a non-linearity of the form \sigma(x) = x^2.

– Gerard Ben Arous, Yann LeCun and their students draw an interesting connection between fully connected neural networks (such as the last 3 layers in the above architecture) and spherical spin glasses.

– Stephane Mallat gives some insights on (iii) via the scattering transform. Arnab Paul and Suresh Venkat borrow ideas from group theory to shed some light on (iii).

– Francis Bach proves some generalization bounds for neural networks with one hidden layer.


I apologize for misrepresenting deep learning so badly in this post. I know I missed a lot, and you’re welcome to correct me in the comments!


Edit. Some further references added thanks to various comments:

– Approaches for (ii): David McAllester with PAC-Bayesian bounds, and Percy Liang and students with a link to AdaGrad.

– Approach to (iii) by a group at Facebook on the relevance of ConvNets for certain natural distributions.

– I said in the post that all optimization theory goes out of the window when the problem is non-convex. This is of course not true, and for instance a 1998 paper by Leon Bottou has results for SGD in more general scenario. However in the non-convex case we only have very weak guarantees which basically say nothing about the global optimum.

Posted in Optimization | 17 Comments

Some pictures in geometric probability

As I discussed in a previous blog post, I have been recently interested in models of randomly growing networks. As a starting point I focused my attention on the preferential attachment rule and its variants, in part because its ubiquity in the literature. However these models miss one key feature, namely the notion of geometry: when a newcomer arrives in the network, she can connect to any vertex, regardless of a possible spatial organization of the network and her current position in that space.

In this blog post I want to discuss some of the most famous (or infamous in some cases) spatial models of network growth. For simplicity, and also to draw pictures, everything will be described in the plane, and specifically on the lattice \mathbb{Z}^2. These models iteratively build an aggregate A_n \subset \mathbb{Z}^2 as follows: First A_0 = \{(0,0)\}. Then A_{n+1}=A_n \cup \{X_n\} where X_n is a random point in the boundary of the aggregate, that is X_n \in \partial A_n := \{x \in \mathbb{Z}^2 : \min_{y \in A_n} \|x-y\|_2 =1\}. Thus a model is defined by the probability measure one puts on the boundary of the aggregate to select the next point to add.

All the pictures are with half a million particles, and the particles are colored as a function of their age, with blue corresponding to old particles, and red corresponding to young ones.

Eden model

The simplest model uses the uniform measure on the boundary, and it is known as the Eden model. Here is a picture:


One of the most basic result about this model is that it admits a limit shape:

Theorem: There exists a (deterministic) convex set B \subset \mathbb{R}^2 such that for any \epsilon >0,

    \[\lim_{n \to +\infty} \mathbb{P} \left( (1- \epsilon) B \subset \frac{1}{\sqrt{n}} A_n \subset (1+\epsilon) B \right) = 1.\]

It is known that B is not an Euclidean ball (this should be clear from the picture), though nobody knows what B is exactly. How do you prove such a theorem? Well it turns out to be pretty easy once you have the right machinery. The first step is to realize that the Eden model exactly corresponds to first passage percolation with exponential clocks: imagine that you have i.i.d. exponential random variables t(e) on the edges of \mathbb{Z}^2, and consider A(t) \subset \mathbb{Z}^2 to be the set of points x such that there exists a path e_1, \hdots e_r from (0,0) to x with \sum_{s=1}^r t(e_s) \leq t. In other words A(t) is the set of points reached at time t by a fluid released at time 0 in (0,0) and with travel times on the edges given by the random variables t(e). It is an easy exercise to convince yourself that A_n has the same distribution as A(t) conditionally on |A(t)| = n (this is actually not true, one in fact needs to put the travel times on the vertices of \mathbb{Z}^2 rather than on the edges, but let me get away with that small mistake). At this point the continuous version of the above theorem can be rather easily proved via Kingman’s subadditive ergodic theorem, see the Saint Flour lecture notes by Kesten for more details.


Diffusion Limited Aggregation (DLA)

Here we consider the harmonic measure from infinity: pick a point of \mathbb{Z}^2 uniformly at random from a large circle that contains A_n, and then start a random walk from this point until it hits \partial A_n (which will happen eventually since a simple random walk in two dimensions is recurrent); let X_n be the latter point. Here is a global picture and a zoom (the same pictures in black and white are provided at the end of this section):

DLA   DLAzoom

Absolutely nothing is known about this model apart from the following simple result of Kesten:

    \[\limsup_{n \to +\infty} \frac{\max_{x \in A_n} \|x\|_2}{n^{2/3}} < + \infty .\]

A fascinating open problem is to show that for some \epsilon>0,

    \[\liminf_{n \to +\infty} \frac{\max_{x \in A_n} \|x\|_2}{n^{0.5+\epsilon}} = + \infty ,\]

which the above picture clearly seems to validate. Note that at the moment nobody can even prove that once rescaled by 1/\sqrt{n} the DLA will not converge to an Euclidean ball…




Ballistic DLA

Here is a model that, as far as I know, was introduced by Ronen Eldan. Instead of having one particle coming from infinity to hit the aggregate through a random walk, imagine now that the aggregate is constantly bombarded (ballistically). Precisely particles are coming from infinity in every direction (that is on every line) at a constant rate. In other words first select a direction uniformly at random, and then among the lines parallel to that direction that hit the current aggregate choose one uniformly at random. The added point is the boundary point at the intersection of this (oriented) line with the aggregate. Here is a picture where we can see some resemblance with the Eden model:


and here is a picture from further away where we can see local resemblance with DLA:


see also this zoom in black and white:


Needless to say, nothing is known about this model.


Internal Diffusion Limited Aggregation (IDLA)

Finally the IDLA model is yet another modification of DLA, where instead of starting the random walk from infinity one starts it from the origin. Perhaps surprisingly, and on the contrary to everything discussed so far, we know quite a lot about this model! Its limit shape is an actual honest Euclidean ball, see this paper by Lawler, Bramson and Griffeath. In fact we even know that the average fluctuation from the aggregate around its limit is of constant order, see this paper by Jerison, Levine and Sheffield (information about the distribution of these fluctuations is also provided!).


Writing this blog post was quite a bit of fun, and I thank Ronen Eldan, Laura Florescu, Shirshendu Ganguly, and Yuval Peres from whom I learned everything discussed here. To conclude the post here are some intriguing pictures from variants of the above model that I cooked up (unfortunately I’m not sure the models are very interesting though):



Posted in Probability theory, Random graphs | 2 Comments

Guest post by Sasho Nikolov: Beating Monte Carlo

If you work long enough in any mathematical science, at some point you will need to estimate an integral that does not have a simple closed form. Maybe your function is really complicated. Maybe it’s really high dimensional. Often you cannot even write it down: it could be a quantity associated with a complex system, that you can only “query” at certain points by running an experiment. But you still need your integral, and then you turn to the trustworthy old Monte Carlo method. (Check this article by Nicholas Metropolis for the history of the method and what it has to do with Stanislaw Ulam’s uncle’s gambling habbit.) My goal in this post is to tell you a little bit about how you can do better than Monte Carlo using discrepancy theory.


The Problem and the Monte Carlo Method

Let us fix some notation and look at the simplest possible setting. We have a function f, that maps the real interval [0,1] to the reals, and we want to know


We will estimate this integral with the average \frac{1}{n}\sum_{i =1}^n{f(y_i)}, where y:= y_1,y_2, \ldots is a sequence of numbers in [0,1]. The error of this estimate is

    \[\mathrm{err}(f, y, n) := \left|\int_0^1{f(x)dx} - \frac{1}{n}\sum_{i =1}^n{f(y_i)}\right|.\]

And here is the main problem I will talk about in this post:

How do we choose a sequence y of points in [0,1] so that the average \frac{1}{n}\sum_{i = 1}^n{f(y_i)} approximates the integral \int_0^1{f(x)dx} as closely as possible?

Intuitively, for larger n the approximation will be better, but what is the best rate we can achieve? Notice that we want a single sequence, so that if we want a more accurate approximation, we just take a few more terms and re-normalize, rather than start everything from scratch. The insight of the Monte Carlo method (surely familiar to every reader of this blog) is to take each y_i to be a fresh random sample from [0,1]. Then for any n, the expectation of \frac{1}{n}\sum_{i =1}^n{f(y_i)} is exactly \int{f(x)dx} (from now on I will skip the limits of my integrals: they will all be from 0 to 1). The standard deviation is of order

    \[\frac{1}{\sqrt{n}}\int{f(x)^2dx} = \frac{\|f\|_{L_2}}{\sqrt{n}},\]

and standard concentration inequalities tell us that, with high probability, \mathrm{err}(f, y,n) will not be much larger than the latter quantity.


Quasi-Monte Carlo and Discrepancy

For a fixed function of bounded L_2 norm, the Monte Carlo method achieves \mathrm{err}(f, y, n) roughly on the order of n^{-1/2}. In other words, if we want to approximate our integral to within \varepsilon, we need to take about \varepsilon^{-2} random samples. It’s clear that in general, even for smooth functions, we cannot hope to average over fewer than \varepsilon^{-1} points, but is \varepsilon^{-2} really the best we can do? It turns out that for reasonably nice f we can do a lot better using discrepancy. We define the star discrepancy function of a sequence y :=y_1,y_2,\ldots as

    \[\delta^*(y, n):= \max_{0 \leq t \leq 1}\left|t - \frac{|\{i: i\leq n, y_i < t\}|}{n}\right|.\]

Notice that this is really just a special case of \mathrm{err}(f, y, n) where f is the indicator function of the interval [0, t). A beautiful inequality due to Koksma shows that in a sense these are the only functions we need to care about:

    \[\mathrm{err}(f, y, n) \leq V(f)\delta^*(y,n).\]

V(f) is the total variation of f, a measure of smoothness, and for continuously differentiable functions it is equal to \int{|f'(x)|dx}. The important part is that we have bounded the integration error by the product of a term that quantifies how nice f is, and a term that quantifies how “random” the sequence y is. With this, our task is reduced to finding a sequence y with small star discrepancy, hopefully smaller than n^{-1/2}. This is the essence of the quasi-Monte Carlo method.


The van der Corput Sequence

A simple sequence with low star discrepancy was discovered by van der Corput in the beginning of the previous century. Let us write the integer i in binary as i = i_k\ldots i_0, i.e. i = i_02^0 + i_12^1 + i_2 2^2 + \ldots + i_k 2^k. Then we define r(i) to be number we get by flipping the binary digits of i around the radix point: r(i) := i_0 2^{-1} + i_1 2^{-2} + \ldots + i_k 2^{-k-1}, or, in binary, r(i) = 0.i_0i_1\ldots i_k. The van der Corput sequence is r(0), r(1), r(2), \ldots.


I have plotted the first eight terms of the van der Corput sequence on the left in the above picture: the index i is on the x-axis and the value r(i-1) on the y-axis. The terms alternate between [0, 1/2) and [1/2, 1); they also visit each of [0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1) exactly once before they return, and so on. For example, each shaded rectangle on the right in the above picture contains exactly one point (the rectangles are open on the top). The key property of the van der Corput sequence then is that r(i) \in [j2^{-k}, (j+1)2^{-k}) if and only if i \equiv j \pmod{2^k}. So for any such dyadic interval, the discrepancy is at most 1/n: the number of integers i less than n such that i \equiv j \pmod{2^k} is either \lfloor n2^{-k} \rfloor or \lceil n2^{-k} \rceil. We can greedily decompose an interval [0, t) into O(\log n) dyadic intervals plus a leftover interval of length o(1/n); therefore the star discrepancy of the van der Corput sequence y is O((\log n)/n). Remember that, together with Koksma’s inequality, this means that we can estimate the integral of any function f with bounded variation with error \mathrm{err}(f, y, n) \ll (V(f)\log n)/n, which, for sufficiently smooth f, is almost quadratically better than the Monte Carlo method! And with a deterministic algorithm, to boot. This was not that hard, so maybe we can achieve the ultimate discrepancy bound O(1/n)? This is the question (asked by van der Corput) which essentially started discrepancy theory. The first proof that O(1/n) is not achievable was given by van Aardenne-Ehrenfest. Klaus Roth simplified her bound and strengthened it to \Omega(\sqrt{\log n}/n) using a brilliant proof based on Haar wavelets. Schmidt later proved that van der Corput’s O((\log n)/n) bound is assymptotically the best possible.


Higher Dimension, Other Measures, and Combinatorial Discrepancy

Quasi-Monte Carlo methods are used in real world applications, for example in quantitative finance, because of the better convergence rates they offer. But there are many complications that arise in practice. One issue is that we usually need to estimate integrals of high-dimensional functions, i.e. functions of a large number of variables. The Koksma inequality generalizes to higher dimensions (the generalization is known as the Koksma-Hlawka inequality), but we need to redefine both discrepancy and total variation for that purpose. The star discrepancy \delta^*(y, n) of a sequence y of d-dimensional points measures the worst-case absolute difference between the d-dimensional volume (Lebesgue measure) of any anchored box [0, t_1) \times \ldots \times [0, t_d) and the fraction of points in the sequence y_1, \ldots, y_n that fall in the box. The generalization of total variation is the Hardy-Krause total variation. Naturally, the best achievable discrepancy increases with dimension, while the class of functions of bounded total variation becomes more restrictive. However, we do not even know what is the best achievable star discrepancy for 2 or higher dimensional sequences! We know that no d-dimensional sequence has discrepancy better than \Omega(\log^{d/2 + \eta_d} n), where \eta_d > 0 is some constant that goes to 0 with d. The van der Corput construction generalizes to higher dimensions and gives sequences with discrepancy O(\log^d n) (the implied constants here and in the lower bounds depend on d). The discrepancy theory community refers to closing this significant gap as “The Great Open Problem”.

Everything so far was about integration with respect to the Lebesgue measure, but in practice we are often interested in a different measure space. We could absorb the measure into the function to be integrated, but this can affect the total variation badly. We could do a change of variables, but, unless we have a nice product measure, this will result in a notion of discrepancy in which the test sets are not boxes anymore. Maybe the most natural solution is to redefine star discrepancy with respect to the measure we care about. But how do we find a low-discrepancy sequence with the new definition? It turns out that combinatorial discrepancy is very helpful here. A classical problem in combinatorial discrepancy, Tusnady’s problem, asks for is the smallest function \Delta_d(n) such that any set of n points in \mathbb{R}^d can be colored with red and blue so that in any axis-aligned box [0, t_1) \times\ldots\times [0, t_d) the absolute difference between the number of red and the number of blue points is at most \Delta_d(n) (see this post for a generalization of this problem). A general transference theorem in discrepancy theory shows that for any probability measure in \mathbb{R}^d there exists a sequence y with star discrepancy at most O(\Delta_{d+1}(n)). The best bound for \Delta_d(n) is O(\log^{d + 0.5} n), only slightly worse than the best star discrepancy for Lebesgue measure. This transference result has long been seen as purely existential, because most non-trivial results in combinatorial discrepancy were not constructive, but recently we have seen amazing progress in algorithms for minimizing combinatorial discrepancy. While even with these advances we don’t get sequences that are nearly as explicit as the van der Corput sequence, there certainly is hope we will get there.



I have barely scratched the surface of Quasi Monte Carlo methods and geometric discrepancy. Koksma-Hlawka type inequalities, discrepancy with respect to various test sets and measures, combinatorial discrepancy are each a big topic in itself. The sheer breadth of mathematical tools that bear on discrepancy questions is impressive: diophantine approximation to construct low discrepancy sequences, reproducing kernels in Hilbert spaces to prove Koksma-Hlawka inequalities, harmonic analysis to prove discrepancy lower bounds, convex geometry for upper and lower bounds in combinatorial discrepancy. Luckily, there are some really nice references available. Matousek has a very accessible book on geometric discrepancy. Chazelle focuses on computer science applications. A new collection of surveys edited by Chen, Srivastav, and Travaglini has many of the latest developments.

Posted in Theoretical Computer Science | 7 Comments