Guest post by Sasho Nikolov: Beating Monte Carlo

If you work long enough in any mathematical science, at some point you will need to estimate an integral that does not have a simple closed form. Maybe your function is really complicated. Maybe it’s really high dimensional. Often you cannot even write it down: it could be a quantity associated with a complex system, that you can only “query” at certain points by running an experiment. But you still need your integral, and then you turn to the trustworthy old Monte Carlo method. (Check this article by Nicholas Metropolis for the history of the method and what it has to do with Stanislaw Ulam’s uncle’s gambling habbit.) My goal in this post is to tell you a little bit about how you can do better than Monte Carlo using discrepancy theory.

The Problem and the Monte Carlo Method

Let us fix some notation and look at the simplest possible setting. We have a function $f$ , that maps the real interval $[0,1]$ to the reals, and we want to know

$\int_0^1{f(x)dx}.$

We will estimate this integral with the average $\frac{1}{n}\sum_{i =1}^n{f(y_i)}$ , where $y:= y_1,y_2, \ldots$ is a sequence of numbers in $[0,1]$ . The error of this estimate is

$\mathrm{err}(f, y, n) := \left|\int_0^1{f(x)dx} - \frac{1}{n}\sum_{i =1}^n{f(y_i)}\right|.$

And here is the main problem I will talk about in this post:

How do we choose a sequence $y$ of points in $[0,1]$ so that the average $\frac{1}{n}\sum_{i = 1}^n{f(y_i)}$ approximates the integral $\int_0^1{f(x)dx}$ as closely as possible?

Intuitively, for larger $n$ the approximation will be better, but what is the best rate we can achieve? Notice that we want a single sequence, so that if we want a more accurate approximation, we just take a few more terms and re-normalize, rather than start everything from scratch. The insight of the Monte Carlo method (surely familiar to every reader of this blog) is to take each $y_i$ to be a fresh random sample from $[0,1]$ . Then for any $n$ , the expectation of $\frac{1}{n}\sum_{i =1}^n{f(y_i)}$ is exactly $\int{f(x)dx}$ (from now on I will skip the limits of my integrals: they will all be from $0$ to $1$ ). The standard deviation is of order

$\frac{1}{\sqrt{n}}\int{f(x)^2dx} = \frac{\|f\|_{L_2}}{\sqrt{n}},$

and standard concentration inequalities tell us that, with high probability, $\mathrm{err}(f, y,n)$ will not be much larger than the latter quantity.

Quasi-Monte Carlo and Discrepancy

For a fixed function of bounded $L_2$ norm, the Monte Carlo method achieves $\mathrm{err}(f, y, n)$ roughly on the order of $n^{-1/2}$ . In other words, if we want to approximate our integral to within $\varepsilon$ , we need to take about $\varepsilon^{-2}$ random samples. It’s clear that in general, even for smooth functions, we cannot hope to average over fewer than $\varepsilon^{-1}$ points, but is $\varepsilon^{-2}$ really the best we can do? It turns out that for reasonably nice $f$ we can do a lot better using discrepancy. We define the star discrepancy function of a sequence $y :=y_1,y_2,\ldots$ as

$\delta^*(y, n):= \max_{0 \leq t \leq 1}\left|t - \frac{|\{i: i\leq n, y_i < t\}|}{n}\right|.$

Notice that this is really just a special case of $\mathrm{err}(f, y, n)$ where $f$ is the indicator function of the interval $[0, t)$ . A beautiful inequality due to Koksma shows that in a sense these are the only functions we need to care about:

$\mathrm{err}(f, y, n) \leq V(f)\delta^*(y,n).$

$V(f)$ is the total variation of $f$ , a measure of smoothness, and for continuously differentiable functions it is equal to $\int{|f'(x)|dx}$ . The important part is that we have bounded the integration error by the product of a term that quantifies how nice $f$ is, and a term that quantifies how “random” the sequence $y$ is. With this, our task is reduced to finding a sequence $y$ with small star discrepancy, hopefully smaller than $n^{-1/2}$ . This is the essence of the quasi-Monte Carlo method.

The van der Corput Sequence

A simple sequence with low star discrepancy was discovered by van der Corput in the beginning of the previous century. Let us write the integer $i$ in binary as $i = i_k\ldots i_0$ , i.e. $i = i_02^0 + i_12^1 + i_2 2^2 + \ldots + i_k 2^k$ . Then we define $r(i)$ to be number we get by flipping the binary digits of $i$ around the radix point: $r(i) := i_0 2^{-1} + i_1 2^{-2} + \ldots + i_k 2^{-k-1}$ , or, in binary, $r(i) = 0.i_0i_1\ldots i_k$ . The van der Corput sequence is $r(0), r(1), r(2), \ldots$ .

I have plotted the first eight terms of the van der Corput sequence on the left in the above picture: the index $i$ is on the $x$ -axis and the value $r(i-1)$ on the $y$ -axis. The terms alternate between $[0, 1/2)$ and $[1/2, 1)$ ; they also visit each of $[0, 1/4)$ , $[1/4, 1/2)$ , $[1/2, 3/4)$ , $[3/4, 1)$ exactly once before they return, and so on. For example, each shaded rectangle on the right in the above picture contains exactly one point (the rectangles are open on the top). The key property of the van der Corput sequence then is that $r(i) \in [j2^{-k}, (j+1)2^{-k})$ if and only if $i \equiv j \pmod{2^k}$ . So for any such dyadic interval, the discrepancy is at most $1/n$ : the number of integers $i$ less than $n$ such that $i \equiv j \pmod{2^k}$ is either $\lfloor n2^{-k} \rfloor$ or $\lceil n2^{-k} \rceil$ . We can greedily decompose an interval $[0, t)$ into $O(\log n)$ dyadic intervals plus a leftover interval of length $o(1/n)$ ; therefore the star discrepancy of the van der Corput sequence $y$ is $O((\log n)/n)$ . Remember that, together with Koksma’s inequality, this means that we can estimate the integral of any function $f$ with bounded variation with error $\mathrm{err}(f, y, n) \ll (V(f)\log n)/n$ , which, for sufficiently smooth $f$ , is almost quadratically better than the Monte Carlo method! And with a deterministic algorithm, to boot. This was not that hard, so maybe we can achieve the ultimate discrepancy bound $O(1/n)$ ? This is the question (asked by van der Corput) which essentially started discrepancy theory. The first proof that $O(1/n)$ is not achievable was given by van Aardenne-Ehrenfest. Klaus Roth simplified her bound and strengthened it to $\Omega(\sqrt{\log n}/n)$ using a brilliant proof based on Haar wavelets. Schmidt later proved that van der Corput’s $O((\log n)/n)$ bound is assymptotically the best possible.

Higher Dimension, Other Measures, and Combinatorial Discrepancy

Quasi-Monte Carlo methods are used in real world applications, for example in quantitative finance, because of the better convergence rates they offer. But there are many complications that arise in practice. One issue is that we usually need to estimate integrals of high-dimensional functions, i.e. functions of a large number of variables. The Koksma inequality generalizes to higher dimensions (the generalization is known as the Koksma-Hlawka inequality), but we need to redefine both discrepancy and total variation for that purpose. The star discrepancy $\delta^*(y, n)$ of a sequence $y$ of $d$ -dimensional points measures the worst-case absolute difference between the $d$ -dimensional volume (Lebesgue measure) of any anchored box $[0, t_1) \times \ldots \times [0, t_d)$ and the fraction of points in the sequence $y_1, \ldots, y_n$ that fall in the box. The generalization of total variation is the Hardy-Krause total variation. Naturally, the best achievable discrepancy increases with dimension, while the class of functions of bounded total variation becomes more restrictive. However, we do not even know what is the best achievable star discrepancy for $2$ or higher dimensional sequences! We know that no $d$ -dimensional sequence has discrepancy better than $\Omega(\log^{d/2 + \eta_d} n)$ , where $\eta_d > 0$ is some constant that goes to $0$ with $d$ . The van der Corput construction generalizes to higher dimensions and gives sequences with discrepancy $O(\log^d n)$ (the implied constants here and in the lower bounds depend on $d$ ). The discrepancy theory community refers to closing this significant gap as “The Great Open Problem”.

Everything so far was about integration with respect to the Lebesgue measure, but in practice we are often interested in a different measure space. We could absorb the measure into the function to be integrated, but this can affect the total variation badly. We could do a change of variables, but, unless we have a nice product measure, this will result in a notion of discrepancy in which the test sets are not boxes anymore. Maybe the most natural solution is to redefine star discrepancy with respect to the measure we care about. But how do we find a low-discrepancy sequence with the new definition? It turns out that combinatorial discrepancy is very helpful here. A classical problem in combinatorial discrepancy, Tusnady’s problem, asks for is the smallest function $\Delta_d(n)$ such that any set of $n$ points in $\mathbb{R}^d$ can be colored with red and blue so that in any axis-aligned box $[0, t_1) \times\ldots\times [0, t_d)$ the absolute difference between the number of red and the number of blue points is at most $\Delta_d(n)$ (see this post for a generalization of this problem). A general transference theorem in discrepancy theory shows that for any probability measure in $\mathbb{R}^d$ there exists a sequence $y$ with star discrepancy at most $O(\Delta_{d+1}(n))$ . The best bound for $\Delta_d(n)$ is $O(\log^{d + 0.5} n)$ , only slightly worse than the best star discrepancy for Lebesgue measure. This transference result has long been seen as purely existential, because most non-trivial results in combinatorial discrepancy were not constructive, but recently we have seen amazing progress in algorithms for minimizing combinatorial discrepancy. While even with these advances we don’t get sequences that are nearly as explicit as the van der Corput sequence, there certainly is hope we will get there.

Conclusion

I have barely scratched the surface of Quasi Monte Carlo methods and geometric discrepancy. Koksma-Hlawka type inequalities, discrepancy with respect to various test sets and measures, combinatorial discrepancy are each a big topic in itself. The sheer breadth of mathematical tools that bear on discrepancy questions is impressive: diophantine approximation to construct low discrepancy sequences, reproducing kernels in Hilbert spaces to prove Koksma-Hlawka inequalities, harmonic analysis to prove discrepancy lower bounds, convex geometry for upper and lower bounds in combinatorial discrepancy. Luckily, there are some really nice references available. Matousek has a very accessible book on geometric discrepancy. Chazelle focuses on computer science applications. A new collection of surveys edited by Chen, Srivastav, and Travaglini has many of the latest developments.

7 Responses to "Guest post by Sasho Nikolov: Beating Monte Carlo"

By Jiantao Jiao February 20, 2015 - 2:12 am

Dear Sasho, many thanks for this wonderful post! I have a basic question regarding the Monte Carlo method and the quadrature/cubature methods in numerical integration. When we use a quadrature rule to do numerical integration, we not only design the points y_i, but also design the weight w_i, and sometimes we want to select the minimum number of nodes such that the integration is exact for polynomials of the highest possible order. But it seems that in Quasi Monte Carlo all the w_i have been set to be equal. Probably in practice sometimes quadrature rule is better, and sometimes Monte Carlo or Quasi Monte Carlo is better? In general how can we compare these two paradigm? Look forward to hearing from you! 🙂
- By Sasho Nikolov April 6, 2015 - 3:18 pm
  
  Dear Jintao,
  
  I think you mean things like Gaussian quadrature rules? As far as I know, the issue with them is that (1) they work best if the function being integrated has a lot of structure, for example is close to a polynomial; (2) there are no sufficiently accurate variants for higher dimensions. However, if you have a one dimensional integral and you think your function is close to a polynomial, a Gaussian quadrature rule might be the right thing to use.
  
  Sasho
By Ravi Ganti January 12, 2015 - 5:15 pm

I think there should be a square root over the integral in the standard deviation formula. No?

While the Koksma inequality is interesting, Are there other discrepancy measures that take into account the function f? The reason, why I am asking this question is that while the Koksma inequality is interesting, it seems like that the only way the sequence enters the picture has got nothing to do with the function at all. Is that correct?
- By Sasho January 17, 2015 - 11:36 pm
  
  Thanks, Ravi, you are absolutely right about the missing square root.
  
  The fact that the guarantee works more or less regardless of the function is sort of the point here: in many applications you don’t know much about the function because it’s a complicated quantity that we don’t fully understand. The picture I have in mind is that in one extreme you have the Monte Carlo method in which you assume the least about the function (that it is square-integrable). In the other extreme you have simple functions with closed form expression for the integral. The Koksma inequality I wrote about in the post is an interesting point on this tradeoff in which you still make a fairly mild smoothness assumption on f but can dramatically improve the error guarantee.
  
  But there are other tradeoffs possible. You can define discrepancy with respect to different test functions and then you can derive an appropriate Koksma-Hlawka type inequality using a standard (although not necessarily trivial to apply) method using reproducing kernel spaces. Identifying natural classes of functions and corresponding discrepancy measures and coming up with low discrepancy constructions is a challenging but important question. And then there is the field of Information Based Complexity (that I am decidedly not an expert in) which deals exactly with this problem of computing with continuous functions from discrete samples.
By vzn December 28, 2014 - 12:46 pm

its great to hear a very tangible motivating example for discrepancy theory which sometimes seems quite abstract.
this reminds me of methods of improving convergence efficiency of differential equation solvers eg runge kutta although encountered it many yrs ago.
see also this neat recent advance in discrepancy via SAT solvers, havent seen a lot of commentary in the blogosphere on it, undernoticed it seems.
By Suresh December 22, 2014 - 7:09 pm

Nice post, Sasho. I like the idea of coupling two separate terms (one for the function and one for the sequence).
- By Sasho December 26, 2014 - 2:56 pm
  
  Thanks Suresh! It’s the magic of Holder :). Maybe I should mention that the total variation can be replaced by the two-norm of the first derivative, and the worst-case discrepancy I defined by average discrepancy. In general you can use an $\ell_p$ version of total variation and a $\ell_q$ version of discrepancy, for conjugate pairs p and q.

Guest post by Sasho Nikolov: Beating Monte Carlo

7 Responses to "Guest post by Sasho Nikolov: Beating Monte Carlo"

By Jiantao Jiao February 20, 2015 - 2:12 am

By Sasho Nikolov April 6, 2015 - 3:18 pm

By Ravi Ganti January 12, 2015 - 5:15 pm

By Sasho January 17, 2015 - 11:36 pm

By vzn December 28, 2014 - 12:46 pm

By Suresh December 22, 2014 - 7:09 pm

By Sasho December 26, 2014 - 2:56 pm

Leave a reply

Archives

Categories

Recent Posts

Subscribe to Blog via Email

Meta

Blogroll