This is the first (short) post dedicated to the Big Data program of the Simons Institute. We received from the program organizer Mike Jordan our first reading assignment which is a report published by the National Academy of Sciences on the “Frontiers in Massive Data Analysis“. This paper was written by a committee of academics and industrials chaired by Mike Jordan. I find the report quite interesting because it clearly identifies the new mathematical and algorithmic challenges that the Big Data point of view brings. This includes in particular many issues related to data representation, but also distributed algorithms, crowdsourcing, or tradeoffs between statistical accuracy and computational cost.

Talking about data representation I would like to link another paper, this one from Stéphane Mallat, Group Invariant Scattering. Stephane’s idea can be described roughly as follows: a useful data representation, say for sound signals, should be of course invariant to translations but also robust to small diffeomorphisms which are close to translations. In other words one is looking for mappings which are invariant to translations and Lipschitz continuous to diffeomorphisms of the form (with the weak topology on diffeomorphisms). As an example consider the modulus of the Fourier transform: this is a mapping invariant to translations but it is not Lipschitz continuous with respect to diffeomorphisms as one can ‘expand’ arbitrary high frequency by a simple transformation of the form . Mallat’s construction turns out to be much more complicated than simple Fourier or Wavelet transforms. Interestingly it builds on ideas from the Deep Learning literature. It also generalizes to other groups of transforms than translations, such as rotations (which can be useful for images).

## By Sebastien Bubeck July 18, 2013 - 3:32 pm

Hey Csaba,

I'm still trying to figure out what is the most exciting mathematical problem in the big data framework so I cannot give you a *very specific* challenge that I really like. However in general I'm quite excited by the data representation problems (hence the link to Mallat's paper) which I think are fundamental both theoretically and to make progress in some applications. Finding (or even better, learning them) what are the properties we want for the data representation is one of the main challenge in my opinion. Mallat suggests invariance with respect to a prespecified group but there could be other properties...

For the second question: I'm thinking of to be a small real number, is just the standard multiplication. The operation is close to a translation because it is close to the identity for s small (closeness is with respect to the weak topology of diffeomorphisms). We would like the representation of and to be close up to . The issue is that with this operation you translate the largest frequency from to , hence the modulus of the Fourier transform between the two signals ( and ) can be as far as , which can be arbitrary large.

## By Csaba July 18, 2013 - 9:31 am

Hi Seb,

So what is a specific challenge that you liked most?

On the second topic, in what is $(1-s)x$ close to a translation? Does $s x$ denote here the differentiation of $x$? Or?

Cheers,

Csaba