This is the first (short) post dedicated to the Big Data program of the Simons Institute. We received from the program organizer Mike Jordan our first reading assignment which is a report published by the National Academy of Sciences on the “Frontiers in Massive Data Analysis“. This paper was written by a committee of academics and industrials chaired by Mike Jordan. I find the report quite interesting because it clearly identifies the new mathematical and algorithmic challenges that the Big Data point of view brings. This includes in particular many issues related to data representation, but also distributed algorithms, crowdsourcing, or tradeoffs between statistical accuracy and computational cost.
Talking about data representation I would like to link another paper, this one from Stéphane Mallat, Group Invariant Scattering. Stephane’s idea can be described roughly as follows: a useful data representation, say for sound signals, should be of course invariant to translations but also robust to small diffeomorphisms which are close to translations. In other words one is looking for mappings which are invariant to translations and Lipschitz continuous to diffeomorphisms of the form (with the weak topology on diffeomorphisms). As an example consider the modulus of the Fourier transform: this is a mapping invariant to translations but it is not Lipschitz continuous with respect to diffeomorphisms as one can ‘expand’ arbitrary high frequency by a simple transformation of the form . Mallat’s construction turns out to be much more complicated than simple Fourier or Wavelet transforms. Interestingly it builds on ideas from the Deep Learning literature. It also generalizes to other groups of transforms than translations, such as rotations (which can be useful for images).