A good NIPS!

This year’s edition of NIPS was a big success. As you probably already know we had the surprise visit of Mark Zuckerberg (see here for the reason behind this visit). More interestingly (or perhaps less interestingly depending on who you are) here are some of the papers that I liked the most.

Optimization

- Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) by Francis Bach and Eric Moulines. With noisy gradients (see this post) it is known that the best rate of convergence to minimize an \alpha-strongly convex function is of order 1/(\alpha t) while for convex functions it is of order 1/\sqrt{t}. Unfortunately in Machine Learning applications the strong convexity parameter \alpha is often a regularization parameter that can be as small as 1/\sqrt{t}, in which case the standard analysis using strong convexity do not yield any acceleration. In this paper Bach and Moulines show that in the case of the square loss (whose strong convexity parameter depends on the smallest non-zero eigenvalue of the covariance matrix of the covariates) one can obtain a rate of order 1/t (i.e. with no dependency on the smallest eigenvalue \alpha). The proposed algorithm is simply Stochastic Gradient Descent with a constant step-size (and averaging for the output). A more intricate algorithm is also proposed for the logistic loss.

- On Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal operator

    \[\mathcal{P}_f(y) = \argmin_x \|x - y\|^2_2 + f(x) .\]

Recall that this proximal operator arises naturally for the minimization of a function of the form F + f, i.e. when one wants to minimize some function F while enforcing some of the ‘properties’ of f in the solution. For instance with f = \|\cdot\|_1 one would like to output a sparse solution. Thus it is very natural to try to understand the relation between \mathcal{P}_{f+g} and \mathcal{P}_f, \mathcal{P}_g. This paper consider various properties under which one has \mathcal{P}_{f+g} = \mathcal{P}_f \circ \mathcal{P}_g.

- Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Rie Johnson and Tong Zhang. This paper gives a beautiful new algorithm achieving the same performances as SDCA and SAG (see this post). The algorithm/analysis are much more intuitive than the one of SDCA and SAG. I will make a more detailed post on this paper later next year.

- Accelerated Mini-Batch Stochastic Dual Coordinate Ascent by Shai Shalev-Shwartz and Tong Zhang. Both SDCA and SAG have a linear dependency on the condition number Q. For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on \sqrt{Q}. This paper partially bridges the gap between these results and present an accelerated version of SDCA using mini batches.

- Mixed Optimization for Smooth Functions by Mehrdad Mahdavi, Lijun Zhang and Rong Jin. This paper considers a new setting which seems quite natural: what if on top of noisy first order oracle one can also access a regular first order oracle? Mehrdad will do a guest post on this problem soon, but the short answer is that with only a logarithmic number of calls to the regular oracle one can attain a rate of order 1/t for smooth optimization (while with only the noisy oracle the rate is 1/\sqrt{t}).

- Estimation, Optimization, and Parallelism when Data is Sparse by John Duchi, Mike Jordan and Brendan McMahan. For this paper too I am hoping to have a guest post (by John Duchi this time) that would explain the new algorithm and its properties. The idea is roughly to do a gradient descent where the step-size adapts to the sparsity of the observed gradient, allowing for much faster rates in certain situations.

 

Bandits

- Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search by Alexander Zimin and Gergely Neu. This paper shows that one can solve episodic loop-free MDPs by simply using a combinatorial semi-bandit strategy (see this paper by Audibert, myself and Lugosi where we solved the semi-bandit problem). I believe that this paper initiate a research direction that will be very fruitful in the future. Namely reducing (or rather reformulating) a complicated sequential decision making problem as a linear bandit (or semi-bandit). A similar approach is very popular in optimization where everyone knows that one should try very hard to formulate the problem of interest as a convex program. On the other hand such an approach in online learning/sequential decision making has not been recognized yet. I believe that at the moment the most promising direction is to try to formulate the problem as a linear bandit as it is both an extremely general problem but also one for which we have seemingly canonical algorithms. A related paper is Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions by Abbasi-Yadkori, Bartlett, Kanade, Seldin and Szepesvari.

- Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards by Thomas Bonald and Alexandre Proutiere. This paper considers the famous Bayesian setting of Berry, Chen, Zame, Heath and Shepp where one has a countable set of arms with Bernoulli distributions and means drawn uniformly on [0,1]. This paper shows the first optimal strategy with a Bayesian regret of order \sqrt{2 n}. I recommend to take a look at the strategy, it is both very elegant and quite smart.

- Sequential Transfer in Multi-armed Bandit with Finite Set of Models by Mohammad Azar, Alessandro Lazaric and Emma Brunskill. This paper considers an elegant and simple model for transfer learning in multi-armed bandit problems. To put it simply the setting is the one of a Bayesian multi-armed bandit where the underlying parameter \theta is replaced by a new fresh independent sample very N steps. If the prior is known then this problem is fairly straightforward (given our current knowledge of the multi-armed bandit). The issue is what to do when the prior is unknown. The paper proposes an interesting strategy based on learning latent variable models via the method of moments (see this paper by Anandkumar, Ge, Hsu, Kakade, and Telgarsky for instance). While this gives non-trivial results I believe that much more can be said by fully embracing the sequential nature of the problem rather than trying to blend together a batch method with standard bandit strategies. I suspect that the technical difficulties to obtain an optimal strategy for this setting will be tremendous (which is quite exciting!).

- Eluder Dimension and the Sample Complexity of Optimistic Exploration by Daniel Russo and Benjamin Van Roy; Prior-free and prior-dependent regret bounds for Thompson Sampling by myself and Che-Yu Liu. These two papers ask the same question (with the latter paper being inspired by the former): we know that Thompson Sampling can be as good as UCB when initialized with a well-chosen priors, but can we show that it has a significant advantage in the cases where the prior is actually informative? The first paper addresses this question in the context of linear bandits (which is, as pointed out above, the most fundamental bandit problem), while the second paper considers a much more restricted class of problems but for which surprisingly strong results can be obtained.

 

Online Learning

- Dimension-Free Exponentiated Gradient by Francesco Orabona. This paper introduces a new regularizer for Mirror Descent and shows that it adapts automatically to the norm of the unknown comparator. In my opinion we know very few interesting regularizers for MD (especially in the full-information setting) and any non-trivial addition to this set seems difficult. This paper manages to do just that. It would be interesting to see if this gives new insights for bandit regularizers.

- Minimax Optimal Algorithms for Unconstrained Linear Optimization by Brendan McMahan and Jacob Abernethy. My advice for this paper is to look at Section 3.2 which gives one very concrete and very cool application of the method they develop.

 

Other

I’m running out of stamina so I will now just list the other papers that I found very interesting.

- Density estimation from unweighted k-nearest neighbor graphs: a roadmap by von Luxburg and Alamgir. 

- Stochastic blockmodel approximation of a graphon: Theory and consistent estimation by Airoldi, Costa and Chan. 

- Near-Optimal Entrywise Sampling for Data Matrices by Achlioptas, Karnin and Liberty.

- Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic by Sharpnack, Krishnamurthy and Singh.

- Estimating the Unseen: Improved Estimators for Entropy and other Properties by Valiant and Valiant. 

- Information-theoretic lower bounds for distributed statistical estimation with communication constraints by Zhang, Duchi, Jordan and Wainwright.

Posted in Conference/workshop | 3 Comments

The hunter and the rabbit

In this post I will tell you the story of the hunter and the rabbit. To keep you interested let me say that in the story we will see a powerful (yet trivial) inequality that I learned today from Yuval Peres, which he calls a reversed second moment method (we will also review the basic second moment method).

We consider a sequential game between a hunter and a rabbit. At each time step t \geq 0 the hunter (respectively the rabbit) is at a certain vertex H_t (respectively at R_t) in the cycle C_n (with vertices labeled \{0, \hdots, n-1\}). We impose the constraint that H_{t+1} and H_t must be neighbors, that is the hunter can only move one step at a time. We impose no restriction on the rabbit (he can jump around like a real rabbit). Let \tau = \min \{t \geq 0 : R_t = H_t \} be the capture time. We allow for randomized strategies for the two players.Of course the hunter wants to minimize the expected capture time while the rabbit wants to maximize it.

December 3rd clarification: the processes R_t and H_t are independent, in other words the hunter does not see the rabbit and vice versa (the game would be trivial otherwise!).

Theorem [Adler et al., 2003]: The hunter can always ensure \mathbb{E} \tau \leq c_1 n \log n and the rabbit can always ensure \mathbb{E} \tau \geq c_2 n \log n.

We will now sketch the proof of this beautiful theorem following the method of [Babichenko et al. 2012]. Quite intuitively it is enough to show that there exists a strategy for the hunter that ensures for any deterministic rabbit \mathbb{P}(\tau < n) \leq c_1' / \log n, and respectively that there exists a strategy for the rabbit that ensures for any deterministic hunter \mathbb{P}(\tau < n ) \geq c_2' / \log n (see Lemma 2.2. here for a formal proof of this).

The hunter’s strategy

Let a, b be independent uniform random variables on [0,1]. The location of the hunter at time t is H_t = \lceil a n + b t \rceil \ \mathrm{mod} \ n. In some sense this hunter moves at a random speed determined by b and in a random direction determined by a. The key idea to analyze this strategy is to consider the number of collisions before time K_n = \sum_{t=0}^{n-1} \mathbf{1}\{R_t = H_t\}. Clearly

    \[\mathbb{P}( \tau < n ) = \mathbb{P}(K_n > 0) \geq \frac{(\mathbb{E} (K_n) )^2}{\mathbb{E} (K_n^2)},\]

where the inequality is a ‘strong’ form of the second moment method which can be proved with Cauchy-Schwarz as follows (recall that the basic second moment simply says \mathbb{P}(K_n = 0) \leq \mathbb{P}(|K_n - \mathbb{E} (K_n)| \geq |\mathbb{E} (K_n)|) \leq \mathrm{Var}(K_n) / (\mathbb{E} (K_n))^2):

    \[(\mathbb{E} (K_n))^2 = (\mathbb{E} (\mathbf{1}\{K_n \neq 0\} K_n))^2 \leq \mathbb{E}(\mathbf{1}\{K_n \neq 0\}^2) \mathbb{E} (K_n^2) = \mathbb{P}(K_n \neq 0) \mathbb{E} (K_n^2) .\]

Now it is just a matter of computing \mathbb{E} K_n and \mathbb{E} K_n^2 which can be done very easily, see page 9 here. 

The rabbit’s strategy

The rabbit’s strategy is an order of magnitude more complicated/more beautiful. The basic intuition to design a good rabbit’s strategy is the reversed second moment method which is the following trivial inequality:

    \[\mathbb{P}( \tau < n) = \mathbb{P}(K_n > 0) \leq \frac{\mathbb{E}(K_{2 n})}{\mathbb{E} (K_{2 n} | K_n > 0)} .\]

Let us look into this last term in more details. Of course the rabbit’s location at a given time step should be uniformly distributed, thus the numerator will be equal to 2 (recall that here the hunter is deterministic). Thus we want to maximize the denominator, in other words once the rabbit has collided with the hunter, he should try to collide as much as possible after that! This is in my opinion a truly beautiful intuition. The way to achieve this proposed in [Babichenko et al. 2012] goes as follows. Let U be a random variable uniformly distributed in \{0, \hdots, n-1\} and Z=(X,Y) be a simple random walk (independent of U) in \mathbb{Z}^2. Let T_0 = 0 and for k \geq 0,

    \[T_{k+1} = \inf\{ t \geq T_k : Y_t = k+1\} .\]

The rabbit is then defined by R_t = (X_{T_t} + U) \ \mathrm{mod} \ n (with R_0 = U). Using basic properties of the simple random walk in 2 dimensions one can analyze the term \mathbb{E} (K_{2 n} | K_n > 0) in a few lines, see page 7 here. I suspect that one could come up with other rabbit strategies that achieve the same excepted capture time.

Posted in Probability theory | 2 Comments

Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part II/II

The goal of this post is to develop a Conditional Gradient method that converges exponentially fast while basically solving only a linear minimization problem over the domain \mathcal{X} on each iteration.

To this end we consider the following relaxation of the problem \min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y which we term a Local Linear Oracle (LLO).

Definition 1: A LLO with parameter \rho for the convex domain \mathcal{X} is a procedure that given a point x\in\mathcal{X}, a scalar r>0 and a linear objective c\in\mathbb{R}^n returns a point y\in\mathcal{X} that satisfies:

1. \forall{z\in\mathcal{X}\cap\mathbb{B}_{r}(x)}:\, c^{\top}y \leq c^{\top}z

2. \Vert{x - y}\Vert_2 \leq \rho\cdot{}r

 

Clearly by taking y_t to be the output of such a an LLO and choosing \gamma_t appropriately we would get exponentially fast convergence of the form \exp\left({-O\left({\frac{t}{Q\rho^2}}\right)}\right) (see Part I for more details).

In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain \mathcal{X} and we specify the parameter \rho.

As a simple easy case, let us consider the very specific case in which \mathcal{X} = \Delta_n, that is the domain is just the n-dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,

    \begin{eqnarray*} &&\min_{y\in\Delta_n}y^{\top}c \\ && \textrm{s.t. } \Vert{x-y}\Vert_1 \leq \sqrt{n}r \end{eqnarray*}

where \Vert{x-y}\Vert_1 = \sum_{i=1}^n\vert{x(i)-y(i)}\vert.

Denote by y^* the optimal solution to the above problem. One can verify that since \Vert{x-y^*}\Vert_1 \leq \sqrt{n}r it follows that c^{\top}y^* \leq c^{\top}z \forall{}z\in\Delta_n\cap\mathbb{B}_r(x). Moreover it holds that \Vert{y^*-x}\Vert_2 \leq \Vert{y^*-x}\Vert_1 \leq \sqrt{n}r. Thus solving the above \ell_1-constrained linear problem yields a LLO for the simplex with parameter \rho = \sqrt{n}. Most importantly, the above \ell_1-constrained problem is solvable using only a single linear minimization step over \Delta_n and additional computation that is polynomial in the number of non-zeros in the input point x. To see this observe that y^* is just the outcome of taking weight of \sqrt{n}r/2 from the non-zero entries of x that correspond to the largest (singed) entries in the vector c and moving it entirely to a single entry that corresponds to the smallest (signed) entry in c. This just requires to check for each non-zero entry in x the value of the corresponding entry in c, sorting these values and reducing weights until a total weight of \sqrt{n}r/2 has been reduced. Finding the smallest entry in c, although a trivial operation, is just a linear minimization problem over the simplex.

What about the general case in which \mathcal{X} is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.

Given the input point to the LLO x\in\mathcal{X} let us write x as a convex combination of vertices of the polytope, that is x = \sum_i\lambda_iv_i where \lambda_i > 0, \sum_i\lambda_i=1 and each v_i is a vertex of \mathcal{X}. Suppose now that there exists a constant \mu = \mu(\mathcal{X}) such that given a point z\in\mathcal{X} which satisfies \Vert{x-z}\Vert_2 \leq r, z could be written also as a convex combination of vertices of the polytope, z=\sum_i\gamma_iv_i such that \Vert{\lambda - \gamma}\Vert_1 \leq \mu\cdot{}r. Denote by m the number of vertices of \mathcal{X} and by \lbrace{v_i}\rbrace_{i=1}^m the set of these vertices. Consider the following optimization problem,

    \begin{eqnarray*} &&\min_{\gamma\in\Delta_m}\sum_{i=1}^m\gamma_iv_i^{\top}c \\ && \textrm{s.t. } \Vert{\lambda-\gamma}\Vert_1 \leq \mu\cdot{}r \end{eqnarray*}

Since the above problem is again a linear \ell_1-constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of \mathcal{X} plus time that is polynomial in the number of non-zeros in \lambda (which is at most t+1 after t iterations of the algorithm and in particular does not depend on m which may be exponential in n). Let \gamma^* be an optimal solution to the above problem and let y^* = \sum_{i=1}^m\gamma^*_iv_i. Then clearly from our definition of \mu we have that \forall{z}\in\mathcal{X}\cap\mathbb{B}_r(x) it holds that c^{\top}y^* \leq c^{\top}z. Also it is not hard to show that \Vert{x-y^*}\Vert_2 \leq \mu{}Dr. Thus if indeed such a constant \mu exists then we have a construction of a LLO with parameter \rho = \mu{}D that requires only a single call to a linear optimization oracle of \mathcal{X}.

The following theorem states the convergence rate of the proposed method .

Theorem 1: Given a polytope \mathcal{X} with parameter \mu as defined above, there exists an algorithm that after t linear minimization steps over the domain \mathcal{X} produces a point x_t\in\mathcal{X} such that

    \begin{eqnarray*} f(x_t) - f(x^*) \leq \left({f(x_0)-f(x^*)}\right)e^{-\frac{1}{4Q\mu^2D^2}t} \end{eqnarray*}

The following theorem states that indeed a constant \mu as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.

Theorem 2: Let \mathcal{X} be a polytope defined as \mathcal{X} = \lbrace{x\in\mathbb{R}^n \, | \, A_1x = b_1 \, A_2x \leq b_2}\rbrace such that A_2\in\mathbb{R}^{m_2\times{n}}. Assume that there exists parameters \psi >0,\xi >0 such that:

1. for any matrix M which consists of at most n linearly independent rows of A_2 it holds that \Vert{M}\Vert_2 \leq \psi (here \Vert{M}\Vert_2 denotes the spectral norm of M).

2. For every vertex v of \mathcal{X} and every row A_2(i) of A_2 it holds that either A_2(i)^{\top}v = b_2(i) or A_2(i)^{\top}v \leq b_2(i) - \xi (that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).

Then \mu \leq \frac{\sqrt{n}\psi}{\xi}.

We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on \mu given in Theorem 2 is reasonable.

 

The simplex

The n-dimensional simplex can be written in the form of linear constraints as \Delta_n = \lbrace{x \, | \, -\textbf{I}x \leq \vec{0}, \, \vec{1}\cdot{}x = 1}\rbrace. Its not hard to see that for the simplex it holds that \psi = \xi = 1 and thus by Theorem 2 \mu \leq \sqrt{n}. In particular we get that when applying the proposed algorithm to the problem \min_{x\in\Delta_n}\Vert{x}\Vert_2^2 an error \epsilon is achieved after O(n\log(1/\epsilon)) iterations which as stated here, is nearly optimal.

 

The flow polytope

Given a directed acyclic graph with m edges, a source node marked s and a target node marked t, every path from s to t in the graph can be represented by its identifying vector, that is a vector in \lbrace{0,1}\rbrace^m in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from s to t. This polytope is also exactly the set of all unit s-t flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in \mathbb{R}^m in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that \psi = \xi = 1 and thus \mu \leq \sqrt{n}. Since the flow polytope is just the convex-hull of s-t paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.

 

The doubly-stochastic matrices polytope

Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also \mu \leq \sqrt{n}.

The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all n\times{n} permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.

 

Matroid polytopes

A matroid is pair (E,I) where E is a set of elements and I is a set of subsets of E called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.

Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set E is the set of edges of a given graph and the set I is the set of all subsets of E which are cycle-free. In particular I in this case contains all the spanning trees of the graph. A subset S\in{I} could be represented by its identifying vector which lies in \lbrace{0,1}\rbrace^{\vert{E}\vert} which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in I. It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in \vert{E}\vert) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in \vert{E}\vert exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that \xi =1 and \psi < n^{3/2} where n = \vert{E}\vert. Thus \mu < n^{3/2}.

An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.

 

We end this post with some further research questions.

The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant \mu on \psi, \xi. In particular if we take \mathcal{X} to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that \xi = 0 and the analysis breaks down. So, an interesting open question is

Question 3: Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?

Posted in Optimization | Leave a comment

Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part I/II

In a previous post Sebastien presented and analysed the conditional gradient method for minimizing a smooth convex function f over a compact and convex domain \mathcal{X}. The update step of the method is as follows,

    \begin{eqnarray*} y_t \in \arg\min_{y\in\mathcal{X}}\nabla{}f(x_t)^{\top}y \\ x_{t+1} = (1-\gamma_t)x_t + \gamma_ty_t \end{eqnarray*}

where \forall{t}, \gamma_t\in(0,1).

The appeal of the conditional gradient method is two fold:

i) the update step of the method requires only to minimize a linear objective over the domain, which for many domains of interest is computationally cheap (examples are various polytopes that arise in combinatorial optimization such as the flow polytope, the spanning tree polytope and the matching polytope, and the set of unit-trace positive semidefinite matrices), whereas other first order methods require to compute a projection onto the convex set with respect to a certain norm on each iteration which can be computationally much more expensive, and

ii) the method yields sparse solutions in the sense that after t iterations and provided that the first iterate is a vertex of the convex domain, the current iterate is naturally given as the convex sum of at most t+1 vertices of the convex domain (for the simplex this means at most t+1 non-zero entries and for the set of unit-trace psd matrices this means rank at most t+1).

If f is \beta-smooth and the diameter of \mathcal{X} is D (i.e. \max_{x,y\in\mathcal{X}}\Vert{x-y}\Vert_2 = D) then choosing \gamma_t = \frac{2}{t+1} guarantees that f(x_t) - f(x^*) \leq \frac{2\beta{}D^2}{t+1}. This is also the convergence rate of projected gradient descent for smooth optimization. In case the objective function f is both \beta-smooth and \alpha-strongly convex (with respect to the same norm) it is known that projected gradient descent has an error that decreases exponentially fast: an additive error \epsilon is guaranteed after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) iterations where Q = \beta/\alpha is the condition number of f (the complexity notation \tilde{O}() is the same as O() but omits factors that are logarithmic in the diameter of \mathcal{X} and in \beta). What about the conditional gradient method in this case? So our first question for this post is the following,

Question 1: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain? 

The answer is no, as observed here. In particular the convergence rate of a linearly converging CG method must depend polynomially on the dimension which is not the case for the projected gradient descent method. This brings us to our second question,

Question 2: Is there a CG method that given a \beta-smooth and \alpha-strongly convex function f guarantees an \epsilon additive error after at most \textrm{poly}(n)\cdot\tilde{O}\left({Q\log{\frac{1}{\epsilon}}}\right) linear optimization steps over the domain?

Note that albeit the \textrm{poly}(n) factor such a result is still interesting since the time to compute a euclidean projection to the domain (or non-euclidean in case of mirror descent methods) may be longer than the time to minimize a linear objective over the domain by a \textrm{poly}(n) multiplicative factor.

Here is the place to remark that several linearly converging CG methods for smooth and strongly convex optimization were proposed before but they rely on assumptions regarding the location of x^* with respect to the boundary of \mathcal{X}. For example if x^* lies in the interior of \mathcal{X} (which means that the problem is an unconstrained one) then the original CG method due to Frank and Wolfe converges exponentially fast, however the number of iterations depends polynomially on the distance of x^* from the boundary, see this paper by Guélat and Marcotte. In case \mathcal{X} is a polytope then a modification of the CG method presented in the same paper, gives a linear convergence rate that is polynomial in the distance of x^* from the boundary of the smallest facet of \mathcal{X} that contains x^*. Here however we don’t want to rely on such strong assumptions on the location of x^* and we aim at a linearly converging method that is free from such restrictions.

 

In the rest of this post we follow our new paper which describes a new CG method for smooth and strongly convex optimization with convergence rate of the form stated in question 2 in case \mathcal{X} is a polyhedral set. Such the convergence rate will depend on geometric properties of the set. The latter dependence is very reasonable for many polytopes that arise in combinatorial optimization problems; indeed domains for which fast and simple combinatorial algorithms for linear optimization exists are what in part makes CG methods an attractive approach for non-linear optimization.

To begin the derivation of this new CG method, let us recall that as Sebastien showed here, the conditional gradient method satisfies the following inequality for each iteration t,

    \begin{eqnarray*} \epsilon_{t+1} \leq (1-\gamma_t)\epsilon_t + \frac{\gamma_t^2\beta}{2}\Vert{y_t-x_t}\Vert^2 \end{eqnarray*}

where \epsilon_t = f(x_t) - f(x^*).

The fact that \Vert{y_t-x_t}\Vert^2 might be as large as the diameter of \mathcal{X} while the approximation error \epsilon_t may be arbitrarily small forces one to choose step sizes \gamma_t that decrease roughly like 1/t in order to get the known 1/t convergence rate.

Let us now consider the case that f is also \alpha-strongly convex. That is,

    \begin{eqnarray*} f(y) \geq f(x) + \nabla{}f(x)^{\top}(y-x) + \frac{\alpha}{2}\Vert{x-y}\Vert^2 \end{eqnarray*}

In particular the above inequality implies that,

    \begin{eqnarray*} \forall{x\in\mathcal{X}}\quad f(x) - f(x^*) \geq \frac{\alpha}{2}\Vert{x-x^*}\Vert^2 \end{eqnarray*}

Now assume that the iterate x_t satisfies that f(x_t)-f(x^*) \leq \epsilon_t for some \epsilon_t \geq 0 and denote by \mathbb{B}_r(x) the euclidean ball of radius r centred at point x. Define r_t = \sqrt{\frac{2\epsilon_t}{\alpha}} and let us now choose y_t as the optimal solution to the following problem,

    \begin{eqnarray*} \min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y \end{eqnarray*}

Note that CG inequality still holds since by the strong convexity we have that x^*\in\mathbb{B}_{r_t}(x_t) and the only important thing in the selection of y_t that is required for the CG inequality to hold is that \nabla{}f(x_t)^{\top}y_t \leq \nabla{}f(x_t)^{\top}x^*. We’ll now get that,

    \begin{eqnarray*} f(x_{t+1}) - f(x^*) \leq \left({1-\gamma_t + \gamma_t^2Q}\right)\epsilon_t \end{eqnarray*}

Thus choosing a constant step size like \gamma_t = \frac{1}{2Q} and an initial error bound \epsilon_1 \geq f(x_1)-f(x^*) results in exponentially fast convergence.

The problem with the above approach is of course that now the optimization problem that needs to be solved on each iteration is no longer a linear minimization step over the original domain \mathcal{X} but a much more difficult problem due to the additional constraint on the distance of y_t from x_t. What we would like to get is an exponentially fast converging method that still requires to solve only a linear minimization problem over \mathcal{X} on each iteration. This will be the subject of our next post.

Posted in Optimization | Leave a comment

5 announcements

First two self-centered announcements:

A new problem around subgraph densities

Nati Linial and I have just uploaded our first paper together, titled ‘On the local profiles of trees‘. Some background on the paper: recently there has been a lot of interest in subgraph densities for very large graphs, mainly because of its importance for the emerging theory of graph limits, see here for a bunch of videos on this, here for a nice survey by Lovasz, and here for the book version of this survey. A basic issue is that we know almost nothing about subgraph densities. Consider for instance the set \Delta(k) of distributions on k-vertex subgraphs induced by very large graphs (that is, a distribution p \in \Delta(k) corresponds to a large graph G in the sense that for a k-vertex subgraph H, p(H) is the probability that by picking k vertices at random in G one obtains H as the induced subgraph). We know very little about \Delta(k). It is non-convex (think of the complete graph, then the empty graph, and then try to take convex combinations of the corresponding distributions). Even worse than that, Hatami and Norine proved that it is undecidable to determine the validity of a linear inequality for the set \Delta(k). Alright, so subgraph densities in general graphs are in some sense intractable. Can we make some simplifying assumptions and recover tractability? It turns out that the answer is essentially yes if you restrict your attention to trees and subtrees! For instance we prove in our paper with Nati that in this case the set of possible distributions becomes convex! We also initiate the study of the defining inequalities for this set, but we are still far from a complete picture. The paper contains a list of 7 open problems, and I strongly recommend the reader to read our short paper and try to solve some of them: they are really really fun to work with!

New paper on the multi-armed bandit

Che-Yu Liu (my first graduate student) and I uploaded a few weeks ago our first paper together, titled ‘Prior-free and prior-dependent regret bounds for Thompson Sampling‘. Let me say that there are still plenty of open questions around the theme developed in this paper. The abstract reads as follows: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit as the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. Building on the techniques of Audibert and Bubeck [2009] and Russo and Roy [2013] we first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by 14 \sqrt{n K} This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by \frac{1}{20} \sqrt{n K}. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.

 

Next are three announcements related to different workshops/conferences:

Simons-Berkeley Research Fellowship 2014-2015

The Simons Institute at UC Berkeley has just posted their call for the Fellowships for next year. Next year’s programs are “Algorithmic Spectral Graph Theory”, “Algorithms & Complexity in Algebraic Geometry” and “Information Theory”. The deadline is December 15. Check out what Justin Thaler has to say about his experience as a research fellow this semester. Let me add that I also enjoy very much my stay there, and the paper with Nati I talked about above would not have been possible without the Simons Institute.

COLT 2014

The call for papers for COLT 2014 is out, see the official website here. Let me also remind you that this edition will be in Barcelona, which should be a lot of fun.

Mathematics of Machine Learning

Together with Nicolo Cesa-Bianchi, Gabor Lugosi, and Sasha Rakhlin, we are organizing a special program in Barcelona on the Mathematics of Machine Learning from April 7, 2014 to July 14, 2014, see the website for all the details. We are still in the process of inviting people but if you are interested in participating please feel free to send us an email. We should also have soon more details on the large workshop that will take place right after COLT.

Posted in Conference/workshop | 1 Comment

First Big Data workshop at the Simons Institute

This week at the Simons Institute we had the first Big Data workshop on Succinct Data Representations and Applications. Here I would like to briefly talk about one of the ‘stars’ of this workshop: the squared-length sampling technique. I will illustrate this method on three examples (taken from three seminal papers).

 

Fast low rank approximation

Frieze, Kannan and Vempala proposed the following algorithm to compute an approximate low rank decomposition of a matrix A \in \mathbb{R}^{m \times n} (this specific version is taken from Chapter 6 here). We denote by A_1, \hdots A_n the columns of A.

Let j_1, \hdots, j_s be s i.i.d. random variables taking values in [n] whose distribution is proportional to the squared-length of the columns, more precisely the probability that j_1 is equal to i \in [n] is proportional to \|A_i\|_2^2. Let B \in \mathbb{R}^{m \times s} be such that its i^{th} column is \frac{1}{\sqrt{s p_{j_i}}} A_{j_i}. Next compute an SVD of B and let u_1, \hdots, u_k be the k top left singular vectors of B. The low rank approximation to A is finally given by

    \[\tilde{A} = \sum_{i=1}^k u_i u_i^{\top} A .\]

One can prove that for s = 4 k / \epsilon^2 this approximation satisfies:

    \[\mathbb{E} \|A - \tilde{A}\|_F^2 \leq \|A - A_k\|^2_F + \epsilon \|A\|_F^2 ,\]

where A_k is the best rank k approximation to A.

The amazing feature of this algorithm is that the complexity of computing the projection matrix \sum_{i=1}^k u_i u_i^{\top} is independent of n (one needs to be careful with how the sampling of j_1, \hdots, j_s is done but this can be taken care of). In fact it is even possible to obtain an algorithm whose complexity is independent of both n and m and depends only polynomially in k and 1/\epsilon (though the version described above might be better in practice because the complexity is simply m k^2 /\epsilon^4).

 

Fast graph sparsification

Spielman and Teng looked at the following graph sparsification problem: given a weighted graph G, find a ‘sparse’ weighted graph H such that

    \[(1-\epsilon) L_G \preceq L_H \preceq (1+\epsilon) L_G ,\]

where L_H and L_G are the graph Laplacians of H and G. The idea is that if one is able to find ‘fast enough’ a sparse approximation H to G then one can solve many problems on G (which is dense) by instead solving them much faster on H (think of solving linear systems of the form L_G x = b for example).

Spielman and Srivastava proposed a method to carry this graph sparsification task using the squared-length sampling technique. First they reduce the problem to the following: given vectors v_1, \hdots, v_N that forms a decomposition of the identity, that is

    \[\sum_{i=1}^N v_i v_i^{\top} = I_n ,\]

find a small subset of them such that appropriately reweighted they form an approximate decomposition of the identity. Rudelson showed that one can solve this problem by simply sampling s = O \left(n \log(n) / \epsilon^2 \right) vectors from \{v_1, \hdots, v_N\} i.i.d. proportionally to their squared-length and reweight them by the square root of the inverse probability of selecting this vector. In other words, he showed that if j_1, \hdots, j_s is an i.i.d. sequence in [N] such that p_i = \mathbb{P}(j_1 = i) is proportional to \|v_i\|_2^2 then with high probability one has

    \[(1-\epsilon) I_n \preceq \sum_{t=1}^s \frac{1}{p_t} v_t v_t^{\top} \preceq (1+\epsilon) I_n.\]

Batson, Spielman and Srivastava also showed how to find this decomposition with a deterministic algorithm, see also this very nice survey of Naor.

 

Approximation guarantee for k-means

The last example is the now famous k-means++ algorithm of Arthur and Vassilvitskii. Given x_1, \hdots, x_N \in \mathbb{R}^n one would like to find c_1, \hdots, c_k \in \mathbb{R}^n minimizing the following quantity:

    \[\sum_{t=1}^n \min_{1 \leq i \leq k} \|x_t - c_i\|_2^2 .\]

The following strategy gives a randomized O(\log(k))-approximation algorithm to this NP-hard problem. First select c_1 at random from x_1, \hdots, x_n. Then select iteratively the new centers at random, by sampling a point at random from x_1, \hdots, x_n proportionally to their distance squared to the current nearest center.

Posted in Conference/workshop | 2 Comments

First week of activity at the Simons Institute

This first week at the Simons Institute was a lot of fun! I attended the first workshop in the Real Analysis program which was about Testing, Learning and Inapproximability. There was plenty of good talks and I learned a lot of new things.

 

First I learned about parallel repetition. Let \Sigma be a finite alphabet. Consider a bipartite graph G with vertex sets U and V and a labeling of the edges with mappings from \Sigma to \Sigma, that is edge uv is labelled with some mapping \pi_{uv} : \Sigma \rightarrow \Sigma \pi_{uv} : \Sigma \rightarrow \Sigma. Now consider the following two players game (Alice and Bob are the players). First an edge uv is drawn at random from some known probability distribution \nu over the set of edges in G (the graph G and the ‘labels’ are also known to the players). The vertex u is revealed to Alice and the vertex v is revealed to Bob. Without communication each one of them has to output an alphabet symbol, say \alpha for Alice and \beta for Bob, and they win if \beta = \pi_{uv} ( \alpha ) . In other words they win if the pair of symbols is consistent with the labeling of the drawn edge. The value of the game v( G ) is the probability that Alice and Bob wins, which can be rewritten as:

    \[v(G) = \max_{f : U \rightarrow \Sigma, g : V \rightarrow \Sigma} \mathbb{P}_{uv \sim \nu}\big(g(v) = \pi_{uv} (f(u)) \big) .\]

Now we define the value of the repeated game as follows: the players simply play the above game k times (with no communication at all except before the game starts when they agree on their strategy, and with the sequence of edges being i.i.d.) and the value of the repeated game, denoted v( G^{\otimes k} ) is the probability that they win all the k games. This can be rewritten as:

    \begin{eqnarray*} v(G^{\otimes k}) = \max_{f : U^k \rightarrow \Sigma^k, g : V^k \rightarrow \Sigma^k} & & \mathbb{P}_{(u_1v_1, \hdots, u_kv_k) \sim \nu^{\otimes k}}\big(\forall i \in [k], \\ & & [g(v_1, \hdots, v_k)]_i = \pi_{u_iv_i}([f(v_1, \hdots, v_k)]_i) \big) . \end{eqnarray*}

One is now interested in understanding the relation between v( G^{\otimes k} ) and v(G).

At this point it’s probably a good idea to pause. If this is the first time that you read about parallel repetition you are probably like ‘wtf, it’s obvious that v( G^{\otimes k} ) = v(G)^k !!!’ (at least this was my initial reaction). It turns out that this is very far from being true. Clearly it’s true that v( G^{\otimes k} ) \geq v(G)^k since you can simply play the k games independently. But in some cases v( G^{\otimes k} ) might be much larger than v(G)^k, which is kind of amazing! Here is a beautiful example (which is due to Lance Fortnow I think) where v(G) = 1/2 and v( G^{\otimes 2} ) = 1/2. The graph G is the complete graph on the set of vertices U=V=\{0,1\} and the distribution \nu is uniform on the set of edges (in other words Alice receives a random bit u \in \{0,1\} and Bob receives a random bit v \in \{0,1\}). The alphabet is \Sigma = (\{A,B\} \times \{0,1\}) \cup \{\mathrm{err}\} where the symbol \mathrm{err} is introduced for sake of simplicity and we assume that the players cannot respond with \mathrm{err}. Now the mappings \pi_{uv} are given by:

    \[\pi_{uv}((A, u) )=(A, u), \ \pi_{uv}((B, v) )=(B, v), \ \pi_{uv}((A, \bar{u}) )= \pi_{uv}((B, \bar{v}) )=\mathrm{err}.\]

In words the players must choose either Alice or Bob and then they both have to output the bit of the chosen player. For instance if they choose Alice then Bob has to guess the bit of Alice (and Alice can just report her bit). Clearly there is not much one can do for the one-shot game: one has v(G) = 1/2. Now the situation becomes more interesting for the two rounds game: the players can agree that in the first round Bob will try to guess Alice’s first bit, and in the second round Alice will try to guess Bob’s second bit. The trick is that when Bob tries to guess Alice’s first bit he can use his second bit as a guess, and respectively Alice’s guess for Bob’s second bit can be her own first bit. That way they effectively reduced the two rounds game to a one-round game, since both predictions are correct if Bob’s second bit is equal to Alice’s first bit which happens with probability 1/2. In other words we proved v( G^{\otimes 2})  = 1/2 while naively one would have guessed that v( G^{\otimes 2} ) = 1/4.

Now that we have a better understanding of why the original question on the relation between v( G^{\otimes k} ) and v(G) is a non-trivial one, it is clear that the interesting thing to study is to upper bound the value v( G^{\otimes k} ) as a function of v(G) (recall the obvious lower bound v( G^{\otimes k} ) \geq v(G)^k). In essence we are asking to show a limit to the power of tricks such as the one described above. Such a result was first proved by Ran Raz in 1995. He showed that, essentially, if v(G) \leq 1- \epsilon then v( G^{\otimes k} ) \leq (1-\epsilon^2 /2)^k (the original proof gave \epsilon^{32} instead of \epsilon^2). At the Simons workshop David Steurer talked about a new proof of this result based on viewing the value of the game as the norm of some matrix, see this paper. This new proof method also lead to new upper bounds, including v( G^{\otimes k} ) \leq v(G)^{\sqrt{k}} (which matches some known lower bounds from Ran Raz). As far as I understand there are still plenty of open questions on this fundamental problem of parallel repetition.

 

Joe Neeman talked about a new proof of Borell’s theorem which states that if X and Y are two standard Gaussian vectors in \mathbb{R}^n with \mathrm{cov}(X, Y) = \rho I_n, then half-spaces are maximizers of the function A \subset \mathbb{R}^n \mapsto \mathbb{P}(X \in A, Y \in A). The proof is really slick, it’s basically one page of fairly simple calculations, see Section 2 hereElchanan Mossel then talked about the ‘natural’ extension of this result to the case of uniform variables on the hypercube \{-1,1\}^n, see this paper.

 

- James Lee talked about a very interesting problem: we know how to prove lower bounds for the size of extended formulations (see this blog post for instance), but such bounds are interesting only for exact optimization. A natural question is whether we can prove lower bounds on the size of LPs even for approximate optimization. Apparently there is a rich literature on this for specific type of LP relaxations such the Sherali-Adams hierarchy. Now in their new paper (which is not yet online) James Lee and his co-authors are able to prove general bounds instead of the one that were previously proved specifically for the Sherali-Adams hierarchy. Of course LPs are nice but SDPs are even nicer (see here) and a natural question is how to generalize these results to SDPs instead of LPs. This was touched upon in Raghu Meka‘s talk where he discussed lower bounds for the number of rounds to solve the hidden clique problem with the Lasserre hierarchy. The paper, joint work with Avi Wigderson, is available here.

 

- Nati Linial gave an absolutely wonderful talk, probably one of the best that I have seen in months. My advice is to look at the video as soon as it is released on the Simons website (this should be a link to it hopefully). Just as a teaser here is one object that Linial is looking at. First a definition: for a graph G=(V,E) let p_i(G) be the fraction of the set of 3 vertices that contains exactly i-1 edges. The vector p(G) = (p_1(G), \hdots, p_4(G)) is called the 3-profile of G. Now the object that we look at is the set S \subset \mathbb{R}^4 of possible ‘asymptotic profile’, more precisely:

    \[S = \{x \in \mathbb{R}^4 : \exists (G_n = (V_n, E_n))_{n \in \mathbb{N}}, p(G_n) \xrightarrow[n \to \infty]{} x\} .\]

There is not much that we know about S. It is not convex: (1,0,0,0) and (0,0,0,1) are clearly in S but (1/2, 0,0,1/2) is of course not in S. We also know since the 60’s that for x \in S one has x_1 + x_4 \geq 1/4 and Linial recently proved that \min(x_2, x_4) \leq 0.26. A better understanding of the properties of S could lead to breakthroughs in the emerging theory of statistics for networks (see Nati’s talk for more details!).

 

Tomorrow the Big Data program will start with its Boot Camp. Apparently there has been a lot of hype around this boot camp and the seats have been ‘sold out’. If you can’t enter the auditorium, or if you are not in Berkeley, the event will be streamed live here.

Posted in Theoretical Computer Science | 4 Comments

Random-Approx 2013

Last week I attended the Random-Approx conference at Berkeley. I missed quite a few talks as I was also settling in my new office for the semester at the Simons Institute so I will just report on the three invited talks:

Luca Trevisan gave a talk on spectral graph theory. He first went through the basics: the eigenvalues of the normalized Laplacian L= I_n - D^{-1/2} A D^{-1/2} (where D is the diagonal matrix of degrees and A the adjacency matrix of the graph) tells you something about connectivity in the graph, precisely the eigenvalue 0 has multiplicity k if and only if the graph has k connected components, and the largest eigenvalue is equal to 2 if and only if the graph has a bipartite connected component (both statement are really easy to prove). This is a very nice result but a much more powerful statement would be a ‘soft’ version where one says something useful about the graph when say some eigenvalue is close to 0 (instead of being exactly 0). Luckily this has been extensively studied. A famous example is the Cheeger’s inequality which states that a small second eigenvalue \lambda_2 implies a small expansion (the expansion \Phi(S) of a set S of vertices is the probability that a random neighbor of a random vertex in S is outside of S), precisely there exists S such that

    \[\frac{\lambda_2}{2} \leq \Phi(S) \leq \sqrt{2 \lambda_2}.\]

Another example is the result of Arora, Barak and Steurer which says that a small value for the k^{th} eigenvalue \lambda_k implies that there is a set of roughly size n/k with small expansion.

One thing that I find fascinating is that as far as I understand we still have pretty much no idea of what ‘type’ of properties are encoded in the spectrum of a graph. Of course there is a good reason for this: computing the spectrum is computationally easy so this question is directly related to asking what properties of a graph are easy to compute, and this latter question is far far beyond our current knowledge. One example of a highly non-trivial property which is ‘hidden’ in the spectrum (and the corresponding eigenbasis) is a Szemeredi decomposition. Recall that the Szemeredi regularity lemma states roughly the following: fix a precision \epsilon >0, then there exists k such that for any graph on n vertices, with n large enough, one can partition the vertices into k clusters of equal sizes, such that for most pairs of clusters (i,j) (except a fraction \epsilon of the pairs), the graph of edges between cluster i and cluster j looks like a random graph (in a certain sense and up to an error of \epsilon) with a certain density of edges d_{i,j}. Another way to put it is that any large graph of size n can be viewed as a small graph of size k (with k independent of n!!!!) with weighted edges d_{i,j}. You can find the precise statement of this result here together with a proof based on the spectrum of the graph.

 

Santosh Vempala gave a talk on high-dimensional sampling algorithms. I already blogged about this since he gave the same talk at ICML, see here.

 

Persi Diaconis gave a very nice board talk about how to use algorithms to prove theorems in probability theory. Of course the standard approach is rather to go the other way around and use probability to derive new algorithms! I discussed a somewhat similar topic in the early days of this blog when I suggested that one could use results in statistics to prove new theorems in probability theory (again one usually do the opposite), see here. In his talk Diaconis illustrated the power of algorithms as a proof method for probability statements with the following problem. Let \Pi(n) be the set of partitions of [n]. The cardinality of \Pi(n) is known as the Bell number B(n). For various reasons one can be interested in computing quantities of the form

(1)   \begin{equation*}  \sum_{\lambda \in \Pi(n)} f(\lambda) , \end{equation*}

for some function f : \Pi(n) \rightarrow \mathbb{R}. Apparently there is a huge litterature that shows that such statistics can be expressed as polynomial functions of the Bell numbers. For each f one has to work hard to prove the expression. What Diaconis did in this paper is to prove an existence theorem for such an expression for ‘reasonable’ functions f. Of course one can view (1) as the expectation of f(\lambda) when \lambda is a set partition drawn uniformly at random. The key for the proof is then to have a ‘convenient algorithm’ to generate such a set partition at random, which in turns will make the analysis easy. The algorithm that they used is the following beautiful strategy of Stam. First one has to know the Dobinski’s formula:

    \[B(n) = \frac{1}{e} \sum_{j=1}^{+\infty} \frac{j^n}{j!} .\]

(Yup that’s right, it’s an amazing formula!) Now pick an integer j at random with probability \frac{1}{e B(n)} \frac{j^n}{j!} (this defines a probability distribution thanks to the Dobinski’s formula). Next drop n labeled balls into j boxes. This gives a partition of [n], and the distribution is uniform. The point is that there are many possible algorithms to generate a partition at random, but this particular clever algorithm leads to a new and interesting point of view to prove statements about uniformly drawn set partitions.

 

P.S: In the last post I gave links to the videos for COLT 2013 and ICML 2013, in case you are interested here are the direct links to my talks in these events:

- Bounded regret in stochastic multi-armed bandits (COLT), paper is here.

Multiple Identifications in Multi-Armed Bandits (ICML), unfortunately for this one the sound is not correctly synchronized so it’s probably better to just look at the paper directly.

- Here is another talk I did during the summer on an older paper, it’s about 0^{th} order Lipschitz optimization.

 

Posted in Random graphs, Theoretical Computer Science | 2 Comments

COLT 2013/ICML 2013 videos

The videos for COLT 2013 were just published on videolectures.net. The videos for ICML 2013 are also available on TechTalks.tv.

Posted in Uncategorized | 2 Comments

Two summer readings on Big Data and Deep Learning

This is the first (short) post dedicated to the Big Data program of the Simons Institute. We received from the program organizer Mike Jordan our first reading assignment which is a report published by the National Academy of Sciences on the “Frontiers in Massive Data Analysis“. This paper was written by a committee of academics and industrials chaired by Mike Jordan. I find the report quite interesting because it clearly identifies the new mathematical and algorithmic challenges that the Big Data point of view brings. This includes in particular many issues related to data representation, but also distributed algorithms, crowdsourcing, or tradeoffs between statistical accuracy and computational cost.

 

Talking about data representation I would like to link another paper, this one from Stéphane Mallat, Group Invariant Scattering. Stephane’s idea can be described roughly as follows: a useful data representation, say for sound signals, should be of course invariant to translations but also robust to small diffeomorphisms which are close to translations. In other words one is looking for mappings \Phi : L_2(\mathbb{R}^d) \rightarrow \mathcal{H} which are invariant to translations and Lipschitz continuous to C^2 diffeomorphisms of the form x \mapsto x - \tau(x) (with the weak topology on C^2 diffeomorphisms). As an example consider the modulus of the Fourier transform: this is a mapping invariant to translations but it is not Lipschitz continuous with respect to diffeomorphisms as one can ‘expand’ arbitrary high frequency by a simple transformation of the form x \mapsto (1-s) x. Mallat’s construction turns out to be much more complicated than simple Fourier or Wavelet transforms. Interestingly it builds on ideas from the Deep Learning literature. It also generalizes to other groups of transforms than translations, such as rotations (which can be useful for images).

Posted in Uncategorized | 2 Comments