Information Theoretical Estimators (ITE) Toolbox

[Admin note: This post comes to us from Zoltan Szabo, of UCL.  He is the creator of a new toolbox called Information Theoretical Estimators (ITE).]

I am an applied mathematician working as a research associate at the Gatsby Unit, UCL with Arthur Gretton. My research interest includes dictionary learning problems, kernel methods and information theory. I got acquainted with the estimation of information theoretical quantities while I was working on independent subspace analysis (ISA) and its extensions with Barnabás Póczos who introduced me to this field. ISA is a blind signal separation problem, which can be formulated as the optimization of Shannon’s differential entropy or mutual information objectives. Provided that one can efficiently estimate these quantities, it is possible to solve a wide variety of ISA-type tasks. While I was working on these problems, I gradually realized that the available software packages focus on (i) discrete variables, or (ii) quite specialized quantities and estimation methods. This is what motivated me to write the Information Theoretical Estimators (ITE) toolbox – as a leisure activity.

ITE is a recently released, free and open source, multi-platform, Matlab/Octave toolbox released under GPLv3(>=) license. The goal in ITE is to (i) cover the state-of-the-art nonparametric information theoretical estimators, (ii) in a highly modular, and (iii) user-friendly way. The focus in ITE is on continuous random variables. At the moment the package provide estimators for the following quantities:

  • entropy: Shannon entropy, Rényi entropy, Tsallis entropy (Havrda and Charvát entropy), complex entropy, \Phi-entropy (f-entropy), Sharma-Mittal entropy,
  • mutual information: generalized variance, kernel canonical correlation analysis, kernel generalized variance, Hilbert-Schmidt independence criterion, Shannon mutual information (total correlation, multi-information), L_2 mutual information, Rényi mutual information, Tsallis mutual information, copula-based kernel dependency, multivariate version of Hoeffding’s \Phi, Schweizer-Wolff’s \sigma and \kappa, complex mutual information, Cauchy-Schwartz quadratic mutual information, Euclidean distance based quadratic mutual information, distance covariance, distance correlation, approximate correntropy independence measure, \chi^2 mutual information (Hilbert-Schmidt norm of the normalized cross-covariance operator, squared-loss mutual information, mean square contingency),
  • divergence: Kullback-Leibler divergence (relative entropy, I directed divergence), L_2 divergence, Rényi divergence, Tsallis divergence, Hellinger distance, Bhattacharyya distance, maximum mean discrepancy (kernel distance), J-distance (symmetrised Kullback-Leibler divergence, J divergence), Cauchy-Schwartz divergence, Euclidean distance based divergence, energy distance (specially the Cramer-Von Mises distance), Jensen-Shannon divergence, Jensen-Rényi divergence, K divergence, L divergence, certain f-divergences (Csiszár-Morimoto divergence, Ali-Silvey distance), non-symmetric Bregman distance (Bregman divergence), Jensen-Tsallis divergence, symmetric Bregman distance, Pearson \chi^2 divergence (\chi^2 distance), Sharma-Mittal divergence,
  • association measures including measures of concordance: multivariate extensions of Spearman’s \rho (Spearman’s rank correlation coefficient, grade correlation coefficient), correntropy, centered correntropy, correntropy coefficient, correntropy induced metric, centered correntropy induced metric, multivariate extension of Blomqvist’s \beta (medial correlation coefficient), multivariate conditional version of Spearman’s \rho, lower/upper tail dependence via conditional Spearman’s \rho,
  • cross-quantities: cross-entropy,
  • kernels on distributions: expected kernel (summation kernel, mean map kernel), Bhattacharyya kernel, probability product kernel, Jensen-Shannon kernel, exponentiated Jensen-Shannon kernel, exponentiated Jensen-Renyi kernel(s), Jensen-Tsallis kernel, exponentiated Jensen-Tsallis kernel(s), and
  • +some auxiliary quantities: Bhattacharyya coefficient (Hellinger affinity), alpha-divergence.

The toolbox also offers

  • solution methods for ISA and its generalizations – as a prototype application how to formulate and solve an information theoretical optimization problem family in a high-level way.
  • several consistency tests (analytical vs estimated value), and
  • a further demonstration in image registration – information theoretical similarity measures exhibit exciting outlier-robust characteristics in this domain.

ITE has been accepted for publication in JMLR: Zoltán Szabó. Information Theoretical Estimators Toolbox. Journal of Machine Learning Research 15:217-221, 2014. It was also presented at NIPS-2013: MLOSS workshop.

A few more details: ITE

Feel free to use the toolbox.


An Extremal Conjecture: Experimenting with Online Collaboration

Update (4/3/2014): I believe I have solved the conjecture, and proven it to be correct.  I will make a preprint available shortly. The original blog post remains available below. — Tom

I have an extremal conjecture that I have been working on intermittently with some colleagues (including Jiantao Jiao, Tsachy Weissman, Chandra Nair, and Kartik Venkat). Despite our efforts, we have not been able to prove it. Hence, I thought I would experiment with online collaboration by offering it to the broader IT community.

In order to make things interesting, we are offering a $1000 prize for the first correct proof or counterexample! Feel free to post your thoughts in the public comments. You can also email me if you have questions or want to bounce some ideas off me.

Although I have no way of enforcing them, please abide by the following ground rules:

  1. If you decide to work on this conjecture, please send me an email to let me know that you are doing so. As part of this experiment with online collaboration, I want to gauge how many people become involved at various degrees.
  2. If you solve the conjecture or make significant progress, please keep me informed.
  3. If you repost this conjecture, or publish any results, please cite this blog post appropriately.

One final disclaimer: this post is meant to be a brief introduction to the conjecture, with a few partial results to get the conversation started; it is not an exhaustive account of the approaches we have tried.

1. The Conjecture

Conjecture 1. Suppose {X,Y} are jointly Gaussian, each with unit variance and correlation {\rho}. Then, for any {U,V} satisfying {U-X-Y-V}, the following inequality holds:

\displaystyle 2^{-2I(Y;U)} 2^{-2I(X;V|U)} \geq (1-\rho^2)+ \rho^2 2^{-2I(X;U)} 2^{-2I(Y;V|U)} . \ \ \ \ \ (1)

2. Partial Results

There are several partial results which suggest the validity of Conjecture 1. Moreover, numerical experiments have not produced a counterexample.

Conjecture 1 extends the following well-known consequence of the conditional entropy power inequality to include long Markov chains.

Lemma 1 (Oohama, 1997). Suppose {X,Y} are jointly Gaussian, each with unit variance and correlation {\rho}. Then, for any {U} satisfying {U-X-Y}, the following inequality holds:

\displaystyle 2^{-2 I(Y;U)} \geq 1-\rho^2+\rho^2 2^{-2I(X;U)}. \ \ \ \ \ (2)

Proof: Consider any {U} satisfying {U-X-Y}. Let {Y_u, X_u} denote the random variables {X,Y} conditioned on {U=u}. By Markovity and definition of {X,Y}, we have that {Y_u = \rho X_u + Z}, where {Z\sim N(0,1-\rho^2)} is independent of {X_u}. Hence, the conditional entropy power inequality implies that

\displaystyle 2^{2h(Y|U)} \geq \rho^2 2^{2h(X|U)} + 2 \pi e(1-\rho^2) = 2 \pi e \rho^2 2^{-2I(X;U)} + 2 \pi e(1-\rho^2). \ \ \ \ \ (3)

From here, the lemma easily follows. \Box

Lemma 1 can be applied to prove the following special case of Conjecture 1. This result subsumes most of the special cases I can think of analyzing analytically.

Proposition 1. Suppose {X,Y} are jointly Gaussian, each with unit variance and correlation {\rho}. If {U-X-Y} are jointly Gaussian and {U-X-Y-V}, then

\displaystyle 2^{-2I(Y;U)} 2^{-2I(X;V|U)} \geq (1-\rho^2)+ \rho^2 2^{-2I(X;U)} 2^{-2I(Y;V|U)}. \ \ \ \ \ (4)

Proof: Without loss of generality, we can assume that {U} has zero mean and unit variance. Define {\rho_u = E[XU]}. Since {U-X-Y} are jointly Gaussian, we have

\displaystyle I(X;U) =\frac{1}{2}\log\frac{1}{1-\rho_u^2} \ \ \ \ \ (5)

\displaystyle I(Y;U) =\frac{1}{2}\log\frac{1}{1-\rho^2\rho_u^2}. \ \ \ \ \ (6)

Let {X_u,Y_u,V_u} denote the random variables {X,Y,V} conditioned on {U=u}, respectively. Define {\rho_{XY|u}} to be the correlation coefficient between {X_u} and {Y_u}. It is readily verified that

\displaystyle \rho_{XY|u} = \frac{\rho\sqrt{1-\rho_u^2}}{\sqrt{1-\rho^2\rho_u^2}}, \ \ \ \ \ (7)

which does not depend on the particular value of {u}. By plugging (5)-(7) into (4), we see that (4) is equivalent to

\displaystyle 2^{-2I(X;V|U)} \geq (1-\rho_{XY|u}^2)+ \rho_{XY|u}^2 2^{-2I(Y;V|U)}. \ \ \ \ \ (8)

For every {u}, the variables {X_u,Y_u} are jointly Gaussian with correlation coefficient {\rho_{XY|u}} and {X_u-Y_u-V_u} form a Markov chain, hence Lemma 1 implies

\displaystyle 2^{-2I(X_u;V_u)} \geq (1-\rho_{XY|u}^2)+ \rho_{XY|u}^2 2^{-2I(Y_u;V_u)}. \ \ \ \ \ (9)

The desired inequality (8) follows by convexity of

\displaystyle \log\left[(1-\rho_{XY|u}^2)+ \rho_{XY|u}^2 2^{-2z}\right] \ \ \ \ \ (10)

as a function of {z}. \Box

3. Equivalent Forms

There are many equivalent forms of Conjecture 1. For example, (1) can be replaced by the symmetric inequality

\displaystyle 2^{-2(I(X;V)+I(Y;U))} \geq (1-\rho^2)2^{-2I(U;V)}+ \rho^2 2^{-2(I(X;U)+I(Y;V))}. \ \ \ \ \ (11)

Alternatively, we can consider dual forms of Conjecture 1. For instance, one such form is stated as follows:

Conjecture 1′. Suppose {X,Y} are jointly Gaussian, each with unit variance and correlation {\rho}. For {\lambda\in [{1}/({1+\rho^2}),1]}, the infimum of

\displaystyle I(X,Y;U,V)-\lambda\Big(I(X;UV)+I(Y;UV)\Big), \ \ \ \ \ (12)

taken over all {U,V} satisfying {U-X-Y-V} is attained when {U,X,Y,V} are jointly Gaussian.