# Claude Shannon and a new theory of everything (Scientific American)

http://www.scientificamerican.com/article/a-meta-law-to-rule-them-all-physicists-devise-a-theory-of-everything/

# Rate-distortion Theory with Non-finite Distortion Measures

Shannon’s rate-distortion theory gives the optimal tradeoff between compression rate and fidelity, as measured by expected average distortion, if the distortion function is bounded. Here we show how to extend this for non-finite distortion metrics that strictly forbid certain reconstructions.

Distortion:
We will assume that the information source $X$ has finite support, and the observed samples $X_t$ are i.i.d. with a known distribution $P_X$.  Let $d(x,y)$ be a distortion function used for measuring fidelity, where $y$ is the reconstruction.  The average distortion for a sequence of source samples and reconstructions is the average of $d(x_t,y_t)$.

Rate Constraint:
The rate constraint imposed by rate-distortion theory allows $n$ information symbols to be encoded as a block, where $n$ is specified by the designer of the encoding. The encoder maps the $n$ symbols to $nR$ bits. A decoder then produces the best possible reconstruction, with respect to the distortion metric, based only on the $nR$ bits of description.

Expected Distortion:
We say that $D$ is achievable at rate $R$ if there exists a block-length $n$ and an encoder and a decoder operating with rate constraint $R$ such that the averaged distortion is less than $D$ in expected value.  The average distortion is random because the information source is random and the encoder and decoder may achieve a different average distortion for different encodings, thus the need to consider the expected value.

Rate-distortion Function:
Let $R(D)$ be the infimum of rates $R$ such that $D$ is achievable at rate $R$.

Rate-Distortion Theorem (expected distortion, known result):
For a source distribution $P_X$, if $d(x,y)$ is bounded,

$\displaystyle R(D) = \min_{P_{Y|X}: E d(X,Y) \leq D} I(X;Y),$
where mutual information $I(X;Y)$ is measured with respect to $P_X P_{Y|X}$.

The above theorem is restricted to bounded distortion functions. Since $X$ has finite support and the optimal choice of Y also has finite support, an unbounded distortion function implies that the distortion is infinite for some pairs of values. That is, some pairs (x,y) are strictly forbidden, no matter how unlikely they are to occur.  This cannon be guaranteed at the rates demanded by the rate-distortion theorem stated above.

Excess Distortion:
An alternative definition for achievability, known as excess distortion, requires that the average distortion be less than $D$ with high probability.  That is, $D$ is achievable at rate $R$ if for any $\epsilon > 0$ there exists a block-length $n$ and an encoder and a decoder operating with rate constraint $R$ such that the averaged distortion is greater than $D$ with probability less than $\epsilon$.

For bounded distortion functions, excess distortion is a stronger requirement than expected distortion.  It’s insightful that the rate-distortion function doesn’t change even with this stronger requirement (see theorem below).  However, for unbounded distortions, the excess distortion requirement allows an $\epsilon$ chance of “error.”  Even if the distortion is infinite during those occurrences, the achievability criterion will still be met.  So it cannot strictly forbid any reconstructions.  This is why the following theorem does not require any condition on the distortion function.

Rate-Distortion Theorem (excess distortion, known result):
Under the excess distortion requirement, for a source distribution $P_X$,

$\displaystyle R(D) = \min_{P_{Y|X}: E d(X,Y) \leq D} I(X;Y).$

Under the expected distortion requirement, one may wonder what the rate-distortion function is in general, for non-finite distortion measures.  This curiosity is reflected in classic information theory texts from decades past and is the subject of this post.  Csiszar’s and Korner’s 1981 textbook dances around this question, containing all the pieces but not connecting them together.  I will make references to the second edition of their book.

Safe Value for $y$:
The first idea one might consider is to identify cases where the rate-distortion function in the theorems above is still accurate.  You might arrive at the following:  If $d(x,y)$ has a safe value of $y$ that always produces finite distortion for any $x$, then the encoder can add a single additional message.  That message tells the decoder to produce the safe value of $y$ for the entire sequence.  This message is only used when the encoder cannot find a valid reconstruction in the codebook with distortion less than infinity, which should be rare.  One additional message does not change the asymptotic rate.

This observation is exercise 7.13 in Csiszar-Korner, which credits Gallager’s 1968 textbook.  I will touch on Gallager’s bound momentarily.

Example of Conflict:
Before continuing, it will be beneficial to give a simple example where a non-finite distortion measure requires a rate higher than the rate-distortion function in the theorems above.  We need look no further than a binary source.  Let the distortion function $d(x,y)$ be zero if $x=y$ and infinity if $x \neq y$.  Under the excess distortion requirement, any finite distortion can be achieved with rate $R = H(X)$.  In other words, near-lossless compression is necessary and sufficient to achieve finite average distortion $D$ with high probability.  On the other hand, it is not sufficient to guarantee finite distortion with probability one.  Under the expected distortion requirement, perfect lossless compression is required, and a rate $R=\log|{\cal X}|$ is necessary.

Gallager’s Bound (1968 textbook, Theorem 9.6.2):
We must assume that for each $x$ in the support of $X$ there is a value of $y$ that yields finite distortion under $d(x,y)$.  If not, then the average distortion will be infinite with high probability regardless of the encoder.  Now suppose that there is a subset ${\cal Y}_s$ such that for each $x$ in the support of $X$ there is a $y \in {\cal Y}_s$ that yields finite distortion under $d(x,y)$.  This is a generalization of the safe value of $y$ mentioned previously—instead it is a safe set in which you can always find a safe $y$, though it will need to be specified by the encoder.

Theorem (Gallager):
Under the expected distortion requirement, for a source distribution $P_X$,

$\displaystyle R(D) \leq \max \left\{ \min_{P_{Y|X}: E d(X,Y) \leq D} I(X;Y), \quad \log|{\cal Y}_s| \right\}.$

The theorem above is achieved by creating a reserve codebook that enumerates the sequences in the set ${\cal Y}_s$.  In the rare cases where the encoder cannot find a good reconstruction sequence in the primary codebook, it instead specifics a safe sequence from the reserve codebook.

Now the main point of this post.

Non-Finite Rate-Distortion Theorem (average distortion, new result):
Under the expected distortion requirement, for a source distribution $P_X$,

$\displaystyle R(D) = \max \left\{ \min_{P_{Y|X}: E d(X,Y) \leq D} I(X;Y), \quad \max_{P_X} \min_{P_{Y|X}: E d(X,Y) < \infty} I(X;Y) \right\}.$

Proof:
The above rate can be achieved by encoding separately for typical $x^n$ sequences and atypical sequences. For the typical ones, design the encoder and decoder to assure an average distortion $D$, and for the atypical ones just assure a finite distortion. The encoder can first specify the type to the decoder using negligible rate, so that the encoder knows which codebook to use.  The maximization of the second term on the right-hand side accounts for each type being encoded separately.  The rate of compression is determined by the highest rate.

Source coding is fundamentally different from channel coding in that there is a zero-error aspect of it that does not require any extra rate.  We can efficiently obtain a perfect covering.  If we consider error events to be instances where the average distortion exceeds $D$, as in the excess distortion definition, then we can separate error events into two cases, those where $x^n$ is typical and those where it is not.  The error probability when $x^n$ is typical can be made exactly zero using the same compression rate that achieves nearly zero error.  The typical set is perfectly covered.  So all of the error in source coding comes from $x^n$ being atypical.  The proof that this occurs comes from the standard analysis using strong typicality.  For an arbitrary typical $x^n$, the probability of error is doubly-exponentially small, where the probability is with respect to a randomly generated codebook.  The union bound over all typical sequences gives a doubly-exponential confidence that no typical $x^n$ will produce an error.  Existence arguments complete the proof.

Since the scheme for achieving the theorem handles each type separately, there will be no errors (I’ve skipped over a uniform convergence bound that is needed to formalize this).  Typical source sequences will yield distortion less than $D$, and atypical sequences will yield finite distortion.  The expected average distortion will meet the constraint.

One possible starting point to arrive at this result would have been to identify the rate required to guarantee finite distortion.  This problem is combinatoric in nature—the distortion function only specifies which reconstructions are forbidden, and these constraints must be met with probability one.  Because of this, the source distribution is no longer relevant.  There are several equivalent formulas to express the minimum rate needed.  A simple one can be obtained from Theorem 9.2 in Csiszar-Korner:

$\displaystyle R = \max_{P_X} \min_{P_{Y|X}: E d(X,Y)< \infty} I(X;Y).$

Then the formula for $R(D)$ in the theorem can be understood in the following way:  Encode for distortion $D$, and when the encoder fails, use a reserve codebook that guarantees finite distortion.  This sounds similar to Gallager’s method.  The difference is that Gallager only gave a loose upper bound on the rate of the reserve codebook.

The converse to this theorem is immediate.  Each of the two terms in the maximization corresponds to a necessary achievability requirement—average distortion $D$ and finite distortion with probability one. $\blacksquare$

# Institute of Mathematical Statistics

IMS provides a premier human network of global nature. IMS runs premier journals such as the Annals of Statistics, the Annals of Probability, the Annals of Applied Statistics and the Annals of Applied Probability. IMS also gives prestigious awards and runs or jointly runs major statistical conferences such as Joint Statistical Meetings (JSM) and World Congress of Probability and Statistics.

IMS membership is free for students, and reduced rates for new graduates and for people with permanent residence in a number of countries designated by the IMS Council (http://imstat.org/membership/designated_countries.htm).

IMS-membershipdrive-yu14

# HBO Series: Silicon Valley

THE REALITY OF SILICON VALLEY [POPULAR SCIENCE, May 2014]
When Mike Judge, creator of Office Space and Beavis and Butt-head, set out to write his HBO comedy series, Silicon Valley, about Bay Area coders, he wanted to conceive a simple, believable widget for his characters to invent. So he teamed with Stanford electrical engineering professor Tsachy Weissman
and Ph.D. student Vinith Misra; they came up with a superfast compression algorithm.
Data compression makes files—text, music, movies— smaller and more transmissible. The best algorithms are fast but don’t harm fidelity. For the show, Weissman and Misra devised a fictional algorithm that excels at just that. The writers coined a metric, the “Weissman Score,” for characters to use when comparing compression codes.
Researchers are always refining compression. Last year, for instance, UCLA unveiled a way to save images more efficiently than JPEGs do. So something like what’s in Silicon Valley, which is more than two times faster than current models, is possible. “New, effective ways of compression are out there,” says Weissman. ADAM HADHAZ

# IEEE North American Summer School of Information Theory

The 2014 IEEE North American Summer School of Information Theory will take place June 18-21, 2014 at the Fields Institute in Toronto, Canada. Applications are now open

https://www.fields.utoronto.ca/programs/scientific/13-14/infotheory/

# Information Theoretical Estimators (ITE) Toolbox

[Admin note: This post comes to us from Zoltan Szabo, of UCL.  He is the creator of a new toolbox called Information Theoretical Estimators (ITE).]

I am an applied mathematician working as a research associate at the Gatsby Unit, UCL with Arthur Gretton. My research interest includes dictionary learning problems, kernel methods and information theory. I got acquainted with the estimation of information theoretical quantities while I was working on independent subspace analysis (ISA) and its extensions with Barnabás Póczos who introduced me to this field. ISA is a blind signal separation problem, which can be formulated as the optimization of Shannon’s differential entropy or mutual information objectives. Provided that one can efficiently estimate these quantities, it is possible to solve a wide variety of ISA-type tasks. While I was working on these problems, I gradually realized that the available software packages focus on (i) discrete variables, or (ii) quite specialized quantities and estimation methods. This is what motivated me to write the Information Theoretical Estimators (ITE) toolbox – as a leisure activity.

ITE is a recently released, free and open source, multi-platform, Matlab/Octave toolbox released under GPLv3(>=) license. The goal in ITE is to (i) cover the state-of-the-art nonparametric information theoretical estimators, (ii) in a highly modular, and (iii) user-friendly way. The focus in ITE is on continuous random variables. At the moment the package provide estimators for the following quantities:

• entropy: Shannon entropy, Rényi entropy, Tsallis entropy (Havrda and Charvát entropy), complex entropy, $\Phi$-entropy ($f$-entropy), Sharma-Mittal entropy,
• mutual information: generalized variance, kernel canonical correlation analysis, kernel generalized variance, Hilbert-Schmidt independence criterion, Shannon mutual information (total correlation, multi-information), $L_2$ mutual information, Rényi mutual information, Tsallis mutual information, copula-based kernel dependency, multivariate version of Hoeffding’s $\Phi$, Schweizer-Wolff’s $\sigma$ and $\kappa$, complex mutual information, Cauchy-Schwartz quadratic mutual information, Euclidean distance based quadratic mutual information, distance covariance, distance correlation, approximate correntropy independence measure, $\chi^2$ mutual information (Hilbert-Schmidt norm of the normalized cross-covariance operator, squared-loss mutual information, mean square contingency),
• divergence: Kullback-Leibler divergence (relative entropy, I directed divergence), $L_2$ divergence, Rényi divergence, Tsallis divergence, Hellinger distance, Bhattacharyya distance, maximum mean discrepancy (kernel distance), J-distance (symmetrised Kullback-Leibler divergence, J divergence), Cauchy-Schwartz divergence, Euclidean distance based divergence, energy distance (specially the Cramer-Von Mises distance), Jensen-Shannon divergence, Jensen-Rényi divergence, K divergence, L divergence, certain f-divergences (Csiszár-Morimoto divergence, Ali-Silvey distance), non-symmetric Bregman distance (Bregman divergence), Jensen-Tsallis divergence, symmetric Bregman distance, Pearson $\chi^2$ divergence ($\chi^2$ distance), Sharma-Mittal divergence,
• association measures including measures of concordance: multivariate extensions of Spearman’s $\rho$ (Spearman’s rank correlation coefficient, grade correlation coefficient), correntropy, centered correntropy, correntropy coefficient, correntropy induced metric, centered correntropy induced metric, multivariate extension of Blomqvist’s $\beta$ (medial correlation coefficient), multivariate conditional version of Spearman’s $\rho$, lower/upper tail dependence via conditional Spearman’s $\rho$,
• cross-quantities: cross-entropy,
• kernels on distributions: expected kernel (summation kernel, mean map kernel), Bhattacharyya kernel, probability product kernel, Jensen-Shannon kernel, exponentiated Jensen-Shannon kernel, exponentiated Jensen-Renyi kernel(s), Jensen-Tsallis kernel, exponentiated Jensen-Tsallis kernel(s), and
• +some auxiliary quantities: Bhattacharyya coefficient (Hellinger affinity), alpha-divergence.

The toolbox also offers

• solution methods for ISA and its generalizations – as a prototype application how to formulate and solve an information theoretical optimization problem family in a high-level way.
• several consistency tests (analytical vs estimated value), and
• a further demonstration in image registration – information theoretical similarity measures exhibit exciting outlier-robust characteristics in this domain.

ITE has been accepted for publication in JMLR: Zoltán Szabó. Information Theoretical Estimators Toolbox. Journal of Machine Learning Research 15:217-221, 2014. It was also presented at NIPS-2013: MLOSS workshop.

A few more details: ITE

Feel free to use the toolbox.

Zoltán

# Claude Shannon’s collected works, volume II

Volume I was published in 1993 edited by Neil Sloane and Aaron Wyner.
Volume II (consisting of unpublished miscellaneous manuscripts) has just been made available by Neil Sloane.
Thanks to Matthieu Bloch, the various manuscripts can be individually accessed from

https://www.dropbox.com/sh/hqbglxnh0g1ikfp/BMUSRjAZNF