Lecture 1. Introduction

What is information theory?

The first question that we want to address is: “What is information?” Although there are several ways in which we might think of answering this question, the main rationale behind our approach is to distinguish information from data. We think of information as something abstract that we want to convey, while we think of data as a representation of information, something that is storable/communicable. This is best understood by some examples.

Example. Information is a certain idea, while data is the words we use to describe this idea.

Example. Information is 10^{1000,000,000}. Possible data describing this information are: 10^{1000,000,000}, 10\cdots 0, 10^{10^9}, 1 followed by a billion zeros.

As we are in a mathematical setting we want to rely on a quantitative approach. The main question that arises naturally is: “How can we measure information?” Making sense of this question requires us to have a model for how data is produced. Throughout this seminar we will consider the probabilistic model which we now introduce.

Definition (Probabilistic model). Data is a random variable X taking values on the space A (alphabet) having distribution P (source distribution). We write X\sim P.

To be precise, with the above we mean that there exists a probability space (\Omega,\mathcal{F},\mathbb{P}) and a measurable space (A,\mathcal{G}) with some measurable function X:\Omega\rightarrow A such that we have \mathbb{P}\circ X^{-1} = P.


  1. While this set-up is very similar to what is done in statistics, the focus in information theory is different. In statistics it is assumed that the data X comes from one of a family of distributions (statistical model), and the goal is to infer something about the particular distribution generating the data. On the other hand, in information theory the distribution of the data X might be known or not, and the goal is to compress or communicate X.
  2. In the probabilistic model we assume that the data is generated by a certain random source. This is a particular modeling assumption and it is not necessarily an expression of belief in how data are actually produced. This is a reasonable modeling assumption to make and it allows us to draw reasonable conclusions (for example, text data is clearly not randomly produced, but you can still do useful things by making the modeling assumption that the data was produced by a stochastic source).
  3. The original motivation behind the development of information theory as based on the probabilistic model came from a practical engineering problem (how to compress/communicate data), and not from the idea of how we measure information (although this aspect was also part of the motivation). The whole field of study was created by the 1948 paper of Claude Shannon.
  4. We are going to use the probabilistic model throughout this seminar, but other models exist as well. A popular model used in theoretical computer science is the algorithmic model (which defines the field of algorithmic information theory, as opposed to probabilistic information theory). In this model it is assumed that data is the output of some computer program (running on a Turing machine). While this approach could be thought of as a generalization of the probabilistic model (in fact, one way in which computers can work is to simulate from some probability distribution), many of the basic quantities in algorithmic information theory (like Kolmogorov complexity) are not computable. This is the reason why this field is suitable for theoretical insights, but it is not necessarily suitable for practical purposes.

How do we measure information in the probabilistic model?

In what follows we assume that the information to be conveyed coincides with the data X itself (and we now assume that X takes values in some countable set A), meaning that there is no universal hidden meaning that we are trying to convey apart from the data itself. For example, assume that the information we want to convey (a particular realization of it) is the text “This is the SAS”. A natural way of measuring the amount of information contained in this data is to look for other representations of this information, and to look for the smallest (in some sense that needs to be specified) representation. As we are in the probabilistic framework, we do not know in advance which data is going to be produced by the random source, so we look for a procedure that takes the random outcome and gives us on average the smallest or most compact representation of the information in that data.

Since the data is random, also the size of a particular realization (encoded data) is random. One way to take into account the randomness is to consider a representation (encoding scheme) that minimizes the expected length/size of the encoded data (and that is uniquely decodable). That is, we measure the amount of information in a given data X as

    \[\min_{\text{valid encoding schemes}} \mathbb{E}_P [\text{length of encoded data}].\]

If we set up things in this way, the measure of information is some functional H of the source distribution P, since P is the only quantity governing the data. This functional H is called the entropy and it is defined as follows.

Definition (Entropy). If P is a discrete distribution, then the entropy of P is defined as

    \[H(P) := \sum_{x\in A} p(x) \log\frac{1}{p(x)} ,\]

where we write p(x)=P(\{x\}) for the probability mass function of P.

While it can be shown that H(P) is the minimal expected length of validly encoded data, we do not proceed this way (the ideas behind this fact are covered in the first couple of lectures of an information theory class). Instead, we will give some intuition on why H(P) is a good measure of information.

We first provide some intuition on why the information in the statement X=a should be decreasing as a function of P(a). In fact, recall that presently we assume to know the source distribution P. If we know P, how informative is a particular outcome x form the source distribution? If P(X=x)=\delta_a(x) (i.e., P is a point mass at a), being informed that the random outcome is a is not informative. On the other hand, if P(X=a)=10^{-10}, being informed that the outcome is a is extremely informative/significant (something very rare has happened).

The relevant decreasing function turns out to be the following:

    \[p(x) \longrightarrow \log\frac{1}{p(x)}.\]

In this respect, \log\frac{1}{p(x)} corresponds to the information that we get from the statement X=x. So the average amount of information in the random outcome X is given by

    \[\mathbb{E}_P \left[ \log\frac{1}{p(X)} \right] = \sum_{x\in A} p(x) \log\frac{1}{p(x)} 	= H(P).\]

Connection between information theory and statistics

While the connection between information theory and statistics is not surprising as both fields rely on the probabilistic model, this correspondence is very strong and natural. We give some examples.

  1. Maximum likelihood estimators (MLE) can be seen as minimal codelength estimators. In a statistical model we assume that X\sim P_{\theta}, with \theta\in \Theta for some parameter space \Theta, and the goal is to find the parameter \theta that generated the data. A popular estimator is the MLE since it is plausible to assume that the parameter that generated the data X is the parameter \tilde\theta whose corresponding distribution would have given maximal probability to X, that is,

        \[\hat\theta := \mathop{\mathrm{argmax}}_{\tilde\theta\in\Theta} P_{\tilde\theta}(X).\]

    Note that we can rewrite the above as

        \[\hat\theta = \mathop{\mathrm{argmin}}_{\tilde\theta\in\Theta} \log \frac{1}{P_{\tilde\theta}(X)},\]

    which can be seen to correspond to the minimal number of bits required to represent X assuming that it was generated by P_{\tilde\theta} (codelength). Hence the connection between MLE in statistics and the minimal codelength estimator in information theory. In this setting we assume that we do not know the distribution generating the data and we try to find a good code to encode the data. The problem of finding a good code is in some sense equivalent to the problem of finding the distribution itself, since once you know the distribution you know the best code (in some sense). Also, we mention that many penalized-MLE estimators (where we take into account the complexity of the model by adding a penalty term to the MLE estimator) can be motivated from an information-theoretic point of view in terms of analogue of coding problems; this is the idea behind the “Minimum Description Length” principle.

  2. In Hypothesis testing, the optimal error exponents are information-theoretic quantities.

These are not just coincidental connections, but examples of basic relationships between fundamental limits of statistical inference on the one hand, and fundamental limits of communication and compression of data on the other hand.

We now turn to the main topic of this seminar, that is, the connection between information theory and probability theory.

Connection between information theory and probability theory

Since we are using a probabilistic model it is clear that probability theory is the language of information theory. However, it is not so obvious that information theory can say something fundamental about probability theory. In fact, in the past half century or so, it has been realized that information theory captures many fundamental concepts in probability theory. Before turning to one key example of such connection (the entropic central limit theorem) which will serve as motivation for the initial few lectures of the seminar, we introduce some relevant quantities.

Definition (Differential or Boltzmann-Shannon entropy). If X\in\mathbb{R}^n, X\sim P and \frac{dP}{d \text{Leb}}=f (i.e., X has a density f with respect to the Lesbegue measure), then the differential entropy of P (equivalently, differential entropy of f) is defined as

    \[h(P) := h(f) := - \int_{R^n} f(x) \log f(x) dx,\]

with the conventions 0\log 0 = 0 and dx = \text{Leb}(dx).

While we can think of h as a measure of disorder (particularly motivated by the setting introduced by Boltzmann in physics), h is not a measure of information in the same sense as H is. The reason is that in the present context of “continuous” data (recall that we are in \mathbb{R}^n and a possible outcome of X is a real number) we need infinitely many bits to encode each outcome of X, so it is not meaningful to talk of the amount of information in an outcome as this is generally infinity. Nonetheless, the differential entropy represents a crucial quantity in information theory, and shows up for example both when considering communication over channels with continuous noise distributions, and when considering lossy data compression (the only kind of data compression possible with sources like \mathbb{R}^n, where one accepts some slight distortion of the data in order to be able to encode it with finitely many bits).

The notion that unifies the continuous entropy h with the discrete entropy H previously introduced is the relative entropy which we now define.

Definition (Relative entropy). If P is a probability measure and Q is a \sigma-finite measure on A, then the relative entropy between P and Q is defined as

    \[D(P || Q) :=  	\begin{cases} 		\int f \log f \, dQ	&\text{if $P\ll Q$ with $\frac{dP}{dQ}=f$},\\ 		\infty		&\text{otherwise}. 	\end{cases}\]

Typically P and Q have respective densities p and q with respect to a given reference measure \lambda. Then the relative entropy reads

    \[D(P || Q) = \int p(x) \log \frac{p(x)}{q(x)} \,\lambda(dx).\]

The following examples show how the relative entropy relates h and H.

  1. If A is a countable set and \lambda is the counting measure, then

        \[D(P || \lambda) = \sum_{x\in A} p(x) \log p(x) = - H(P).\]

  2. If A=\mathbb{R}^n and \lambda is the Lesbegue measure, then

        \[D(P || \lambda) = \int p(x) \log p(x) \, dx = - h(P).\]

The following property of relative entropy is the most important inequality in information theory.

Lemma. Let P be a probability measure on A, and Q be a sub-probability measure on A (i.e., Q is a nonnegative, countably additive measure with 0<Q(A)\le 1). Then D(P || Q) \ge 0.

Proof. We only need to consider the case where P\ll Q. Let f=\frac{dP}{dQ} and R=Q/Q(A). Then we have

    \begin{align*} 	D(P || Q)  	&= \int f(x) \log f(x) \,Q(dx) 	= Q(A)\,\mathbb{E}_R\left[f(X)\log f(X)\right]\\ 	&\ge Q(A)\, \mathbb{E}_R[f(X)]\log\mathbb{E}_R[f(X)] = \log\frac{1}{Q(A)} \ge 0, \end{align*}

where we have applied Jensen’s inequality (which holds as R is a probability measure) using that x\mapsto x\log x is convex, and used that \mathbb{E}_R[f(X)]=P(A)/Q(A)=1/Q(A) and that Q(A)\le 1. \square

As a consequence of this result we can now show that the Gaussian distribution maximizes the entropy under constraints on the first two moments.

Lemma. Let \mathcal{P}_{\mu,\sigma^2} be the class of all probability densities on \mathbb{R} (with respect to Lebesgue measure) with mean \mu and variance \sigma^2 and define

    \[g_{\mu,\sigma^2}(x) := \frac{1}{\sqrt{2\pi\sigma^2}} \, e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]

Then h(g_{\mu,\sigma^2}) \ge h(f) for any f\in\mathcal{P}.

Proof. First of all note that

    \begin{align*} 	\int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx  	= \int f(x) \log g_{\mu,\sigma^2}(x) \, dx  \end{align*}

as \log g_{\mu,\sigma^2} is quadratic function and, consequently, only the first two moments are involved in computing its expectation. Hence, we have

    \begin{align*} 	h(g_{\mu,\sigma^2}) - h(f)  	&= - \int g_{\mu,\sigma^2}(x) \log g_{\mu,\sigma^2}(x) \, dx + \int f(x) \log f(x) \, dx\\ 	&= \int f(x) \log \frac{f(x)}{g_{\mu,\sigma^2}(x)} \,dx = D(f||g_{\mu,\sigma^2}) \ge 0. \qquad\square \end{align*}

We are now ready to present the first example of cross-fertilization where information-theoretic concepts can be used to capture fundamental properties in probability theory. Let us first recall the classical central limit theorem (CLT).

Theorem (CLT). If X_1,X_2,\ldots are i.i.d. real-valued random variables with mean 0 and variance 1, then

    \[S_n=\frac{1}{\sqrt{N}} \sum_{i=1}^N X_i \stackrel{\mathcal{D}}{\longrightarrow} N(0,1),\]

that is,

    \[\mathbb{P}\{S_n\in A\} \longrightarrow \frac{1}{\sqrt{2 \pi}} \int_A e^{-\frac{x^2}{2}} \,dx\]

for nice enough sets A.

If we denote by f_{S_n} the density of the normalized partial sum S_n introduced in the statement of the theorem above, we note the following.

  1. For each n\ge 1 we have S_n\in\mathcal{P}_{0,1}. This follows immediately from basic properties of expected values.
  2. From the previous lemma it follows immediately that

        \[\mathop{\mathrm{argmax}}_{f\in \mathcal{P}_{0,1}} h(f) = g_{0,1}.\]

So, the CLT tells us that the sequence f_{S_1}, f_{S_2}, \ldots\in\mathcal{P}_{0,1} converges to the maximizer g_{0,1} of the entropy in \mathcal{P}_{0,1}. In fact, it turns out that the convergence in the central limit theorem can be studied in terms of the entropy and that the CLT is an expression of increasing entropy, as the following entropic central limit theorem describes.

Theorem (Entropic CLT). Let X_1,X_2,\ldots be i.i.d. real-valued random variables with mean 0 and variance 1, and assume the distribution of X_1 has a density (with respect to Lebesgue measure). Under minimal assumptions (specifically, that h(f_{S_n})>-\infty for some n), we have

    \[h(f_{S_n}) \uparrow h(g_{0,1}),\]

or, equivalently,

    \[D(f_{S_n}||g_{0,1}) \downarrow 0.\]

The entropic central limit theorem is remarkable as usually limit theorems do not come with an associated monotonicity statement. This suggests that the relative entropy is a natural tool to analyze the CLT.

Of course, a natural question that presents itself is whether other limit theorems in probability can be understood from a similar information-theoretic point of view.

Plan for future lectures

In the next lecture or two we will present a full proof of the entropic central limit theorem, and also discuss briefly how other limit theorems can be analogously understood from this information-theoretic point of view. Later, we will look at finer behavior than limit theorems, for instance we may look at how information theory can provide insights into large deviations and concentration inequalities.

Lecture by Mokshay Madiman | Scribed by Patrick Rebeschini

25. September 2013 by Ramon van Handel
Categories: Information theoretic methods | Comments Off on Lecture 1. Introduction