Problem Set 3

(4.1)

Verify that the entropy function satisfies the required properties of continuity, non-negativity, monotonicity, and independence.

I will prove these properties for the discrete case.

First, I want to show that $\lim_{x \to 0^+} x \log_b x = 0$ for any $b > 1$ , since otherwise entropy is not defined for distributions that assign zero probability to any outcome. This can be done using L’Hôpital’s Rule.

$\begin{align*} \lim_{x \to 0^+} x \log_b x &= \lim_{x \to 0^+} \frac{\log_b x}{x^{-1}} \\ &= \lim_{x \to 0^+} \frac{\frac{d}{d x} \log_b x}{\frac{d}{d x} x^{-1}} \\ &= \lim_{x \to 0^+} \left( \frac{1}{x \ln b} \right) \left( \frac{-1}{x^{-2}} \right) \\ &= \lim_{x \to 0^+} \frac{-x}{\ln b} \\ &= 0 \end{align*}$

Continuity

To talk about the continuity of entropy, we need to define a topology for probability distributions. Let $\Omega$ be a set of finite cardinality $n$ . The space of probability distributions over $\Omega$ can be viewed as the set of vectors in $\mathbb{R}_{\geq 0}^n$ with $L^1$ norm 1. In this way entropy is a function that maps a subset of $\mathbb{R}_{\geq 0}^n$ to $\mathbb{R}$ . So I will prove the continuity of entropy with respect to the topologies of $\mathbb{R}_{\geq 0}^n$ and $\mathbb{R}$ .

First let’s show that $x \log x$ is continuous. I take as given that $\log(x)$ is a continuous function on its domain (after all it’s the inverse of $e^x$ , which is strictly monotonic and $C^\infty$ ). Then $x \log(x)$ is also continuous, since finite products of continuous functions are continuous. This suffices for $x > 0$ . At zero, $x \log x$ is continuous because we have defined it to be equal to the limit we found above.

Thus each term of the entropy function is a continuous function from $\mathbb{R}_{\geq 0}$ to $\mathbb{R}$ . But we can also view each term as a function from $\mathbb{R}_{\geq 0}^n$ to $\mathbb{R}$ . Each one ignores most of its inputs, but this doesn’t change its continuity. (The epsilon-delta proof follows easily from the triangle inequality, since the only part of the distance between inputs that matters is that along the active coordinate.) So entropy is a sum of continuous functions, and is thus continuous.

Non-negativity

The probability of each individual outcome must be between zero and one. Thus $-p_i \log p_i \geq 0$ for all $i$ . Since $x \log x$ is only equal to zero when $x$ is zero or one, the entropy can only be zero when a single outcome has probability one.

Monotonicity

Note that $\partial/\partial p_i H(p) = -\log(p_i) - 1$ for any $i$ . This is a strictly decreasing function, so entropy is strictly concave on all of $\mathbb{R}_{\geq 0}^n$ . The constraint that $\sum p_i$ is one is linear, so entropy is concave on this subset of $\mathbb{R}_{\geq 0}^n$ as well. Thus there is a unique global maximum.

We can locate it using a Lagrange multiplier. Our Lagrange function is

$-\sum_{i = 1}^n p_i \log p_i + \lambda \left( \sum_{i = 1}^n p_i - 1 \right)$

The partial derivative with respect to any $p_i$ is $-\log p_i - 1 + \lambda$ . Since this depends only on $\lambda$ , it implies that all the $p_i$ must be the same. Taking our constraint into account this means there’s only one possibility: $p_i = 1/n$ for all $i$ . This is the maximum entropy distribution that we seek.

Call this distribution $p_*$ . Its entropy is $-\sum_{i = 1}^n 1/n \log 1/n = -\log 1/n$ . Thus

$H(p) \leq H(p_*) = - \log \frac{1}{n}$

for all probability distributions $p$ over $n$ outcomes. Equality is only achieved for $p_*$ itself, since the maximum is unique. Note that $H(p_*)$ grows without bound as $n$ increases.

Independence

If $p$ and $q$ are independent, their joint probability distribution is the product of the individual distributions. Thus

$\begin{align*} H(p, q) &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log(p_i q_j) \\ &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \left( \log p_i + \log q_j \right) \\ &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log p_i - \sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log q_j \\ &= -\sum_{i = 1}^n p_i \log p_i \sum_{j = 1}^m q_j - \sum_{j = 1}^m q_j \log q_j \sum_{i = 1}^n p_i \\ &= -\sum_{i = 1}^n p_i \log p_i - \sum_{j = 1}^m q_j \log q_j \\ &= H(p) + H(q) \end{align*}$

(4.2)

Prove the relationships in Equation (4.10).

I take $I(x, y) = H(x) + H(y) - H(x, y)$ as the definition of mutual information. By the definition of conditional entropy,

$\begin{align*} H(y | x) &= H(x, y) - H(x) \\ H(x | y) &= H(x, y) - H(y) \end{align*}$

Thus

$\begin{align*} I(x, y) &= H(y) - H(y | x) \\ &= H(x) - H(x | y) \end{align*}$

Finally, using the definition of marginal distributions we can show that

$\begin{align*} I(x, y) &= -\sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) + \sum_{x, y} p(x, y) \log p(x, y) \\ &= -\sum_{x, y} p(x, y) \log p(x) - \sum_{x, y} p(x, y) \log p(y) + \sum_{x, y} p(x, y) \log p(x, y) \\ &= \sum_{x, y} p(x, y) \left( \log p(x, y) - \log p(x) - \log p(y) \right) \\ &= \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)} \end{align*}$

(4.3)

Consider a binary channel that has a small probability $\epsilon$ of making a bit error.

For reasons that will become clear I will call the error probability $\epsilon_0$ .

(a)

What is the probability of an error if a bit is sent independently three times and the value determined by majority voting?

Majority voting can recover the message if a single instance of the bit is flipped. So the probability of an error is the probability of having two or three bits flipped. This can be expressed using the binomial distribution. Let’s call it $\epsilon_1$ .

$\begin{align*} \epsilon_1 &= B(2; \epsilon_0, 3) + B(3, \epsilon_0, 3) \\ &= {3 \choose 2} \epsilon_0^2 (1 - \epsilon_0) + {3 \choose 3} \epsilon_0^3 \\ &= 3 \epsilon_0^2 (1 - \epsilon_0) + \epsilon_0^3 \\ &= 3 \epsilon_0^2 - 2 \epsilon_0^3 \end{align*}$

(b)

How about if that is done three times, and majority voting is done on the majority voting?

The answer is the same as above, just using $\epsilon_1$ instead of $\epsilon_0$ .

$\begin{align*} \epsilon_2 &= B(2; \epsilon_1, 3) + B(3, \epsilon_1, 3) \\ &= 3 \epsilon_1^2 - 2 \epsilon_1^3 \end{align*}$

(c)

If majority voting on majority voting on … on majority voting is done N times, how many bits are needed, and what is the probability of an error? How does this probability depend on $\epsilon$ ?

Each round triples the total number of bits sent. So $n$ rounds of voting requires $3^n$ bits.

The probability of an error can be expressed as a recurrence relation. Define

$\begin{align*} f_0(x) &= 3 x^2 - 2 x^3 \\ f_{n+1}(x) &= f_0(f_n(x)) \end{align*}$

Then with a base error rate (i.e. per individual bit) of $\epsilon_0$ , the probability of an error after $n$ rounds of voting is $\epsilon_n = f_n(\epsilon_0)$ . If there is a closed form solution to this relation, I don’t know how to find it. But it’s still possible to say a lot about its behavior.

Convergence to a step function

As $n$ approaches infinity, $f_n$ converges pointwise on $[0, 1]$ to

$f_\infty(x) = \begin{cases} 0 & \text{for} & 0 \leq x < 1/2 \\ 1/2 & \text{for} & x = 1/2 \\ 1 & \text{for} & 1/2 \leq x \leq 1 \end{cases}$

So as long as the error rate isn’t $1/2$ (i.e. zero information gets through), with enough rounds of majority voting the error rate can be made arbitrarily small.

Let’s prove this. Since $f_0$ is a polynomial, it’s continuous. By inspection there are three fixed points: zero, one half, and one. This suffices to show that for any $x$ , if $f_n(x)$ converges it must converge to zero, one half, or one (I’ll prove this below).

Now fix any $x$ such that $0 < x < 1/2$ . Because $f_0$ is a cubic polynomial, it can’t cross the line $y = x$ more than three times. We’ve already noted that it does cross this line exactly three times, namely at its fixed points. So the fact that

$3 \cdot \left( \frac{1}{4} \right)^2 - 2 \cdot \left( \frac{1}{4} \right)^3 = \frac{5}{32} < \frac{1}{4}$

is sufficient to prove that $f_0(x) < x$ . Furthermore, $3 x^2$ is greater than $2 x^3$ , so $0 < f_0(x)$ . Thus $0 < f_0(x) < x < 1/2$ . By induction this shows that

$0 < \cdots < f_2(x) < f_1(x) < f_0(x) < x$

Thus $f_n(x)$ is a bounded monotonic sequence, and must converge. Since $x < 1/2$ the only fixed point it can converge to is zero.

All that remains is to show that all points in $(1/2, 1)$ converge to one. Note that

$\begin{align*} 1 - f_0(1 - x) &= 1 - 3(1 - x)^2 + 2(1 - x)^3 \\ &= 1 - (1 - x)^2 (3 - 2(1 - x)) \\ &= 1 - (1 - 2x + x^2) (1 + 2x)) \\ &= 1 - (1 + 2x - 2x - 4x^2 + x^2 + 2x^3) \\ &= 3x^2 - 2x^3 \\ &= f_0(x) \end{align*}$

This symmetry establishes the claim.

For completeness let’s prove the claim that a sequence generated by recursive application of a continuous function $f$ can only converge to a fixed point of $f$ . I’ll do this in two steps.

First, we must establish that if a sequence $x_0, x_1, \ldots$ in the domain of $f$ converges to a point $x_\infty$ (also in the domain of $f$ ), then the sequence $f(x_0), f(x_1), \ldots$ converges to $f(x_\infty)$ . Fix any $\epsilon > 0$ . Since $f$ is continuous, there is some $\delta$ such that $\lvert x - x_\infty \rvert < \delta$ implies $\lvert f(x) - f(x_\infty) \rvert < \epsilon$ . Since $x_0, x_1, \ldots$ converges to $x_\infty$ , there is some $N$ such that $n > N$ implies $\lvert x_n - x_\infty \rvert < \delta$ . Thus $n > N$ implies $\lvert f(x_n) - f(x_\infty) \rvert < \epsilon$ which establishes the claim.

Second, we can show that any sequence $x_0, x_1, \ldots$ generated by successive application of $f$ that converges must converge to a fixed point of $f$ . Say $x_0, x_1, \ldots$ converges to $x_\infty$ . Since $x_{n + 1} = f(x_n)$ for all $n$ , this means that $f(x_0), f(x_1), \ldots$ converges to $x_\infty$ . But by the result just proven, $f(x_0), f(x_1), \ldots$ converges to $f(x_\infty)$ . Since limits are unique this means $x_\infty = f(x_\infty)$ .

Behavior of leading term

For small $\epsilon_0$ , the cubic term in $f_0(\epsilon_0)$ is approximately zero. So let’s consider the function $g_0(x) = 3x^2$ , with $g_n$ defined recursively in a similar fashion as $f_n$ .

For any $x \in (0, 1]$ , it’s clear that $f_0(x) < g_0(x)$ . And for any $x$ and $y$ such that $0 < x < y < 1$ ,

$0 < 3x^2 - 2x^3 < 3x^2 < 3y^2 < 1$

so $0 < f_0(x) < g_0(y) < 1$ . Thus by induction $f_n(x) < g_n(x)$ for any x in $(0, 1]$ and for all $n$ . So $g_n(\epsilon_0)$ is an upper bound for the probability of an error after $n$ rounds of majority voting with a base error rate of $\epsilon_0$ .

The interesting thing about $g_n$ is that it has a closed form solution that’s easy to find.

$g_n(x) = 3^{2^n - 1} x^{2^n}$

So while each round of voting increases the number of bits by a factor of three, the exponent on top of the base error rate grows by a factor of two. This is an astonishing win for majority voting.

Just keep in mind that $g_n(\epsilon_0)$ is only a reasonable upper bound for very small $\epsilon_0$ . Indeed for $\epsilon_0 > 1/3$ , $g_n(\epsilon_0)$ goes to infinity while $f_n(\epsilon_0)$ goes to zero.

Examples

Here is a table showing error rates after various numbers of voting rounds for a variety of base rates. These are calculated exactly from $f_n(\epsilon_0)$ , not the approximate upper bound $g_n(\epsilon_0)$ .

voting rounds	0	1	2	3	4	5
bits	1	3	9	27	81	243
$p_\text{error}$	0.25	$\num{1.6e-1}$	$\num{6.6e-2}$	$\num{1.2e-3}$	$\num{4.5e-4}$	$\num{6.2e-7}$
$p_\text{error}$	0.1	$\num{2.8e-2}$	$\num{2.3e-3}$	$\num{1.6e-5}$	$\num{7.6e-10}$	$\num{1.8e-18}$
$p_\text{error}$	0.01	$\num{3.0e-4}$	$\num{2.7e-7}$	$\num{2.1e-13}$	$\num{1.4e-25}$	$\num{5.5e-50}$
$p_\text{error}$	0.001	$\num{3.0e-6}$	$\num{2.7e-11}$	$\num{2.2e-21}$	$\num{1.4e-41}$	$\num{6.1e-82}$

(4.4)

Calculate the differential entropy of a Gaussian process.

Since we’re integrating the Gaussian over the whole real line, translation is irrelevant. So without loss of generality I’ll calculate the differential entropy of a Gaussian with zero mean.

$\begin{align*} H(N(\mu, \sigma^2)) &= -\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \right) \mathrm{d} x \\ &= -\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \left( -\frac{1}{2} \log(2 \pi \sigma^2) - \frac{x^2}{2 \sigma^2} \right) \mathrm{d} x \\ &= \frac{1}{2} \log(2 \pi \sigma^2) \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x + \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} \frac{x^2}{2 \sigma^2} e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x \\ &= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \int_{-\infty}^\infty \left( \frac{x}{\sqrt{2 \pi \sigma^2}} \right) \left( \frac{x}{\sigma^2} e^{\frac{-x^2}{2 \sigma^2}} \right) \mathrm{d} x \\ &= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x \\ &= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \\ &= \frac{1}{2} \log(2 \pi e \sigma^2) \end{align*}$

The second term is integrated by parts. Note that

$\frac{d}{dx} e^{\frac{-x^2}{2 \sigma^2}} = -\frac{x}{\sigma^2} e^{\frac{-x^2}{2 \sigma^2}}$

(Of course one could also just note that this integral is the distribution’s variance, but now we know how to perform that calculation ourselves. Going forward I’ll skip this derivation.)

(4.5)

A standard telephone line is specified to have a bandwidth of 3300 Hz and an SNR of 20 dB.

(a)

What is the capacity?

A signal to noise ratio of 20 dB means that $20 = 10 \log_{10}(S/N)$ . So $S/N = 10^2$ . Presumably the noise power in this figure is calculated over the relevant bandwidth, so we may infer that $\Delta f N_0 = N$ . Then the signal capacity per second is given by

$\begin{align*} C &= \Delta f \log_2 \left( 1 + \frac{S}{\Delta f N_0} \right) \\ &= 3300 \si{Hz} \log_2 \left( 1 + 10^2 \right) \\ &= 22 \si{kbits/s} \end{align*}$ $\begin{align*} \end{align*}$

(b)

What SNR would be necessary for the capacity to be 1 Gbit/s?

If the SNR in decibels is $x$ , then $S/N = 10^{x/10}$ . So we need to solve

$C = \Delta f \log_2 \left( 1 + 10^{x/10} \right)$

for $x$ . This gives us

$\begin{align*} x &= 10 \log_{10} \left( 2^{C / \Delta f} - 1 \right) \\ &= 10 \log_{10} \left( 2^{\frac{10^9 \si{bits/s}}{3300 \si{Hz}}} - 1 \right) \\ &= \num{9e5} \si{dB} \end{align*}$

This is unrealistically high. A more reasonable option would be to increase the bandwidth to $\num{1.5e8} \si{Hz}$ .

(4.6)

Let $(x_1, x_2, \ldots, x_n)$ be drawn from a Gaussian distribution with variance $\sigma^2$ and unknown mean value $x_0$ . Show that $f(x_1, \ldots, x_n) = n^{-1} \sum_{i = 1}^n x_i$ is an estimator for $x_0$ that is unbiased and achieves the Cramér–Rao lower bound.

The probability of seeing a particular sequence $x_1, \ldots, x_n$ of independent samples is

$p(x_1, \ldots, x_n) = \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \mathrm{d} x$

So the expected value of this estimator is

$\begin{align*} \langle \hat{x_0} \rangle &= \left \langle \frac{1}{n} \sum_{i = 1}^n x_i \right \rangle \\ &= \int_{\mathbb{R}^n} \left( \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right) \left( \frac{1}{n} \sum_{i = 1}^n x_i \right) \mathrm{d} x_1 \ldots \mathrm{d} x_n \\ &= \frac{1}{n} \sum_{i = 1}^n \prod_{j = 1}^n \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_j - x_0)^2}{2 \sigma^2}} \left( 1 - \delta_{ij} (1 + x_i) \right) \mathrm{d} x_j \\ &= \frac{1}{n} \sum_{i = 1}^n x_0 \\ &= x_0 \end{align*}$

The Kronecker delta is used to indicate that only one term in the product contains $x_i$ . This term integrates to $x_0$ (since it is the expected value of a single Gaussian), while the others integrate to one.

Now let’s show that this estimator achieves the Cramér–Rao bound. That is, that the variance of this estimator is equal to one over the Fisher information. To do this it’s helpful to know that in this case

$\begin{align*} \frac{\partial}{\partial \alpha} p_\alpha(x) &= \frac{d}{d x_0} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - x_0)^2}{2 \sigma^2}} \\ &= \frac{d}{d x_0} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - x_0)^2}{2 \sigma^2}} \frac{x - x_0}{\sigma^2} \\ &= p_\alpha(x) \frac{x - x_0}{\sigma^2} \end{align*}$

Then the Fisher information is

$\begin{align*} J(\alpha) &= \int_{-\infty}^\infty \frac{1}{p_\alpha(x)} \left( \frac{\partial p_\alpha(x)}{\partial \alpha} \right)^2 \mathrm{d} x \\ &= \int_{-\infty}^\infty \frac{1}{p_\alpha(x)} \left( p_\alpha(x) \frac{x - x_0}{\sigma^2} \right)^2 \mathrm{d} x \\ &= \int_{-\infty}^\infty p_\alpha(x) \frac{(x - x_0)^2}{\sigma^4} \mathrm{d} x \\ &= \frac{1}{\sigma^4} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - x_0)^2}{2 \sigma^2}} (x - x_0)^2 \mathrm{d} x \\ &= \frac{1}{\sigma^2} \end{align*}$

The integral evaluates to $\sigma^2$ since it’s the distribution’s variance. We have $n$ samples, so in this case the Cramér–Rao bound states that the variance of our estimator can be no smaller than $\sigma^2/n$ .

We know that the expected value of our estimator is $x_0$ , so to find its variance let’s compute the expected value of our estimator squared.

$\begin{align*} \left \langle \hat{x_0}^2 \right \rangle &= \left \langle \left( \frac{1}{n} \sum_{i = 1}^n x_i \right)^2 \right \rangle \\ &= \int_{\mathbb{R}^n} \left( \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right) \left( \frac{1}{n} \sum_{i = 1}^n x_i \right)^2 \mathrm{d} x_1 \ldots \mathrm{d} x_n \\ &= \frac{1}{n^2} \int_{\mathbb{R}^n} \left( \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right) \left( \sum_{i, j} x_i x_j \right) \mathrm{d} x_1 \ldots \mathrm{d} x_n \\ &= \frac{1}{n^2} \sum_{i, j} \int_{\mathbb{R}^n} \left( \prod_{k = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_k - x_0)^2}{2 \sigma^2}} \right) x_i x_j \mathrm{d} x_1 \ldots \mathrm{d} x_n \\ &= \frac{1}{n^2} \left( \sum_{i=j} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} x_i^2 \mathrm{d} x_i \right. \\ &\phantom{=\frac{1}{n^2}} + \left. \sum_{i \neq j} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} x_i \mathrm{d} x_i \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_j - x_0)^2}{2 \sigma^2}} x_j \mathrm{d} x_j \right) \\ &= \frac{1}{n^2} \left( \sum_{i=j} (\sigma^2 + x_0^2) + \sum_{i \neq j} x_0^2 \right) \\ &= \frac{1}{n^2} \left( n (\sigma^2 + x_0^2) + (n^2 - n) x_0^2 \right) \\ &= \frac{\sigma^2}{n} + x_0^2 \end{align*}$

Thus the variance of our estimator is

$\begin{align*} \left \langle \hat{x_0}^2 \right \rangle - \left \langle \hat{x_0} \right \rangle^2 &= \frac{\sigma^2}{n} + x_0^2 - x_0^2 \\ &= \frac{\sigma^2}{n} \end{align*}$

This is equal to the inverse of the Fisher information, so this estimator achieves the Cramér–Rao bound.