Verify that the entropy function satisfies the required properties of continuity, non-negativity, monotonicity, and independence.
I will prove these properties for the discrete case.
First, I want to show that for any , since otherwise entropy is not defined for distributions that assign zero probability to any outcome. This can be done using L’Hôpital’s Rule.
To talk about the continuity of entropy, we need to define a topology for probability distributions. Let be a set of finite cardinality . The space of probability distributions over can be viewed as the set of vectors in with norm 1. In this way entropy is a function that maps a subset of to . So I will prove the continuity of entropy with respect to the topologies of and .
First let’s show that is continuous. I take as given that is a continuous function on its domain (after all it’s the inverse of , which is strictly monotonic and ). Then is also continuous, since finite products of continuous functions are continuous. This suffices for . At zero, is continuous because we have defined it to be equal to the limit we found above.
Thus each term of the entropy function is a continuous function from to . But we can also view each term as a function from to . Each one ignores most of its inputs, but this doesn’t change its continuity. (The epsilon-delta proof follows easily from the triangle inequality, since the only part of the distance between inputs that matters is that along the active coordinate.) So entropy is a sum of continuous functions, and is thus continuous.
The probability of each individual outcome must be between zero and one. Thus for all . Since is only equal to zero when is zero or one, the entropy can only be zero when a single outcome has probability one.
Note that for any . This is a strictly decreasing function, so entropy is strictly concave on all of . The constraint that is one is linear, so entropy is concave on this subset of as well. Thus there is a unique global maximum.
We can locate it using a Lagrange multiplier. Our Lagrange function is
The partial derivative with respect to any is . Since this depends only on , it implies that all the must be the same. Taking our constraint into account this means there’s only one possibility: for all . This is the maximum entropy distribution that we seek.
Call this distribution . Its entropy is . Thus
for all probability distributions over outcomes. Equality is only achieved for itself, since the maximum is unique. Note that grows without bound as increases.
If and are independent, their joint probability distribution is the product of the individual distributions. Thus
Prove the relationships in Equation (4.10).
I take as the definition of mutual information. By the definition of conditional entropy,
Thus
Finally, using the definition of marginal distributions we can show that
Consider a binary channel that has a small probability of making a bit error.
For reasons that will become clear I will call the error probability .
What is the probability of an error if a bit is sent independently three times and the value determined by majority voting?
Majority voting can recover the message if a single instance of the bit is flipped. So the probability of an error is the probability of having two or three bits flipped. This can be expressed using the binomial distribution. Let’s call it .
How about if that is done three times, and majority voting is done on the majority voting?
The answer is the same as above, just using instead of .
If majority voting on majority voting on … on majority voting is done N times, how many bits are needed, and what is the probability of an error? How does this probability depend on ?
Each round triples the total number of bits sent. So rounds of voting requires bits.
The probability of an error can be expressed as a recurrence relation. Define
Then with a base error rate (i.e. per individual bit) of , the probability of an error after rounds of voting is . If there is a closed form solution to this relation, I don’t know how to find it. But it’s still possible to say a lot about its behavior.
As approaches infinity, converges pointwise on to
So as long as the error rate isn’t (i.e. zero information gets through), with enough rounds of majority voting the error rate can be made arbitrarily small.
Let’s prove this. Since is a polynomial, it’s continuous. By inspection there are three fixed points: zero, one half, and one. This suffices to show that for any , if converges it must converge to zero, one half, or one (I’ll prove this below).
Now fix any such that . Because is a cubic polynomial, it can’t cross the line more than three times. We’ve already noted that it does cross this line exactly three times, namely at its fixed points. So the fact that
is sufficient to prove that . Furthermore, is greater than , so . Thus . By induction this shows that
Thus is a bounded monotonic sequence, and must converge. Since the only fixed point it can converge to is zero.
All that remains is to show that all points in converge to one. Note that
This symmetry establishes the claim.
For completeness let’s prove the claim that a sequence generated by recursive application of a continuous function can only converge to a fixed point of . I’ll do this in two steps.
First, we must establish that if a sequence in the domain of converges to a point (also in the domain of ), then the sequence converges to . Fix any . Since is continuous, there is some such that implies . Since converges to , there is some such that implies . Thus implies which establishes the claim.
Second, we can show that any sequence generated by successive application of that converges must converge to a fixed point of . Say converges to . Since for all , this means that converges to . But by the result just proven, converges to . Since limits are unique this means .
For small , the cubic term in is approximately zero. So let’s consider the function , with defined recursively in a similar fashion as .
For any , it’s clear that . And for any and such that ,
so . Thus by induction for any x in and for all . So is an upper bound for the probability of an error after rounds of majority voting with a base error rate of .
The interesting thing about is that it has a closed form solution that’s easy to find.
So while each round of voting increases the number of bits by a factor of three, the exponent on top of the base error rate grows by a factor of two. This is an astonishing win for majority voting.
Just keep in mind that is only a reasonable upper bound for very small . Indeed for , goes to infinity while goes to zero.
Here is a table showing error rates after various numbers of voting rounds for a variety of base rates. These are calculated exactly from , not the approximate upper bound .
voting rounds | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
bits | 1 | 3 | 9 | 27 | 81 | 243 |
0.25 | ||||||
0.1 | ||||||
0.01 | ||||||
0.001 |
Calculate the differential entropy of a Gaussian process.
Since we’re integrating the Gaussian over the whole real line, translation is irrelevant. So without loss of generality I’ll calculate the differential entropy of a Gaussian with zero mean.
The second term is integrated by parts. Note that
(Of course one could also just note that this integral is the distribution’s variance, but now we know how to perform that calculation ourselves. Going forward I’ll skip this derivation.)
A standard telephone line is specified to have a bandwidth of 3300 Hz and an SNR of 20 dB.
What is the capacity?
A signal to noise ratio of 20 dB means that . So . Presumably the noise power in this figure is calculated over the relevant bandwidth, so we may infer that . Then the signal capacity per second is given by
What SNR would be necessary for the capacity to be 1 Gbit/s?
If the SNR in decibels is , then . So we need to solve
for . This gives us
This is unrealistically high. A more reasonable option would be to increase the bandwidth to .
Let be drawn from a Gaussian distribution with variance and unknown mean value . Show that is an estimator for that is unbiased and achieves the Cramér–Rao lower bound.
The probability of seeing a particular sequence of independent samples is
So the expected value of this estimator is
The Kronecker delta is used to indicate that only one term in the product contains . This term integrates to (since it is the expected value of a single Gaussian), while the others integrate to one.
Now let’s show that this estimator achieves the Cramér–Rao bound. That is, that the variance of this estimator is equal to one over the Fisher information. To do this it’s helpful to know that in this case
Then the Fisher information is
The integral evaluates to since it’s the distribution’s variance. We have samples, so in this case the Cramér–Rao bound states that the variance of our estimator can be no smaller than .
We know that the expected value of our estimator is , so to find its variance let’s compute the expected value of our estimator squared.
Thus the variance of our estimator is
This is equal to the inverse of the Fisher information, so this estimator achieves the Cramér–Rao bound.