Problem Set 7 (Function Fitting)

1

Generate 100 points uniformly distributed between 0 and 1, and let , where is a Gaussian random variable with a standard deviation of 0.5. Use an SVD to fit to this data set, finding and . Evaluate the errors in and using equation (12.34), by bootstrapping to generate 100 datasets, and from fitting an ensemble of 100 independent data sets.

I used Eigen to compute the SVD (in C++). My code is here. I found and .

2

Generate 100 points uniformly distributed between 0 and 1, and let , where is a Gaussian random variable with a standard deviation of 0.1. Write a Levenberg-Marquardt routine to fit to this data set starting from (remembering that the second-derivative term can be dropped in the Hessian), and investigate the convergence for both fixed and adaptively adjusted values.

We want to minimize

The first derivatives of this function are

The second derivatives are

Using these we can construct our matrix .

My implementation lives here.

3

An alternative way to choose among models is to select the one that makes the weakest assumptions about the data; this is the purpose of maximum entropy methods. Assume that what is measured is a set of expectation values for functions of a random variable ,

(a)

Given these measurements, find the compatible normalized probability distribution that maximizes the differential entropy

This is an optimization problem subject to multiple equality constraints, so Lagrange multipliers are a natural choice. Our Lagrangian is

Note that I’ve added an additional constraint that is normalized.

The solution will satisfy

The derivatives with respect to the Lagrange multipliers are simple enough; they just reproduce our original constraints. But what is the derivative with respect to a function? It turns out the functional derivative has a rigorous definition in the calculus of variations. Ignoring the motivation for a moment, it turns out that in this case, we can just throw away everything that’s not an integral, and differentiate the expressions inside the integrals as if were a scalar.

This should be equal to zero, or equivalently,

Thus the general form of the solution to such a problem takes the form of an exponential family.

Now, let’s dig into that functional derivative a little more. I assume we’re working in , i.e. the space of square integrable real valued functions. (This is a Banach space. Calculus of variations is often considered in Banach spaces, but can be defined more generally as well.) Let be a function that maps to via an integral

Here is any function in . For any square integrable function , consider the modified function (for some real ). Think of this as with a small perturbation in the direction of . Then the equivalent of a directional derivative of (in the direction of ) can be found by taking the first terms of a Taylor expansion in .

Recall that the derivative of a function at can be defined as the number such that . So it shouldn’t seem totally crazy that the derivative of (in the direction of ) is

Now if we’re trying to maximize (or minimize) , we want to find a such that all directional derivatives of are zero. The fundamental lemma of calculus of variations tells us that if

for all , then must be identically zero. So this is why (at least in handwaving form) we ended up setting the derivatives of the expressions inside of integrals to zero. (If we wanted to be more rigorous, at least in the Banach space setting, we would show that the above formula agrees with the definition of the Fréchet derivative.)

(b)

What is the maximum entropy distribution if we know only the second moment?

Assume the second moment is

We know the solution has the form

We just need to find and .

For to be square integrable, it has to disappear as approaches positive or negative infinity. So let’s flip the sign of , and assume this new is positive from here on out. So our modified general form is

To ensure that the distribution is normalized, we require

The comes from the well known Gaussian integral.

And for the second moment to be ,

Here I integrated by parts (then changed variables). Note that

The boundary terms disappear, so I omitted them.

Plugging in the first result to the second,

So then

Thus the solution is

This is a Gaussian with zero mean and variance . However it should be noted that the entropy of a Gaussian doesn’t depend on its mean, so any mean here would do.

4

Now consider the reverse situation. Let’s say that we know that a data set was drawn from a Gaussian distribution with variance and unknown mean . Try to find an optimal estimator of the mean (one that is unbiased and has the smallest possible error in the estimate).

Let our estimator be

The sum of two normally distributed random variables and is also normally distributed:

So by induction, . Thus

since and .

As such the mean of our estimator is , which means it’s unbiased. Its variance is . Let’s see how this compares to the Cramér–Rao bound.

The score of a Gaussian (with respect to ) is

The Fisher information is the variance of the score. And since the expectation of the score is always zero, its variance is just its expected square.

(Recall that for a Gaussian with zero mean is its variance .) So the Fisher information of the collection of samples is .

The Cramér–Rao bound states that the variance of is no less than . So it turns out our estimator is as good as possible, since its variance is