Optimization Algorithms

Parabolic Interpolation

A task that comes up frequently in optimization problems is guessing a function minimum based on a parabola fit between three points. So I’ll derive this technique here.

First, we need a parabola that goes through three points, say , , and . In the interest of generality, I’ll construct it using Lagrange polynomials.

Differentiating, we find that the derivative is equal to the following.

Setting this to zero, we find that the solution is surprisingly simple.

This can also be factored so that it only involves differences of coordinates and function values.

Nelder-Mead

TODO: Demonstrate that the Numerical Recipes formulation is equivalent to the standard formulation.

Conjugate Gradient Descent

Motivation

Since one dimensional optimization problems can be approached systematically, it’s reasonable to consider schemes for multidimensional optimization that consist of iterative line searches.

Say we have a function that maps to , and some arbitrary starting point . If we choose linearly independent directions , , , we can perform successive line minimizations. This will give us points

where the coefficients are chosen via some one dimensional optimization routine. Since each point is a local minimum along the direction , it must be true that

Otherwise we could have gone farther along and made smaller. The problem with choosing random directions is that, for , in general

This says that the directional derivative at along is nonzero. So each successive line minimization undoes the work of the previous ones, in the sense that it’s necessary to cycle back and minimize along the previous direction again. This can lead to a very inefficient search.

This begs the question: can we find directions for which successive line minimizations don’t disturb the previous ones? In particular, for we want

If we can do this, after performing line minimizations, the resulting point will still be minimal along all directions considered so far. This implies that is the minimum within the subspace spanned by . So each successive line search expands the space within which we’ve minimized by one dimension. And after minimizations, we’ve covered the whole space — the direction derivative at must be zero in all directions (i.e. ), so is a local minimum.

It turns out it’s possible to do this exactly for quadratic minima, and it can be approximated for other functions (after all, every function without vanishing gradient and Hessian looks quadratic close to a local extremum). In the latter case, repeating the whole process multiple times yields better and better solutions.

Derivation

Preliminaries

Let’s express as a second order Taylor expansion about . I’ll assume this is exact in the following discussion, but if you have higher order terms as well, you can view everything as an approximation.

is the Hessian matrix at . I assume we don’t have any way to compute it directly, but it’s important to consider its presence as we derive the algorithm. Notationally, I don’t attach an to it since we will never consider the Hessian at any other location.

By differentiating, we find that

or equivalently,

Now if we could compute , we could set the gradient to zero and find the minimum directly by solving . By our assumption, this is forbidden to us, but it still brings up an important point: if isn’t invertible, there isn’t a unique solution.

The easiest way out is to assume that is positive definite. However to handle the case of cubic or higher order minima, we need to relax this. Everything in the following derivation works even when the Hessian disappears, as long as the line searches terminate — i.e. there’s not a direction where it can go downhill forever, and you aren’t unlucky enough to shoot one exactly along the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat functions, then you just need to ensure you can’t go downhill forever — i.e. is positive semidefinite.

Moving along, define via line minimizations as before. For any ,

From this it follows that

or more generally, for ,

Thus for ,

(Recall that because is chosen to make a local minimum along .) The whole point of this exercise is to make this inner product zero, since then each line minimization won’t spoil the previous ones. So our goal will be achieved if we can ensure that

for all . Such vectors are called conjugate vectors, from which this algorithm derives its name. (Though perhaps it’s applied sloppily, since it’s not the gradients themselves that are conjugate.)

Base Case

We will now construct a set of conjugate vectors inductively. If , we are done. So take as the first direction, and find via a line search. Yes, points uphill; I assume the line search is smart enough to find a negative . Also note that if the line search fails, then clearly wasn’t strictly convex. Finally, since the gradient at isn’t zero, will not be zero.

By construction,

So trivially, spans a subspace of dimension one, and the following properties hold:

Induction

Now assume that we’ve constructed , , and , , , that span a subspace of dimension , and that the following properties hold:

We will choose a new direction of the form

for some undetermined scalar . As before, if the gradient at is zero, then that is a minimum. Additionally, since , no value of can make zero. And since for all , no value of can make a linear combination of the prior .

Our primary concern is that . So we expand

If is positive definite, by definition . Even without this assumption, though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and solve

Since we can’t compute , we can’t use this equation directly. But recall

Thus

We can simplify one step further and write

This makes sense if the function we’re minimizing really is quadratic. But when that assumption is only approximately true, empirically it tends to work better to use the first form. Also, note that the denominator (in both forms) is guaranteed to be nonzero since if was zero we already found our minimum.

Now that we have determined , we can compute explicitly and find by line minimization. Because is a minimum along , . And our choice of ensures that , which as we saw earlier implies that . Now we must show that these relations hold for as well.

First we show the conjugacy conditions. For any ,

As we have seen, this implies that

for .

So finally we must show the orthogonality of the gradients. For any ,

And

Thus we have proven

By induction this shows that these properties hold up to . (We cannot induct further than this since the gradient at will be zero, and so would be in the span of .)