Problem Set 8 (Optimization)

This week’s questions are all about exploring various optimization algorithms, so I won’t write them here specifically. This page summarizes my results. For notes on derivation and implementation, see here. Code is here.

I implemented Nelder-Mead, naive gradient descent (always step a constant factor times the gradient), and conjugate gradient descent. For the latter I initially used a golden section line search, but replaced it with Brent’s method. I still want to update it to use Brent’s method with derivatives. I also played around with the reference CMA-ES implementation for comparison.

Visualizations

Nelder-Mead 2d

psi

Nelder-Mead 10d

psi

Naive Gradient Descent 2d

psi

Naive Gradient Descent 10d

psi

Conjugate Gradient Descent 2d

psi

Conjugate Gradient Descent 10d

psi

CMA-ES 2d

psi

CMA-ES 10d

psi

Performance Comparisons

I haven’t made a pretty chart for this yet since I’m still playing around with the parameters (to make it as fair a fight as possible). But initial results suggest that on the multi-dimensional Rosenbrock function, conjugate gradient descent is a clear winner in scaling, at least through one thousand dimensions.

Here are some raw numbers. Nelder Mead and CMA-ES are asked to reach a tolerance of 1e-8, while the gradient descent methods are asked to find points where the gradient norm is less than 1e-8. These are the number of evaluations required to reach those goals, capped at 1e6.

Dim 2
nm:  193
gd:  28527
cgd: 1021
cma: 870

Dim 4
nm:  562
gd:  35724
cgd: 9448
cma: 2760

Dim 8
nm:  1591
gd:  45049
cgd: 5987
cma: 5790

Dim 16
nm:  27440
gd:  124637
cgd: 13083
cma: 14292

Dim 32
nm:  78969
gd:  200567
cgd: 13898
cma: 57876

Dim 64
nm:  561195
gd:  455578
cgd: 15194
cma: 211616

Dim 128
nm:  1000001
gd:  564451
cgd: 19565
cma: 953694

Dim 256
nm:  1000001
gd:  782196
cgd: 28110
cma: 1000000

Dim 512
nm:  1000001
gd:  1000000
cgd: 46918
cma: 1000000

Dim 1024
nm:  1000001
gd:  1000000
cgd: 10055
cma: 1000000

As a final note, all the algorithms except CMA-ES spent the majority of their time evaluating the function. But CMA-ES imposes large computational costs of its own, since it essentially has to invert a covariance matrix (or at least solve an eigenvalue problem).