We’ve learned a lot about simulating physical systems and optimizing functions. The two can be combined in an interesting way: we can use a simulation to define a function from a design of some device to an evaluation of its performance, then optimize this function. This idea is by no means new; design optimization is commonplace in industry and has been explored extensively in academia.
A particular topic in this field that is more nascent is the use of gradient based optimization methods. As we’ve seen, these can converge much faster than gradient-free methods (see for example my comparisons of optimization algorithms at the bottom of problem set 8). But calculating the derivative of a simulation with respect to its initial conditions (or its physical parameters, or the parameters of its control laws) isn’t trivial. In particular, for derivates to be defined at all your whole simulation has to avoid non-differentiable operations (which rules out the equation solvers used by many implicit methods), and once you have an analytical expression for the derivative, you have to have a numerically stable way of calculating it.
I’m particularly interested in the latter issue. So my main focus will be to see if I can get any use out of gradients in a classic chaotic system: the three body problem. This will primarily be a qualitative study aimed at informing future research.
Differentiable simulations for narrowly defined tasks are common. The most widely known example is topology optimization, which is usually used to minimize the compliance of a structure as calculated by a finite element model. However this simulation isn’t dynamic, and it assumes a particular design representation (voxels).
Researchers that apply deep learning to robotics are increasingly interested in dynamic differentiable simulations. Here are some examples:
Degrave et. al. (2016) model robot dynamics via a simulation written in an autodifferentiation framework, but use only a few spheres per robot (placed at key joints).
Belbute-Perez et. al. (2018) implemented a differentiable simulation in PyTorch that they use for solving classic reinforcement learning problems like cartpole.
Simulations such as these tend to be for very simple systems, since the emphasis is more on the deep learning applications than the simulations themselves. I’m interested in using bigger, messier simulations to optimize more freeform structures.
Some of the most exciting differentiable simulation work today comes from CSAIL. Yuanming Hu in particular has driven a series of differentiable physics based studies:
ChainQueen: A real-time differentiable physical simulator for soft robotics (2019)
Taichi: a language for high-performance computation on spatially sparse data structures (2019)
DiffTaichi: Differentiable Programming for Physical Simulation (2020)
The latter implements 10 different differentiable physics simulations. Some of them implement fairly stiff models (like billiard balls). The author uses on the order of 1,000 time steps.
We’ve seen how under the hood, most simulations are ODE or PDE integration problems that are discretized and solved numerically. The integrators (like Euler, RK4, or finite differences rules) at the end of the day tend to consist of compositions of simple linear combinations of function evaluations, so there’s no reason we can’t differentiate through a whole simulation by repeatedly applying the chain rule. This tells us how the results of the simulation (whether that’s the final configuration of matter, or an integral that was computed over the duration of the simulation) change with respect to all the initial conditions (like initial displacements and momenta) and/or the parameters of the simulation itself (like step size, or constants of physical laws). This enables us to use gradient descent to tune these parameters to achieve desired goals.
Ok, time for the math. Let’s consider ODEs of the form . Assuming a fixed step size , the most common integration schemes are basically functions such that
For instance, Euler integration uses .
The function is applied recursively to produce a series of points, starting from some initial condition .
We can write this more compactly as the recurrence relation
By differentiating both sides of this relation and applying the chain rule, we find that
Note that we can take the partial derivative of any component of with respect to any component of . So the equation above relates Jacobian matrices. The left hand side is the Jacobian of of as a function of . The right hand side is the product of the Jacobian of as a function of with the Jacobian of as a function of . Note that necessarily involves the gradient of as a function of time (i.e. , the object that normal non-differential simulation integrates forward), so the Jacobian of will involve second order derivatives of (first evaluated with respect to time, then with respect to the prior state).
Assuming that we can compute on demand, this gives a convenient recurrence relation for in terms of . So to differentiate through such a simulation, all we need to do is figure out , and compute the series of derivate values alongside the values .
As mentioned above, for Euler integration . Thus
And
I stated with the Euler integration method I wrote during ODE week, and extended it to support the calculation of the full Jacobian matrix of the state variables (i.e. the partial derivative of every phase space component with every other.) The code is in this directory.
As I mentioned before, I want to investigate n-body problems. Let’s take our phase space coordinates to be position and velocity. Given positions and masses , Newton’s law of universal gravitation (along with Newton’s second law of motion) tells us how to compute the acceleration experienced by a particular particle.
To start, let’s consider two particles. I’ll make one 100 times heavier than the other, and our goal will be to find a stable circular orbit by varying the initial velocity of the satellite. As a baseline, here’s a really bad trajectory, in which the satellite gets flung very far away.
Differentiating through the simulation makes it very easy to find the desired solution. This orbit was found by gradient descent, but Nelder Mead and CMA-ES find the same optimum.
Now let’s see if we can solve some three body problems. I’ll ask it to make all three particles arrive at particular places at a particular time. Taking only 1,000 steps, and letting the optimizer adjust all the initial positions and velocities, it quickly finds a solution.
Here’s the initial (radomly generated) configuration.
Here’s the solution, found by conjugate gradient descent.
If it can move all the particles, it just moves them far away from each other so they don’t interact. To prevent this I’ll fix one of the particle’s initial positions to be more near the center. Let’s also consider longer time horizons, say, a still modest 2,000 steps.
Now CGD only finds solutions for certain boundary conditions. In particular, as feared the gradients can become very large very fast. Here’s a sample of the gradients from four consecutive search steps (iterations 4 through 7 of an unsuccessful optimization attempt).
Loss gradient
2.48672
-0.0576553
-0.180537
-1.27643
0.610839
3.42989
1.83012
0.637792
1.42442
1.57643
Loss gradient
0.0911027
0.953993
0.78372
3.82267
0.515682
0.519563
-0.398867
0.524445
1.18138
3.85341
Loss gradient
-8.16494
-1.95158
111.631
-29.8753
23.6031
7.00313
-2.90714
-0.627171
4.63286
-5.53816
loss gradient
-2.29063e+15
1.17083e+17
5.39334e+15
2.03754e+17
4.70533e+14
-1.79634e+16
-8.96676e+14
1.85524e+16
2.15819e+15
9.71962e+16
Things start out fine, but then out of nowhere the gradient explodes. From here it gets caught in a death spiral. Looking at the trajectories, it’s clear what’s going on. When we start, the particles interact with each other, but never get too close. But during optimization we end up in the following situation:
Two particles have a very close flyby, and scatter. Their trajectories after this event depend very sensitively on their prior states. The derivate is useless here, since in only a couple steps the rounding error from one step will be amplified to order unity.
CMA-ES does a little better of avoiding these pitfalls, but it too sometimes fails to reach a good solution, and other times just gives up:
(16,32)-CMA-ES(mu_eff=9.4)
ConditionNumber: maximal condition number 9.01e+12 reached.
Interestingly, Nelder Mead does at least as well as CMA-ES. I would have guessed that it would be more susceptible to chaos than CMA-ES, but it performs admirably. Here’s a path it found (of 10,000 time steps) that reaches the goal despite some complex and chaotic interactions beforehand.
I couldn’t get gradient based methods to converge at all on this time horizon.
During the course of this project, I started working on a GPU based particle system that will be used for a variety of research projects at CBA. Its whole purpose is to be used for optimization problems, but to date I’ve only used derivative free techniques. I was hoping I’d have time to add derivatives for this presentation, but unfortunately I’m not there yet.
Still, taking a look at what I have so far, I can identify some interesting differences between it and the simulation I used here that could change the results. This video shows two gears interacting, first with ~5k particles, and then with ~40k. The blue particles on the left gear are driven by a PD controller to follow a sinusoidal trajectory.
Running a single simulation, I get around 400 million particle updates per second (as you can see in the simulation monitor). If I run multiple simulations at once, the GPU can overlap some computation and data transfer operations so I can get over 600 million particle updates per second. Getting this performance was an interesting project in its own right, and there are still plenty of optimizations that could be applied. Now that I know CUDA, not using GPUs for simulation seems criminally negligent.
The contact forces between the gears can be very high, but there aren’t any chaotic trajectories like there are in the three body problem. So it’s possible that derivatives could be much more useful.
This also illustrates another potential way out. As anyone who’s rung a bell knows, stiff materials like metals do have vibrational modes, even if the vibrations are so fast and so small as to be invisible. This behavior is reproduced in particle systems like the one shown above (though practically speaking, the dynamic range between the smallest timescales and largest is often kept smaller than it is in the real world, so that larger time steps can be used). So if one were to ask where exactly one particle will end up on a small scale, the problem is chaotic. But the larger scale behavior can still be very predictable. Can gradients flow through systems like these without diverging along the way? Essentially, can extremely high magnitude noise in the gradient noise be averaged away with the right objective?
Overall, though the gradient free techniques I tried were more robust, they were not immune to the fundamental problem: the lack of a discernible relationship between inputs and outputs. So techniques for keeping gradient magnitudes down could be of use even for gradient free optimization methods. The magnitude of the gradient can be used as a metric of the sensitivity of the system, even if it isn’t needed for optimization.
So ultimately I think the question I want to answer is simply how to set up simulations to provide meaningful maps from inputs to outputs. This gets at the fundamental limits of simulation optimization.
Even in truly chaotic systems, there are many things I could do to help my optimization algorithms. Right now, my objective function is only a function of the final state. If I added a term that was integrated throughout the whole simulation, and which penalized near collisions, then the gradient might point the optimizer away from these non-differentiable events, rather than toward them. This is related to the idea of averaging out noise, but is aimed more directly at avoiding particularly problematic scenarios.
Finally, the optimization problems in the current study are compounded by the fact that I’m using Euler integration, since the integrator itself becomes inaccurate in near collisions. In particular, energy is artificially added to the simulation which further amplifies the already very large gradients involved. I’ve tried to get around this by using small time steps, but a better solution would be to use a higher order (and ideally symplectic) integrator. Verlet integration would work nicely.
There are also a few methods out there that can provide some of the benefits of gradients without actually needing to differentiate through the simulation. Most notably, reinforcement learning algorithms (such as REINFORCE), get around the need for calculating derivatives by using a statistical estimator of the gradient instead. This is sometimes necessary, for instance if the initial conditions of the problem are produced by a non-differential function of the parameters you want to optimize. I’m keeping notes on this technique here.
Finally, I’m also interested in geometric integration, specifically symplectic integration (as mentioned earlier). These techniques guarantee (up to numeric precision) that physically conserved quantities (e.g. energy) are preserved by the integrator. This prevents a wide range of issues inherent in general purpose integrators, such as the amplification that is so easy to see with Euler Integration. The math involved is pretty formidable, so I’m working towards a full understanding from the ground up. My notes on the subject are here. I want to understand what sorts of guarantees these integrators can provide when energy is intentionally added and removed from the system (so certainly isn’t conserved).