A GPU Lattice-Gas Implementation

This semester we had the chance to do a basic Lattice Gas Automata implementation, which I did in C and OpenGL. Here I have implemented 2 different types of Lattice Gas algorithms in OpenCL to run on the GPU.

The goal was to explore Lattice Gasses as a model for acoustic propogation.

OpenCL vs. CUDA

In the GPGPU programming world, CUDA and OpenCL are the two biggest approaches. CUDA has a more pleasant API, but at the cost of portability, as CUDA programs can only run on NVIDIA hardware. I'm interested in learning GPGPU programming for possible mobile applications, so I chose OpenCL for this project.


I couldn't just scatter calls down to the terrible OpenCL API throughout my code, so I wrote a thin wrapper called EZCL that provides a much more pleasant interface to application code. The kernels are still written in standard OpenCL, but I've wrapped the C API used to create and destroy buffers, read in and compile kernels, etc. As an example, a kernel to scale an input buffer by a coeffecient would look like:

__kernel void scale(
                    __global float* input,
                    __global float* output,
                    float coef)
                int i = get_global_id(0);
                output[i] = input[i] * coef;

If the kernel is in a file "kernels.cl", you could use EZCL to load and execute it like so:

#include "EZCL.h"
            void scaleBuf(float *in, float *out, float coef, size_t size) {
                EZCL_Buffer *inBuf = EZCL_Malloc(size, CL_MEM_READ_ONLY);
                EZCL_Buffer *outBuf = EZCL_Malloc(size, CL_MEM_WRITE_ONLY);
                EZCL_Kernel *scale = EZCL_LoadKernel("kernels.cl", "scale", "bbf");
                EZCL_CopyToBuffer(inBuf, (uint8_t *)in, size*sizeof(float));
                EZCL_Enqueue1DKernel(scale, size, 32, inBuf, outBuf, coef);
                EZCL_CopyFromBuffer((uint8_t *)out, outBuf size*sizeof(float));

Note that when we load the kernel we have to tell EZCL what the argument types are with the third argument, which is a format string. In this case it takes 2 buffers and a float. The buffers are untyped and it's up to the kernel to interpret them.

Lattice Gas Automata

With my OpenCL wrapper in hand, it was time to port my existing code over to OpenCL. The initial porting went smoothly, and I was able to convert my existing C code to parallel OpenCL without too much trouble. Once I had the basic behavior replicated I wanted to see what the effect of barriers in the space would be, so I created a double-slit barrier spanning the space.

Comparing the CPU algorithm to the GPU gave about a 25x speedup (the GPU executes in about 3.5% the time). The output from my speed benchmark is:

CreateWorld took 0 ms on CPU
            500 generations took 4915 ms on the CPU
            Found Device: NVIDIA Corporation - GeForce GTX 660
            CreateWorld took 86 ms on GPU
            500 generations took 191 ms on the GPU

This is without much in the way of optimization for memory access or work item grouping, so I expect that it could be sped up substantially from here.

Boundary Implementation

To implement a reflective boundary I added an extra bit to my lookup table that represented whether the cell was a reflective boundary. For reflective cells the lookup simply reversed whatever particles came in, and did not do any other colliding.

The below video looks pretty crappy because the compression compresses out a lot of the interesting strucure.

Periodic boundaries are the simplest to implement, but to further explore the behavior I wanted to create an oscillating boundary on one side, with absorbing constant boundaries on the other 3. This was where the Lattice Gas Automata algorithm started getting in the way. Because the particle sites are either on or off, a constant boundry condition would require probabilistically emitting particles at the desired density. I was also interested in being able to perterb the lattice from within (simulating a vibrating object for instance), and there was not a simple way to move the particles around in a physically meaningful way.

Going Continuous

My first attept was to try to implement the same mass and momentum conserving collision rules but in a continuous-valued deterministic way. I tried using 6 floats intead of 6 bits to represent my states, and when through each axis calculating how much particle density was colliding and how much was left over to pass-through. Colliding particles were evenly distributed between the other 2 axes. This determinism obliterated the dispersive qualities though. The next step was to try randomizing, so that a random amount from each collision was redirected, but that still did not disperse as I'd hoped. It was time to revisit the literature for other approaches.

Lattice Boltzmann Methods

Lattice Boltzmann Methods are a further development of Lattice Gasses where the particles are continuous-valued, but still are on a discrete lattice. Erlend Magnus Viggen's thesis "The Lattice Boltzmann Method with Applications in Acoustics" is a very good read on the topic, and does an excellent job of breaking down the algorithm. Carolin Wolf's thesis "Platform-independent parallelization of the Lattice Boltzmann method with OpenCL" was also very helpful. I borrowed some of her ideas for implementing the boundary-copying kernels for both the LBM and LGA implementations

Most of the literature in 2 dimensions uses the D2Q9 lattice, which has 9 velocity vectors, one of which is the zero vector. The reason seems to be mostly convenience, as a D2Q7 hexagonal lattice has the additional complexity of fitting hexagons into a square grid in RAM. I went with the D2Q7 because I already had the mapping from the LGA implementation. Because LBM has weights on the lattice connections (see below), there is more flexibility in the lattice configuration while still maintaining isotropy. This also enables 3D lattices without the extra complexity in LGAs with projecting a 4D lattice into 3D space.

Lattice Boltzman Colliding

Like the LGA algorithm, the update LBM update step is separated into collision and streaming steps. The streaming step is more or less the same as with LGA, but the collision is more interesting.

Viggen has a very good description of the collision, but I'll highlight some important points. Like the LGA, the lattice state variables encode particles with discrete velocities. Rather than simply boolean state as with LGA, in the LBM we have a continuous density for each of the 7 velocity vectors (one of which is the zero vector).

The general idea is that for each cell we first calculate the net density value and velocity vector, and then calculate a new equilibrium distribution that conserves mass and momentum. Finding the equilibrium distribution is where most of the variation in LBM implementations seems to be, and in this case I used the expression from Viggen, which is common in the literature. An interesting fact is that this is equivilent to maximizing entropy within the conservation constraints.

Takeaway Lessons

One of the most frustrating parts of this project was the lack of good debugging tools when working in the GPU. There's no way to step-debug the code, or even use printfs. There is a good tool called gDEBugger that can break on host-side OpenCL calls and inspect the buffers, which was very useful for tracking down issues, but it's no substitute for a decent debugger. I was hoping to have more time to experiment with the memory organization and task queuing, but just getting all of the timing and alignment issues sorted out ended up taking a substantial amount of time.

There was a substantial computation cost to moving from LGA to LBM. My maximum achievable frame rate with LBM on the GPU is currently similar to my LGA implementation on the CPU. There are still many optimizations that I haven't explored that would likely speed up the processing. Nevertheless, I think that the extra flexibility in the simulation (easier boundary conditions, perterbations, interactions with objects, lattice configurations) make it a powerful tool.