Graphics Processing Units

While the CPU offers a complex set of instructions that includes vectorization (MMX, SSE, AVX), there was still a need for a dedicated massive parallelization when dealing with graphical components of OS and videogames. The GPU can be seen as a collection of lightweight CPUs running the same simple program (shader) with varying input parameters. They can store data in both dedicated and shared memory.

GPUs were originally designed for video games, but there is now a shift towards A.I. computing and realtime simulations, offering viable alternatives to CPU-based supercomputers. Modern GPUs include “tensor cores” (in NVIDIA terminology) that accelerate common operations in deep learning training and inference. Specifically, they add support for fused multiply-add operations on 4x4 matrices. They are also designed for 16 bit precision (and 16 or 32 bit precision in the accumulator), since using lower precision has proved to reduce energy costs and compute time of modern neural networks without impacting performance.

Corporations that are very heavily invested in training and deploying large neural networks have now developed their own hardware that’s even further specialized for deep learning. The flagship example is Google’s TPU. TPUs are built around fused matrix multiply-adds, but also include hardware support for convolution operations. Their memory architectures also differ from GPUs, in that they have operations specifically for loading matrices of weights to be used in matrix operations. They also have hardware implementations of nonlinear activation functions. Google isn’t the only game in town: Apple and Huawei have dedicated neural processing chips for cellphones (and Apple includes similar hardware in Apple silicon). A variety of startups are starting to emerge with even more specialized hardware, such as Groq’s Language Processing Unit, which is specifically tailored to the transformer architecture used in the current generation of large language models.

Current status

NVIDIA
- Lovelace (2022, 40-series, e.g. RTX 4090): 16K shader cores
- Ampere (2020, 30-series, e.g. RTX 3090Ti): 10752 shading units

source: NVIDIA

AMD
- (2022, 7000 series, e.g. RX7900 XTX): 6K shader cores

Shaders

Shaders are specialized specialized programs running on the GPU, each with its own task. Object geometry is represented by a set of points, usually linked through edges and faces. The first shaders on the pipeline offer control over this geometry information, while the last shader is responsible for producing color for each pixel (or fragment).

With some advanced trickery, computing can be achieved using shaders. The core idea is to retrieve the result of the computing in the final texture, which is not necessarily displayed thanks to render to texture.

CUDA, OpenCL

Those programming languages offer a more legitimate way of using the GPU for computing, with an elegant API and advanced math libraries. More info on this later.

High Performance Computing (HPC)

Scaling beyond what fits on a single chip or motherboard requires parallelization. So supercomputers are composed of multiple nodes connected by a high speed network. This makes programming a supercomputer very different from programming a regular computer, since the nodes don’t share memory. MPI is the de facto standard for synchronizing and exchanging data between nodes in an HPC system.

Individual nodes function for the most part like powerful PCs, with some exceptions. Today most new supercomputers get most of their FLOPS from GPUs rather than CPUs. So each node will probably have multiple powerful GPUs. NVIDIA V100s are standard today, though the new A100s are faster still and will presumably be common in the next generation. Using GPUs introduces another division of memory, but that’s no different from the difference between RAM and VRAM in any regular computer. Beyond that, it’s now common to see “NUMA nodes”, which stands for non-uniform memory access. This means that certain portions of RAM are physically closer to certain processors, and can be accessed more quickly.

A single node in Summit, showing NUMA architecture

Supercomputers are most commonly measured in peak FLOPS, usually measured on LINPACK or a similar benchmark. Peak FLOPS is traditionally the most important metric, but FLOPS per Watt is becoming increasingly important. Both metrics are tracked and updated closely at top500.org (see Top500 and Green500).

MIT has a new supercomputer called Satori, built in collaboration with IBM. It currently ranks #9 in the world for FLOPS per Watt.

CBA keeps its own table of FLOPS measurements, based on a massively parallel pi calculation. The table currently has entries ranging from Intel 8088 (a variant of the Intel 8086 mentioned above) to Summit, currently the world’s second most powerful computer (in peak FLOPS).