HTMAA 2023 > Yohan > Week 10

Networking & Communications

thumbnail RP2040 Together Strong

An RP2040 Supercomputer

This week I'm prototyping a "super computer," or at least whatever that entails for microcontrollers. The idea is very simple: one RP2040 ($0.80 SoC, 264kB SRAM, ARM Cortex-M0 @ 133MHz) can take a while to render expensive images; using both cores, you can reduce those times by a factor of at most two. But N RP2040's might be able to render, in realtime, things like the mandelbrot set or ray-traced images!

For now I'll be using the Xiao RP2040, but during Wildcard Week I'm hoping to create the board directly from RP2040's, which will have a total cost under $4. The board has three RP2040's with two independent cores each for a total of 6 workers rendering a subsection of the screen.

Bus Arbitration

The naive approach to this project is dedicating one RP2040 as the "host" that talks to the LCD display and every other core. This is what I did initially, using UART to communicate between processors. But this is wasteful of an RP2040 core, has double the transfer overhead and complicates implementation because one MCU has different code than the other two. So instead I want every core to be a "worker" that communicates (via SPI) with the display directly. Now the problem is how do I ensure only one worker is talking to the display at once?

This problem is known as distributed bus arbitration, as opposed to centralized bus arbitration which requires a dedicated arbiter. But all existing schemes require extra hardware (e.g. a MUX) and I'm a couple thousand miles away from lab this week (happy thanksgiving). So I'm inventing my own.

Relay Arbitration

circuit Final circuit of this scheme How this works: every core has two pins, ping and pong which are wired in a loop. Every time a worker is pinged, it pulses a pong to the next worker in the chain. However, a worker can delay that pong whenever it needs to talk to the display (i.e. it's done rendering its subsection of the frame), but must wait to be pinged before it can do that. So, it works like runners in a relay race where the baton is the right to talk to the display.

This scheme is really nice because there's no notion of priority (e.g. one worker always has first dibs on the LCD display) and requires minimal hardware. However it requires constantly checking the ping line and even interrupts wouldn't work because this works at a high frequency (and I want to maximize the RP2040 cores for rendering). If only there were programmable "mini" processors...

PIO

It turns out that every RP2040 comes with eight PIO (programmable input/output) state machines that can be programmed using their dialect of assembly and run independently from the CPU. The hardware is extremely limited (set of 9 instructions, up to 32 instructions per four state machines) but clever usage of this can make really cool stuff: the official examples include software SPI, UART, I2C and combining state machines can even implement VGA! This is what I used to implement relay arbitration; I tested this with one RP2040 (2 cores = 2 workers) first because in theory this will scale to N workers (with minimal propagation delay within reason).

Here the two cores independently control one of the built-in LEDs, at a super low frequency for demonstration frequency. It looks like a simple blinky program but I promise you that there's bus arbitration implemented here! Thus far the PIO code is also very simple:

ping:
    wait 1 pin 0
public hold:
    wait 0 irg 0 rel    ; The CPU raises this IRQ when it needs the
                        ; resource and waits for the state machine
                        ; to reach this instruction
pong:
    set pins 1
    set pins 0

Then I added the display using my ST7735 drivers from week 2 to make it a little more interesting:

Above: the second core is disabled to demonstrate that the first core only touches its part of the screen. But if I enable both cores, they contend for the display but render images in parallel!

Will it Scale?

Now I needed to test whether this will scale for N microcontrollers. I was on vacation so all I could bring was a breadboard but that's good enough for an LED demo:

Above: four cores across two RP2040s controlling their own LED and turning it on when it's controlling the bus. Again, super low frequency to demonstrate that only one worker will talk to the display at a time.

Super Important Note

Something that might be overlooked here is that you cannot have two output pins connected together or they will short. Even if you can ensure just one is turned on, this will not work. PIO state machines are able to dynamically change GPIO pins into output/inputs which allows my implementation to not only work but also not fry pins. That being said, due to hardware limitations all my pins had to be consecutive so breadboard prototyping was no longer feasible.

PCB

So I designed a PCB which supports three RP2040's and a TFT display. pcb PCB designed in KiCAD

Same demo from above, but now with six workers!

Moving Everything to PIO

So far, I've been using PIO for realtime/high-frequency arbitration logic and only bothering the CPU (via interrupts) when needed, but I wanted to experiment with moving everything to PIO; basically, re-implement the TFT drivers in PIO and add some sort of FIFO that the CPU pushes to (well actually it's DMA so all of this is asynchronous and the CPU is busy rendering 100% of the time). And uh... this was the beginning of the downfall for this week.

Granted, I got it to work perfectly with just one worker...

And also four workers (I've shorted the ping/pong pins of the middle MCU to essentially remove it)...

But as soon as I add a third MCU, independent of its position, everything breaks. And this is overlooking quite a few hours of work but at this point I'd ran out of both time and sanity for the week. So I'll return to this supercomputer and do some cool stuff with it during Wildcard Week.