maxl over unreliable networks

2024-03-14

Jake published a paper called MAXL, which is a distributed/networked machine controller that does time sync and manages trajectory buffers.

As I wrote about here, I'm working on meshing + IP routing on embedded. The short version is that I want this because it's compatible with, you know, the internet — essentially everything, from anywhere. Security, obviously, would be a huge concern with exposing this to the internet, and it would need to be locked down before doing so. But you could, and more relevantly, IP allows arbitrary network topologies, up to limitations of your link budget. It would make it very easily, comparatively, to plug in a new actuator, rearrange cables, etc.

The apparent problem with doing this is that dynamically destination-routed packets in an unreliable network may never arrive, and if we're using them for machine control, that seems bad — MAXL depends on realtime network qualities that an IP network (that hides the PHY layer) totally trashes. My answer to this problem is to make the MAXL system CAM- and kinematics-aware, in order to ensure that in the worst case, the machine pauses briefly in a safe location while waiting for new packets.

trajectory plannning

How? As part of the CAM process, partition your tool trajectories into those with tool engagement and those without. Plunge -> side-mill -> retract might be a trajectory with engagement, and a rapid above the plane of the part is an example of one without.

(For simplicity we'll treat rapids in-plane as engaged moves as actuators require some, but potentially not precise, synchronization — we can cut corners here to optimize, but that can come later.)

If you're going to execute an engaged trajectory (using this plunge -> side-mill -> retract as our example), you make sure all motors can execute it atomically before committing to execution. That is, the whole trajectory must be loaded into memory on all distributed actuators, and you only command execution of this one safe trajectory at a time. Now if you let the machine run, it will do one safe thing — it will end in the retracted state, not engaged with the part.

If the command is disengaged, then you can execute the command without waiting for global synchronization from all motors.

In general, as long as you can ensure that all actuators have loaded or executed all programs up to the end of a given engaged trajectory in the program, you can queue execution of that trajectory. This means that you can keep loading trajectory data as long as the motor has available memory, and queue execution as all other actuators report readiness.

These assumptions mean that we don't need a realtime network. If we only execute when it's safe, and end up in a safe place, then we can

estop

In order to avoid Byzantine problems, I'm assuming an estop is in the machine reachable by the controller, and I treat it as an absolutely-reliable communication channel. (For robustness to Byzantine faults we can say it has watchdog timer functionality, so network partitions mean estop.) This solves the problem of the controller telling the motors to start, but only wanting them all to start if they're all ready. But establishing that requires another roundtrip, which the motors may also not acknowledge, etc. etc.

Instead, send the "go" as normal, and if any motors don't acknowledge within a window, trigger estop — this doesn't sacrifice functionality for us, as this is definitely a failure condition for a distributed machine.

safe tails + leaders

What happens if our atomic trajectories are too long to fit in memory? The scheme relies on this being possible. As-required, compute "safe tails" for too-long atomic trajectories (power off spindle, raise Z), etc. and inject them into the program in order to create synthetic safe points (you'll need to create safe lead-ins as well that undo the tail operation). This has the effect of producing new atomic subtrajectories — you intentionally size these to fit into memory on your actuators.

In the normal case, you're aiming not to execute these — you try to load the whole next segment, acquire global actuator acknowledgement, and cancel the next tail/leader pair before the distributed machine starts to run it.

If required though, if the machine is stuck waiting for new data, the result is that it automatically goes to a safe spot to wait.