Custom / Application Message Formats

Custom / Application Message Formats

I don’t like calling this page “application-layer” because that’s only true in the conventional OSI sense—you could do L2 networking with JSON. You wouldn’t want to, but you could.

In particular, I often use protobuf for what are essentially l2 protocols over serial, wrapped in COBS.

Self-describing formats

A message format is self-describing if it can be parsed from a common description without application-specific knowledge. E.g. JSON, yaml, msgpack, CBOR, ini are all self-describing because you can always parse a valid document, even if it doesn’t fit into your model.

By contrast, a custom-specified binary struct format is not self-describing because you need a priori knowledge about the encoding itself in order to decode data.

JSON

{
    "key": 123,
    "another_key": [true, 1, {}, null]
}

JSON (JavaScript Object Notation) is an extremely common format on the web, where memory is plentiful and strings are cheap. As the name implies, it was inspired by a subset of JavaScript.

In approximate, quasi-EBNF:

json := object | array | string | number | bool | null
object := '{' + kvs + '}'
kvs := kv + ',' + kvs | kv
kv := '"' + string + '":' + json
array := '[' + vals + ']'
vals := json + ',' + vals | json

JSON does not have dedicated types for binary data or integers, but conventionally users select base64-encoded strings or occasionally integer (0-255) arrays to represent bytes.

Please note that all of the following are valid JSON documents (for instance):

true

null

[1, true, []]

JSON does not have to have a root-level object, though this is commonly the most convenient thing.

JSON has usability problems (no trailing commas, no comments, not compact, costly to parse), but it’s very common, and concisely expresses nearly the minimum type required to serialize almost any data structure.

msgpack

msgpack, as the linked website says, is like JSON, but (relatively) fast and small. It’s a binary format that has mostly the same types as JSON, but with a dedicated binary type, so is somewhat more useful in an embedded context.

It’s still bitten by the problem with self-describing formats, however, which is that each message needs to send all dictionary field names as strings. This is extremely memory-inefficient.

COBS

Consistent Overhead Byte Stuffing is a method for packet framing. It’s often used on top of a serial PHY to build a protocol that can send/receive multiple bytes.

The problem solved by COBS is how to know where each packet begins and ends in a stream of bytes. If there’s a byte value you know you’ll never use otherwise, you can use it as a packet delimiter. This can work well if you’re only sending data from some discrete set of values. But if you’re going to send arbitrary numeric values, sooner or later any byte value may show up in one of your packets. So it’s helpful to encode our data in some way as to reserve a byte to be an unambiguous packet delimiter.

COBS is just such an algorithm, that replaces all zeros in a packet with other values, so that the null byte can be used as the delimiter. Each zero gets replaced with the distance to the next zero (or the packet delimiter). And we prepend a byte that stores the distance to the first zero. So if we want to encode 5 0 3 6 0 0 4 (in decimal notation), we end up with 2 5 3 3 6 1 1 0. (When there are more than 254 nonzero bytes in a row, we have to insert additional control bytes to break up the data.) When decoding, we just keep track of how many bytes we’ve read since the last control byte, and replace the next byte with a zero when appropriate.

Generated formats

Protobuf

(Largely reproduced from post on Nathan’s class site).

Protobuf is a language-independent serialization and deserialization format with code generation for most programming languages and a relatively compact wire (binary) encoded format.

Why?

syntax = "proto3";

package dev.npry.machines.example;

message Vec3 {
    float x = 1;
    float y = 2;
    float z = 3;
}

message IMUReading {
    Vec3 accel = 1;
    Vec3 gyro = 2;
    Vec3 mag = 3;
}

message TempHumReading {
    float temp_c = 1;
    float relative_humidity = 2;
}

message SensorReading {
    uint64 timestamp_epoch_millis = 1;

    oneof reading {
        IMUReading imu = 2;
        TempHumReading temp_hum = 3;
    };
}

Automatically generates something like this:

$ nanopb_generator sensor.proto
# produces -> sensor.pb.h, sensor.pb.c

// sensor.pb.h
typedef struct {
    float x;
    float y;
    float z;
} vec3;

typedef struct {
    vec3 accel;
    vec3 gyro;
    vec3 mag;
} imu_reading;

typedef struct {
    float temp_c;
    float relative_humidity;
} temp_hum_reading;

typedef struct {
    uint64_t timestamp_epoch_millis;

    union {
        temp_hum_reading temp_hum;
        imu_reading imu;
    };
} sensor_reading;

#define sensor_reading_fields // encoded field info
// cont'd for other structs

Which lets you write code like this:

// main.c
#include <stdlib.h>
#include <stdio.h>
#include <pb.h>
#include "sensor.pb.h"

int main(void) {
    uint8_t buf[256];
    pb_ostream_t ostream = pb_ostream_from_buffer(buf, 256);

    imu_reading reading = read_imu();

    sensor_reading reading {
        .timestamp_epoch_millis = now(),

        .imu = reading,
        .has_imu = true,
    };

    if !pb_encode(&ostream, sensor_reading_fields, &reading) {
        fprintf(stderr, "failed to encode protobuf: %s\n", PB_GET_ERROR(ostream));
        return EXIT_FAILURE;
    }

    uart_send(buf, ostream.bytes_written);
    uart_receive(buf, 256);

    pb_istream_t istream = pb_istream_from_buffer(buf, 256);
    if !pb_decode(&istream, sensor_reading_fields, &reading) {
        fprintf(stderr, "failed to decode protobuf: %s\n", PB_GET_ERROR(ostream));
        return EXIT_FAILURE;
    }

    if reading.has_temp_hum {
        printf("remote temperature: %fC\n", reading.temp_hum.temp_c);
    }

    return EXIT_SUCCESS;
}

Or in Python, you get classes, something like this (API inexact):

$ protoc --python_out=gen
# betterproto is preferred (get the beta!)

class TempHumReading():
    @property
    def temp_c(self) -> float:
        # ...

    # ...

class SensorReading():
    @property
    def timestamp_epoch_millis(self) -> int:
        # ...

    @property
    def temp_hum(self) -> Optional[TempHumReading]:
        # ...

    # ...

def main():
    bytes_ = read_serial()
    reading = SensorReading.decode(bytes_)

    print(reading.temp_hum.temp_c)

Or in Rust, Go, Java, JavaScript, C#, C++ – code generation and bindings exist for every mainstream language, and it means you don’t need to rewrite your serialization code everywhere if you need to change something.

Flatbuffers

Flatbuffers is another Google project similar to Protobuf. It’s intended for resource-constrained environments like video games that want to minimize parsing overhead. As a result, the generated code interfaces tend to be less ergonomic, but it also tends to have significantly less overhead and give you more control over memory (a plus for embedded).

The best way to see usage is just to read the tutorial.

The set of supported languages is slightly smaller than Protobuf’s, but it hits everything mainstream.