Interface for debugging CV on ESP32S3

A web interface served from ESP32S3 to preview the camera feed, adjust transforms, and test a computer vision pipeline in real-time.

Tags

esp32s3, wifi, camera ov2640, computer vision, debugging, web interface

Assignment

  • individual assignment:
    • write an application that interfaces a user with an input &/or output device that you made

Context

This is part of HyperCarousel, my final project. The HyperCarousel system uses a camera to read handwritten lookup tables on custom slide mounts. Getting the computer vision pipeline to work requires constant iteration — adjusting crop regions, threshold values, and transform parameters. Recompiling and reflashing firmware for every tweak is tedious.

The solution: a web interface served directly from the ESP32S3 that lets me preview the camera feed, adjust parameters live, and see results immediately.

Hardware

I'm using a Seeed XIAO ESP32S3 Sense — the variant with a detachable OV2640 camera module. The ESP32S3 has enough RAM for image processing and enough flash for serving a web interface.

I designed a simple breakout board to expose I2C for communication with the main controller:

ESP32S3 camera breakout schematic
ESP32S3 camera breakout schematic
ESP32S3 camera breakout PCB
ESP32S3 camera breakout PCB
Front — OV2640 module and FPC connector
Front — OV2640 module and FPC connector
Back — JST connector for I2C
Back — JST connector for I2C

What Needs Debugging

The CV pipeline reads handwritten digits from a lookup table. It has five stages, and each stage has parameters that need tuning:

Step 0: Raw camera capture (480×640, rotated 90° CCW)
Step 0: Raw camera capture (480×640, rotated 90° CCW)

Step 1: Crop — Remove borders to focus on the card area. Crop percentages depend on camera mounting position.

Step 1: Cropped to 389×378 pixels
Step 1: Cropped to 389×378 pixels

Step 2: Binary threshold — Adaptive thresholding converts grayscale to binary. Block size and constant C need tuning.

Step 2: Binary image — black for dark features, white for background
Step 2: Binary image — black for dark features, white for background

Step 3: L-mark detection — Custom corner markers for perspective correction. Arm length thresholds affect detection reliability.

TL kernel response (brighter = stronger match)
TL kernel response (brighter = stronger match)
BR kernel response
BR kernel response
Step 3: All four L-marks detected (red dots mark corners)
Step 3: All four L-marks detected (red dots mark corners)

Step 4: Perspective transform — Warp the detected quad to a rectangle.

Step 4: Perspective-corrected grayscale
Step 4: Perspective-corrected grayscale
Step 4: Warped binary for digit detection
Step 4: Warped binary for digit detection

Step 5: Digit recognition — Connected component analysis and heuristic classification.

Step 5: Detected digit bounding boxes
Step 5: Detected digit bounding boxes

The Web Interface

The ESP32S3 runs a web server that serves an HTML/JS interface. The interface fetches processed images from the device and displays them in the browser.

Web debug interface — camera feed, orientation controls, keystone correction
Web debug interface — camera feed, orientation controls, keystone correction

Key Features

Orientation controls — The camera is mounted sideways, so the image needs rotation. But I wasn't sure which combination of rotation and flip would give the correct orientation. The interface offers 8 combinations (90° CW/CCW × none/180°/flipH/flipV) so I can find the right one without recompiling.

Keystone correction slider — If the camera isn't perfectly perpendicular to the card, the image appears trapezoidal. The slider applies a horizontal keystone transform to correct this.

Live preview — The browser polls the ESP32S3's /processed endpoint, which returns raw grayscale bytes with dimensions in HTTP headers. JavaScript renders them to a canvas.

Implementation

The server exposes several endpoints:

Endpoint
Returns
/
HTML/JS interface
/raw
Raw camera JPEG
/processed
Grayscale with transforms applied
/debug
Intermediate pipeline stages (for testing)
/orient
Set orientation mode (0-7)
/keystone
Set keystone correction value

The /processed handler applies the configured transforms:

void handleProcessed() {
  GrayImage gray, rotated, transformed;
  if (!captureGray(gray)) { server.send(500, "text/plain", "capture failed"); return; }

  // Apply configured rotation (camera is mounted sideways)
  (orientMode < 4) ? rotate90CCW(gray, rotated) : rotate90CW(gray, rotated);

  // Apply secondary transform based on mode
  int subMode = orientMode % 4;
  switch (subMode) {
    case 0: transformed = std::move(rotated); break;
    case 1: rotate180(rotated, transformed); break;
    case 2: flipH(rotated, transformed); break;
    case 3: flipV(rotated, transformed); break;
  }

  // Apply keystone if set
  if (keystoneVal != 0.0f) keystoneH(transformed, output, keystoneVal);

  // Send with dimensions in headers
  server.sendHeader("X-Width", String(output.width));
  server.sendHeader("X-Height", String(output.height));
  server.send(200, "application/octet-stream", output.data, output.size());
}

The frontend renders grayscale bytes directly to canvas:

async function updatePreview() {
  const res = await fetch("/processed");
  const width = parseInt(res.headers.get("X-Width"));
  const height = parseInt(res.headers.get("X-Height"));
  const data = new Uint8Array(await res.arrayBuffer());

  const canvas = document.getElementById("preview");
  canvas.width = width;
  canvas.height = height;
  const ctx = canvas.getContext("2d");
  const imgData = ctx.createImageData(width, height);

  for (let i = 0; i < data.length; i++) {
    imgData.data[i * 4] = data[i]; // R
    imgData.data[i * 4 + 1] = data[i]; // G
    imgData.data[i * 4 + 2] = data[i]; // B
    imgData.data[i * 4 + 3] = 255; // A
  }
  ctx.putImageData(imgData, 0, 0);
}

Cloud OCR Fallback

I also added a cloud fallback for when on-device recognition fails. The ESP32S3 can POST the processed image to a Vercel serverless function that calls Google Cloud Vision API:

// Vercel API route - /api/ocr
export async function POST(request: Request) {
  const { image } = await request.json();
  const client = new vision.ImageAnnotatorClient();
  const [result] = await client.textDetection({ image: { content: image } });

  // Extract numbers from right column based on x-position
  const numbers = result.textAnnotations
    ?.filter((a) => /^\d+$/.test(a.description || ""))
    .filter((a) => a.boundingPoly?.vertices?.[0]?.x > imageWidth * 0.6)
    .map((a) => parseInt(a.description));

  return Response.json({ numbers });
}

This didn't work well either — Google Vision concatenates all digits instead of parsing the table structure — but the interface made it easy to test and rule out.

Reflection

Building this debug interface was more valuable than the CV algorithm itself. Every time I made a change to the pipeline, I could see the result immediately in my browser. No compile-flash-wait cycle.

The ugly emoji buttons in the screenshot? I vibe-coded the frontend and the ESP32's web server doesn't handle UTF-8 correctly. The interface looks broken but functions fine. Good enough for debugging.

The interface also revealed problems I wouldn't have caught otherwise — like discovering that my keystone correction was being applied in the wrong direction, or that certain lighting conditions caused the L-mark detector to find false positives in the film transparency area.

Design Files