My TRMNL e-ink display runs off my own Deno BYOS server. The render pipeline is: lay the dashboard out as HTML, screenshot it in headless Chrome over CDP, then turn that RGB screenshot into the 1/2/4-bit grayscale the panel actually wants. That last step (grayscale + Floyd-Steinberg dithering) was the slowest thing in the whole request, and it was pure JS.
After this rabbit hole the dither pipeline does an 1800×1480 4bpp frame in ~15.7ms instead of ~64ms (M2 Pro), the working set fits in L1, and the kernel is bit-exact against the JS reference it replaced.
Notes from the journey, mostly so future me remembers why this kernel looks the way it does.
E-ink at 1bpp can only show black or white. To fake a gray it scatters black and white pixels so the average looks gray. Floyd-Steinberg is the classic: walk the pixels left-to-right, top-to-bottom, snap each one to the nearest available level, and push the rounding error onto the not-yet-visited neighbours with these weights:
* 7/16
3/16 5/16 1/16
Two things matter for performance. The luma step (RGB → one gray value per pixel) is embarrassingly parallel,
every pixel is independent, perfect for SIMD. The Floyd-Steinberg step is the opposite: pixel x can’t start until
pixel x-1 has pushed its error, so the row is a serial dependency chain. The whole optimisation is about
making the parallel part fly and the serial part cheap.
The JS version ran as stages: decode to RGBA, build an f32 luma buffer, run Floyd-Steinberg over it, pack. Each stage streamed the whole image through memory again.
I collapsed luma + Floyd-Steinberg into a single AssemblyScript kernel that takes RGB in and writes quantized indices out, no intermediate luma buffer. Then I leaned on it:
f32x4 and i8x16 channel shuffles (--enable simd in the asc flags)v128 storeFunny detail: the first working WASM version was slower than the JS (69ms vs 64ms). WASM isn’t magic; a naive port just moves the same memory traffic behind a FFI boundary. The wins were all in what came after.
The big one. The old kernel kept a (W+2)·(H+1) f32 error arena (about 1.5 MB for TRMNL X) and streamed it
through L2 twice per frame. I replaced it with two (W+2) i16 row buffers, ~7 KB total, small enough to live in
L1. Row y+1’s luma gets computed straight into the “next” buffer one row ahead of the Floyd-Steinberg reader;
after each row the two buffers just swap pointers.
Then the part I’m smug about (though the idea isn’t mine, it’s straight out of Itamar Turner-Trauring’s Speeding up your code when multiple cores aren’t an option, which argues Floyd-Steinberg is basically impossible to parallelize and the real win is keeping the error in registers instead of hammering a memory array. Each cell in the row below collects error from three pixels above it (its down-left, down, and down-right neighbours). The obvious code writes three f32 cells per pixel. Instead I carry the un-divided error sum in two scalar registers across the loop:
next[x] = 1·err(x-2) + 5·err(x-1) + 3·err(x)
└────────────┬──────────┘ └─ this pixel
carried in dlPrev, un-divided ×3, added at commit
So each pixel commits one i16 cell and does one divide (a >>4 shift, because the weights sum to 16),
instead of three writes and three divides. 16× less down-row write bandwidth, and one rounding point per cell
instead of three.
Going all-integer had a bonus I didn’t expect: the kernel became bit-deterministic. Round-half-up via
(x + bias) / step matches JS Math.round exactly, so the parity test went from “visually identical (≤1 level on
≤1% of pixels)” back to genuinely bit-exact, including the all-black and all-white cases the f32 path could only
get “close” on.
The screenshot comes from CDP’s Page.captureScreenshot, and CDP always emits the same PNG: 8-bit colour
type 2 (RGB), non-interlaced, level-1 zlib. A general PNG decoder handles a grab-bag of formats it’ll never see
here. So I wrote a decode specialized on exactly those invariants: inflate via DecompressionStream, defilter
straight into RGB output, no ping-pong scratch, no RGB→RGBA widen. 2.9× faster than the general decoder, and the
kernel lost a quarter of its input bytes (and all the wasted alpha math) by going ditherFromRgba → ditherFromRgb.
The luma SIMD was reading 8 pixels per chunk with overlapping loads. I rewrote it to do 16 pixels per chunk:
three back-to-back v128 loads (48 bytes, every byte used, no overlap), a two-stage i8x16.shuffle that
deinterleaves the RGBRGB… stream into separate R, G, B byte planes, then luma computed in Q15
fixed-point.
Q15 just means: instead of multiplying by 0.2126 in float, multiply by round(0.2126 × 2^15) as an integer and
shift back down by 15 at the end. The Rec.709 weights become 6966 / 23436 / 2366, they sum to exactly 32768, and
i32x4.extmul_*_i16x8_u gives the widening multiply for free. R, G and B sum in i32 before a single
(+0x4000) >> 15 round.
That dropped the luma pass from ~3.75 to ~2.94 SIMD ops/pixel, and because it’s now integer, made it bit-exact with the float reference across every bit depth and panel size, including odd widths that exercise the scalar tail. The f32 path had always needed ±1 ULP of slack; this one doesn’t.
A handful of small ones that needed no cleverness:
memory.fill to zero the last row’s staging instead of a scalar loop.Page, reused. Same memoization I already had for the browser: open the page once, goto for each
render instead of new-page / close every time. Drops two CDP round-trips off every frame.Once the kernel was the hot path I wanted to step it in DevTools. AssemblyScript emits a source map, but Deno’s
raw-imports instantiates the module from bytes, so there’s no URL for Chrome to resolve ./dither.wasm.map
against (it tries wasm://wasm/dither.wasm.map and fails). Two fixes made it work: build with --baseDir so the
map records source paths relative to itself, and embed an absolute file:// sourceMappingURL so DevTools
fetches the map directly. Now breakpoints land in the .as.ts.
1800×1480 @ 4bpp, M2 Pro:
| before | after | |
|---|---|---|
| dither pipeline (luma + Floyd-Steinberg) | ~64ms (JS) | ~15.7ms |
| kernel only | ~60.5ms | ~15.4ms |
| Floyd-Steinberg working set | ~1.5 MB f32 arena | ~7 KB (two i16 rows, L1-resident) |
| PNG decode | general decoder | 2.9× faster (CDP-specialized) |
| luma SIMD throughput | ~3.75 ops/px | ~2.94 ops/px |
| output vs JS reference | “visually identical” | bit-exact |
The luma pass is already SIMD and near its op floor. The Floyd-Steinberg pass is inherently serial, each pixel depends on its left neighbour’s error, so you can’t vectorize across the row, and the scalar body is sitting at the inter-pixel dependency-chain floor. Going faster from here means a different algorithm (a parallel-friendly dither, or quantizing on the GPU), not a faster Floyd-Steinberg. For a frame that refreshes every few minutes on an e-ink panel, 16ms is so far under the budget that the next bottleneck is Chrome taking the screenshot. So I stopped.
0.2126 / 0.7152 / 0.0722 luma weights (my 6966 / 23436 / 2366 Q15 constants).v128 op used (i8x16.shuffle, extend_*, extmul_*, narrow, splat).Page.captureScreenshot,
the 8-bit RGB / level-1 zlib invariant the specialized PNG decoder exploits..as.ts coda.