Skip to content

This site

Tachyon

The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.

Optimization DevOps Fullstack
Python Zig Rust C++ WebGPU FastAPI Astro Fly.io

9,100 ns → 0.29 ns

Python V0 → Zig V7 per pair

7 stages

Naïve Zig → analyzer-driven V7

150 GB/s

V7 peak on Ryzen 9950X3D

WGSL

WebGPU compute kernels in browser

What this site is

This site is itself one of the case studies: an end-to-end demonstration of the kind of optimisation work I do for clients, on a problem small enough to walk a reader through every stage of. The interactive demos at /lab run real benchmarks against real binaries; the writeups explain what each layer costs, and why. The point isn't to flex Zig. It's to give a non-specialist reader an honest sense of where time actually goes when you ask a function to run fast, and what each rung of the optimisation ladder costs in maintenance.

Start with Python

If you ask a typical engineer to compute the great-circle distance between two latitude/longitude pairs for every row in a dataset, the answer is almost always a few lines of Python like this:

# Python (pandas)
df["distance_km"] = df.apply(
    lambda r: haversine(r.lat1, r.lon1, r.lat2, r.lon2),
    axis=1,
)

That's roughly 5–30 microseconds per pair. On a few million rows it's the function that blocks shipping. Switching to a vectorised NumPy version (drop .apply, operate on whole columns) gets you a 10–100× win, typically 0.1–1 microsecond per pair, for free.

A naïve, single-threaded Zig haversine that uses the language-builtin trig functions runs at ~19.5 nanoseconds per pair on a Ryzen 9 9950X3D. Six iterations of architectural rework later (SIMD, multi-threading, Estrin's scheme on the FMA chains, and finally a critical-path analyzer that reads the compiled binary and finds LLVM emitting mul+add where it could be emitting fused FMA) the same function runs at ~0.29 nanoseconds per pair across all 16 cores, sustaining ~150 GB/s of haversine throughput. That's another two to three orders of magnitude beyond NumPy. Cumulatively, from the pandas .apply the average engineer writes to the analyzer-tuned V7 AVX-512 kernel at the end of the journey, the gap is roughly 30,000×.

The interactive runner at /lab/benchmarks invokes the C++, Rust, and Zig binaries live; the deep-dive writeup walks the full Python → AVX-512 → "read the assembly" arc with code at every stage, including the small Python tool that does the critical-path analysis on the compiled binary. The point isn't that Python is bad: it's a great default. The point is that knowing what the journey looks like lets you decide deliberately when to leave Python and when not to.

WebGPU compute lab

The /lab/gpu page runs three compute shaders in your browser: matrix multiplication, image convolution, and a parallel Fibonacci reduction. Each benchmark batches all iterations into a single GPU command buffer; submitting per iteration would be dominated by submission overhead and would not reflect any realistic workload's performance.

Shaders are plain WGSL; the runner is a small TypeScript wrapper around the WebGPU API. CPU baselines are computed on a Fly.io-hosted Zig binary so the comparison is fair: same workload sizes, same precision, same data. For small problem sizes the CPU wins on latency (kernel launch overhead dominates); for large ones the GPU wins by orders of magnitude. The page makes that crossover legible.

Deploy

The frontend is Astro 6 with React islands, built with Bun, deployed to Cloudflare Pages. The backend is a multi-stage Docker build: a builder stage installs Zig 0.15, Rust nightly, CMake, protobuf, and abseil; compiles all benchmarks; then the runtime stage copies just the binaries into a slim Python image. Deployment is a single flyctl deploy; remote builds run on Fly.io's builders so the local toolchain stays uninvolved.

Why this exists as a case

The Python-to-assembly journey is the same conversation I end up having with most engineers I work with. Knowing the shape of it (what's a 10× win, what's a 1000× win, when does each become available, and what each rung costs in maintenance) is what separates "I optimised it" from "I optimised it deliberately." This case page exists so I can point at a worked example of that conversation when it comes up, with real numbers attached.

Related work