This site
Tachyon
The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.
9,100 ns → 0.29 ns
Python V0 → Zig V7 per pair
7 stages
Naïve Zig → analyzer-driven V7
150 GB/s
V7 peak on Ryzen 9950X3D
WGSL
WebGPU compute kernels in browser
What this site is
This site is itself one of the case studies: an end-to-end demonstration of the kind of optimisation work I do for clients, on a problem small enough to walk a reader through every stage of. The interactive demos at /lab run real benchmarks against real binaries; the writeups explain what each layer costs, and why. The point isn't to flex Zig. It's to give a non-specialist reader an honest sense of where time actually goes when you ask a function to run fast, and what each rung of the optimisation ladder costs in maintenance.
Start with Python
If you ask a typical engineer to compute the great-circle distance between two latitude/longitude pairs for every row in a dataset, the answer is almost always a few lines of Python like this:
# Python (pandas)
df["distance_km"] = df.apply(
lambda r: haversine(r.lat1, r.lon1, r.lat2, r.lon2),
axis=1,
)
That's roughly 5–30 microseconds per pair. On a few million
rows it's the function that blocks shipping. Switching to a vectorised NumPy
version (drop .apply, operate on whole columns) gets you a 10–100×
win, typically 0.1–1 microsecond per pair, for free.
A naïve, single-threaded Zig haversine that uses the language-builtin trig
functions runs at ~19.5 nanoseconds per pair on a Ryzen 9
9950X3D. Six iterations of architectural rework later (SIMD, multi-threading,
Estrin's scheme on the FMA chains, and finally a critical-path analyzer that
reads the compiled binary and finds LLVM emitting mul+add where it could be
emitting fused FMA) the same function runs at ~0.29 nanoseconds per
pair across all 16 cores, sustaining ~150 GB/s of
haversine throughput. That's another two to three orders of magnitude beyond
NumPy. Cumulatively, from the pandas .apply the average engineer
writes to the analyzer-tuned V7 AVX-512 kernel at the end of the journey, the
gap is roughly 30,000×.
The interactive runner at /lab/benchmarks invokes the C++, Rust, and Zig binaries live; the deep-dive writeup walks the full Python → AVX-512 → "read the assembly" arc with code at every stage, including the small Python tool that does the critical-path analysis on the compiled binary. The point isn't that Python is bad: it's a great default. The point is that knowing what the journey looks like lets you decide deliberately when to leave Python and when not to.
WebGPU compute lab
The /lab/gpu page runs three compute shaders in your browser: matrix multiplication, image convolution, and a parallel Fibonacci reduction. Each benchmark batches all iterations into a single GPU command buffer; submitting per iteration would be dominated by submission overhead and would not reflect any realistic workload's performance.
Shaders are plain WGSL; the runner is a small TypeScript wrapper around the WebGPU API. CPU baselines are computed on a Fly.io-hosted Zig binary so the comparison is fair: same workload sizes, same precision, same data. For small problem sizes the CPU wins on latency (kernel launch overhead dominates); for large ones the GPU wins by orders of magnitude. The page makes that crossover legible.
Deploy
The frontend is Astro 6 with React islands, built with Bun, deployed to
Cloudflare Pages. The backend is a multi-stage Docker build: a builder stage
installs Zig 0.15, Rust nightly, CMake, protobuf, and abseil; compiles all
benchmarks; then the runtime stage copies just the binaries into a slim
Python image. Deployment is a single flyctl deploy; remote
builds run on Fly.io's builders so the local toolchain stays uninvolved.
Why this exists as a case
The Python-to-assembly journey is the same conversation I end up having with most engineers I work with. Knowing the shape of it (what's a 10× win, what's a 1000× win, when does each become available, and what each rung costs in maintenance) is what separates "I optimised it" from "I optimised it deliberately." This case page exists so I can point at a worked example of that conversation when it comes up, with real numbers attached.
Related work
Engineer on the optimisation arc
Provstiskyen: performance work on a 10-year SaaS
Profiled and fixed the cold-start path on a 44,000-line R Shiny production app: 50-second logins down to 18, and 35-minute deploys down to 80 seconds, all on the existing codebase. The full rewrite that came later was made possible by a year of targeted optimisation work first.
Optimization Fullstack DevOpsEnterprise CI cluster
Jenkins pipeline right-sizing
Took 2,600 production pipelines from 8% to ~60% memory utilisation by building per-build telemetry, then designing bins from real percentile data. Same hardware, several multiples more headroom, no rewrite of any pipeline required.
DevOps Observability Optimization