This site
Tachyon
The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.
9,840 ns → 210 ps
Python V0 → Zig V7 per pair
7 stages
Naïve Zig → analyzer-driven V7
150 GB/s
V7 peak on Ryzen 9950X3D
WGSL
WebGPU compute kernels in browser
What this site is
This site is itself one of the case studies: an end-to-end demonstration of the kind of optimisation work I do for clients, on a problem small enough to walk a reader through every stage of. The interactive demos at /lab run real benchmarks against real binaries; the writeups explain what each layer costs, and why. The point isn't to flex Zig. It's to give a non-specialist reader an honest sense of where time actually goes when you ask a function to run fast, and what each rung of the optimisation ladder costs in maintenance.
Start with Python
If you ask a typical engineer to compute the great-circle distance between two latitude/longitude pairs for every row in a dataset, the answer is almost always a few lines of Python like this:
# Python (pandas)
import math
def haversine(lat1, lon1, lat2, lon2):
R = 6371.0 # km
dlat = math.radians(lat2 - lat1)
dlon = math.radians(lon2 - lon1)
a = (math.sin(dlat / 2) ** 2
+ math.cos(math.radians(lat1))
* math.cos(math.radians(lat2))
* math.sin(dlon / 2) ** 2)
return 2 * R * math.asin(math.sqrt(a))
df["distance_km"] = df.apply(
lambda r: haversine(r.lat1, r.lon1, r.lat2, r.lon2),
axis=1,
)
That's about 9,840 nanoseconds per pair, call it 10 microseconds.
On a few million rows it's the function that blocks shipping. Switching to a
vectorised NumPy version (drop .apply, operate on whole columns) gets
you a ~230× win, down to about 40 ns per pair, for free.
A naïve, single-threaded Zig haversine that uses the language-builtin trig
functions runs at ~19.5 nanoseconds per pair on a Ryzen 9
9950X3D. Six iterations of architectural rework later (SIMD, multi-threading,
Estrin's scheme on the FMA chains, and finally a critical-path analyzer that
reads the compiled binary and finds LLVM emitting mul+add where it could be
emitting fused FMA) the same function runs at ~210 picoseconds per
pair across all 16 cores, sustaining ~150 GB/s of
haversine throughput. That's another two to three orders of magnitude beyond
NumPy. Cumulatively, from the pandas .apply the average engineer
writes to the analyzer-tuned V7 AVX-512 kernel at the end of the journey, the
gap is roughly 46,000×.
The interactive runner at /lab/benchmarks invokes the C++, Rust, and AVX-512 Zig binaries live; the deep-dive writeup walks the full Python → AVX-512 → "read the assembly" arc with code at every stage, including the small Python tool that does the critical-path analysis on the compiled binary. The point isn't that Python is bad: it's a great default. The point is that knowing what the journey looks like lets you decide deliberately when to leave Python and when not to.
WebGPU compute lab
The /lab/gpu page runs three compute shaders in your browser: matrix multiplication, image convolution, and a parallel Fibonacci reduction. Each benchmark batches all iterations into a single GPU command buffer; submitting per iteration would be dominated by submission overhead and would not reflect any realistic workload's performance.
Shaders are plain WGSL; the runner is a small TypeScript wrapper around the WebGPU API. CPU baselines are computed on a Fly.io-hosted Zig binary so the comparison is fair: same workload sizes, same precision, same data. For small problem sizes the CPU wins on latency (kernel launch overhead dominates); for large ones the GPU wins by orders of magnitude. The page makes that crossover legible.
Deploy
The frontend is Astro 6 with React islands, built with Bun, deployed to
Cloudflare Pages. The backend is a multi-stage Docker build: a builder stage
installs Zig 0.15, Rust nightly, CMake, protobuf, and abseil; compiles all
benchmarks; then the runtime stage copies just the binaries into a slim
Python image. Deployment is a single flyctl deploy; remote
builds run on Fly.io's builders so the local toolchain stays uninvolved.
Why this exists as a case
The Python-to-assembly journey is the same conversation I end up having with most engineers I work with. Knowing the shape of it (what's a 10× win, what's a 1000× win, when does each become available, and what each rung costs in maintenance) is what separates "I optimised it" from "I optimised it deliberately." This case page exists so I can point at a worked example of that conversation when it comes up, with real numbers attached.
Related work
Two acts on a parish-admin platform
Provstiskyen: optimising then rewriting a 10-year SaaS
Two acts on a 44,000-line R Shiny platform that runs about half of Denmark's deaneries. Act I cut cold start from 50s to 18s and deploys from 35min to 80s on the existing codebase. Act II, once the architecture itself was the ceiling, is a full rewrite onto FastAPI, Polars, and React: performant by default, far more maintainable, with the legacy app retiring as the last module ports across.
DevOps Optimization FullstackEnterprise CI cluster
Jenkins pipeline right-sizing
Took 2,600 production pipelines from 8% to ~60% memory utilisation by building per-build telemetry, then designing bins from real percentile data. Same hardware, several multiples more headroom, no rewrite of any pipeline required.
DevOps Observability Optimization