Tachyon

What this site is

This site is itself one of the case studies: an end-to-end demonstration of the kind of optimisation work I do for clients, on a problem small enough to walk a reader through every stage of. The interactive demos at /lab run real benchmarks against real binaries; the writeups explain what each layer costs, and why. The point isn't to flex Zig. It's to give a non-specialist reader an honest sense of where time actually goes when you ask a function to run fast, and what each rung of the optimisation ladder costs in maintenance.

Start with Python

If you ask a typical engineer to compute the great-circle distance between two latitude/longitude pairs for every row in a dataset, the answer is almost always a few lines of Python like this:

# Python (pandas)
import math

def haversine(lat1, lon1, lat2, lon2):
    R = 6371.0  # km
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) ** 2
         + math.cos(math.radians(lat1))
         * math.cos(math.radians(lat2))
         * math.sin(dlon / 2) ** 2)
    return 2 * R * math.asin(math.sqrt(a))

df["distance_km"] = df.apply(
    lambda r: haversine(r.lat1, r.lon1, r.lat2, r.lon2),
    axis=1,
)

That's about 9,840 nanoseconds per pair, call it 10 microseconds. On a few million rows it's the function that blocks shipping. Switching to a vectorised NumPy version (drop .apply, operate on whole columns) gets you a ~230× win, down to about 40 ns per pair, for free.

A naïve, single-threaded Zig haversine that uses the language-builtin trig functions runs at ~19.5 nanoseconds per pair on a Ryzen 9 9950X3D. Six iterations of architectural rework later (SIMD, multi-threading, Estrin's scheme on the FMA chains, and finally a critical-path analyzer that reads the compiled binary and finds LLVM emitting mul+add where it could be emitting fused FMA) the same function runs at ~210 picoseconds per pair across all 16 cores, sustaining ~150 GB/s of haversine throughput. That's another two to three orders of magnitude beyond NumPy. Cumulatively, from the pandas .apply the average engineer writes to the analyzer-tuned V7 AVX-512 kernel at the end of the journey, the gap is roughly 46,000×.

The interactive runner at /lab/benchmarks invokes the C++, Rust, and AVX-512 Zig binaries live; the deep-dive writeup walks the full Python → AVX-512 → "read the assembly" arc with code at every stage, including the small Python tool that does the critical-path analysis on the compiled binary. The point isn't that Python is bad: it's a great default. The point is that knowing what the journey looks like lets you decide deliberately when to leave Python and when not to.

WebGPU compute lab

The /lab/gpu page runs three compute shaders in your browser: matrix multiplication, image convolution, and a parallel Fibonacci reduction. Each benchmark batches all iterations into a single GPU command buffer; submitting per iteration would be dominated by submission overhead and would not reflect any realistic workload's performance.

Shaders are plain WGSL; the runner is a small TypeScript wrapper around the WebGPU API. CPU baselines are computed on a Fly.io-hosted Zig binary so the comparison is fair: same workload sizes, same precision, same data. For small problem sizes the CPU wins on latency (kernel launch overhead dominates); for large ones the GPU wins by orders of magnitude. The page makes that crossover legible.

Deploy

The frontend is Astro 6 with React islands, built with Bun, deployed to Cloudflare Pages. The backend is a multi-stage Docker build: a builder stage installs Zig 0.15, Rust nightly, CMake, protobuf, and abseil; compiles all benchmarks; then the runtime stage copies just the binaries into a slim Python image. Deployment is a single flyctl deploy; remote builds run on Fly.io's builders so the local toolchain stays uninvolved.

Why this exists as a case

The Python-to-assembly journey is the same conversation I end up having with most engineers I work with. Knowing the shape of it (what's a 10× win, what's a 1000× win, when does each become available, and what each rung costs in maintenance) is what separates "I optimised it" from "I optimised it deliberately." This case page exists so I can point at a worked example of that conversation when it comes up, with real numbers attached.

What this site is

Start with Python

WebGPU compute lab

Deploy

Why this exists as a case

Related work

Provstiskyen: optimising then rewriting a 10-year SaaS

Jenkins pipeline right-sizing