Skip to content

Writing · 16 May 2026 · 22 min read

Haversine: three Pythons, seven Zigs, and the knowledge tax

Three Python implementations across the expertise gradient, then seven Zig steps into vectorised + threaded native. The last step is analyzer-driven: read the compiled assembly, find that LLVM isn't emitting FMA, fix it in the source, and pool the threads instead of spawning per call. Lands at 150 GB/s on a 9950X3D.

Optimization

All numbers in this post are measured on a Ryzen 9 9950X3D (Zen 5, 16 cores / 32 threads, AVX-512, 128 MiB L3).

The knowledge tax

Every performance optimisation you ship is paid for twice. Once at writing time: the hours you spent profiling, the unfamiliar tool you had to learn, the trick that worked here but you can’t reuse anywhere else. And again at maintenance time: the next person who needs to touch this code has to understand the trick before they can change it safely. Cumulatively across the lifetime of the codebase, that’s the knowledge tax of fast code.

Most code doesn’t pay it. Most code is a request handler that runs once per request, a script that runs once a day, a transformation that processes a few thousand rows. The naïve implementation is correct, readable, and the time it takes to run is rounding error against the network round-trips on either side of it.

But some code is a hot loop. It runs millions of times per request, or ten thousand times per second on the queue. For that code, the knowledge tax pays for itself.

The interesting engineering question is which rung of the ladder your code actually needs to be on, and the answer depends on who’ll maintain it, how often it runs, and what its real cost is right now. This writeup walks all the way up the ladder, with real numbers at each step, so the shape of it is visible.

The example: the haversine great-circle distance between two latitude / longitude pairs. Three Python implementations, then a Zig port that progresses through scalar polynomial, SIMD, multi-threading, Estrin’s scheme for the FMA chains, and finally an analyzer-driven pass over the compiled assembly that takes the kernel to ~150 GB/s by fusing mul-then-add into FMA, replacing the comparison-based sign branches with XOR-on-the-sign-bit, and replacing the per-call thread spawn with a persistent pool.

Python: three versions

All three are pure Python, no Cython, no Numba, no hand-rolled C extensions. The differences are which standard tools the author reached for.

V0: what a non-performance programmer writes

import math
import pandas as pd

def haversine_scalar(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(math.radians, (lat1, lon1, lat2, lon2))
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = (math.sin(dlat / 2) ** 2
         + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2)
    return 2 * R * math.asin(math.sqrt(a))

df["distance_km"] = df.apply(
    lambda r: haversine_scalar(r.lat1, r.lon1, r.lat2, r.lon2),
    axis=1,
)

Idiomatic. Correct. Reviewable by any Python programmer who’s never heard of numpy. ~9,100 nanoseconds per pair on the 9950X3D. That’s roughly 110,000 pairs per second; a million-row dataframe takes about nine seconds.

The maintenance cost of this code is zero. It reads like prose. The runtime cost is everything.

V1: first numpy reflex

The change that’s a 180× speed-up:

import numpy as np

def haversine_numpy(lat1, lon1, lat2, lon2):
    R = 6371.0
    lat1r = np.radians(lat1)
    lon1r = np.radians(lon1)
    lat2r = np.radians(lat2)
    lon2r = np.radians(lon2)
    dlat = lat2r - lat1r
    dlon = lon2r - lon1r
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1r) * np.cos(lat2r) * np.sin(dlon / 2) ** 2
    return 2 * R * np.arcsin(np.sqrt(a))

Same formula, same number of arithmetic operations, but numpy runs the loop in its C implementation rather than the Python interpreter. ~43 nanoseconds per pair. The million-row dataframe now finishes in 43 milliseconds.

The maintenance cost is slightly higher (the reader has to know what np.sin does and that it operates on whole arrays) but numpy is on every working Python developer’s mental keyboard. This is the cheapest ~210× in software engineering, and it’s also the single most valuable thing to know about Python for performance work. Almost everyone underestimates how much of their slow code is fixed by it.

V2: Polars, planned to parallelise

A modern Python codebase reaches for Polars when it’s doing serious dataframe work. Polars dispatches through its Rust compute kernels (Arrow-backed columnar layout, SIMD where available), and (the part that actually matters for this kernel) its lazy planner schedules independent sub-expressions onto separate threads.

That second point is the optimisation. If you write the haversine as one giant chained expression, Polars sees a single dependency chain and runs it on one core. To make Polars’s parallelism work for you, you have to split the computation into stages of independent expressions and let the planner schedule them in parallel:

import polars as pl

def haversine_polars(df: pl.DataFrame) -> pl.Series:
    R = 6371.0
    return (
        df.lazy()
        .with_columns([
            pl.col("lat1").radians().alias("lat1r"),
            pl.col("lat2").radians().alias("lat2r"),
            pl.col("lon1").radians().alias("lon1r"),
            pl.col("lon2").radians().alias("lon2r"),
        ])
        .with_columns([
            ((pl.col("lat2r") - pl.col("lat1r")) / 2).sin().alias("s_dlat"),
            ((pl.col("lon2r") - pl.col("lon1r")) / 2).sin().alias("s_dlon"),
            pl.col("lat1r").cos().alias("c_lat1"),
            pl.col("lat2r").cos().alias("c_lat2"),
        ])
        .select(
            (2 * R * (
                pl.col("s_dlat") * pl.col("s_dlat")
                + pl.col("c_lat1") * pl.col("c_lat2")
                * pl.col("s_dlon") * pl.col("s_dlon")
            ).sqrt().arcsin()).alias("distance_km"),
        )
        .collect()["distance_km"]
    )

~28 nanoseconds per pair on the 9950X3D. ~1.5× faster than numpy on the same hardware, computing the identical formula.

The win is the second with_columns stage: the four expressions inside it are independent (no expression reads another’s output within the same stage), so Polars schedules them across separate cores. numpy’s np.sin and np.cos calls have to run sequentially on one thread — the Python interpreter holding the GIL across the call. Polars releases the GIL inside its Rust engine and parallelises within the engine.

Note the cost of getting this win: knowing that an expression graph parallelises but an expression chain doesn’t, knowing to insert intermediate stages, knowing to call .lazy() and .collect() instead of using eager mode. The first time I wrote this kernel I chained everything into a single expression and got the same speed as numpy — the second with_columns was the change. The knowledge tax here is real, and it’s higher than “use numpy” but well below “write SIMD.” This is the rung most modern Python data work should aim for.

The Python recap

Versionns/pairCumulative speedupKnowledge required
V0 — df.apply + math~9,100None
V1 — np.sin over arrays~43~210דUse numpy”
V2 — Polars, planner-parallel~28~325דStage independent exprs in with_columns

The V0 → V1 cliff is the cheapest performance win in software. If your code is on V0 and it matters, fix it. The V1 → V2 step is a smaller jump (~1.5×) but it requires a real Polars-shaped mental model: knowing that the planner only parallelises independent sub-expressions, knowing to stage them with with_columns, knowing to use lazy + collect. Worth the climb if your team already speaks Polars; not worth it just to brag about using it.

Changing language is less dramatic than you’d think

Now we change languages. A naïve Zig port of the same formula:

pub fn haversine_naive(lat1: f64, lon1: f64, lat2: f64, lon2: f64) f64 {
    const lat1_rad = lat1 * TO_RADIANS;
    const lon1_rad = lon1 * TO_RADIANS;
    const lat2_rad = lat2 * TO_RADIANS;
    const lon2_rad = lon2 * TO_RADIANS;
    const dlon = lon2_rad - lon1_rad;
    const dlat = lat2_rad - lat1_rad;
    const a = @sin(dlat / 2.0) * @sin(dlat / 2.0)
        + @cos(lat1_rad) * @cos(lat2_rad)
        * @sin(dlon / 2.0) * @sin(dlon / 2.0);
    const c_val = 2.0 * std.math.atan2(@sqrt(a), @sqrt(1.0 - a));
    return EARTH_RADIUS_KM * c_val;
}

This is V1 Zig: scalar, single-threaded, language-builtin trig. On the 9950X3D: ~19.5 ns/pair. Modestly faster than numpy’s ~43 single-threaded ns/pair, and slightly faster than Polars at ~28. But Polars is using all 32 hardware threads against V1 Zig’s one. Per-core, V1 Zig is doing about ten times the work numpy is.

This is the part of the curve people consistently get wrong. “We rewrote it in Rust and it’s only 2× faster than numpy, what gives?” What gives is that the language wasn’t the bottleneck. libm was. Memory bandwidth was. Branch prediction was. Changing language only matters if it lets you unlock something the previous language couldn’t express.

For haversine, the thing the previous language can’t express is SIMD across the polynomial. That’s the real cliff.

V2 → V5 Zig: the cliff that needed real expertise

Each step assumes the previous one. Together they take the same arithmetic numpy is doing at 43 ns/pair single-threaded down to sub-nanosecond per pair on AVX-512 hardware with all cores engaged.

V2 Zig: polynomial scalar (the prerequisite)

Replace @sin, @cos, @sqrt (libm calls) with FMA polynomial chains:

inline fn sine_core(a: f64, b: f64, c_val: f64) f64 {
    const x = a * b + c_val;
    const x2 = x * x;
    var r: f64 = 0.00000000000000000000000000000000000000000000000938148839499;
    r = r * x2 + -0.000000000000000000000000000000000000000166533453693;
    r = r * x2 + 0.00000000000000000000000000000000000843325124113;
    // ... 7 more FMAs ...
    r = r * x2 + 1.0;
    return r * x;
}

As scalar code, V2 Zig is actually slower than V1 Zig: about ~24.4 ns/pair on the 9950X3D versus V1’s ~19.5. libm’s hand-tuned @sin beats a generic nine-coefficient polynomial in scalar code; the polynomial only earns its keep in the next step, where it becomes the thing the vectoriser can use.

V3 Zig: 4-wide SIMD (AVX2)

A @Vector(4, f64) evaluates the polynomial on four pairs at once. @sin couldn’t be vectorised. The polynomial can. ~7.1 ns/pair: roughly 3.4× faster than V2 scalar.

V4 Zig: 8-wide SIMD (AVX-512)

Same kernel, @Vector(8, f64). On Zen 5 with native AVX-512 this is almost a clean doubling of throughput, at ~3.1 ns/pair. On hardware without AVX-512 the compiler splits into 2× AVX2 and the win shrinks correspondingly. The SIMD path is hardware-target-aware in a way that the higher-level Python paths aren’t.

V5 Zig: SIMD + multithreading

The AVX-512 kernel distributed across CPU cores. The work fans out across 16 cores (32 SMT threads on the 9950X3D), each thread runs its AVX-512 polynomial on its slice, results join at the end. ~0.78 ns/pair, or 40 GB/s of haversine throughput. Sub-nanosecond per pair.

Cumulative gap from numpy (~43 ns) to V5 Zig (~0.78 ns) on a 16-core AVX-512 desktop: ~55×. That is the dramatic drop. It’s not in the language change; it’s in the SIMD and threading work the language change unlocked.

The knowledge tax of getting to V5: understanding scalar polynomial approximations, knowing what FMA does, recognising when SIMD applies, being able to spawn and join threads safely, understanding when AVX-512 is even available. This is an engineer who has spent real time inside a debugger with perf open.

V6: looking at the compiler’s output

V5 is FMA-latency-bound. The polynomial inside the SIMD kernel is a Horner chain: eight sequential r = r * x2 + ck operations for the sine polynomial, eighteen for the arcsine. Each FMA depends on the previous, so the critical path of the inner loop is (chain length) × FMA_latency.

On Zen 4 / modern Intel, FMA latency is 4 cycles and throughput is 2 / cycle. A serial chain leaves FMA throughput on the table. The fix is Estrin’s scheme: split the Horner chain into an even-indexed sub-chain and an odd-indexed sub-chain that the CPU can issue in parallel, then combine.

For the 9-coefficient sine polynomial:

// Original: 8 sequential FMAs in x²
//   var r = c0; r = r*x2 + c1; r = r*x2 + c2; ... r = r*x2 + c8;
//   return r * x;
//
// Estrin's: two independent chains in x⁴ = x²·x², combined at the end.

const x4 = x2 * x2;

// Even chain (coeffs at original indices 0, 2, 4, 6, 8): 4 FMAs
var even: Vf = s(c0);
even = even * x4 + s(c2);
even = even * x4 + s(c4);
even = even * x4 + s(c6);
even = even * x4 + s(c8);

// Odd chain (coeffs at original indices 1, 3, 5, 7): 3 FMAs
var odd: Vf = s(c1);
odd = odd * x4 + s(c3);
odd = odd * x4 + s(c5);
odd = odd * x4 + s(c7);

return (even + x2 * odd) * x;  // 1 combine FMA + 1 multiply

Same arithmetic count: 8 polynomial FMAs, no more, no less. But the critical-path length drops from 8 sequential FMAs to max(4, 3) + 1 = 5, and the CPU’s two FMA execution ports can both stay busy through the chain. For the 19-coefficient arcsine polynomial, 18 sequential FMAs become 9 + 8 + 1 combine, with the longer chain as the 9-FMA critical path. Roughly half the original latency.

This was a real change. I implemented it, ran V5 and V6 side by side, and verified the floating-point results are bit-identical (the change is a pure reordering of operations; the sum is mathematically equivalent. The bench prints V5_sample=204.692910587 V6_sample=204.692910587 diff=0 to confirm it every run). On the 9950X3D, V6 ran ~1.24× faster than V5 in clean averaged runs (0.38 ns/pair vs 0.48 ns/pair, median of fifteen runs). The absolute gap is small because both versions are already past the single-nanosecond barrier, but the relative win is the latency- budget reclaim from breaking the Horner chain, which would scale linearly with a longer polynomial.

The maintenance cost of this code is the cost of explaining Estrin’s scheme to the next person who looks at it. The comment block above the function is most of the documentation, and the test that confirms V5 and V6 produce bit-identical output is the rest.

V7: reading the compiler’s output

V6 was the last rung the source could climb. To go further I had to look at what LLVM actually emitted for V6’s inner loop, find out why we weren’t already hitting the theoretical ceiling, and either tell the compiler to do better or change the surrounding harness.

I wrote a small analyzer for this. It parses objdump output, picks out the inner SIMD loop, builds a register-dependency DAG weighted by Zen 5 latencies, and reports the two quantities that matter: the throughput-bound cycle budget (peak IPC, no stalls) and the latency- bound critical path (longest chain through the DAG). The mismatch tells you which side you can attack.

Running it on the V6 binary surfaced three things I would not have guessed from reading the Zig source.

Finding 1. LLVM 0.15.1 emits no vfmadd*pd at all in this kernel. Every r = r * x4 + ck in the V6 source landed as a separate vmulpd (3-cycle latency) followed by a vaddpd (3-cycle latency). Here’s the arcsine Estrin chain inside V6’s compiled worker, straight out of objdump:

; V6 arcsine chain: mul, then add, for each Estrin step
vmulpd  %zmm26,%zmm24,%zmm26     ; r *= x2 ...
vaddpd  %zmm8, %zmm26,%zmm26     ; ... + c   (one Estrin step = 6 cycles)
vmulpd  %zmm26,%zmm26,%zmm27     ; x4 = x2 * x2
vmulpd  %zmm27,%zmm27,%zmm28
vmulpd  %zmm9, %zmm28,%zmm29     ; even chain in x4
vmulpd  %zmm14,%zmm28,%zmm30     ; odd  chain in x4 (interleaved)
vaddpd  %zmm10,%zmm29,%zmm29
vmulpd  %zmm29,%zmm28,%zmm29
vmulpd  %zmm30,%zmm28,%zmm30
vaddpd  %zmm11,%zmm29,%zmm29
vaddpd  %zmm16,%zmm30,%zmm30
...

The whole V6 worker has 73 × vmulpd, 51 × vaddpd, zero vfmadd. On Zen 5 each polynomial step is a 3 + 3 = 6-cycle chain step where a single fused vfmadd231pd would be 4 cycles, and the instruction count doubles. The OoO front-end has more uops to feed than it should.

Finding 2. The arcsine Estrin chain still dominates the critical path. 137 cycles per 8-pair batch, of which ~100 cycles are the arcsine polynomial. Throughput-bound budget at peak IPC: ~155 cycles. So we were close to the throughput limit (which makes sense; Estrin’s whole point is to reach the throughput ceiling instead of being latency-bound), but every extra mul + add pair we shipped instead of one fused FMA was wasted budget on both axes.

Finding 3. The thread-spawn cost was a huge fraction of wallclock. V5 and V6 both call std.Thread.spawn 16 times on every invocation, then join. At 0.4 ns/pair × 1M pairs that’s ~400 μs total wallclock, and 16 spawns at ~25 μs each is ~400 μs of clone()-syscall overhead sitting right in the timing window. The actual SIMD work inside that wallclock is much less than the wallclock suggests.

V7’s three changes follow directly from those three findings.

Change 1: force the FMA fusion. Zig has a built-in for this: @mulAdd(Vf, a, b, c) is the explicit intrinsic and LLVM honours it unconditionally. Wrap it for readability:

inline fn fma(a: Vf, b: Vf, c: Vf) Vf {
    return @mulAdd(Vf, a, b, c);
}

// Each polynomial step is now ONE fused instruction:
even = fma(even, x4, s(c2));
even = fma(even, x4, s(c4));
// ...

The same arcsine Estrin chain after rebuilding with V7:

; V7 arcsine chain: one fused FMA per polynomial step
vfmadd213pd  %zmm11,%zmm9,%zmm10   ; even *= x4; even += c    (4 cycles)
vfmadd213pd  %zmm17,%zmm9,%zmm3    ; odd  *= x4; odd  += c
vfmadd213pd  %zmm12,%zmm9,%zmm10
vfmadd213pd  %zmm20,%zmm9,%zmm3
vfmadd213pd  %zmm18,%zmm9,%zmm10
vfmadd213pd  %zmm21,%zmm9,%zmm3
vfmadd213pd  %zmm16,%zmm9,%zmm10
vfmadd213pd  %zmm10,%zmm8,%zmm3    ; combine even + x2 * odd
...

Half the instructions, and every step that used to be 6 cycles on the critical path is now 4. Analyser histogram on the V7 worker: 55 vfmadd*pd, 18 vmulpd, no vaddpd left in the polynomial loops.

Change 2: branchless sign manipulation. V6 emitted vcmpltpd + vblendmpd (about a 5-cycle bubble) just to pick between +RAD_C and -RAD_C based on the sign bit of lat. The result is the magnitude with the opposite sign of lat, which is one XOR away from the input’s sign bit. Single vxorpd, no mask register.

inline fn signed_const(lat: Vf, magnitude: Vf) Vf {
    const sign = ... // lat's sign bit, as a bitmask
    return magnitude XOR (~sign);
}

Change 3: persistent worker pool. The 16 workers spawn once on process startup, park on a condition variable, and are signalled per dispatch. Per-call overhead drops from ~400 μs of pthread_create and pthread_join to a handful of futex calls. The pool is pinned to physical cores only (16 workers, not 32 SMT threads); SMT siblings contend for the same FMA execution ports, and the analyzer told me the body was throughput-bound, which is the same signal that says “don’t oversubscribe.”

const WorkerPool = struct {
    workers: []Worker,
    wakeup: std.Thread.Condition,
    done:   std.Thread.Condition,
    generation: u64,  // incremented per job; workers see fresh wakeups
    pending:    usize, // counts workers still running this job
    // ...
};

After all three changes, the analyzer on the V7 worker:

# V6 worker
# Throughput-bound cycles: 155.5
# Latency-bound critical path: 137 cycles
# → throughput-bound

# V7 worker
# Throughput-bound cycles: 133.0
# Latency-bound critical path: 68 cycles
# → throughput-bound, but lots more headroom

Throughput improved 15 %; latency-bound chain halved. End-to-end with the pool absorbing thread-spawn overhead: V7 lands at ~0.29 ns/pair median, 0.21 ns/pair best case. That’s ~1.31× faster than V6 typical, ~1.8× best case, and ~1.66× faster than V5. Peak sustained throughput is 150 GB/s of haversine across the 1M-pair working set.

V5_sample = 204.692910587 km
V6_sample = 204.692910587 km
V7_sample = 204.692910587 km   diff5v6 = 0   diff5v7 = 0

The cross-check still prints zero. V5, V6, and V7 agree to the last bit of the f64 on this fixture (FMA fusion shifts the rounding semantics in principle but doesn’t actually change the result here).

Could V7 go further if I wrote the inner loop in raw assembly?

This is the natural next question, and the honest answer is: for almost every codebase, no, and for this one, not by much.

The win in V7 wasn’t writing assembly. It was reading assembly, finding the inefficiency (the missing FMA fusion), and reaching back into the Zig source with a one-line built-in that fixed it. The same pattern applies to every other language with a serious optimising backend: write the C/Zig/Odin/Rust in a shape the compiler recognises, read the output to verify it did the thing you expected, and reach back into the source when it didn’t.

Where hand-written assembly does pay for itself is the small set of codebases whose entire reason for existing is the inner loop: FFmpeg codecs, BLAS / cuBLAS kernels, glibc string routines, OpenSSL crypto primitives. Three properties tend to be true of all of them:

  1. The kernel is small enough to fit in one engineer’s head completely; maintenance is rare and is done by the same expert who wrote it.
  2. The performance ceiling is well-known (cycles per byte, peak FLOPs) and the compiler-generated version misses it by enough to matter commercially.
  3. The platform target is fixed: a hand-tuned AVX-512 kernel for one microarchitecture is shipped alongside a different hand-tuned kernel for the next, with runtime dispatch.

This kernel doesn’t have those properties. Compiler output is within a few percent of what hand-assembly could do (the throughput-bound budget the analyzer reports is the ceiling), maintenance has to be possible for any engineer comfortable with Zig, and the target is “modern AMD with AVX-512” but should not regress on AVX2-only Intel. So V7 stops at “read assembly, write Zig that compiles to good assembly.”

The general rule I’d hand to anyone walking this ladder: read your compiler’s output the moment you start to suspect it’s the bottleneck, and let the analyzer tell you where the slack is. Hand-rolling assembly past that point is a maintenance debt with diminishing returns unless your library is the bottleneck of a million other applications.

The maintenance cost of V7

V6’s maintenance burden was “the next reader has to understand Estrin’s scheme.” V7 layers three more pieces of expertise on top:

  • Why @mulAdd and not a*b + c. Anyone refactoring “for clarity” would replace fma(a, b, c) with a * b + c and immediately regress to V6’s mul + add cost. A test could lock the optimisation in place (grep the binary for FMA count) but that’s another knob to maintain.
  • The XOR sign trick is bitwise IEEE-754 reasoning. It is exactly correct, but a reader who doesn’t already think in sign-bit-of-double will look at the function and not be sure what it computes.
  • The worker pool is a parallel-programming object. Lifecycle management, condition-variable correctness, shutdown drain, all of it. This is real concurrency code now, not “spawn and join.”

Every one of these is the kind of code a refactor in good faith can quietly undo. V5 and V6 you could explain to a competent engineer in ten minutes; V7 needs an hour and a willingness to read objdump.

The shape of the ladder

All numbers measured on a Ryzen 9 9950X3D, 16 cores / 32 threads, AVX-512. V5 / V6 / V7 are 5-iteration averaged wallclock; single-shot V5 / V6 are ~30 % higher because of pthread_create jitter (V7 doesn’t pay that on every call; see the V7 section).

StepApproachns/pairSpeedup vs prevKnowledge required
Python V0df.apply + math~9,100None
Python V1np.sin over arrays~43~210דUse numpy”
Python V2Polars, planner-parallel~28~1.5דStage independent exprs in with_columns
Zig V1naïve, scalar~19.5~1.4×Compile a native binary
Zig V2polynomial scalar~24.4~0.8× (slower!)Math approximations
Zig V3AVX2 SIMD4~7.1~3.4×@Vector, FMA
Zig V4AVX-512 SIMD8~3.1~2.3×AVX-512 availability
Zig V5+ multithreading~0.48~6.5×Thread pools, work split
Zig V6+ Estrin’s~0.38~1.24×FMA latency, critical-path analysis
Zig V7+ FMA fusion + branchless + persistent pool~0.29~1.31×Reading objdump, IEEE-754 bit-tricks, pthread lifecycle

The V1 → V2 row deserves the asterisk. Scalar polynomial is slower than libm’s @sin in single-pair scalar code, by ~25%. The polynomial isn’t a win on its own; it’s a prerequisite for everything below it, because the SIMD vectoriser cannot fold libm calls. V2’s regression is the cost of the next three rungs; the win starts at V3.

Three things to notice.

The biggest free win is one rung up. Going from df.apply to vectorised dataframe maths is the cheapest ~210× in software. If you remember one thing from this post, it’s that.

The Polars win is structural, not magical. Polars beats numpy here by ~1.5× because its planner schedules the four independent trig expressions across CPU cores. numpy can’t: the Python interpreter holds the GIL across np.sin and np.cos calls, so they run sequentially. The Polars win evaporates if you write a single chained expression instead of staging independent ones via with_columns. The expertise is in writing expressions the planner can parallelise.

The dramatic drop is SIMD + threading, not language alone. Decent Python (V2 Polars at ~28 ns/pair) is ~58× slower than V5 Zig SIMD-and- threaded code (~0.48 ns/pair), and the cliff is V3–V5, not V0→V1 or V1→Zig. The Zig win comes from vectorisation across the polynomial plus parallelism across cores: things Python can’t reach. That expertise is expensive, and worth it only when the hot loop justifies it.

The last 1.5× has a vertical maintenance cliff. V5 → V6 was Estrin’s scheme: one paper to read, one source comment, V5/V6 are bit-identical. V6 → V7 was three independent expertises stacked: reading the compiler’s assembly to find the missing FMA fusion, IEEE-754 sign-bit tricks for branchless comparisons, and a persistent worker pool with its own lifecycle. The compute speedup is real (~1.31× typical, up to ~1.8×) but every line of V7 is paid for at maintenance time, and every refactor within five years has to know the analyzer exists or it silently regresses. This is where the “is the hot loop hot enough” question gets sharp.

When to climb which rungs

  • If the function is called rarely or operates on small inputs, stay at V0 or V1. The runtime cost is rounding error and the maintenance cost is zero.
  • If you’re doing analytical work over millions of rows interactively, get to V1 / V2. That’s where pandas, numpy, and Polars live, where most data work happens, and where 99% of useful performance lives for analytics.
  • If you’ve identified a single hot loop in a production service and profiled it, V3 / V4 in native code might pay for itself. The hot loop is the function that runs millions of times per request. The rest of the service stays in Python.
  • If your entire workload is the hot loop (physics simulation, cryptography, real-time signal processing, financial backtesting), climb all the way to V5 / V6, accept the maintenance cost, and isolate that code so the rest of the codebase doesn’t have to read it.
  • V7 is for one team in a hundred. If you measure your workload and the extra 1.3× is genuinely worth the on-call burden (HFT books, ML inference kernels, anything where the hot loop is the product) then the analyzer + FMA fusion + persistent pool work earns its keep. For almost every other shop, V6 is the right ceiling and V7 is a hobby.

The mistake is climbing rungs your workload doesn’t need, or refusing to climb the rungs it does. Both are bad engineering. The judgement call is where on the ladder this particular code lives, and the answer depends on who’ll maintain it, how often it runs, and what its actual cost is right now.

Run the benchmarks yourself at /lab/benchmarks. The Tachyon case page has the rest of the context: how the WebGPU compute lab fits in, how the site itself is deployed, and why I treat this project as the worked-example version of the optimisation work I do on production codebases.