Provstiskyen: optimising then rewriting a 10-year SaaS

The starting point

Provstiskyen is a Danish administration platform for parish councils: the bookkeeping, reporting, and appropriations workflow that runs about half of the country's deaneries. It had been built and run single-handed for about ten years, in R and Shiny, with ShinyProxy hosting one R process per user on a 16-core / 64 GB host. It worked, and it had steady paying customers. It was a success-at-scale problem: every new user added a fixed per-user resource bill, and the cold-start experience was rough.

I came on in late 2023 as its first engineer, formally from January 2024, originally to move the platform to Kubernetes. The remit grew. Over the next two years the work split cleanly into two acts. Act I made the existing R Shiny platform cheaper, faster, and more reliable, without rewriting it. Act II, once the architecture itself had become the ceiling, was a full rewrite onto a modern stack built to carry the platform another decade. They are two different kinds of engineering, and the judgement that connects them is the point of this case: optimise the system you have first, and rewrite only once you have proven it is the architecture, not the code, that is holding you back.

Act I: a cheaper, faster platform on the existing stack

Kubernetes migration (completed April 2024)

The flat 16-core host was replaced with Google Kubernetes Engine. Pods scale up on demand and scale to zero when nobody is logged in. The application didn't change yet, but the bill did: from a fixed monthly cost regardless of usage, to paying only for what the platform actually serves. Same app, same code, much lower floor.

Kubernetes solved cost and elasticity. Cold start was still the visible problem. Users felt every second of the 50-second login wait, and deploys took long enough that we shipped monthly instead of daily. The next stretch, through 2024 and into 2025, was a sustained optimisation pass on the legacy stack.

Pre-warmed pod pool

ShinyProxy normally spins up a fresh R process on user login. That process is the 50-second cost. I changed the cluster topology to keep a small pool of warm pods ready ahead of demand: when a user logs in, an already-running pod is claimed instantly and a new warm one starts in the background. The login latency the user experiences becomes the speed of the load-balancer redirect, not the speed of R booting. The trade-off is real (you pay for warm pods that aren't being used yet), but the pool is small relative to total capacity and the UX win is dramatic.

Base image: 35-minute builds to 80 seconds

Every deploy of the legacy app reinstalled the entire R-package dependency tree from source: about 35 minutes per build, which made shipping anything during the workday painful. I split the Dockerfile into two layers: a base image with R and all package dependencies installed once (rebuilt only when the dependency manifest changes), and a thin app layer on top that contains only the source code. Cold base-image rebuilds still take 35 minutes; routine app rebuilds run in roughly 80 seconds. That is a 26× drop on the hot path, and it unblocked the deploy cadence. We went from monthly to multiple times per day without trying.

Flame graph, Polars, and a single stored procedure

With the pod pool in place, the user-facing wait was no longer "wait for R to start"; it was "wait for the newly-claimed pod to load the app's data into memory before serving the first page", about 32 seconds. I profiled that startup path with R's flame-graph tooling and found two things worth fixing:

Many separate database round-trips during initialisation. Each was small and harmless on its own, but they were sequential, and sequential network round-trips against MariaDB stack up fast. I consolidated them into a single stored procedure that returns every table the app needs in one response.
Two compute-heavy R functions on the startup path: financial data transformations the app needed before any UI could render. I ported both to Polars (the Rust-based dataframe library, called from R via its arrow integration). More than 10 seconds saved on those two functions alone.

Together those drop pod-claim-to-first-page from 32 seconds to 18. Combined with the pod pool already delivering an instant pod, the end-to-end cold start the user experienced went from 50 seconds to 18, on the existing codebase, before a single line of the rewrite existed.

Act II: the rewrite (started July 2025, first release January 2026)

After a year of optimising R Shiny, the architecture itself was the ceiling. Per-user R processes don't scale to ten times the user count. The R ecosystem for web-first concerns (auth, tenancy, caching) is thinner than the Python or TypeScript equivalents. And a 44,000-line R codebase (one large app.R plus surrounding modules and helper scripts) was slowing every new feature. By mid-2025 I was convinced the platform needed a clean rebuild, and in July I got the go-ahead to start one. I did not argue that R Shiny had been a mistake; it had run a real business for ten years. I argued that the features customers would want over the next decade needed a foundation it could not give them.

The stack I chose, with the reasoning:

Layer	Choice	Why
Backend	FastAPI + Polars + Pydantic	Async-friendly, automatic OpenAPI, dataframe operations that beat anything R offers for the analytics workloads. Pydantic validates at the boundaries.
Frontend	React 19 + Vite + TanStack Query/Router	Tailwind and shadcn/ui for the component layer. A typed, modular SPA in place of one monolithic Shiny UI.
Database	MariaDB (Cloud SQL)	Same engine as the legacy app, so a shared production database bridges the two while modules move across.
Cache	DragonflyDB	Drop-in Redis protocol, much higher throughput per node. Cache-aside with Arrow IPC serialisation.
Auth	Auth0	Separate staging and production tenants, so local development can never touch real user data.
Hosting	GKE	A managed control plane buys a tiny team what it could not cheaply build or keep running: high availability, automatic security patches, node auto-repair. Self-hosting that reliability is a full-time job, and one thunderstorm should not be able to take the product down.

The win is maintainability, not raw speed

The rewrite needed no clever performance work at all. Polars on FastAPI is fast enough by default: response times stay sub-second across every migrated module without anyone tuning for it. What it bought instead was a codebase that is larger than the 44k-line original and far easier to live in. It is modular, typed, and independently testable, and a new feature now lands in hours instead of a wrestling match with one giant app.R. Act I was targeted optimisation on a system I could not replace yet: a pre-warmed pod pool, a Polars port of the hot startup path, one stored procedure in place of a dozen sequential round-trips. Act II was the opposite discipline, choosing an architecture solid enough that none of that was necessary. Both are the same call made twice: put the effort where the leverage actually is.

Architecture

Dumb router, smart service. Routers do HTTP semantics only (request parsing, status codes, content types). Services own business logic and call data loaders. Data loaders are layered: BaseDataLoader abstract, MariaDBDataLoader for SQL, CachedDataLoader wrapping it with DragonflyDB cache-aside. Each layer is independently testable; tests mock at the data-loader boundary so real service and router code runs against known fixtures.

One React tree, app-wide. Under TanStack Router the whole platform is a single React root. Auth0, the React Query cache, and the nanostore state are created once at the root route, so they survive navigation: the cache stays warm, sidebar and filter state persist, and route changes are client-side and sub-100ms.

Snapshot-immutable analysis runs. Each Analyse run persists its input configuration and its output together in an analysis_runs table (config as JSON, results gzip-compressed), so any two runs can be diffed to explain why their numbers differ. The new engines reproduce the legacy R calculations module by module, each checked against a fixture from the legacy R app as it lands, with the last few still in progress.

The cutover: bifurcated, not a migration weekend

A gradual module-by-module migration was not the shape this took. The practical version was a bifurcation. At the first release in January 2026, an NGINX ingress in front of both apps sent everything to the new platform except the Analyse module, which kept routing to the legacy R app. Analyse is the rarer, more complex, slower-to-port use case, so it was deliberately left for last. Users never saw a migration weekend; they saw the whole product move to the new stack while one module stayed where it was.

Where things stand

The full product runs on the new platform today, with over 1,000 tests passing, including a parametrised access-matrix suite that checks every user persona against every endpoint group so no permission regression can ship undetected. The one piece still on the legacy app is Analyse, and it is the active work: several engines are in place, a couple still run a simpler interim model, and each is checked against the legacy R output as it is finished. When the last engine and the cross-module summary land, the NGINX route to the old app is removed and the legacy R Shiny platform is switched off for good. That release is 2.0.

What this case shows

Optimise first, rewrite once the architecture is the ceiling. The R Shiny app got a year of focused performance and infrastructure work before any rewrite began. The 50-second login was 18 seconds before a single FastAPI route existed. That year bought the runway, and the rewrite was the right call only once per-user R processes could no longer scale, not before.
The rewrite's value was maintainability, not speed. No Zig, no hand-tuning. A boring, modern, modular stack that is performant by default and that a small team can extend for years.
A 26× build-time win changes how a team works. Shipping monthly versus shipping daily is not 30× faster delivery; it is a different culture. For a small team it is the most valuable single optimisation available.