Provstiskyen: performance work on a 10-year SaaS

The starting point

Provstiskyen is a Danish administration platform for parish councils: the bookkeeping, reporting, and appropriations workflow that runs ~40 of the country's ~100 deaneries. It had been built and maintained for roughly ten years by the founder, single-handed, in R and Shiny, with ShinyProxy hosting one R process per user on a 16-core / 64 GB host. (R is a popular default for business and biology in Denmark; less so for serving a production multi-tenant web app to thousands of users.) It worked. It had steady customers. It was a success-at-scale problem: every new user added a fixed per-user resource bill, and the cold-start experience was rough.

I joined as the first hired engineer in May 2024, originally to help move the platform to Kubernetes. The remit grew from there. Across the next year and a half I led three successive waves of work: an infrastructure migration, a deep performance and build-pipeline optimisation pass on the existing R Shiny app, and finally, once the architecture itself had become the ceiling, a full rewrite onto a modern stack designed to carry the platform another decade. The optimisation work is what matters most about this case. The rewrite is the natural follow-on once the existing system has been measured, profiled, and pushed as far as its architecture allows; the optimisation work would have shipped on its own merits even if the rewrite had never been authorised.

Wave 1: Kubernetes migration (May 2024)

The flat 16-core host was replaced with Google Kubernetes Engine. Pods scale up on demand and scale to zero when nobody is logged in. The architecture didn't change yet, but the bill did: from a fixed monthly cost regardless of usage, to paying only for what the platform actually serves. Same app, same code, lower floor.

Wave 2: Three optimisations on the existing R Shiny app

Kubernetes solved the cost-and-elasticity story. Cold-start was still the visible problem. Users felt every second of the 50-second login wait, and deploys took long enough that we were shipping monthly instead of daily. The next year was a sustained optimisation pass on the legacy stack, before any rewrite.

Pre-warmed pod pool

ShinyProxy normally spins up a fresh R process on user login. That process is the 50-second cost. I changed the cluster topology to keep a small pool of warm pods ready ahead of demand: when a user logs in, an already-running pod is claimed instantly and a new warm one is started in the background. The login latency the user experiences becomes the speed of the load-balancer redirect, not the speed of R booting.

The cost trade-off is real: you pay for the warm pods that aren't being used yet. But the pool size is small relative to total capacity, and the UX win is dramatic.

Base image: 35-minute builds → 80 seconds

Every deploy of the legacy app reinstalled the entire R-package dependency tree from source: about 35 minutes per build, which made shipping anything during the workday painful. I split the Dockerfile into two layers: a base image with R and all package dependencies installed once (rebuilt only when the dependency manifest changes), and a thin app layer on top that contains only the source code.

Cold base-image rebuilds still take 35 minutes; routine app rebuilds run in roughly 80 seconds. That's a 26× drop on the hot path, and it unblocked the deploy cadence. We went from monthly to multiple times per day without trying.

Flame graph + Polars + a single stored procedure

Startup time was 50 seconds even after the pod-pool work, because every newly-claimed pod had to load the app's data into memory before serving the first page. I profiled the startup path with R's flame-graph tooling and found two things worth fixing:

Many separate database round-trips during the initialisation phase. Each one was small and harmless on its own, but they were sequential, and sequential network round-trips against MariaDB stack up fast. I consolidated them into a single stored procedure that returns every table the app needs in one response.
Two compute-heavy R functions on the startup path: financial data transformations the app needed before any UI could render. I ported both to Polars (the Rust-based dataframe library, called from R via its arrow integration). More than 10 seconds saved on those two functions alone.

Combined effect on warm-pool-claimed pods: 50 seconds → 18 seconds of startup, on top of the pre-warming already giving users an instant pod.

Wave 3: The platform rewrite (July 2025 onwards)

After a year of optimising R Shiny, it became clear the architecture itself had a ceiling. Per-user processes don't scale to ten times the user count, the R ecosystem for web-first concerns (auth, tenancy, caching) is thinner than the Python or TypeScript equivalents, and a ~44,000-line R codebase (the legacy app is one large app.R plus the surrounding modules and helper scripts) was slowing iteration on new features. I proposed a from-scratch rewrite. The framing I used to argue for it was that this was about the right tool for where we're going, not that the old was wrong.

The stack I chose, with the reasoning:

Layer	Choice	Why
Backend	FastAPI + Polars	Async-friendly, automatic OpenAPI, dataframe operations that beat anything R can offer for the analytics workloads.
Frontend	Astro 6 + React 19 + TanStack Query	Static-first shell, React islands only where pages need interactivity. Charts via Plotly.js with WebGL fallback for older hardware.
Database	MariaDB (Cloud SQL)	Same engine as the legacy app — the shared production database is the strangler-fig bridge while modules migrate one at a time.
Cache	DragonflyDB	Drop-in Redis protocol, much higher throughput per node. Cache-aside pattern with Arrow IPC serialisation.
Auth	Auth0	Separate staging vs production tenants so local development can't ever touch real user data.
Hosting	GKE	Already in place from Wave 1; no reason to change.

Migration strategy: Strangler Fig

A big-bang rewrite of a live system is irresponsible at this scale. Instead, the NGINX ingress sits in front of both apps and routes per-feature: new module is live → ingress sends that path to the new platform → old module gets switched off. The legacy app shrinks as the new one grows. There is never a "migration weekend." Users don't see the migration happening to them.

Backend architecture

Dumb router, smart service. Routers do HTTP semantics only (request parsing, status codes, content types). Services own business logic and call data loaders. Data loaders are split: BaseDataLoader abstract → MariaDBDataLoader for SQL → CachedDataLoader wrapping the above with DragonflyDB cache-aside. Each layer is independently testable; tests mock at the data-loader boundary so real service + router code runs against known fixtures.

Single-page-per-island. Every Astro page renders exactly one client:only="react" component that wraps an AuthenticatedLayout. Two islands on a page would mean two React trees with disjoint Auth0 / QueryClient / nanostore contexts: a class of bug we ran into early and locked out structurally.

Snapshot-immutable analysis runs. The Analyse module (the last major piece, currently in build) runs four engine modules (buildings, activities, cemeteries, administration) against a shared cached data loader. Each run's input config and output blob are persisted together (gzipped JSON in the analysis_runs.results column), so two runs can be diffed to explain why their outputs differ. Outputs are parity-tested against a fixture from the legacy R app so we know the new engines agree with the old ones to within tolerance.

Where things stand today

17 modules are migrated and live on the new platform. The Analyse module's four sub-engines are implemented and parity-tested; the remaining work is the parsonages engine and the cross-module summary. 271 backend tests pass (non-integration). The access matrix is locked behind a parametrised test suite covering 11 user personas against 13 endpoint groups (149 tests total) so no permission regression can ship without flagging itself.

DragonflyDB hit rates run above 95%. Sub-second response times across every migrated module. The legacy R app is on a defined sunset path; we expect it switched off within the next two quarters.

What I think this case shows

Three things, mostly:

Optimisation first, rewrite later (or never). The R Shiny app got twelve months of focused performance and infrastructure work before any rewrite began. The 50-second login was 18 seconds before a single FastAPI route existed. The optimisation wins were the headline outcome; the rewrite is the natural follow-on once the architecture itself is the bottleneck.
A 26× build-time win is the most valuable optimisation you can do for a small team, because it changes how the team works. Shipping monthly versus shipping daily isn't 30× faster delivery; it's a different culture.
The "boring" stack is the right answer for a 10-year platform. FastAPI + Polars + Astro + React isn't novel. It's specifically chosen because it can be built on for many years without bet-the-company technology decisions.