Inference at the edge of what we can afford

Introduction

We started the year with a simple question — is our fleet sized right — and ended it with a rather more awkward one — what does 'right' even mean when the workload changes shape every two weeks.

By the numbers

14k hrs

Observed inference telemetry, 2025

70%

Cost reduction not from a better model

P95 → P99

We now size to a budgeted tail, not peak

Inference at the edge of what we can afford — A hyperscaler hall we walked during capacity planning week.

Eight hours of rack telemetry compressed into a fifteen-second loop.

What the telemetry told us

Across four clients and roughly fourteen thousand hours of inference, the utilisation graphs told a consistent story. The cost-per-useful-answer curve was dominated not by model choice but by how aggressively we batched, cached and routed around slow paths. The biggest wins came from boring, unglamorous infrastructure work.

How we size now

We have stopped sizing to peak and started sizing to a budgeted tail. It is less tidy on a capacity chart and more honest about what the business is actually buying. The finance team, for what it is worth, prefers it.

What the data shows

GPU utilisation, tail versus peak

Mean utilisation curve across four clients, weekday workloads, March 2026.

Cost contribution by control, not by model

Share of the total cost reduction attributed to each infrastructure lever, 2025.

Inference routing map, taken the week we rebuilt the scheduler.

Our cost curve flattened the week we stopped trying to make one agent do everything.

Principal engineer — Public sector body

Where we land

We will keep writing these as we find them. If any of this lands close to a problem you are working on, the team is always happy to talk it through.

Inference at the edge of what we can afford

What the telemetry told us

How we size now

About this note