Introduction
We started the year with a simple question — is our fleet sized right — and ended it with a rather more awkward one — what does 'right' even mean when the workload changes shape every two weeks.
By the numbers
14k hrs
Observed inference telemetry, 2025
70%
Cost reduction not from a better model
P95 → P99
We now size to a budgeted tail, not peak
What the telemetry told us
Across four clients and roughly fourteen thousand hours of inference, the utilisation graphs told a consistent story. The cost-per-useful-answer curve was dominated not by model choice but by how aggressively we batched, cached and routed around slow paths. The biggest wins came from boring, unglamorous infrastructure work.
How we size now
We have stopped sizing to peak and started sizing to a budgeted tail. It is less tidy on a capacity chart and more honest about what the business is actually buying. The finance team, for what it is worth, prefers it.
What the data shows
GPU utilisation, tail versus peak
Cost contribution by control, not by model
Our cost curve flattened the week we stopped trying to make one agent do everything.
Where we land
We will keep writing these as we find them. If any of this lands close to a problem you are working on, the team is always happy to talk it through.