economicsMay 20, 2026· 3 min read

Before you downgrade the model, cache the good one

Cost pressure pushes teams toward cheaper, dumber models. Caching offers the opposite trade: keep frontier quality, pay small-model prices on the traffic that repeats.

The standard cost-reduction playbook reaches for a smaller model: accept a quality haircut everywhere, save per token. It's a real lever, but examine what it trades, every answer gets worse, including the novel ones where quality is the product. You've made the bill smaller by making the product smaller.

In plain words: A cheaper model makes every answer worse. A cache makes repeated answers free while keeping the good model. Try the free option first.

Caching inverts the trade. The repeated head of traffic, where the frontier model's answer is already banked, serves at effectively zero cost, full quality. Spend concentrates on the novel tail, where the good model earns its price. Blended cost falls toward small-model territory while quality stays at the frontier where users can tell the difference.

model upgrades without the cold start

1
gpt-4o cache · warm
2
canary: slice of traffic · on the new model
3
quality holds?
4
migrate entries with leasing
5
new model · cache still warm
6
stay, nothing lost

The upgrade is a workflow, not a leap of faith.

The math compounds with hit rate: at a 50% hit rate, your effective per-query cost halves with zero quality loss, equivalent to a 50%-cheaper model that's somehow exactly as good. No distillation project, no eval suite, no regression risk. One container.

The bottom line

Enterprise's arbitrage router lets you do both deliberately: easy intents route to cheap models, hard ones to the frontier, all behind the cache. But the order of operations matters, cache first, downgrade only what the cache and the router prove can survive it.