Prompt injection meets your cache: the attack nobody threat-modeled
Injected instructions in one response become served truth for every similar query — unless the cache can smell an answer that doesn't answer.
Prompt injection analysis usually ends at the model: attacker smuggles instructions, model misbehaves once, incident closed. Add a naive cache and the blast radius changes category — the injected output gets stored, and the cache faithfully serves it to every future query in the semantic neighbourhood. One successful injection becomes standing infrastructure for the attacker.
Crowkis's coherence stage exists for precisely this shape of attack: injected content characteristically fails to actually answer the question it's stored under, and coherence — weighted 0.30, the joint-heaviest stage — scores that mismatch before storage. Content heuristics add a second look, and neighbourhood agreement flags answers that contradict their semantic peers.
Five stages score every write before it can ever be served.
Source trust closes the loop on persistence: a writer whose outputs keep getting refused accumulates ledger history and faces an ever-higher bar, so injection campaigns burn their own access. Every refusal is logged with its stage — your security team sees the attempt pattern, not just the absence of damage.
The bottom line
The principle generalizes: any system that stores model output and re-serves it needs an immune response, because the model will eventually be made to say something hostile. Ours is five stages deep and keeps receipts.