Streaming cache hits: instant answers that still feel like typing
Users expect LLM answers to arrive as a typing stream. CGETSTREAM serves cached answers chunk by chunk, so a sub-millisecond hit doesn't break the interface's rhythm.
Caching collides with a UX convention: every LLM interface streams tokens, and users have learned to read the typing rhythm as 'the AI is thinking.' Return a cached answer as one instantaneous block and the experience goes uncanny — the seam between hit and miss becomes visible, and visible seams erode trust in both halves.
CGETSTREAM and the SDKs' streaming helpers serve hits in configurable chunks with configurable pacing — chunk_tokens and delay_ms — so a cache hit walks onto the screen with the same gait as a model response. Your frontend keeps one rendering path; users keep one mental model; the seam disappears.
Four doors in, one cache, and the model only sees genuinely new questions.
Behind the curtain it composes with real streams: stream_get_or_compute passes a genuine model stream through on a miss, captures it for the write pipeline, and replays the banked version for every future paraphrase. First asker gets the live stream; everyone after gets the recording, indistinguishable at a fraction of a millisecond's cost.
The bottom line
It's a small feature with an honest insight inside: latency wins must be spent carefully in interfaces built around latency. We give you the win and the dimmer switch.