engineeringApril 27, 2026· 3 min read

Streaming cache hits: instant answers that still feel like typing

Users expect LLM answers to arrive as a typing stream. CGETSTREAM serves cached answers chunk by chunk, so a sub-millisecond hit doesn't break the interface's rhythm.

Caching collides with a UX convention: every LLM interface streams tokens, and users have learned to read the typing rhythm as 'the AI is thinking.' Return a cached answer as one instantaneous block and the experience goes uncanny, the seam between hit and miss becomes visible, and visible seams erode trust in both halves.

In plain words: Cached answers arrive instantly, but they're typed out like the model is answering, so the speed-up never makes the interface feel weird or broken.

CGETSTREAM and the SDKs' streaming helpers serve hits in configurable chunks with configurable pacing, chunk_tokens and delay_ms, so a cache hit walks onto the screen with the same gait as a model response. Your frontend keeps one rendering path; users keep one mental model; the seam disappears.

adoption is one port change

1
your app · redis-py · ioredis · Lettuce
2
crowkis
3
claude code · agents
4
services
5
your LLM provider

Four doors in, one cache, and the model only sees genuinely new questions.

Behind the curtain it composes with real streams: stream_get_or_compute passes a genuine model stream through on a miss, captures it for the write pipeline, and replays the banked version for every future paraphrase. First asker gets the live stream; everyone after gets the recording, indistinguishable at a fraction of a millisecond's cost.

The bottom line

It's a small feature with an honest insight inside: latency wins must be spent carefully in interfaces built around latency. We give you the win and the dimmer switch.