Voice assistants: caching as a conversational necessity
Voice gives you about a second before silence feels broken. Model round-trips don't fit. Cache hits do — with room to spare for the speech stack.
Voice interfaces live under a brutal latency budget: ASR, understanding, synthesis, and playback all share roughly a second before users perceive the assistant as broken. A multi-second LLM round-trip blows the budget on its own. For repeated intents — which dominate voice traffic's command-like distribution — that spend of time and tokens is doubly absurd.
Crowkis hands voice stacks their latency budget back: semantic hits return in under a millisecond, leaving nearly the whole second for speech processing. 'What's on my calendar', 'play the news', 'how do I get downtown' and their endless phrasings become instant, while only genuinely novel requests wait on a model.
Reuse only when meaning, structure, confidence, and trust all agree.
Voice phrasing variability is the semantic layer's home turf: spoken language is messier than typed, ASR adds its own noise, and exact matching is hopeless — but intent plus template plus embedding matching was built for exactly this looseness, with the confidence gate guarding against the looseness becoming wrongness.
The bottom line
Streamed cache hits complete the illusion: CGETSTREAM feeds the TTS chunk by chunk, so the assistant starts speaking immediately and naturally. Users call the product 'snappy.' The dashboard calls it a 90th-percentile latency you didn't have to engineer twice.