Caching what the model saw: multimodal image-plus-text lookups
Vision queries are expensive and repetitive — the same product photo, the same screenshot, asked about again and again. Crowkis caches image-plus-text lookups so a repeated visual question is a hit.
Multimodal calls are among the priciest a model offers, and in production they repeat constantly: the same product image run through 'describe this,' the same screenshot asked 'what's the error,' the same chart queried 'summarize this.' A text-only cache can't see the image, so it misses every time. Crowkis caches the image-plus-text pair, so the second identical visual question is a hit.
CSET accepts an IMAGE argument alongside the query, and CGET (and CIMGGET) match on the combination — the image's content and the accompanying text together. So 'what's in this photo' over an identical image returns the cached answer instead of re-running an expensive vision pass, while a different image is correctly a different key.
Every paraphrase is a fresh bill — unless the cache understands meaning.
The economics are the same argument the whole product makes, sharpened by price: vision tokens cost more than text tokens, so the savings per avoided call are larger. Anywhere the same images recur — catalogues, dashboards, document pipelines, support screenshots — the multimodal cache turns a repeated expensive call into a cheap lookup.
The bottom line
An LLM cache that goes blind the moment an image appears isn't a cache for how models are actually used now. Crowkis remembers what the model saw, not just what it was told — because the picture is increasingly part of the question.