One signed Docker image. Every feature compiled in. Free to run. docker pull crowkis/crowkis:latest
← back to the Roost
vs the fieldApril 22, 2026· 3 min read

Crowkis vs vLLM prefix caching: different layers, different physics

vLLM's prefix caching saves GPU work inside one inference server. Crowkis saves the inference itself. You probably want both — but only one cuts the bill to zero on a hit.

If you self-host models, vLLM's automatic prefix caching is straight-up good engineering: shared prompt prefixes reuse KV-cache blocks on the GPU, throughput rises, latency falls. Run it. But understand its layer — it accelerates inference that is still happening, on one server, for requests sharing literal prefixes, with state that lives and dies with GPU memory.

The ceiling is physics: even a perfectly prefix-cached request still decodes output tokens, still occupies GPU, still takes its hundreds of milliseconds, and on hosted APIs you can't deploy vLLM at all — that layer belongs to your provider. Paraphrases share no prefix, so the semantic repetition dominating real traffic gets nothing.

the crowkis read path — five gates, every one can veto

Reuse only when meaning, structure, confidence, and trust all agree.

Crowkis sits above the inference layer entirely: when meaning matches and the gates pass, no inference happens anywhere — not on your GPUs, not on theirs. The answer returns in under a millisecond from a durable local store that survives restarts and works identically whether the model behind it is self-hosted, hosted, or both on alternating Tuesdays.

The bottom line

Layered correctly: Crowkis eliminates the repeated questions; vLLM accelerates the novel ones that remain. The GPU does less work twice over, and the bill notices both.