Why Crowkis is Rust all the way down
A cache lives in the hot path of every request. The language choice isn't aesthetic — it's the difference between predictable microseconds and mystery pauses.
Read it →
Notes from the nest · 100 posts
Engineering notes written by the people building Crowkis. Comparisons with everything else, use cases, economics, internals, security, operations — and nothing written to rank on a search engine.

A cache lives in the hot path of every request. The language choice isn't aesthetic — it's the difference between predictable microseconds and mystery pauses.
Read it →
A cache lives in the hot path of every request. The language choice isn't aesthetic — it's the difference between predictable microseconds and mystery pauses.
Redis is magnificent infrastructure for exact-match workloads. LLM traffic isn't one. Here's why speaking the same protocol doesn't mean solving the same problem.
Pull, run, first hit in the dashboard — with no config file, no signup, and no environment variables you're required to set. We timed it. It holds.
Nowhere else do thousands of people ask the same fifty questions, all day, in every phrasing imaginable. Crowkis was practically designed in a support queue.
Take your daily query volume, multiply by the repeat fraction, multiply by your blended price per call. That number, twelve times a year, is the cache argument.
Injected instructions in one response become served truth for every similar query — unless the cache can smell an answer that doesn't answer.
GPTCache proved developers want semantic caching. Crowkis is what happens when that idea grows up, moves out of your Python process, and gets a security model.
docker pull, restart, done — no schema migrations, no export/import, no upgrade runbook. The on-disk format is a stability promise, not an implementation detail.
HR policy, expense rules, deploy commands, VPN setup — every employee rediscovers them through your copilot, billed per discovery. Give the company one memory.
Your vector store finds the chunks fast. Then the model re-synthesizes the same answer from the same chunks, thousands of times. That second step is the bill.
After the 2026 gateway compromise, 'how many packages are in your hot path?' became a real procurement question. Our answer is a number: zero.
347 integration tests, a smoke suite that kills the process on purpose, and a Docker image hardened before anyone asked. The receipts behind 'production-ready.'
A cache exists to make costs predictable. Metering the cache would be self-defeating. So Community is free and Enterprise is flat per cluster — priced on a call, not a meter.
The built-in dashboard for humans, /metrics for your Grafana, one JSON line per event for your pipeline — same truth, three consumers, zero adapters.
Python gateways treat caching as one feature among forty. Crowkis treats it as the product — and ships it without a Python supply chain attached.
Most semantic caches call out to a vector database. Crowkis embeds the HNSW graph in-process — and that placement decision is worth more than any algorithm tweak.
Agents re-ask, re-plan, and re-fetch with industrial enthusiasm. Multiply by a fleet and you get the most cacheable traffic in existence — if the cache understands agents.
A WHERE clause is a promise; a namespace is a wall. How Crowkis makes cross-tenant leakage structurally impossible rather than procedurally unlikely.
Every cache vendor promises a hit rate. Crowkis Replay computes yours — on your real queries, before you spend anything. The pitch is a number with your name on it.
One container, a PVC, real health probes, hard memory bounds, graceful shutdown. Everything your cluster expects from a tenant that's read the manual.
Durability isn't a checkbox — it's a sequence of writes in the right order with checksums at every step. Here's the boring machinery that makes restarts uneventful.
Portkey is a control panel for LLM calls. Crowkis is the memory underneath them. Confusing the two costs you the savings both promise.
Users put personal data in prompts whether you like it or not. The cache's job is a full lifecycle: keep it out of shared entries, find it on demand, erase it provably.
Every developer on your team asks the assistant the same questions about the same codebase. With Crowkis behind MCP, the second ask is free for everyone.
The fastest disk read is the one that never happens. A few bits per key let Crowkis skip files that can't contain your answer — at a 1% false-positive cost we chose on purpose.
Every team has a runaway-loop story that ends with a shocking invoice. Per-key budgets with hard TPM and dollar walls end the genre.
Slice the traffic, compare against cached baselines, promote or retreat — model upgrades as a controlled experiment with the cache as your measuring instrument.
Observability tools show you beautiful charts of money leaving. Crowkis is the component that makes the chart go down.
Most self-hosted breaches are defaults, not exploits. Crowkis inverts the failure direction: forget to configure auth and you get a locked deployment, not an open one.
Shipping times, return windows, size guides, 'does this come in blue?' — commerce traffic is seasonal, spiky, and gloriously repetitive. Cache accordingly.
One similarity threshold for all traffic is how caches embarrass themselves. Crowkis classifies every query into one of twelve intents, each with its own rules of reuse.
Every multi-second model wait is paid twice — once in tokens, once in user patience. The cache refunds both, but only one shows up in accounting.
Providers have incidents; your product doesn't have to. Health-aware backend routing plus a warm cache turns upstream outages into degraded modes users barely notice.
Pinecone answers 'what's similar?'. A production cache must answer 'is this safe to serve?'. Those are different questions with different architectures.
Every accept and refuse, per source, append-only. Trust with memory changes attacker economics — and gives auditors the artifact they actually want.
Every cohort asks why the quadratic formula works. Teach the model once per concept, not once per student — while keeping personalized work personal.
Embeddings blur exactly where caches need precision — numbers, dates, entities. Template abstraction catches what cosine similarity structurally cannot.
Cost pressure pushes teams toward cheaper, dumber models. Caching offers the opposite trade: keep frontier quality, pay small-model prices on the traffic that repeats.
CROWKIS_MEMORY_LIMIT means what it says — no GC mood swings, no mystery RSS, eviction that engages before the kernel has opinions.
Every DIY semantic cache is a vector database, a Redis, a cron job, and a prayer. Crowkis is the version where the parts were designed for each other.
No phone-home, offline license verification, one binary. The deployment story for networks that treat outbound packets as incidents.
Every sane checklist says don't write your own storage engine. We did it anyway. Here's the actual reasoning, the architecture, and the parts that were painful.
Clinical-adjacent assistants repeat administrative and informational answers constantly — but every cached byte is regulated. This is what compliance-mode caching looks like.
Chain-of-thought tokens are the most expensive ones you buy. Crowkis extracts the thought's skeleton, abstracts the specifics, and recomposes it for the next input that shares its shape.
Agents multiply model calls per user action by 10–50x. Without aggressive reuse, the unit economics of agentic products simply don't close.
Live verdicts, hit-type economics, top misses, safety blocks, tenant accounting, system pressure — what each panel answers and who keeps it open.
pgvector is a lovely extension for storing embeddings next to your data. Routing every LLM query through Postgres is how lovely things die.
Each regime wants specific retention, audit, and erasure behavior. Enterprise compliance modes preset the whole posture, so the auditor's checklist maps to a flag.
Money questions repeat endlessly and tolerate zero staleness. Fintech is where freshness control stops being a feature and becomes the product.
LRU evicts by recency and nothing else. But cache entries have wildly different replacement costs — and forgetting a $0.40 answer to keep a $0.0004 one is just bad accounting.
Full engine, production use, no license, no meter, no time bomb. Here's why giving the small end away is the rational structure, not a teaser.
Fail-open design means most 'incidents' are the absence of savings, not the presence of errors. Here's the whole decision tree, which fits on an index card.
Serverless caches meter every operation. A cache that charges per request in front of an API that charges per request is a strange kind of savings.
RESP, gRPC, REST, and the dashboard each get auth that fits their use — constant-time tokens for the data plane, RBAC for the control plane, mandatory locks past loopback.
Air-gapped networks, FedRAMP postures, and zero phone-home tolerance rule out most AI infrastructure on page one. Crowkis was designed to pass that page.
Answers age at different speeds — prices in days, math never. A single TTL knob can't express that, so Crowkis ships five policies plus version pinning and webhooks.
Three sentences, one dashboard number, and a flat price. The rare infrastructure purchase that finance understands faster than engineering does.
Exciting infrastructure is a contradiction in terms. Every Crowkis design decision optimizes for the same review: 'it just runs.'
AWS will happily run an exact-match cache for you at any scale. It will miss your LLM traffic at any scale, too.
'Many eyes' assumes the eyes show up. For your hot path, a signed single binary with zero dependencies is a smaller attack surface than a thousand auditable packages nobody audits.
Caching across customers multiplies savings and multiplies risk. Tenant isolation has to be architecture, not a WHERE clause.
Every new API is a tax on adoption: clients, docs, muscle memory, tooling. RESP3 meant inheriting twenty years of all four on day one.
Model prices vary 50x for overlapping quality on easy queries. The arbitrage router exploits the spread automatically, with a quality bar you set per intent.
Memcached is the purest cache ever written — and purity is exactly the problem when your keys are sentences.
Seed-stage AI products routinely spend salary-sized sums recomputing known answers. Free Community edition exists precisely for this moment of your company.
Crowkis serves thousands of connections through async IO — then funnels every cache decision through a single deterministic actor. Here's why that's a feature.
Swap models with a normal cache and you re-purchase your entire corpus at the new model's prices. Migration leasing is the line item that prevents the line item.
The new Redis-compatibles race each other on throughput. On LLM traffic they all hit the same wall at full speed: the keys never repeat.
Every product team is duct-taping its own LLM cache right now. Platform engineering exists to end exactly this kind of duplication.
MCP turns Crowkis into something an AI assistant can use deliberately — check the cache, store the answer — over plain stdio, with the banner silenced so JSON-RPC stays clean.
Caching ROI isn't a hockey stick — it's a staircase that starts the first hour. Here's the honest schedule of when each saving shows up.
Provider prompt caching discounts your repeated prefixes. You still call the model, still wait, and still pay — just slightly less. There's a bigger idea available.
Semantic caching has an obvious failure mode nobody likes to talk about: one bad write, served forever to everyone nearby. This is how Crowkis decides what to trust.
At consumer scale, traffic converges on shared intents while costs and latency multiply by millions. The cache becomes load-bearing infrastructure.
LSM compaction is where storage engines breed complexity. Crowkis ships exactly one strategy across three levels — chosen for cache workloads, closed for configuration.
Anthropic's prompt caching is excellent at its actual job — cheap long contexts. It was never designed to be your response cache, and the pricing says so.
Voice gives you about a second before silence feels broken. Model round-trips don't fit. Cache hits do — with room to spare for the speech stack.
Users expect LLM answers to arrive as a typing stream. CGETSTREAM serves cached answers chunk by chunk, so a sub-millisecond hit doesn't break the interface's rhythm.
Google bills cached context per token per hour — a parking meter for your own prompts. Compare that with a cache you simply own.
Product copy, help docs, and templates get re-translated continuously as releases churn. Most of the content didn't change. Stop paying as if it did.
Bottom-heavy by design: the layers that hold your data get the most hostile coverage, and the smoke suite's signature move is killing the process to prove a point.
vLLM's prefix caching saves GPU work inside one inference server. Crowkis saves the inference itself. You probably want both — but only one cuts the bill to zero on a hit.
Reports, tickets, calls, and articles get summarized on every view, by every viewer, in every digest. The document didn't change between viewers. The bill did.
LangSmith shows you every span of every chain, beautifully. The spans are still billed. There's a component whose job is making the spans not happen.
Routing tickets, tagging content, extracting fields — LLM classification runs millions of small calls over heavily repeating inputs. The cache hit rate is absurd, in your favor.
Cloudflare's gateway adds caching at the CDN layer — exact-match, eventually-evicted, on someone else's network. Useful plumbing; not a reuse brain.
Every docs site has the same hit parade — auth, rate limits, pagination, that one confusing endpoint. The assistant answering them should not bill like a consultant.
Kong added AI plugins to a great API gateway. A semantic-cache plugin in a proxy is a feature; a semantic cache engine is a product. The difference shows in production.
If your product is answering questions, your COGS is the model bill and your UX is the latency. The cache moves both — which makes it strategy, not plumbing.
Every team builds the in-house semantic cache once. The prototype takes a week. The production version takes the year you didn't budget. We know — we budgeted it.
Redis shipping a semantic cache service confirms the problem is real. Their answer is a managed add-on; ours is a from-scratch engine. The difference is in the bones.
LangChain, LlamaIndex, and Semantic Kernel all offer cache hooks. Framework caches live and die with the framework. Infrastructure shouldn't.
Bedrock's caching cuts repeated-prefix costs inside one cloud's model garden. Your cache strategy deserves a longer horizon than a vendor's feature page.
One import gives you LangChain's in-memory exact cache. It's the caching equivalent of a sticky note — gone on restart, blind to paraphrase, local to one process.
Serverless Redis with per-request pricing is elegant for occasional workloads. An LLM cache is the opposite of an occasional workload.
Somewhere in your repo is a script that hashes prompts and skips duplicates. It's doing its best. Here's everything it can't see.
Chroma is wonderful for getting embeddings working before lunch. The qualities that make it great for prototypes are the ones a cache in production can't keep.
The default strategy — every query goes to the model — has a precise cost. It's on your invoice, itemized as everything.
Fine-tuning a smaller model is a months-long bet on cheaper tokens. Caching is a five-minute bet on zero tokens. One of these compounds weekly.
Million-token contexts tempt teams to ship the whole knowledge base with every call. That's not memory — that's paying to re-read the library daily.
100 posts in the roost · crows remember faces. we remember production incidents.