The write-ahead log: how the cache survives a kill -9
Durability isn't a checkbox — it's a sequence of writes in the right order with checksums at every step. Here's the boring machinery that makes restarts uneventful.
Caches traditionally shrug at durability — 'it's just a cache' — but an LLM cache's contents cost real money to rebuild. Losing a warm corpus to a pod reschedule means re-purchasing it from your provider at full price. So CrowkisDB treats every entry as worth keeping: writes land in the write-ahead log before anything else, each record framed with a CRC32 checksum, fsynced according to policy.
Recovery is the payoff: on startup the engine replays the log from the last checkpoint, validating every checksum, rebuilding the memtable to the exact pre-crash state. A torn write at the tail — the classic power-cut artifact — fails its checksum and truncates cleanly instead of poisoning the replay. The vector index recovers alongside, so semantic search resumes where it stopped.
Five stages score every write before it can ever be served.
We don't trust this machinery; we attack it. The integration suite kills the process mid-write and verifies every acknowledged entry survives; the Docker smoke test does the same to the whole container. Durability claims that aren't tested by murder are marketing.
The bottom line
The result is operationally liberating: deploys, reschedules, and crashes are non-events. The cache you had is the cache you have. Boring, by design, provably.