The world's shortest cache runbook
Fail-open design means most 'incidents' are the absence of savings, not the presence of errors. Here's the whole decision tree, which fits on an index card.
Runbook length measures architectural fear, and ours is short on purpose. Symptom one: hit rate dropped. Check the dashboard's top misses — usually a traffic shift (new product launch, new question genre) doing exactly what it should; pre-warm if it matters, shrug if it doesn't. The cache underperforming is a cost regression, not an outage.
Symptom two: Crowkis is unreachable. Your get_or_compute calls fall through to the provider — the app keeps working at pre-cache prices while you docker restart and the WAL replays the corpus intact. The failure mode of the savings layer is the absence of savings; we consider this the only acceptable failure mode for something in your hot path.
Reuse only when meaning, structure, confidence, and trust all agree.
Symptom three: something looks wrong with a specific answer. crowkis why <key> explains the serving decision — which gates passed, at what scores; crowkis dump shows the entry; the ledger shows its write history; Enterprise Live Edit corrects or kills it in place, audited. Investigation is built into the CLI, not reconstructed from logs.
The bottom line
That's the card. crowkis doctor covers the diagnostics, the durability drill is pre-run on every release, and the pager stays quiet — the entire operational ambition of this product is to be the component nobody talks about in retros.