Bloom filters: how the engine knows what it doesn't know
The fastest disk read is the one that never happens. A few bits per key let Crowkis skip files that can't contain your answer — at a 1% false-positive cost we chose on purpose.
LSM trees scatter data across many sorted files, so a naive lookup might probe each one — and most probes would find nothing. Bloom filters fix the economics: a compact bit-array per SSTable answers 'could this key be here?' with no false negatives and a tunable false-positive rate. A 'no' skips the file entirely; only plausible files get touched.
Crowkis tunes its filters to roughly 1% false positives — a deliberate spot on the memory/IO curve. Tighter filters buy little (the occasional wasted probe costs one read) while meaningfully inflating memory; looser ones start leaking real IO. One percent keeps filters small enough to live comfortably in RAM, where they make misses nearly free.
Reuse only when meaning, structure, confidence, and trust all agree.
Cheap misses matter more in a cache than anywhere: every genuinely novel question is a guaranteed miss on its way to the model, and that miss shouldn't pay a disk-tour tax before the model call it was always going to make. Filters keep the miss path as fast as the hit path is famous for.
The bottom line
It's twelve bytes of math per key standing between you and a pile of pointless reads — the kind of unglamorous engineering that compounds into the latency numbers on the homepage.