engineeringMay 18, 2026· 11 min read

Why we wrote our own LSM tree instead of bolting onto RocksDB

Every sane checklist says don't write your own storage engine. We did it anyway. Here's the actual reasoning, the architecture, and the parts that were painful.

The standard advice is correct: do not write your own storage engine. RocksDB exists, it is battle-tested, and the people who built it are smarter about compaction than you will ever need to be. We started there too.

In plain words: If you're not a database person: a storage engine is the part of a system that decides how data physically lands on disk and how it's found again. Most products borrow one off the shelf. We built ours, and this post is the why.

The problem is that a semantic LLM cache is not a generic key-value workload. Every entry carries an embedding, a structural template hash, an intent class, a confidence history, and a trust score. The interesting reads are not 'get key', they are 'find the neighbours of this vector, then check whether their templates agree, then check their trust ledger entries.' With a generic engine, every one of those becomes a separate lookup with its own serialization boundary.

The shape of CrowkisDB

CrowkisDB is a deliberately small LSM tree. Writes land in a write-ahead log first, that's the crash-safety guarantee, then in an in-memory table that flushes to sorted, compressed files on disk once it reaches 64 MB. A three-level compactor folds those files together in the background. The HNSW vector index lives beside the tree and persists with it.

the write path

1
client write, CSET
2
write-ahead log · crash-safe append · CRC32 per record
3
MemTable · sorted, in-memory
4
SSTables, L0 → L1 → L2 · LZ4 blocks · bloom filters ~1% FP
5
HNSW vector index · persists with the store

One write, three durable destinations, log, tree, vector index.

Why owning the engine pays

Owning the engine means the scoring pipeline reads index data without crossing a serialization boundary, and compaction understands that an entry's value includes its vector. The read path can interleave the five reuse checks with storage lookups instead of round-tripping a foreign API five times per query.

the read path, five gates, every one can veto

1
query: 'how long do refunds take?'
2
intent classifier
3
template match
4
HNSW neighbours
5
confidence gate ≥ 0.88
6
trust + freshness
7
answer · 0.4 ms
8
(nil) → your model

All local, all sub-millisecond, a miss costs almost nothing.

It also means our test suite covers crash recovery on our own WAL format. Of the 347 integration tests in the suite, the deepest coverage sits exactly here: WAL replay, flush correctness, compaction invariants, write batches, and HNSW persistence across restarts.

The honest costs

It was not free. We spent weeks on problems RocksDB solved a decade ago, manifest atomicity, bloom filter tuning, compaction scheduling. The discipline that made it survivable was ruthless scope-cutting: three levels, not seven; one compaction strategy, not five; and a written invariant list that every storage PR is checked against.

We would not recommend this path to anyone who can express their workload in a generic engine. We couldn't, and that's the whole story.

The payoff shows up in the numbers users actually feel: cache hits served from a single process in well under a millisecond, no GC pauses because there's no garbage collector, and a binary you can drop on a laptop or an air-gapped server with zero external dependencies.