benchmarksJune 8, 2026· 6 min read

84 of 84: correctness and isolation under a hostile harness

Empty strings, 100 KB values, null bytes, emoji, 16 threads hammering across tenants. The stress harness throws 84 nasty checks at Crowkis and counts the cross-tenant leaks. The leak count is zero.

Speed is negotiable; correctness is not. A cache that occasionally serves tenant A's answer to tenant B isn't a fast cache, it's a data breach with good latency. So the harshest part of our harness isn't about performance at all — it's 84 correctness and robustness checks designed to make Crowkis misbehave.

Stress harness results (v0.2.2)checks

Checks passed84

Checks failed0

Cross-tenant leaks (16 threads)0

Robustness, correctness, concurrency, and benchmark sections — all green.

The robustness checks feed Crowkis the inputs that break naive servers: empty and whitespace-only queries, 100 KB values, Unicode and emoji, embedded CRLF and null bytes, absurd thresholds, extreme TTLs. None crash it, none corrupt a neighbouring entry. The correctness checks verify the invariants that actually matter: exact round-trips, negative-cache behaviour, pinning, and agent-memory isolation.

In plain words: A cross-tenant leak is when one customer's cached answer is served to another. For a semantic cache it's the cardinal sin, because entries match by meaning — a leak doesn't stay in one cell, it spreads to every similar question.

The concurrency check is the one we lose sleep over: 16 threads running 60 operations each, deliberately interleaving reads and writes across tenant boundaries, then auditing whether anything crossed. Zero leaks. That result isn't an accident of timing — it's the single-writer actor making races structurally impossible, validated under exactly the load that would expose them.

We score ourselves 9 out of 10 on correctness and isolation, and 4 out of 10 on hot-path latency. The point of a brutal scorecard is that you can trust the high numbers because you can see the low ones.

That honesty is the product. The same harness that returns 84/84 on correctness flags CDEDUP's latency stall and the throughput ceiling without flinching, because a scorecard you can only pass isn't measuring anything. The cache earns the critical path by being boring exactly where boring is the whole job: never the wrong answer, never the wrong tenant.