featuresJune 18, 2026· 3 min read

CEVAL: nine evaluators that grade your LLM output without a second LLM

LLM-as-judge is expensive and leaks your data. CEVAL ships nine deterministic evaluators — toxicity, PII, injection-safety, relevance, JSON validity and more — that score input/output pairs locally and track the results over time.

The default way to evaluate LLM output in production is to call another LLM to grade it. That's a second bill, a second source of latency, and a second copy of your users' data leaving the building. CEVAL takes the other road: nine deterministic evaluators that score an input/output pair in-process, no model, no egress.

In plain words: Instead of paying a second AI to grade the first one's answers, Crowkis runs fast local checks — for toxicity, leaked personal data, valid JSON, relevance — and charts the pass rates over time.

The roster covers the checks teams actually run: non_empty, json_valid, toxicity, pii_leak, injection_safe, answered, exact_match, contains, and relevance. Call one by name, or run CEVAL SUITE to fire all of them, and each returns a name, a score, a pass/fail, and a detail string you can log or alert on.

the crowkis read path — five gates, every one can veto

Reuse only when meaning, structure, confidence, and trust all agree.

Because they're deterministic, the results are trackable rather than noisy: CEVAL's per-evaluator counters surface on /metrics as crowkis_eval_* series, so you can watch your toxicity rate or JSON-validity rate as a time series on the same dashboard as your cache hits — and catch a regression the day a prompt change ships, not the week the complaints arrive.

The bottom line

An eval you can afford to run on every request is worth more than a perfect one you run on a sample. CEVAL is cheap enough to be always-on, which is the only way evals catch the regression that happens at 2 a.m.