CEVAL: nine evaluators that grade your LLM output without a second LLM
LLM-as-judge is expensive and leaks your data. CEVAL ships nine deterministic evaluators — toxicity, PII, injection-safety, relevance, JSON validity and more — that score input/output pairs locally and track the results over time.
The default way to evaluate LLM output in production is to call another LLM to grade it. That's a second bill, a second source of latency, and a second copy of your users' data leaving the building. CEVAL takes the other road: nine deterministic evaluators that score an input/output pair in-process, no model, no egress.
The roster covers the checks teams actually run: non_empty, json_valid, toxicity, pii_leak, injection_safe, answered, exact_match, contains, and relevance. Call one by name, or run CEVAL SUITE to fire all of them, and each returns a name, a score, a pass/fail, and a detail string you can log or alert on.
Reuse only when meaning, structure, confidence, and trust all agree.
Because they're deterministic, the results are trackable rather than noisy: CEVAL's per-evaluator counters surface on /metrics as crowkis_eval_* series, so you can watch your toxicity rate or JSON-validity rate as a time series on the same dashboard as your cache hits — and catch a regression the day a prompt change ships, not the week the complaints arrive.
The bottom line
An eval you can afford to run on every request is worth more than a perfect one you run on a sample. CEVAL is cheap enough to be always-on, which is the only way evals catch the regression that happens at 2 a.m.