guidesJune 21, 2026· 5 min read

Cache an LLM call in three lines: the Python SDK

The Python SDK wraps the semantic cache in an ergonomic client — get-or-compute, streaming, tenants, models. Here's the three-line version and the production version.

The fastest way to feel the savings is the get-or-compute pattern: ask the cache first, and only call the model on a miss — banking the result for every future paraphrase. The Python SDK makes it three lines around your existing model call.

the three-line version

from crowkis import CrowkisClient

cache = CrowkisClient(host="127.0.0.1", port=6379, tenant="demo", model="gpt-4o")

answer = cache.get_or_compute(
    "Explain vector caches in one paragraph",
    compute=lambda: call_your_model(prompt),  # only runs on a miss
)

On the first call, `compute` runs and the answer is stored with the full anti-poisoning pipeline. On every semantically similar call after, the cache returns the stored answer in well under a millisecond — no model call, no token cost.

the production version — explicit set/get with confidence

# store, with a TTL and the model that produced it
cache.cset("Explain vector caches", answer, ttl=3600)

# read, gated on confidence — fall back to the model if unsure
hit = cache.cget("what is a semantic cache?", with_confidence=True)
if hit and hit.confidence >= 0.88:
    return hit.value
return call_your_model(prompt)

cache.close()

In plain words: Ask the cache first; only pay the model on a real miss. The second time anyone asks the same thing — even worded differently — it's free and instant.

Install is `pip install crowkis` (or `pip install ./sdk/python` from the repo). The client is sync or async, supports per-call `tenant` and `model` overrides for isolation and accounting, and exposes the streaming helpers so a cached answer can be typed out like a live one.