Semantic Caching Is Harder Than It Looks
Semantic similarity measures intent, not validity. Safe AI caching needs source versions, schemas, model policy, evidence, permissions, review state, and wrong-hit tracking.
A contract review workflow extracts clause-level risk flags from supplier agreements. A new contract arrives from the same supplier. The semantic similarity score is high: same parties, same structure, same question. The system returns the cached result.
What it does not know: the supplier updated the liability cap three weeks ago. The old extraction had been acceptable. The new version is not. The cache did exactly what it was configured to do. The workflow had no way to know that the meaning of "safe to reuse" had changed.
Semantic caching is powerful only when the system knows what makes an answer reusable. Most systems do not.
Semantic similarity measures intent. It does not prove that the source, policy, schema, model, evidence, permissions, review state, and freshness requirements still make the cached output valid.
The thing prompt-level caching gets wrong
The default semantic cache is simple: embed the incoming request, find the nearest stored embedding, and return the cached result if the similarity score clears a threshold. That works in demos because demos rarely change the world around the answer.
Two requests can mean the same thing and still need different answers because the document changed, the schema changed, the user has different permissions, or a reviewer corrected the old output.
The source document, retrieval window, parsed artifact, or chunk set differs from the version that produced the cached output.
The prompt, output schema, fallback rule, model route, or review condition changed since the result was stored.
The cached answer may contain information the current user or tenant should not see.
The supplier record, pricing table, contract clause, policy document, or business fact has changed.
A safe cache key is multi-dimensional
Semantic similarity can be one input into the cache decision. It should not be the whole decision.
A cache that cannot answer that question is not an AI systems optimization. It is a shortcut around the trust boundary.
Cache the right layer
Not everything in an AI workflow should be cached at the same level. Some outputs are safe to reuse. Others need freshness checks. A few should almost never be cached as final answers.
| Cache candidate | Safer cache boundary | Risk to watch |
|---|---|---|
| Document parsing | Cache parsed artifacts by file hash, parser version, and document version. | Parser upgrades can make old artifacts incomplete or incompatible. |
| Embeddings | Cache by source version, embedding model, chunking policy, and tenant boundary. | Embedding model changes can shift similarity distances. |
| Field extraction | Cache field claims only with evidence refs, schema version, source version, and support status. | Fields marked `missing_evidence` or `schema_failure` can be reused as if they were `success`. |
| Summaries | Cache only when source set, audience, policy, user scope, and freshness window match. | A summary can omit facts that matter in a newer version or different context. |
| Risk decisions | Prefer caching inputs, traces, and intermediate extractions, not final verdicts. | Stale decisions can become business risk. |
Represent cache eligibility in code
A workflow should be able to say, in structured terms, why an output can or cannot be reused. That reason should travel with the output and appear in traces.
from pydantic import BaseModel
class CacheKey(BaseModel):
intent_embedding_hash: str
source_version: str
prompt_version: str
schema_version: str
model_policy: str
evidence_required: bool
tenant_id: str
max_age_seconds: int | None = None
class CachedResult(BaseModel):
output: dict
cache_key: CacheKey
evidence_refs: list[str]
produced_at: str
review_status: str | None = None
invalidation_reason: str | None = None
A source update invalidates by `source_version`. A schema release invalidates by `schema_version`. A reviewer correction can invalidate the specific entry and semantic neighbors. The cache hit carries enough metadata to explain why reuse was allowed.
Invalidation is the real design problem
Most cache bugs are not caused by storing the wrong thing. They are caused by failing to invalidate when the meaning of the stored output changes.
| Change | What should invalidate | Why |
|---|---|---|
| Prompt or schema update | Cached structured outputs for that route. | The expected object, acceptance rules, and semantics changed. |
| Source document update | Answers, summaries, embeddings, and evidence claims tied to the old source. | The cached output may no longer reflect the current source. |
| Model or embedding version change | Similarity matches and model-dependent outputs. | Distances and output behavior can shift even if the API shape is stable. |
| Access policy change | User-visible cached output that crosses permission boundaries. | Fast reuse cannot override authorization. |
| Reviewer correction | The corrected entry and related semantic-neighbor entries. | A human found the cached behavior wrong or incomplete. |
Measure whether the cache is safe
Hit rate, latency saved, and cost saved tell you whether the cache is efficient. They do not tell you whether it is safe.
Wrong-hit rate is the percentage of cache hits that returned output that was incorrect, stale, unauthorized, unsupported, or policy-invalid for the request. If a reviewer corrects output that came from cache, that is a wrong hit. If an eval fixture proves the answer should have been regenerated, that is a wrong hit.
Store the full key, triggering request, returned output, evidence refs, and reuse reason.
Add eval fixtures where a similar request should not come from cache because source, schema, or access changed.
Feed reviewer corrections and rejections back into invalidation and wrong-hit tracking.
- Cache by prompt string or embedding similarity alone.
- Reuse answers without checking source version, schema version, model policy, evidence, review state, and user permissions.
- Cache unsupported claims as if they were verified facts.
- Let semantic similarity override tenant isolation or access control.
- Ignore reviewer corrections when invalidating related cache entries.
- Measure cache hit rate without measuring wrong-hit rate.
What to actually build first
Start with the layer that has the lowest risk and highest reuse value. That is usually document parsing and embeddings, not final answers.
| Build order | Implementation options | What to evaluate |
|---|---|---|
| Parsed artifact cache | Cache parsed documents by file hash, parser version, and document version. | Whether repeated processing drops without stale artifacts affecting downstream steps. |
| Embedding cache | Cache embeddings by source version, embedding model, chunking policy, and tenant ID. | Whether retrieval behavior stays stable across source and model updates. |
| Field extraction cache | Cache only successful field claims with evidence refs, source version, schema version, and review-aware invalidation. | Whether `missing_evidence` and `schema_failure` are never reused as success. |
| Summary cache | Cache summaries only with audience, source set, policy version, access scope, and freshness window. | Whether reused summaries remain complete enough for the current context. |
| Risk decision cache | Cache final decisions last, if at all, and only behind explicit revalidation. | Whether wrong-hit rate stays within the workflow's risk tolerance. |
Where this shows up
Semantic caching matters when workflows repeat, but trust boundaries still matter.
PolicyTrace could safely cache parsed artifacts and some extraction evidence by source hash, but final claims still need schema, provenance, source version, and review-aware invalidation.
A contract workflow would need clause, amendment, policy, tenant, access, and reviewer-correction-aware cache keys before reusing risk summaries.
An invoice workflow could cache supplier matching and document parsing, while treating totals, tax, PO matching, and exception decisions as freshness-sensitive.
The practical takeaway
Semantic caching and output quality are not naturally in tension. They become enemies when the cache ignores the trust boundaries the rest of the system enforces.
Cache the artifact layer first. Cache answers carefully, with full key metadata. Measure wrong-hit rate, not just hit rate. And make sure that when a reviewer finds an error, the cache learns from it.
This closes the Efficiency Layer batch: route the model, control token economics, and cache only when trust boundaries still hold.