Semantic Caching Is Harder Than It Looks

Semantic similarity measures intent, not validity. Safe AI caching needs source versions, schemas, model policy, evidence, permissions, review state, and wrong-hit tracking.

9 min read
System Layer Efficiency Layer Semantic caching
The cached answer was fast. It was also wrong now.

A contract review workflow extracts clause-level risk flags from supplier agreements. A new contract arrives from the same supplier. The semantic similarity score is high: same parties, same structure, same question. The system returns the cached result.

What it does not know: the supplier updated the liability cap three weeks ago. The old extraction had been acceptable. The new version is not. The cache did exactly what it was configured to do. The workflow had no way to know that the meaning of "safe to reuse" had changed.

Semantic caching is powerful only when the system knows what makes an answer reusable. Most systems do not.

03
A cache hit is a trust decision, not just a similarity score.

Semantic similarity measures intent. It does not prove that the source, policy, schema, model, evidence, permissions, review state, and freshness requirements still make the cached output valid.

The thing prompt-level caching gets wrong

The default semantic cache is simple: embed the incoming request, find the nearest stored embedding, and return the cached result if the similarity score clears a threshold. That works in demos because demos rarely change the world around the answer.

A similarity score is not a validity check.

Two requests can mean the same thing and still need different answers because the document changed, the schema changed, the user has different permissions, or a reviewer corrected the old output.

1Context changes

The source document, retrieval window, parsed artifact, or chunk set differs from the version that produced the cached output.

2Policy changes

The prompt, output schema, fallback rule, model route, or review condition changed since the result was stored.

3Permission changes

The cached answer may contain information the current user or tenant should not see.

4Truth changes

The supplier record, pricing table, contract clause, policy document, or business fact has changed.

A safe cache key is multi-dimensional

Semantic similarity can be one input into the cache decision. It should not be the whole decision.

01IntentMeaning, task type, and route.
02SourceDocument, chunk, or artifact version.
03PolicyPrompt, schema, fallback, and review rules.
04ModelProvider, model version, parameters.
05EvidenceRefs, support status, and quality.
06AccessUser, tenant, and permission boundary.
07ReviewAccepted, corrected, rejected, or pending.
08FreshnessMax age and invalidation triggers.
The question is not "is this similar?" The question is "is this still valid for this request?"

A cache that cannot answer that question is not an AI systems optimization. It is a shortcut around the trust boundary.

Cache the right layer

Not everything in an AI workflow should be cached at the same level. Some outputs are safe to reuse. Others need freshness checks. A few should almost never be cached as final answers.

Cache candidateSafer cache boundaryRisk to watch
Document parsingCache parsed artifacts by file hash, parser version, and document version.Parser upgrades can make old artifacts incomplete or incompatible.
EmbeddingsCache by source version, embedding model, chunking policy, and tenant boundary.Embedding model changes can shift similarity distances.
Field extractionCache field claims only with evidence refs, schema version, source version, and support status.Fields marked `missing_evidence` or `schema_failure` can be reused as if they were `success`.
SummariesCache only when source set, audience, policy, user scope, and freshness window match.A summary can omit facts that matter in a newer version or different context.
Risk decisionsPrefer caching inputs, traces, and intermediate extractions, not final verdicts.Stale decisions can become business risk.

Represent cache eligibility in code

A workflow should be able to say, in structured terms, why an output can or cannot be reused. That reason should travel with the output and appear in traces.

from pydantic import BaseModel


class CacheKey(BaseModel):
    intent_embedding_hash: str
    source_version: str
    prompt_version: str
    schema_version: str
    model_policy: str
    evidence_required: bool
    tenant_id: str
    max_age_seconds: int | None = None


class CachedResult(BaseModel):
    output: dict
    cache_key: CacheKey
    evidence_refs: list[str]
    produced_at: str
    review_status: str | None = None
    invalidation_reason: str | None = None
This turns invalidation from guesswork into workflow logic.

A source update invalidates by `source_version`. A schema release invalidates by `schema_version`. A reviewer correction can invalidate the specific entry and semantic neighbors. The cache hit carries enough metadata to explain why reuse was allowed.

Invalidation is the real design problem

Most cache bugs are not caused by storing the wrong thing. They are caused by failing to invalidate when the meaning of the stored output changes.

ChangeWhat should invalidateWhy
Prompt or schema updateCached structured outputs for that route.The expected object, acceptance rules, and semantics changed.
Source document updateAnswers, summaries, embeddings, and evidence claims tied to the old source.The cached output may no longer reflect the current source.
Model or embedding version changeSimilarity matches and model-dependent outputs.Distances and output behavior can shift even if the API shape is stable.
Access policy changeUser-visible cached output that crosses permission boundaries.Fast reuse cannot override authorization.
Reviewer correctionThe corrected entry and related semantic-neighbor entries.A human found the cached behavior wrong or incomplete.

Measure whether the cache is safe

Hit rate, latency saved, and cost saved tell you whether the cache is efficient. They do not tell you whether it is safe.

The metric that matters most is wrong-hit rate.

Wrong-hit rate is the percentage of cache hits that returned output that was incorrect, stale, unauthorized, unsupported, or policy-invalid for the request. If a reviewer corrects output that came from cache, that is a wrong hit. If an eval fixture proves the answer should have been regenerated, that is a wrong hit.

1Log cache hits

Store the full key, triggering request, returned output, evidence refs, and reuse reason.

2Test negative cases

Add eval fixtures where a similar request should not come from cache because source, schema, or access changed.

3Use review feedback

Feed reviewer corrections and rejections back into invalidation and wrong-hit tracking.

Do not do this.
  • Cache by prompt string or embedding similarity alone.
  • Reuse answers without checking source version, schema version, model policy, evidence, review state, and user permissions.
  • Cache unsupported claims as if they were verified facts.
  • Let semantic similarity override tenant isolation or access control.
  • Ignore reviewer corrections when invalidating related cache entries.
  • Measure cache hit rate without measuring wrong-hit rate.

What to actually build first

Start with the layer that has the lowest risk and highest reuse value. That is usually document parsing and embeddings, not final answers.

Build orderImplementation optionsWhat to evaluate
Parsed artifact cacheCache parsed documents by file hash, parser version, and document version.Whether repeated processing drops without stale artifacts affecting downstream steps.
Embedding cacheCache embeddings by source version, embedding model, chunking policy, and tenant ID.Whether retrieval behavior stays stable across source and model updates.
Field extraction cacheCache only successful field claims with evidence refs, source version, schema version, and review-aware invalidation.Whether `missing_evidence` and `schema_failure` are never reused as success.
Summary cacheCache summaries only with audience, source set, policy version, access scope, and freshness window.Whether reused summaries remain complete enough for the current context.
Risk decision cacheCache final decisions last, if at all, and only behind explicit revalidation.Whether wrong-hit rate stays within the workflow's risk tolerance.

Where this shows up

Semantic caching matters when workflows repeat, but trust boundaries still matter.

PPolicyTrace

PolicyTrace could safely cache parsed artifacts and some extraction evidence by source hash, but final claims still need schema, provenance, source version, and review-aware invalidation.

CFuture ContractCopilot

A contract workflow would need clause, amendment, policy, tenant, access, and reviewer-correction-aware cache keys before reusing risk summaries.

IFuture invoice intelligence

An invoice workflow could cache supplier matching and document parsing, while treating totals, tax, PO matching, and exception decisions as freshness-sensitive.

The practical takeaway

Semantic caching and output quality are not naturally in tension. They become enemies when the cache ignores the trust boundaries the rest of the system enforces.

A fast wrong answer is still a production failure.

Cache the artifact layer first. Cache answers carefully, with full key metadata. Measure wrong-hit rate, not just hit rate. And make sure that when a reviewer finds an error, the cache learns from it.

Continue reading Use caching as part of the system, not a shortcut around it.

This closes the Efficiency Layer batch: route the model, control token economics, and cache only when trust boundaries still hold.