Designing Fallbacks for AI Systems
AI failures need designed paths: retry, narrower scope, simpler model, stronger model, human review, partial answer, safe message, or stop.
Consider a document extraction pipeline that returns a confident invoice total. The source PDF has two tables: a draft estimate on page one and a revised final total on page three. The model blends conflicting evidence, no error is thrown, no flag is raised, and the wrong amount moves forward.
That is not only a model failure. It is a system design failure. The model produced a plausible answer. The workflow had no designed path for when that answer should not be trusted.
A fallback is not what the system does after it breaks. It is part of how the system works.
"Try harder" is not a fallback. A production AI workflow needs explicit decisions about when to retry, narrow scope, switch path, ask for review, return partial output, or stop safely.
Why fallback design gets skipped
Fallbacks feel like a second-order problem during development because demos are engineered to avoid them. The input is clean, the scope is narrow, and the model cooperates.
"Send it to a human" is not a fallback unless the reviewer gets the input, output, evidence, trigger reason, and available actions.
Teams often prefer a weak answer over a safe stop, even when the workflow has no supporting evidence.
A loop that runs until something passes is deferred uncertainty dressed as reliability.
Fallbacks are conditional, not just sequential
A fallback ladder is useful as a menu of options. But production systems should not blindly walk the same ladder for every failure. The routing policy should choose the fallback based on the failure type.
Use when the failure is repairable, such as malformed JSON, missing required fields, or schema violations.
Use when ambiguity is too high: one field family, one page range, one table, one row, or one source section.
Use when the workflow needs a different parser, prompt, model policy, authority rule, retrieval scope, or specialist extractor.
Use when the system has enough context to help a human decide and the decision is worth human time.
Use when the workflow can support only part of the answer, or when automation should not continue.
Missing evidence should not trigger the same behavior as a rate limit. A conflicting value should not be handled like malformed JSON. An unsupported user request should not go to a reviewer who has no authority to make it supported.
The fallback table
Different failures need different paths. The important part is not just what the system does, but what state it records when it does it.
| Failure | Fallback path | What to record |
|---|---|---|
| Invalid structured output | Retry once with validation errors, then store the failure and route to review. | Raw output, schema version, validation errors, retry count, final status. |
| Provider rate limit or capacity spike | Use exponential backoff with jitter, respect retry headers, cap concurrency, then degrade or stop safely. | Provider error, status code, wait time, retry schedule, concurrency level, final fallback. |
| Missing evidence | Narrow retrieval scope, run source-specific extraction, or mark the field unsupported. | Evidence requirement, source searched, missing field, unsupported status. |
| Conflicting values | Apply authority rules first; preserve unresolved candidates and route to review. | Candidate values, source documents, authority rule applied, reviewer decision. |
| Unsupported user request | Return a safe message naming the supported scope and the input needed. | Request type, unsupported reason, user-facing response, suggested next step. |
| High-risk uncertainty | Stop automation and require review rather than generating a confident answer from weak evidence. | Risk reason, uncertainty signal, review priority, reviewer outcome. |
A fallback that does not record why it fired, what state it saw, and what outcome it produced cannot be improved. You cannot regression-test behavior you cannot observe.
Narrowing the task means decomposing the failure
When a broad extraction fails, the natural instinct is to retry the same call with a better prompt. Often the better move is to split the work into smaller pieces that can succeed or fail independently.
An invoice schema that tries to extract supplier name, line items, totals, tax, purchase order number, and payment terms in one call can fail as a unit when one field is ambiguous. Break it up. Extract totals separately from line items. Extract named parties separately from dates. Smaller schemas isolate the failing field and let the workflow return supported parts instead of discarding everything.
If retrieval across a full document cannot locate the relevant section, narrow the search to a page range, heading, table, or source document. A failed full-document extraction does not always mean the answer is missing. It may mean the search surface was too wide.
Partial output is a feature, not a failure
A system that returns three supported fields and clearly marks two unresolved fields is more trustworthy than one that returns all five with quiet uncertainty.
from typing import Literal
from pydantic import BaseModel
class FieldResult(BaseModel):
value: str | None = None
status: Literal[
"success",
"missing_evidence",
"schema_failure",
"unsupported",
"requires_review",
]
evidence_refs: list[str] = []
error_message: str | None = None
class InvoiceResult(BaseModel):
supplier_name: FieldResult
invoice_total: FieldResult
tax_amount: FieldResult
purchase_order: FieldResult
A missing field could mean not found, failed validation, unsupported, outside the attempted scope, or pending review. Wrap fields with status, evidence, and error metadata so partial output survives serialization and reaches the user safely.
Human review is a designed step, not a safety net
Review should not be where difficult cases disappear. It should be where the workflow exposes enough information for a human to decide and enough structure for the system to improve.
The reviewer sees the original input, parsed artifacts, model output, evidence, validation errors, and route reason.
The reviewer can accept, correct a field, reject the result, escalate, or mark the request unsupported.
The outcome becomes an evaluation fixture, routing signal, prompt improvement candidate, or schema update.
A review queue with structured outcomes is a quality loop. If reviewers clear cases but the system sees no usable signal, next week's queue will look the same.
Retries are not free
A retry can improve output quality. It can also increase cost, latency, and instability. Invisible retries make it impossible to tell whether the workflow is robust or simply trying until it gets lucky.
Record the triggering error, prompt version, model version, attempt count, what changed between attempts, what the output difference was, and whether the final result was accepted.
For HTTP 429s, capacity spikes, and network timeouts, immediate retries from concurrent jobs can make the outage worse. Use exponential backoff with randomized jitter, respect retry headers when available, and cap concurrency so recovery behavior does not create a traffic spike.
- Retry every failure without distinguishing schema errors, evidence gaps, conflicts, provider errors, and unsupported requests.
- Hide fallback decisions inside prompt instructions instead of workflow policy.
- Send cases to review without source context, validation errors, and clear reviewer actions.
- Use a stronger model as a substitute for product and risk policy.
- Return a confident final answer when required evidence is missing.
- Drop fallback events instead of turning them into traces and evaluation examples.
What to test before you ship
Fallback logic should be testable the same way business logic is testable: with fixtures, not hope.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Retry policy | Validation-error retries, max attempt counts, retry reasons, stored attempt history, and exponential backoff with jitter for provider failures. | Whether retries fix known repairable failures without hiding instability. |
| Failure routing | Route tables based on failure type, task, evidence requirement, user role, impact level, and workflow state. | Whether each failure type takes the intended path. |
| Partial output | Field-level status, unsupported reasons, validation errors, and source evidence per accepted claim. | Whether users can distinguish supported output from missing, failed, or uncertain output. |
| Review queue | Review reasons, priority, reviewer actions, correction capture, escalation states, and structured outcomes. | Whether review clears work and improves the future system. |
| Fallback evaluation | Fixtures for invalid output, missing evidence, conflicts, unsupported requests, provider failures, and review cases. | Whether fallback behavior can be regression-tested before release. |
Where this shows up
Fallback design is where AI system behavior becomes product behavior.
PolicyTrace surfaces conflicts, provenance, and review state so a production version can route uncertain fields, unsupported claims, and source disagreements instead of forcing a final answer.
A contract workflow would need fallbacks for missing amendments, conflicting clauses, ambiguous obligations, unsupported risk flags, and reviewer escalation.
An invoice workflow would need fallbacks for missing POs, mismatched totals, uncertain suppliers, duplicate invoices, tax ambiguity, and exception queues.
The practical takeaway
AI systems become trustworthy when failure is a designed path, not a surprising event.
Conditional means matched to the failure type. Bounded means retries, review, and stop conditions have limits. Observable means every fallback emits a trace. Learnable means review outcomes, retry records, and unsupported request logs feed evaluation, routing, and schema improvements.
Users can tell where the boundary is. Reviewers have what they need. Engineers can see what is actually happening. That is what makes fallback design part of the system, not a cleanup task after launch.
Fallbacks only work if the system can trace the failure that triggered them. That is the next Runtime & Operations layer.