How to Trace a Failed LLM Run
When an AI workflow fails, the final answer is not enough. You need the input, route, prompt, model, context, schema errors, evidence gaps, review action, and eval case.
The product manager sees a bad output. The engineer sees a model response. The data scientist asks what context was sent. The reviewer says the source document did contain the right value. The support ticket has a screenshot, but no run ID.
This is what happens when an AI workflow has output but no trace. The team has a symptom, not an explanation.
To debug it, you need to reconstruct the input, parsing, routing, prompt, model, context, structured output, validation, evidence, fallback, review action, and final state.
Why final-answer debugging fails
If the only artifact is the final answer, every investigation turns into speculation. The model may not be the real failure point.
The upload, parser, OCR, extraction text, or user request may have lost the fact before the model saw it.
The workflow may have chosen the wrong task, prompt, model policy, evidence requirement, or fallback path.
Retrieval, chunking, filtering, page limits, or masking may have removed the needed source material.
The model may have done what it was asked, while the product policy failed to say when to stop.
The trace map
A useful trace follows the workflow boundary by boundary. It should show what entered each step, what came out, and what decision moved the run forward.
The minimum trace record
The trace does not need to be fancy at first. It needs to be complete enough to answer what changed, what the model saw, why the workflow took that route, and what happened after validation.
from pydantic import BaseModel
class LlmRunTrace(BaseModel):
run_id: str
input_ref: str
parsed_artifact_ref: str | None = None
route: str
route_reason: str | None = None
prompt_id: str
prompt_version: str
model_provider: str
model_name: str
model_policy: str
temperature: float | None = None
top_p: float | None = None
system_fingerprint: str | None = None
context_refs: list[str] = []
raw_output_ref: str
validation_errors: list[str] = []
evidence_refs: list[str] = []
fallback_taken: str | None = None
review_outcome: str | None = None
eval_case_id: str | None = None
This maps cleanly to observability standards
A custom trace object does not mean building a custom observability universe. The same structure can map onto OpenTelemetry traces and LLM-oriented conventions such as OpenInference.
| Trace concept | Observability mapping | Why it helps |
|---|---|---|
| Run ID | Global trace ID for the user-facing workflow. | Support, engineering, review, and evaluation can discuss the same incident. |
| Parse, route, call, validate | Parent-child spans with attributes for status, latency, cost, model, prompt, and schema versions. | Failures can be isolated to a workflow boundary instead of blamed on the final answer. |
| Artifact refs | Span attributes or linked records pointing to stored artifacts rather than raw sensitive content. | Logs stay searchable without copying full prompts, documents, or PII into every event. |
| Review and eval IDs | Linked attributes that connect production incidents to review outcomes and regression cases. | The trace can flow into Datadog, Honeycomb, OpenSearch, or another pipeline while still feeding the evaluation loop. |
Start with the failure question
A trace is useful when it lets the team ask sharper questions. The goal is not collecting every possible field. The goal is reducing ambiguity during debugging.
| Question | Trace evidence | Likely fix path |
|---|---|---|
| Did the model see the right source? | Parsed artifact, retrieved chunks, document refs, page refs, masking logs. | Parser, chunking, retrieval, context selection, or masking fix. |
| Did the workflow choose the right route? | Classification output, route decision, route reason, risk policy, branch taken. | Router fixture, route rule, classifier prompt, or task boundary fix. |
| Did the model produce the wrong shape? | Raw output, schema version, validation errors, retry history. | Schema, structured output, retry policy, or prompt contract fix. |
| Was the answer unsupported? | Evidence refs, provenance scores, missing source fields, unsupported claims. | Evidence requirement, provenance matching, review policy, or safe-stop fix. |
| Did review correct the system? | Reviewer action, override value, reviewer note, final state, eval case ID. | Eval fixture, UI change, policy update, or reviewer training signal. |
| Is behavior drifting even when schemas still pass? | Distribution of route labels, confidence scores, refusal rates, fallback rates, and reviewer overrides over time. | Semantic drift alert, model policy review, prompt comparison, or evaluation refresh. |
Trace semantic drift before it becomes an outage
Some regressions do not show up as schema failures. The model can keep returning valid JSON while its judgment changes: different classification tags, lower confidence, more hedging, more refusals, or more reviewer overrides.
Track distributions over time for categorical outputs, confidence bands, evidence quality, fallback reasons, and review outcomes. A quiet shift in those distributions can reveal provider behavior changes, prompt drift, input drift, or routing problems before users report obvious failures.
Prompt and model versions are not optional
If you cannot tell which prompt and model produced the output, you cannot know whether a failure still exists. You can only rerun the current system and hope it behaves the same way.
Store prompt IDs, prompt versions, model names, model policy, parser version, schema version, and route decision. Exact replay may not always be possible, but useful diagnosis is impossible without these anchors.
Record temperature, top_p, seed when the provider supports it, and provider-returned system fingerprints when available. Those values do not guarantee perfect replay, but they help separate random sampling variance, provider-side routing changes, and application logic bugs.
Evidence gaps are trace events
A missing citation, weak provenance match, unsupported field, or source conflict is not just a quality issue. It is a workflow event that should change the next step.
The value may be correct, but the workflow cannot support it. Route to search, retry, partial output, or review.
The source match exists but confidence is low. Show it to review instead of hiding it behind the final answer.
Multiple sources disagree. Preserve the conflict, apply authority rules, or escalate the decision.
Turn failures into eval examples
A trace should not end with a bug fix. The best trace records become regression examples so the system does not forget the lesson.
Attach the run ID to an eval case with the input, expected route, expected output shape, evidence requirement, fallback expectation, and review outcome. That turns an incident into a future release gate.
- Save only the final answer and raw model response.
- Drop prompt version, model version, parser version, schema version, or route decision.
- Debug a failed answer without checking whether the model saw the right input and context.
- Treat validation errors as temporary noise rather than workflow signals.
- Let reviewer corrections stay outside the trace.
- Fix production failures without adding regression examples.
Implementation options to test
Tracing can start with structured run records and artifact references. The point is to preserve enough data to explain a run without leaking sensitive content into every log line.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Run identity | Generate a run ID and pass it through parsing, routing, model calls, validation, review, and eval. | Whether support, engineering, and review can refer to the same incident. |
| Artifact references | Store refs to source files, parsed text, chunks, prompts, raw outputs, evidence records, and review actions. | Whether the team can reconstruct a run without duplicating sensitive data in logs. |
| Step status | Record start/end status, error type, retry count, fallback taken, latency, and cost per step. | Whether failures can be isolated to a workflow boundary. |
| Version anchors | Record prompt version, model provider/name, temperature, top_p, system fingerprint when available, parser version, schema version, and route policy version. | Whether regressions can be tied to a concrete change. |
| Observability integration | Represent the workflow as OpenTelemetry-style traces and spans, with LLM-specific attributes where useful. | Whether AI traces can join existing production observability pipelines instead of living in a separate debugging silo. |
| Drift monitoring | Track distributions for route labels, confidence bands, fallback reasons, evidence quality, and review outcomes. | Whether behavior shifts are visible before they become support tickets or quality incidents. |
| Learning loop | Link failed runs to eval fixtures, reviewer corrections, and release-gate outcomes. | Whether tracing improves the system rather than only explaining the past. |
Where this shows up
Tracing matters anywhere a model call sits inside a larger workflow.
PolicyTrace's parsing, masking, classification, specialist extraction, arbitration, provenance, and review steps give a natural trace shape for debugging where a document AI result went wrong.
A contract workflow would need traces across clause routing, amendment context, retrieved sections, risk flags, evidence references, and reviewer decisions.
An invoice workflow would need traces across supplier matching, line-item extraction, tax checks, PO matching, exception reasons, and review outcomes.
The practical takeaway
Tracing turns a bad answer into an inspectable chain of causes. Without it, teams debug production AI by folklore.
That difference is what lets a team fix the right layer, design better fallbacks, and add the right evaluation case before the next release.
Once runs can be traced, the next question is whether every task should use the same model, context, and cost profile. That opens the Efficiency Layer.