How to Trace a Failed LLM Run

When an AI workflow fails, the final answer is not enough. You need the input, route, prompt, model, context, schema errors, evidence gaps, review action, and eval case.

10 min read
System Layer Runtime & Operations Layer Failure tracing
The answer was wrong. Now everyone is guessing why.

The product manager sees a bad output. The engineer sees a model response. The data scientist asks what context was sent. The reviewer says the source document did contain the right value. The support ticket has a screenshot, but no run ID.

This is what happens when an AI workflow has output but no trace. The team has a symptom, not an explanation.

03
A failed LLM run is not one event. It is a chain of decisions.

To debug it, you need to reconstruct the input, parsing, routing, prompt, model, context, structured output, validation, evidence, fallback, review action, and final state.

Why final-answer debugging fails

If the only artifact is the final answer, every investigation turns into speculation. The model may not be the real failure point.

1 The input may be broken

The upload, parser, OCR, extraction text, or user request may have lost the fact before the model saw it.

2 The route may be wrong

The workflow may have chosen the wrong task, prompt, model policy, evidence requirement, or fallback path.

3 The context may be missing

Retrieval, chunking, filtering, page limits, or masking may have removed the needed source material.

4 The policy may be vague

The model may have done what it was asked, while the product policy failed to say when to stop.

The trace map

A useful trace follows the workflow boundary by boundary. It should show what entered each step, what came out, and what decision moved the run forward.

Input layer
SourceUser inputFile, message, request, role, timestamp.
ParseParsed artifactText, pages, tables, chunks, OCR quality.
MaskTransformationsPII masking, redaction, normalization.
Workflow layer
RouteDecisionTask type, branch, risk level, model policy.
PromptInstructionPrompt ID, version, variables, examples.
ContextEvidence sentRetrieved chunks, documents, page refs.
Model layer
CallModel runProvider, model, parameters, latency, cost.
OutputRaw responseText, tool call, JSON, refusal, error.
ValidateAcceptance gateSchema errors, business rules, retry reason.
Trust layer
EvidenceSupport checkSource refs, provenance, unsupported fields.
ReviewHuman actionAccept, override, reject, flag, escalate.
EvalLearning recordFixture, expected output, regression signal.

The minimum trace record

The trace does not need to be fancy at first. It needs to be complete enough to answer what changed, what the model saw, why the workflow took that route, and what happened after validation.

from pydantic import BaseModel


class LlmRunTrace(BaseModel):
    run_id: str
    input_ref: str
    parsed_artifact_ref: str | None = None
    route: str
    route_reason: str | None = None
    prompt_id: str
    prompt_version: str
    model_provider: str
    model_name: str
    model_policy: str
    temperature: float | None = None
    top_p: float | None = None
    system_fingerprint: str | None = None
    context_refs: list[str] = []
    raw_output_ref: str
    validation_errors: list[str] = []
    evidence_refs: list[str] = []
    fallback_taken: str | None = None
    review_outcome: str | None = None
    eval_case_id: str | None = None

This maps cleanly to observability standards

A custom trace object does not mean building a custom observability universe. The same structure can map onto OpenTelemetry traces and LLM-oriented conventions such as OpenInference.

Trace concept Observability mapping Why it helps
Run ID Global trace ID for the user-facing workflow. Support, engineering, review, and evaluation can discuss the same incident.
Parse, route, call, validate Parent-child spans with attributes for status, latency, cost, model, prompt, and schema versions. Failures can be isolated to a workflow boundary instead of blamed on the final answer.
Artifact refs Span attributes or linked records pointing to stored artifacts rather than raw sensitive content. Logs stay searchable without copying full prompts, documents, or PII into every event.
Review and eval IDs Linked attributes that connect production incidents to review outcomes and regression cases. The trace can flow into Datadog, Honeycomb, OpenSearch, or another pipeline while still feeding the evaluation loop.

Start with the failure question

A trace is useful when it lets the team ask sharper questions. The goal is not collecting every possible field. The goal is reducing ambiguity during debugging.

Question Trace evidence Likely fix path
Did the model see the right source? Parsed artifact, retrieved chunks, document refs, page refs, masking logs. Parser, chunking, retrieval, context selection, or masking fix.
Did the workflow choose the right route? Classification output, route decision, route reason, risk policy, branch taken. Router fixture, route rule, classifier prompt, or task boundary fix.
Did the model produce the wrong shape? Raw output, schema version, validation errors, retry history. Schema, structured output, retry policy, or prompt contract fix.
Was the answer unsupported? Evidence refs, provenance scores, missing source fields, unsupported claims. Evidence requirement, provenance matching, review policy, or safe-stop fix.
Did review correct the system? Reviewer action, override value, reviewer note, final state, eval case ID. Eval fixture, UI change, policy update, or reviewer training signal.
Is behavior drifting even when schemas still pass? Distribution of route labels, confidence scores, refusal rates, fallback rates, and reviewer overrides over time. Semantic drift alert, model policy review, prompt comparison, or evaluation refresh.

Trace semantic drift before it becomes an outage

Some regressions do not show up as schema failures. The model can keep returning valid JSON while its judgment changes: different classification tags, lower confidence, more hedging, more refusals, or more reviewer overrides.

Valid shape is not stable behavior.

Track distributions over time for categorical outputs, confidence bands, evidence quality, fallback reasons, and review outcomes. A quiet shift in those distributions can reveal provider behavior changes, prompt drift, input drift, or routing problems before users report obvious failures.

Prompt and model versions are not optional

If you cannot tell which prompt and model produced the output, you cannot know whether a failure still exists. You can only rerun the current system and hope it behaves the same way.

Reproducibility starts with versioned behavior.

Store prompt IDs, prompt versions, model names, model policy, parser version, schema version, and route decision. Exact replay may not always be possible, but useful diagnosis is impossible without these anchors.

Replayability has deterministic limits.

Record temperature, top_p, seed when the provider supports it, and provider-returned system fingerprints when available. Those values do not guarantee perfect replay, but they help separate random sampling variance, provider-side routing changes, and application logic bugs.

Evidence gaps are trace events

A missing citation, weak provenance match, unsupported field, or source conflict is not just a quality issue. It is a workflow event that should change the next step.

M Missing evidence

The value may be correct, but the workflow cannot support it. Route to search, retry, partial output, or review.

W Weak evidence

The source match exists but confidence is low. Show it to review instead of hiding it behind the final answer.

C Conflicting evidence

Multiple sources disagree. Preserve the conflict, apply authority rules, or escalate the decision.

Turn failures into eval examples

A trace should not end with a bug fix. The best trace records become regression examples so the system does not forget the lesson.

A production failure is a candidate golden example.

Attach the run ID to an eval case with the input, expected route, expected output shape, evidence requirement, fallback expectation, and review outcome. That turns an incident into a future release gate.

Do not do this.
  • Save only the final answer and raw model response.
  • Drop prompt version, model version, parser version, schema version, or route decision.
  • Debug a failed answer without checking whether the model saw the right input and context.
  • Treat validation errors as temporary noise rather than workflow signals.
  • Let reviewer corrections stay outside the trace.
  • Fix production failures without adding regression examples.

Implementation options to test

Tracing can start with structured run records and artifact references. The point is to preserve enough data to explain a run without leaking sensitive content into every log line.

Need Implementation options What to evaluate
Run identity Generate a run ID and pass it through parsing, routing, model calls, validation, review, and eval. Whether support, engineering, and review can refer to the same incident.
Artifact references Store refs to source files, parsed text, chunks, prompts, raw outputs, evidence records, and review actions. Whether the team can reconstruct a run without duplicating sensitive data in logs.
Step status Record start/end status, error type, retry count, fallback taken, latency, and cost per step. Whether failures can be isolated to a workflow boundary.
Version anchors Record prompt version, model provider/name, temperature, top_p, system fingerprint when available, parser version, schema version, and route policy version. Whether regressions can be tied to a concrete change.
Observability integration Represent the workflow as OpenTelemetry-style traces and spans, with LLM-specific attributes where useful. Whether AI traces can join existing production observability pipelines instead of living in a separate debugging silo.
Drift monitoring Track distributions for route labels, confidence bands, fallback reasons, evidence quality, and review outcomes. Whether behavior shifts are visible before they become support tickets or quality incidents.
Learning loop Link failed runs to eval fixtures, reviewer corrections, and release-gate outcomes. Whether tracing improves the system rather than only explaining the past.

Where this shows up

Tracing matters anywhere a model call sits inside a larger workflow.

P PolicyTrace

PolicyTrace's parsing, masking, classification, specialist extraction, arbitration, provenance, and review steps give a natural trace shape for debugging where a document AI result went wrong.

C Future ContractCopilot

A contract workflow would need traces across clause routing, amendment context, retrieved sections, risk flags, evidence references, and reviewer decisions.

I Future invoice intelligence

An invoice workflow would need traces across supplier matching, line-item extraction, tax checks, PO matching, exception reasons, and review outcomes.

The practical takeaway

Tracing turns a bad answer into an inspectable chain of causes. Without it, teams debug production AI by folklore.

The trace is the difference between "the model failed" and "this step failed for this reason."

That difference is what lets a team fix the right layer, design better fallbacks, and add the right evaluation case before the next release.

Continue reading Next, bridge operations into efficiency.

Once runs can be traced, the next question is whether every task should use the same model, context, and cost profile. That opens the Efficiency Layer.