PolicyTrace implementation note D

PolicyTrace Prompt Design: From PDF Text to Reviewable Evidence

The closing note in the PolicyTrace series: how prompts support document classification, specialist extraction, typed outputs, field citations, provenance, review, and evaluation.

PolicyTrace Prompt Design Evidence Trail Series Closure

Prompt Design Map

The prompt layer is useful because it is bounded: classify the document, route to a specialist prompt, return a typed record, preserve source phrases, and let the rest of the system arbitrate and review.

Classification and routing Specialist extraction Evidence preservation Evaluation gate

PDFParsed textDocling turns the source document into Markdown and layout artifacts.PDF -> Markdown

CLSClassifyKeyword heuristic and small LLM classifier identify the document type.Schedule | Certificate | SOF

RTERouteThe document type selects the matching prompt from the versioned registry.PromptRegistry

JSONTyped outputInstructor asks the model for a UKMotorGoldenRecord shaped by Pydantic.response_model

CITField citationsThe model also returns verbatim source phrases for extracted fields.field_citations

EVProvenanceVerbatim phrases help match values back to PDF text geometry.partial_ratio

GATEEvaluatePrompt changes should pass golden examples, conflict checks, and evidence checks.release gate

Current implementation boundary

PolicyTrace has versioned prompts and field citation mechanics. A production version would add a formal prompt registry, approval workflow, trace logs, and evaluation gates around prompt changes.

Why this matters

The prompt is not asked to be judge, database, auditor, and UI. It produces structured candidates and source phrases that the rest of the system can inspect.

Core thesis

Prompts should feed the system, not pretend to be the system.

The common shortcut in Document AI is to treat prompt design as the whole product: paste a PDF into a model, ask for JSON, and hope the answer is good enough. PolicyTrace takes a different path.

The prompt layer is important, but bounded. It helps classify documents, extract typed candidate records, and preserve source phrases. It does not decide final trust. Arbitration, provenance, review, evaluation, and deployment all remain outside the prompt.

The prompt produces candidates. The architecture produces trust.

That is the closing idea of the PolicyTrace series. Reliable Document AI is not one clever instruction. It is a chain of responsibilities.

1
Prompts ask for typed partial records, not magical final truth.
2
Prompts preserve verbatim source phrases so evidence can be matched later.
3
Prompt changes should be evaluated like code changes because they affect system behavior.

Avoid the giant prompt

One prompt over the whole PDF pack is the wrong abstraction.

A policy pack contains different document types with different responsibilities. A Schedule is not a Certificate. A Statement of Fact is not a Policy Booklet. If all documents are concatenated into one prompt, the model has to extract, classify, arbitrate, and ignore boilerplate at the same time.

PolicyTrace separates those jobs. It classifies the document type first, then routes the text to a specialist prompt. That keeps the prompt smaller, more explicit, and easier to improve without breaking unrelated document behavior.

SCHSchedule prompt

Extracts the core policy facts: vehicle, drivers, cover, excesses, premium, NCB, and field citations.

CERCertificate prompt

Focuses on legal-use fields such as class of use, driving other cars, and named driver entitlement.

GENFallback prompt

Handles less-primary document types without pretending every source is equally authoritative.

Typed extraction

The prompt is constrained by a schema.

PolicyTrace uses Instructor with a Pydantic response model. That means the model is not asked to invent an arbitrary JSON shape. It is asked to populate a known UKMotorGoldenRecord structure.

This is useful for engineering because downstream code can depend on a typed contract. The arbiter can merge records field by field. The provenance matcher can walk the record. The review UI can flatten it into field rows. The prompt is one producer inside a typed workflow.

Schema-constrained output changes the prompt's job.

The model is no longer writing a report. It is filling a structured contract that other parts of the system will inspect.

1
Pydantic defines the canonical output shape.
2
Optional fields allow partial per-document extraction without forcing hallucinated values.
3
Instructor uses the schema to guide model output and retries validation failures.

Field citations

The important trick is asking for two representations.

The model often canonicalizes values correctly. A PDF may say 15/04/2026 at 00:00 hours, while the schema value becomes an ISO datetime. A PDF may say GBP 703.28, while the schema value becomes a float.

That is useful for downstream systems, but it makes evidence matching harder. The canonical value may not look like the raw PDF text anymore. PolicyTrace solves this by asking the model to populate field_citations: a dictionary from dotted field path to a verbatim phrase copied from the document.

VALCanonical value

The typed field value is used by the arbiter, API, UI, and downstream consumers.

TXTVerbatim phrase

The citation preserves the raw text that appeared in the PDF for provenance matching.

MAPField path

The dotted path connects the schema value, citation quote, provenance match, and review row.

Hidden evidence helpers

field_citations should support review without leaking into the final record.

PolicyTrace declares field_citations with Field(exclude=True). That detail matters. The field is available to the model schema during extraction, but it is excluded when the Golden Record is serialized.

In other words, the citation map is an internal evidence helper. It helps provenance.py match source phrases back to Docling geometry. It does not become part of the downstream Golden Record payload.

The final output stays clean, but the review layer keeps the trace.

That is a reusable pattern for Document AI systems: separate canonical output from evidence artifacts.

1
The serialized Golden Record contains the canonical business values.
2
The citation sidecar helps provenance matching and reviewer inspection.
3
If citations are missing, fallback matching still works for some fields, but amounts and booleans remain harder.

Prompt versioning

Prompts need release discipline.

PolicyTrace moved prompts into config/prompts.yaml with an active version and document-type keys. That is much safer than burying prompt strings in Python code.

A production version would go further. It would treat prompts like versioned artifacts: reviewed, tested, traced, and rolled back when needed. A prompt change can alter extraction quality, citation coverage, conflict handling, cost, and latency. That is a release event.

Current project has

1
Versioned YAML prompts with active_version: "v2".
2
PromptRegistry lookup by DocumentType with reload and switch_version methods.
3
FIELD_CITATIONS instructions in primary extraction prompts.

Production would add

1
Prompt approvals, owners, change notes, and rollback history.
2
Trace logs linking each run to prompt version, model version, schema version, and settings.
3
Evaluation gates before a prompt version is promoted.

Evaluation gates

A prompt change should prove it did not break the workflow.

The previous post focused on evaluation. Prompt design is where that matters most. A prompt can improve one field family while damaging another. It can increase citation coverage but increase latency. It can reduce missing fields by encouraging hallucination. The only honest way to know is to run it against golden examples.

For PolicyTrace, I would gate prompt changes on Golden Record field checks, conflict fixtures, citation coverage, provenance quality, reviewer override signals, and cost or latency budgets.

RECRecord diff

Did the canonical field values improve, regress, or change unexpectedly?

CITCitation coverage

Did the prompt preserve enough verbatim phrases for source matching?

COSTRuntime budget

Did the prompt increase latency, retries, context pressure, or model spend?

Series closure

PolicyTrace is not a prompt demo. That is the point.

The series started with architecture because the real lesson was never "write a better prompt." The lesson was to build a workflow where prompts are bounded, evidence is preserved, conflicts are visible, reviewers can intervene, deployment constraints are explicit, production gaps are named, and evaluation protects future changes.

That is the AI Tool Stack position in miniature: practical AI engineering is not about making a model sound confident. It is about building systems that can be inspected, reviewed, improved, and operated.