PolicyTrace system design chapter 03
PolicyTrace Evidence and Provenance: Why JSON Is Not Enough
The third core PolicyTrace chapter: how extracted fields become reviewable evidence through hidden source citations, Docling geometry, provenance matching, and PDF highlights.
Evidence and Provenance Map
2026-04-15T00:00:00
Certificate of Motor Insurance
build_provenance prefers citation quotes, then falls back to normalized values.Where did this value come from, and can I inspect the source without searching the PDF by hand?
Clicking a field selects the PDF, scrolls to the source page, overlays the matched region, and shows the matched text plus confidence signal.
Problem
The answer is not finished until it can be traced.
A lot of AI extraction demos stop when the model returns JSON. That is useful for a prototype, but it is not enough for a workflow where a reviewer has to approve, reject, or override a field.
If the system says the policy starts on 2026-04-15, the reviewer needs the next question answered immediately: where did that value come from?
PolicyTrace treats evidence as part of the product.
The Golden Record contains the canonical value. The provenance layer explains where the value was found. The review UI keeps both visible at the same time.
Failure example
Output-only JSON creates a trust gap.
Output-only extraction
{
"start_date": "2026-04-15",
"class_of_use": "Social",
"voluntary_excess": "250"
}
The values may be correct, but the reviewer still has to search the PDFs manually to confirm them.
Evidence-backed extraction
{
"field_path": "policy_header.period_of_cover.start_date",
"value": "2026-04-15T00:00:00",
"matched_text": "15/04/2026 at 00:00 hours",
"source_filename": "Certificate.pdf",
"page": 1,
"bbox": [18.2, 31.4, 62.9, 34.1]
}
The field becomes inspectable: value, source phrase, document, page, and highlight location travel together.
Architecture concept
PolicyTrace uses two representations of the same fact.
A canonical value and a source phrase are not the same thing. The canonical value is shaped for systems. The source phrase is shaped for review.
Dates are the easiest example. A PDF may say 15/04/2026 at 00:00 hours. The Golden Record may store an ISO-like value. If the system tries to find the normalized value in the PDF, it may fail. If it preserves the original phrase, provenance matching has something real to locate.
Where this lives in the repo
- config/prompts.yamlSpecialist prompts ask the model to populate
field_citationswith verbatim phrases copied from the source document. - src/schema.py
field_citationsexists on the model but is excluded from the final serialized Golden Record. - src/provenance.py
build_provenanceuses citations and Docling text geometry to produce field-level provenance. - src/api.pyThe API returns
GoldenRecordWithProvenanceso the UI can show field values and source locations together.
Important boundary
The model does not get to invent coordinates.
This is the key design decision. PolicyTrace does not ask the LLM to say "page 3, x 42, y 18" and hope that geometry is real. The model supplies typed values and source phrases. The system builds geometry from Docling and matches after extraction.
That separation matters because coordinates are an artifact problem, not a language problem. The PDF parser owns text and layout. The model owns candidate extraction. The provenance layer connects them.
PDFParser-owned geometry
Docling provides text elements with page and bounding box data before review begins.
LLMModel-owned citation
The model provides a verbatim phrase copied from the source, which is easier to verify than invented coordinates.
UIReviewer-owned trust
The reviewer sees the matched source and can decide whether the field is good enough for the workflow.
Real walkthrough
What happens when a reviewer clicks a field.
The review UI starts with the Golden Record fields on one side and the source PDF on the other. Each field can carry provenance: the matched text, source filename, page, bounding box, and match score.
When a field is selected, the frontend finds its FieldProvenance, switches to the right PDF, scrolls to the page, and overlays the percentage-based bbox on the rendered PDF page. The reviewer is not forced to trust a JSON blob. They see the answer and its source together.
The provenance object is intentionally small.
- field_pathThe dotted path back to the Golden Record field.
- matched_textThe source text snippet the matcher found in the PDF corpus.
- match_scoreA reviewer-visible confidence signal for the source match, not a guarantee of business correctness.
- source_filenameThe PDF that contains the matched source text.
- locationPage plus browser-percent bounding box for the highlight overlay.
Source honesty
Provenance is review evidence, not legal proof.
It is tempting to oversell provenance. PolicyTrace should not do that. A matched highlight means the system found text that supports a field. It does not mean the business interpretation is automatically correct, or that the workflow has legal-grade audit controls.
The current implementation is honest about those boundaries. If no good match exists, the field can remain without location data. If the document type is broad boilerplate, such as a Policy Booklet, it can be excluded from matching to reduce false positives. If production required stronger guarantees, the next layer would be evaluation, audit logging, reviewer approval history, and retention policy.
A good evidence layer narrows the review problem.
It does not remove judgment. It removes search. That is already a major workflow improvement.
Reusable pattern
The same evidence pattern travels beyond insurance.
PolicyTrace uses UK motor insurance because the document pack makes the problem visible: schedules, certificates, statements, and policy wording overlap. But the evidence pattern is broader.
Any AI system that extracts from documents should separate the value used downstream from the evidence used for review. Lending packs, claims files, compliance evidence, onboarding documents, audit bundles, and healthcare forms all need the same split.
1Extract candidate values
Let the model do bounded extraction into a known schema.
2Preserve source phrases
Keep verbatim text for matching, even if the final value is normalized.
3Review with evidence
Make the source inspectable where the reviewer is already making the decision.