Back to PolicyTrace

PolicyTrace system design chapter 03

PolicyTrace Evidence and Provenance: Why JSON Is Not Enough

The third core PolicyTrace chapter: how extracted fields become reviewable evidence through hidden source citations, Docling geometry, provenance matching, and PDF highlights.

Evidence and Provenance Map

Canonical value Hidden citation Geometry match Reviewer view
Golden Record field
policy_header.period_of_cover.start_date Canonical value used by downstream systems 2026-04-15T00:00:00
field_citations Internal source phrase excluded from serialized JSON 15/04/2026 at 00:00 hours
Source PDF text

Certificate of Motor Insurance

Policy numberNBM-DEMO-0427
InsurerNorthbridge Mutual Motor Insurance Ltd
Effective from15/04/2026 at 00:00 hours
Expires14/04/2027 at 23:59 hours
RegistrationZX24 DEM
1Docling corpusPDF text elements are stored with page and browser-percent bounding boxes.
2Source phraseThe model supplies verbatim citation text, not coordinates.
3Matcherbuild_provenance prefers citation quotes, then falls back to normalized values.
4FieldProvenanceThe review UI receives field path, matched text, score, file, page, and bbox.
Reviewer question

Where did this value come from, and can I inspect the source without searching the PDF by hand?

source file page 1 bbox highlight
PolicyTrace answer

Clicking a field selects the PDF, scrolls to the source page, overlays the matched region, and shows the matched text plus confidence signal.

matched text match score review state

Problem

The answer is not finished until it can be traced.

A lot of AI extraction demos stop when the model returns JSON. That is useful for a prototype, but it is not enough for a workflow where a reviewer has to approve, reject, or override a field.

If the system says the policy starts on 2026-04-15, the reviewer needs the next question answered immediately: where did that value come from?

PolicyTrace treats evidence as part of the product.

The Golden Record contains the canonical value. The provenance layer explains where the value was found. The review UI keeps both visible at the same time.

Failure example

Output-only JSON creates a trust gap.

Output-only extraction

{
  "start_date": "2026-04-15",
  "class_of_use": "Social",
  "voluntary_excess": "250"
}

The values may be correct, but the reviewer still has to search the PDFs manually to confirm them.

Evidence-backed extraction

{
  "field_path": "policy_header.period_of_cover.start_date",
  "value": "2026-04-15T00:00:00",
  "matched_text": "15/04/2026 at 00:00 hours",
  "source_filename": "Certificate.pdf",
  "page": 1,
  "bbox": [18.2, 31.4, 62.9, 34.1]
}

The field becomes inspectable: value, source phrase, document, page, and highlight location travel together.

Architecture concept

PolicyTrace uses two representations of the same fact.

A canonical value and a source phrase are not the same thing. The canonical value is shaped for systems. The source phrase is shaped for review.

Dates are the easiest example. A PDF may say 15/04/2026 at 00:00 hours. The Golden Record may store an ISO-like value. If the system tries to find the normalized value in the PDF, it may fail. If it preserves the original phrase, provenance matching has something real to locate.

Where this lives in the repo

  • config/prompts.yamlSpecialist prompts ask the model to populate field_citations with verbatim phrases copied from the source document.
  • src/schema.pyfield_citations exists on the model but is excluded from the final serialized Golden Record.
  • src/provenance.pybuild_provenance uses citations and Docling text geometry to produce field-level provenance.
  • src/api.pyThe API returns GoldenRecordWithProvenance so the UI can show field values and source locations together.

Important boundary

The model does not get to invent coordinates.

This is the key design decision. PolicyTrace does not ask the LLM to say "page 3, x 42, y 18" and hope that geometry is real. The model supplies typed values and source phrases. The system builds geometry from Docling and matches after extraction.

That separation matters because coordinates are an artifact problem, not a language problem. The PDF parser owns text and layout. The model owns candidate extraction. The provenance layer connects them.

PDFParser-owned geometry

Docling provides text elements with page and bounding box data before review begins.

LLMModel-owned citation

The model provides a verbatim phrase copied from the source, which is easier to verify than invented coordinates.

UIReviewer-owned trust

The reviewer sees the matched source and can decide whether the field is good enough for the workflow.

Real walkthrough

What happens when a reviewer clicks a field.

The review UI starts with the Golden Record fields on one side and the source PDF on the other. Each field can carry provenance: the matched text, source filename, page, bounding box, and match score.

When a field is selected, the frontend finds its FieldProvenance, switches to the right PDF, scrolls to the page, and overlays the percentage-based bbox on the rendered PDF page. The reviewer is not forced to trust a JSON blob. They see the answer and its source together.

The provenance object is intentionally small.

  • field_pathThe dotted path back to the Golden Record field.
  • matched_textThe source text snippet the matcher found in the PDF corpus.
  • match_scoreA reviewer-visible confidence signal for the source match, not a guarantee of business correctness.
  • source_filenameThe PDF that contains the matched source text.
  • locationPage plus browser-percent bounding box for the highlight overlay.

Source honesty

Provenance is review evidence, not legal proof.

It is tempting to oversell provenance. PolicyTrace should not do that. A matched highlight means the system found text that supports a field. It does not mean the business interpretation is automatically correct, or that the workflow has legal-grade audit controls.

The current implementation is honest about those boundaries. If no good match exists, the field can remain without location data. If the document type is broad boilerplate, such as a Policy Booklet, it can be excluded from matching to reduce false positives. If production required stronger guarantees, the next layer would be evaluation, audit logging, reviewer approval history, and retention policy.

A good evidence layer narrows the review problem.

It does not remove judgment. It removes search. That is already a major workflow improvement.

Reusable pattern

The same evidence pattern travels beyond insurance.

PolicyTrace uses UK motor insurance because the document pack makes the problem visible: schedules, certificates, statements, and policy wording overlap. But the evidence pattern is broader.

Any AI system that extracts from documents should separate the value used downstream from the evidence used for review. Lending packs, claims files, compliance evidence, onboarding documents, audit bundles, and healthcare forms all need the same split.

1Extract candidate values

Let the model do bounded extraction into a known schema.

2Preserve source phrases

Keep verbatim text for matching, even if the final value is normalized.

3Review with evidence

Make the source inspectable where the reviewer is already making the decision.

Next core chapter

Conflict and arbitration.

The next core chapter focuses on disagreement: what happens when two source documents support different values, and how the system should expose that conflict to a reviewer.