Problem

Two documents can both look valid and still disagree.

A policy pack is not a single source of truth. It is a small bundle of overlapping documents. A Schedule may contain detailed customer, vehicle, premium, and excess information. A Certificate may carry legal-use wording and entitlement fields. The values overlap, but they do not always match.

If the system extracts each document and then simply returns one JSON object, the disagreement can disappear. That is dangerous because the final value may look clean even when the evidence behind it is contested.

PolicyTrace treats disagreement as state.

The useful output is not only the selected value. It is the selected value plus the competing value, the field path, the winning source, and a review path.

Failure example

Silent overwrite is the failure mode.

The simplest extraction pipeline asks a model to read a pack and produce JSON. That can work for demos, but it makes conflict handling vague. If two documents disagree, which value should the model choose? The latest one? The clearest one? The one mentioned twice?

Output-only extraction

The result looks authoritative, but the source disagreement is gone.

{ "class_of_use": "Social only" }

Arbitrated extraction

The Golden Record has a winner, and the conflict remains inspectable.

{ "record": { "class_of_use": "Social, Domestic and Pleasure" }, "conflicts": [ { "field": "cover_and_excesses.class_of_use", "schedule_value": "Social only", "certificate_value": "Social, Domestic and Pleasure", "winner": "certificate" } ] }

Architecture concept

Arbitration is a system rule, not an LLM opinion.

PolicyTrace uses the model to extract structured fields from each source document, but it does not ask the model to decide the final truth of the policy pack. That responsibility belongs to the arbiter.

This matters because confidence and authority are different ideas. A model might be very confident that it read a value correctly. That does not mean the value should win. The winning source should be chosen by a domain rule that the team can test, review, and change deliberately.

Schedule winsvehicle details, cover type, NCB, excess breakdown, financial summaryThe Schedule carries much of the detailed policy and premium structure.
Certificate winsclass_of_use, driving_other_carsThe Certificate is treated as master for key legal-use and entitlement fields.
Conflict recordedboth present and differentThe arbiter appends a `ConflictEntry` instead of letting the losing value vanish.

Implementation

The current repo makes conflict handling deterministic.

The key implementation is `PolicyArbiter` in `src/arbiter.py`. It receives one Schedule extraction and one Certificate extraction, builds a single `UKMotorGoldenRecord`, and returns the merged record plus a list of conflicts.

The helper `_check_conflict` does a narrow but important job: when both source values exist and differ after simple normalization, it records the field path, the Schedule value, the Certificate value, and the configured winner. Then it returns the winning value for the Golden Record.

ConflictEntry( field="cover_and_excesses.class_of_use", schedule_value="Social only", certificate_value="Social, Domestic and Pleasure", winner="certificate", )

That design is intentionally boring in the best sense. It is inspectable. It is testable. It can be explained to a business reviewer. And when the hierarchy changes, the change belongs in system rules and tests, not in a vague prompt instruction.

Reviewer visibility

A reviewer needs the conflict, not just the chosen answer.

The API already returns conflicts inside `GoldenRecordWithProvenance`, alongside the record, provenance list, and session id. That creates the foundation for a reviewer-visible conflict experience.

The current React UI has the conflict type shape and the main review surface for fields, evidence, verification, and overrides. A dedicated conflict panel or conflict queue is still a clear UI hardening step. That is the honest production lesson: recording conflict state is necessary, but the product still has to make it hard to miss.

FIELDField-level signal

Mark the Golden Record row when the chosen value came from a conflict rather than a single uncontested source.

WHYAuthority explanation

Show the rule that decided the winner, such as Schedule wins for vehicle detail or Certificate wins for legal-use fields.

SRCBoth source values

Keep the losing value visible so the reviewer can see what was discarded and where it came from.

ACTReview action

Let the reviewer verify the winner, flag the field, override it, or escalate the pack when the rule is not enough.

Production pattern

The hierarchy of truth should become configuration and audit history.

The current implementation is a reference project, so its rules live in Python. That is fine for a compact system. In production, those rules usually need stronger governance: versioned configuration, domain owner approval, regression fixtures, and audit history for rule changes.

That is especially true in regulated or operational workflows. When a reviewer asks why a value won, the answer should not be "the model chose it." It should be "rule version 12 says Certificate is authoritative for this field, the Schedule disagreed, and this reviewer approved or overrode the result."

RULERules are explicit

Authority belongs in code or configuration where it can be reviewed, tested, and versioned.

LOGConflicts are durable

Conflict entries should travel with the result and later become part of an audit trail.

EVALConflicts are test cases

Evaluation data should include disagreement fixtures, not only clean extraction examples.

UIReview is designed

The reviewer surface should make conflict obvious, explain the winner, and preserve action history.

Reusable idea

This pattern travels beyond insurance documents.

Insurance is just a useful test case because the hierarchy is easy to explain. The same pattern appears in lending, claims, onboarding, medical administration, legal intake, procurement, compliance, and finance operations.

Any workflow with multiple source documents needs the same separation: extract each source, preserve evidence, apply authority rules, record conflict state, and give reviewers a clear way to inspect and correct the result.

The broader AI Tool Stack lesson

A production AI system should not pretend disagreement does not exist. It should make disagreement visible, bounded, testable, and reviewable.

Next core chapter

Designing the Human Review Loop for AI Extraction.

The next core chapter moves from conflict state to reviewer workflow: how people inspect evidence, verify fields, flag uncertainty, and override values when automation reaches its limit.