Problem
Two documents can both look valid and still disagree.
A policy pack is not a single source of truth. It is a small bundle of overlapping documents. A Schedule may contain detailed customer, vehicle, premium, and excess information. A Certificate may carry legal-use wording and entitlement fields. The values overlap, but they do not always match.
If the system extracts each document and then simply returns one JSON object, the disagreement can disappear. That is dangerous because the final value may look clean even when the evidence behind it is contested.
The useful output is not only the selected value. It is the selected value plus the competing value, the field path, the winning source, and a review path.
Failure example
Silent overwrite is the failure mode.
The simplest extraction pipeline asks a model to read a pack and produce JSON. That can work for demos, but it makes conflict handling vague. If two documents disagree, which value should the model choose? The latest one? The clearest one? The one mentioned twice?
Output-only extraction
The result looks authoritative, but the source disagreement is gone.
Arbitrated extraction
The Golden Record has a winner, and the conflict remains inspectable.
Architecture concept
Arbitration is a system rule, not an LLM opinion.
PolicyTrace uses the model to extract structured fields from each source document, but it does not ask the model to decide the final truth of the policy pack. That responsibility belongs to the arbiter.
This matters because confidence and authority are different ideas. A model might be very confident that it read a value correctly. That does not mean the value should win. The winning source should be chosen by a domain rule that the team can test, review, and change deliberately.
Implementation
The current repo makes conflict handling deterministic.
The key implementation is `PolicyArbiter` in `src/arbiter.py`. It receives one Schedule extraction and one Certificate extraction, builds a single `UKMotorGoldenRecord`, and returns the merged record plus a list of conflicts.
The helper `_check_conflict` does a narrow but important job: when both source values exist and differ after simple normalization, it records the field path, the Schedule value, the Certificate value, and the configured winner. Then it returns the winning value for the Golden Record.
That design is intentionally boring in the best sense. It is inspectable. It is testable. It can be explained to a business reviewer. And when the hierarchy changes, the change belongs in system rules and tests, not in a vague prompt instruction.
Reviewer visibility
A reviewer needs the conflict, not just the chosen answer.
The API already returns conflicts inside `GoldenRecordWithProvenance`, alongside the record, provenance list, and session id. That creates the foundation for a reviewer-visible conflict experience.
The current React UI has the conflict type shape and the main review surface for fields, evidence, verification, and overrides. A dedicated conflict panel or conflict queue is still a clear UI hardening step. That is the honest production lesson: recording conflict state is necessary, but the product still has to make it hard to miss.
FIELDField-level signal
Mark the Golden Record row when the chosen value came from a conflict rather than a single uncontested source.
WHYAuthority explanation
Show the rule that decided the winner, such as Schedule wins for vehicle detail or Certificate wins for legal-use fields.
SRCBoth source values
Keep the losing value visible so the reviewer can see what was discarded and where it came from.
ACTReview action
Let the reviewer verify the winner, flag the field, override it, or escalate the pack when the rule is not enough.
Production pattern
The hierarchy of truth should become configuration and audit history.
The current implementation is a reference project, so its rules live in Python. That is fine for a compact system. In production, those rules usually need stronger governance: versioned configuration, domain owner approval, regression fixtures, and audit history for rule changes.
That is especially true in regulated or operational workflows. When a reviewer asks why a value won, the answer should not be "the model chose it." It should be "rule version 12 says Certificate is authoritative for this field, the Schedule disagreed, and this reviewer approved or overrode the result."
RULERules are explicit
Authority belongs in code or configuration where it can be reviewed, tested, and versioned.
LOGConflicts are durable
Conflict entries should travel with the result and later become part of an audit trail.
EVALConflicts are test cases
Evaluation data should include disagreement fixtures, not only clean extraction examples.
UIReview is designed
The reviewer surface should make conflict obvious, explain the winner, and preserve action history.
Reusable idea
This pattern travels beyond insurance documents.
Insurance is just a useful test case because the hierarchy is easy to explain. The same pattern appears in lending, claims, onboarding, medical administration, legal intake, procurement, compliance, and finance operations.
Any workflow with multiple source documents needs the same separation: extract each source, preserve evidence, apply authority rules, record conflict state, and give reviewers a clear way to inspect and correct the result.
A production AI system should not pretend disagreement does not exist. It should make disagreement visible, bounded, testable, and reviewable.