PolicyTrace implementation note C

Evaluating PolicyTrace: Golden Examples, Conflicts, and Review Outcomes

A practical evaluation design for Document AI systems that need more than field accuracy: source evidence, conflict handling, reviewer decisions, and release gates.

PolicyTrace Evaluation Evidence Trail Human Review

Evaluation Map

For PolicyTrace, the useful question is not only "did the model extract the value?" It is "did the system produce a reviewable, evidence-backed result that survives conflict and change?"

Golden expected output Evidence and provenance Reviewer outcome Release gate

01Golden examplesRepresentative policy packs with expected records, conflicts, citations, and review decisions.fixtures

02Field checksCompare extracted Golden Record fields against expected values and acceptable normalizations.record diff

03Conflict checksTest hierarchy-of-truth decisions, losing values, winning source, and conflict visibility.ConflictEntry

04Evidence checksConfirm field paths, source filenames, matched text, match score, page, and bounding box quality.FieldProvenance

05Review outcomesMeasure which fields are verified, flagged, overridden, or missing evidence.review_state

06Release gatesBlock risky changes to prompts, models, schemas, arbiter rules, or provenance logic.regression suite

Current implementation boundary

The repo already has deterministic arbiter tests and the data structures needed for evaluation. A full production evaluation harness is the next layer, not something the current repo fully claims.

Why this matters

Simple accuracy misses the real workflow. A value can be correct but unsupported, supported but overridden, or arbitrated correctly but still risky enough to review.

Core thesis

Document AI evaluation is not just field accuracy.

For PolicyTrace, a field-level accuracy score would be useful, but incomplete. The workflow also needs to know whether the field has evidence, whether conflicting documents were handled correctly, whether the reviewer trusted the result, and whether a change made any of those signals worse.

That is why I would evaluate PolicyTrace as a system, not just as a model response. The model extracts candidates. The arbiter chooses winners. Provenance links values back to PDFs. Reviewers verify, flag, or override the output. Each layer needs its own checks.

Evaluation should follow the same path as the workflow.

If the product promise is reviewable extraction, the evaluation plan should test reviewability, not only JSON similarity.

1
Compare extracted fields to expected Golden Record values.
2
Check source evidence, conflict metadata, and reviewer-facing decisions.
3
Run the same checks before changing prompts, models, schemas, or arbitration rules.

Golden examples

The evaluation set should be made of realistic document packs.

A single policy PDF is not enough. PolicyTrace is interesting because it works across a pack: Schedule, Certificate, Statement of Fact, and boilerplate wording. The evaluation set should preserve that multi-document shape.

Each golden example should include the input PDFs, the expected canonical Golden Record, expected conflicts, expected source citations, and expected review outcomes for fields that are ambiguous, missing, or high impact.

PACKDocument pack

Include clean packs, conflict packs, missing-field packs, and documents with awkward wording or formatting.

RECExpected record

Define expected values in the schema, including accepted normalizations for dates, currency, booleans, and names.

REVExpected review

Mark fields that should be verified, flagged, overridden, or escalated because evidence is weak.

Field evaluation

The Golden Record check should be field-aware.

PolicyTrace uses a nested Pydantic schema for the Golden Record. That makes evaluation easier because every value has a field path. The evaluation harness should flatten the record, compare each field to the expected value, and report the result by section and field family.

Some fields need exact matches. Others need normalized comparison. A date may be extracted as an ISO value even though the PDF text uses a UK date phrase. A currency value may drop a symbol. A driver name may need whitespace or initial handling. The evaluation should encode those rules explicitly.

EXACTExact fields

Policy number, VRM, insurer, and boolean fields should usually match directly after trimming.

NORMNormalized fields

Dates, currency, names, and class-of-use labels may need canonical comparison rules.

MISSMissingness

Missing required or expected fields should be reported separately from incorrect values.

Conflict evaluation

Conflict handling deserves its own tests.

The repo already has deterministic tests for the PolicyArbiter. That is the right seed for a broader evaluation layer. The production version should keep testing cases where Schedule and Certificate agree, disagree, or partially fill each other.

The key is to evaluate more than the final winner. The system should record the field path, both source values, and the winning source. A correct final value is not enough if the disagreement was silently buried.

A conflict test should assert the decision record, not only the output.

PolicyTrace should be able to explain which source won and what value lost.

1
When policy numbers disagree, the expected winner should match the source authority rule.
2
When class of use disagrees, the Certificate should win because it is authoritative for that field.
3
When values match, no unnecessary conflict should be emitted.

Evidence evaluation

Evidence quality is a separate dimension from value quality.

A field can be correct and still be hard to review if the system cannot show where it came from. PolicyTrace already stores provenance with a field path, extracted value, matched text, match score, source filename, page, and bounding box. Those should become evaluation targets.

I would track citation coverage, match quality, source correctness, and location sanity. If a value came from the Schedule, the provenance should not point at the Certificate. If a field has no match, the reviewer should see that clearly instead of being given false confidence.

COVCitation coverage

What share of expected fields have usable source evidence?

SRCSource correctness

Does the provenance point to the expected document and page?

BOXLocation sanity

Is the bounding box present, inside page bounds, and visually useful for review?

Review outcomes

The reviewer is part of the evaluation signal.

PolicyTrace supports verify, reject or flag, and override actions in session review state. In a production evaluation loop, those actions should be treated as feedback. Fields that are frequently overridden are not just UI events. They are evidence that the extraction, arbitration, prompt, schema, or source evidence needs attention.

Review outcomes also help prioritize. A low-impact field with weak evidence may be acceptable for a demo. A high-impact field with frequent overrides should become a release blocker.

VERVerified rate

Which fields are reviewers consistently willing to approve?

FLGFlag rate

Which fields create uncertainty, missing evidence, or repeated escalation?

OVROverride rate

Which fields are often corrected, and what pattern appears in the corrected values?

Regression gates

Evaluation should run before changing the system.

PolicyTrace has several moving parts that can improve one thing while breaking another: prompts, model choice, schema fields, source authority rules, PII masking, page caps, and provenance matching. A production workflow needs regression gates before those parts change.

The gate should not be one giant pass or fail score. It should say what changed: field accuracy moved, conflict behavior changed, citation coverage dropped, override risk increased, or latency moved outside the budget.

Current repo has

1
Deterministic unit tests for arbiter merge and conflict behavior.
2
Schema objects for Golden Record, FieldProvenance, ConflictEntry, and review state.
3
Debug metrics that can seed a broader evaluation and reporting layer.

Production would add

1
Golden document packs with expected outputs, conflicts, citations, and review outcomes.
2
Automated comparison reports by field family, document type, and failure mode.
3
Release thresholds for prompt, model, schema, arbiter, and provenance changes.

Production metrics

The useful dashboard is a workflow dashboard.

Once PolicyTrace is used repeatedly, the evaluation layer should feed an operational dashboard. The important metrics are not only model metrics. They are workflow metrics: how many fields were extracted, how many had evidence, how often conflicts appeared, how often reviewers overrode values, how long extraction took, and how much each pack cost.

Those metrics help the team decide what to improve next. Maybe the model is fine but provenance is weak. Maybe arbitration is correct but reviewers are still overriding one field family. Maybe latency is acceptable for small packs but not large booklets.

Evaluation becomes useful when it points to engineering action.

The goal is not a beautiful report. The goal is to know what broke, what improved, and where to spend engineering time.

1
Quality: field match rate, missing fields, citation coverage, conflict correctness.
2
Review: verified rate, flag rate, override rate, unresolved exceptions.
3
Operations: latency, pages processed, retries, failures, token usage, cost per pack.

Next implementation note

Prompt design as a bounded layer.

The next note closes the series by showing how prompts feed typed outputs, source citations, provenance, and evaluation gates without pretending to be the whole system.

Previous note Back to PolicyTrace Next note