PolicyTrace implementation note C
Evaluating PolicyTrace: Golden Examples, Conflicts, and Review Outcomes
A practical evaluation design for Document AI systems that need more than field accuracy: source evidence, conflict handling, reviewer decisions, and release gates.
Evaluation Map
For PolicyTrace, the useful question is not only "did the model extract the value?" It is "did the system produce a reviewable, evidence-backed result that survives conflict and change?"
fixturesrecord diffConflictEntryFieldProvenancereview_stateregression suiteCurrent implementation boundary
The repo already has deterministic arbiter tests and the data structures needed for evaluation. A full production evaluation harness is the next layer, not something the current repo fully claims.
Why this matters
Simple accuracy misses the real workflow. A value can be correct but unsupported, supported but overridden, or arbitrated correctly but still risky enough to review.
Core thesis
Document AI evaluation is not just field accuracy.
For PolicyTrace, a field-level accuracy score would be useful, but incomplete. The workflow also needs to know whether the field has evidence, whether conflicting documents were handled correctly, whether the reviewer trusted the result, and whether a change made any of those signals worse.
That is why I would evaluate PolicyTrace as a system, not just as a model response. The model extracts candidates. The arbiter chooses winners. Provenance links values back to PDFs. Reviewers verify, flag, or override the output. Each layer needs its own checks.
Evaluation should follow the same path as the workflow.
If the product promise is reviewable extraction, the evaluation plan should test reviewability, not only JSON similarity.
- 1Compare extracted fields to expected Golden Record values.
- 2Check source evidence, conflict metadata, and reviewer-facing decisions.
- 3Run the same checks before changing prompts, models, schemas, or arbitration rules.
Golden examples
The evaluation set should be made of realistic document packs.
A single policy PDF is not enough. PolicyTrace is interesting because it works across a pack: Schedule, Certificate, Statement of Fact, and boilerplate wording. The evaluation set should preserve that multi-document shape.
Each golden example should include the input PDFs, the expected canonical Golden Record, expected conflicts, expected source citations, and expected review outcomes for fields that are ambiguous, missing, or high impact.
PACKDocument pack
Include clean packs, conflict packs, missing-field packs, and documents with awkward wording or formatting.
RECExpected record
Define expected values in the schema, including accepted normalizations for dates, currency, booleans, and names.
REVExpected review
Mark fields that should be verified, flagged, overridden, or escalated because evidence is weak.
Field evaluation
The Golden Record check should be field-aware.
PolicyTrace uses a nested Pydantic schema for the Golden Record. That makes evaluation easier because every value has a field path. The evaluation harness should flatten the record, compare each field to the expected value, and report the result by section and field family.
Some fields need exact matches. Others need normalized comparison. A date may be extracted as an ISO value even though the PDF text uses a UK date phrase. A currency value may drop a symbol. A driver name may need whitespace or initial handling. The evaluation should encode those rules explicitly.
EXACTExact fields
Policy number, VRM, insurer, and boolean fields should usually match directly after trimming.
NORMNormalized fields
Dates, currency, names, and class-of-use labels may need canonical comparison rules.
MISSMissingness
Missing required or expected fields should be reported separately from incorrect values.
Conflict evaluation
Conflict handling deserves its own tests.
The repo already has deterministic tests for the PolicyArbiter. That is the right seed for a broader evaluation layer. The production version should keep testing cases where Schedule and Certificate agree, disagree, or partially fill each other.
The key is to evaluate more than the final winner. The system should record the field path, both source values, and the winning source. A correct final value is not enough if the disagreement was silently buried.
A conflict test should assert the decision record, not only the output.
PolicyTrace should be able to explain which source won and what value lost.
- 1When policy numbers disagree, the expected winner should match the source authority rule.
- 2When class of use disagrees, the Certificate should win because it is authoritative for that field.
- 3When values match, no unnecessary conflict should be emitted.
Evidence evaluation
Evidence quality is a separate dimension from value quality.
A field can be correct and still be hard to review if the system cannot show where it came from. PolicyTrace already stores provenance with a field path, extracted value, matched text, match score, source filename, page, and bounding box. Those should become evaluation targets.
I would track citation coverage, match quality, source correctness, and location sanity. If a value came from the Schedule, the provenance should not point at the Certificate. If a field has no match, the reviewer should see that clearly instead of being given false confidence.
COVCitation coverage
What share of expected fields have usable source evidence?
SRCSource correctness
Does the provenance point to the expected document and page?
BOXLocation sanity
Is the bounding box present, inside page bounds, and visually useful for review?
Review outcomes
The reviewer is part of the evaluation signal.
PolicyTrace supports verify, reject or flag, and override actions in session review state. In a production evaluation loop, those actions should be treated as feedback. Fields that are frequently overridden are not just UI events. They are evidence that the extraction, arbitration, prompt, schema, or source evidence needs attention.
Review outcomes also help prioritize. A low-impact field with weak evidence may be acceptable for a demo. A high-impact field with frequent overrides should become a release blocker.
VERVerified rate
Which fields are reviewers consistently willing to approve?
FLGFlag rate
Which fields create uncertainty, missing evidence, or repeated escalation?
OVROverride rate
Which fields are often corrected, and what pattern appears in the corrected values?
Regression gates
Evaluation should run before changing the system.
PolicyTrace has several moving parts that can improve one thing while breaking another: prompts, model choice, schema fields, source authority rules, PII masking, page caps, and provenance matching. A production workflow needs regression gates before those parts change.
The gate should not be one giant pass or fail score. It should say what changed: field accuracy moved, conflict behavior changed, citation coverage dropped, override risk increased, or latency moved outside the budget.
Current repo has
- 1Deterministic unit tests for arbiter merge and conflict behavior.
- 2Schema objects for Golden Record, FieldProvenance, ConflictEntry, and review state.
- 3Debug metrics that can seed a broader evaluation and reporting layer.
Production would add
- 1Golden document packs with expected outputs, conflicts, citations, and review outcomes.
- 2Automated comparison reports by field family, document type, and failure mode.
- 3Release thresholds for prompt, model, schema, arbiter, and provenance changes.
Production metrics
The useful dashboard is a workflow dashboard.
Once PolicyTrace is used repeatedly, the evaluation layer should feed an operational dashboard. The important metrics are not only model metrics. They are workflow metrics: how many fields were extracted, how many had evidence, how often conflicts appeared, how often reviewers overrode values, how long extraction took, and how much each pack cost.
Those metrics help the team decide what to improve next. Maybe the model is fine but provenance is weak. Maybe arbitration is correct but reviewers are still overriding one field family. Maybe latency is acceptable for small packs but not large booklets.
Evaluation becomes useful when it points to engineering action.
The goal is not a beautiful report. The goal is to know what broke, what improved, and where to spend engineering time.
- 1Quality: field match rate, missing fields, citation coverage, conflict correctness.
- 2Review: verified rate, flag rate, override rate, unresolved exceptions.
- 3Operations: latency, pages processed, retries, failures, token usage, cost per pack.