Back to PolicyTrace

PolicyTrace system design chapter 05

Designing the Human Review Loop for AI Extraction

A practical look at the reviewer surface inside PolicyTrace: PDF evidence, structured fields, confidence signals, source highlighting, verification, flags, and overrides.

Review Loop Control Surface

The UI makes review a first-class workflow: source PDF on the left, Golden Record fields on the right, and reviewer actions attached to the exact field being inspected.

42Fields 36Located 18Verified 3Overridden
PolicyTrace review dashboard
42 fields 36 located 18 verified 3 overridden
Schedule.pdf Certificate.pdf Statement-of-Fact.pdf
Golden Record Click any field to highlight its source location in the PDF.
Policy number PTM-2026-00419 Schedule.pdf - p.1 - "Policy number PTM-2026-00419"
98% V E F
Insurer Example Insurance Ltd Schedule.pdf - p.1 - "Insurer: Example Insurance Ltd"
96% V E F
Class of use Social, Domestic and Pleasure Certificate.pdf - p.1 - "Use for social, domestic and pleasure"
91% V E F
Annual mileage Not extracted No location data - route to reviewer or cross-check rule.
62% V E F
Voluntary excess 250 GBP Schedule.pdf - p.3 - "Voluntary excess GBP 250"
93% V E F
1Select fieldThe reviewer chooses a Golden Record field that needs inspection.
2Jump to evidenceThe PDF pane opens the right document and page, then highlights the matched region.
3Check confidenceThe reviewer sees whether provenance matching is strong, weak, or missing.
4Act on fieldVerify, flag, or override the value with a field-level action.
5Persist stateThe session stores review decisions so the workflow can report what changed.

Core thesis

Human review is not a failure of automation. It is the control surface.

Many AI extraction demos make review sound like an exception path: the model fails, so a person fixes it. In production workflows, that framing is too narrow. Review is where evidence, uncertainty, conflicts, and business judgement become visible.

PolicyTrace treats review as part of the system design. The reviewer is not asked to trust a JSON blob. They can inspect the source PDF, see which field was extracted, check the matched text, and make a field-level decision.

The reviewer should see the same chain the system used.

A useful review loop does not just show the final answer. It shows the field, the source, the page, the matched phrase, the confidence signal, and the available action.

  • 1
    Keep the source document visible next to the structured record.
  • 2
    Make field selection drive source highlighting, not manual PDF searching.
  • 3
    Persist review actions as workflow state, not just UI decoration.

Why review UX matters

The cost of review is mostly attention switching.

If a reviewer has to copy a value, search the PDF, compare it by eye, remember the conflict, and then update a spreadsheet, the AI system has not removed much work. It has just moved the work into a worse interface.

The split-screen pattern helps because the question is always local: this field, this document, this page, this evidence. That is the difference between reviewing a system and hunting through a pile of artifacts.

PDFSource stays visible

The original document remains in the workflow, so the reviewer can inspect the evidence rather than trusting an extracted value.

RECRecord stays structured

The Golden Record is organized by field sections, which makes review a systematic pass instead of a manual search task.

ACTActions stay local

Verify, flag, and override live next to the field, where the reviewer has context.

LOGState stays inspectable

Review state can be summarized as verified and overridden counts, then hardened into audit history later.

Source highlighting

Field selection should move the reviewer to the evidence.

PolicyTrace builds provenance records with a field path, extracted value, matched text, match score, source filename, page, and bounding box. The UI uses that to connect a structured field to a specific region in the PDF.

When the reviewer selects a field, the PDF pane can switch documents, scroll to the page, and highlight the matched area. This is a small interaction, but it changes the review workload. The reviewer spends less time searching and more time deciding.

Evidence should be navigable, not merely stored.

A provenance entry hidden in JSON helps engineers debug. A provenance entry connected to a PDF highlight helps reviewers work.

  • 1
    Source filename chooses the active PDF.
  • 2
    Page location moves the reviewer to the right part of the document.
  • 3
    Bounding boxes turn evidence into an inspectable visual target.

Reviewer actions

Verify, flag, and override are three different decisions.

A good review loop should avoid turning every exception into the same generic "needs human" bucket. In PolicyTrace, field actions have separate meanings. Verify means the extracted value is acceptable. Flag means it needs attention. Override means the reviewer has supplied a corrected value.

Those actions change the role of the reviewer from passive checker to active decision maker. They also create workflow signals that a production system can use later: which fields are frequently overridden, which document types produce weak evidence, and where rules or prompts need improvement.

VVerify

The reviewer confirms the field can be trusted for the current workflow.

FFlag

The reviewer marks a value as questionable, missing, low-confidence, or unsuitable for downstream use.

OOverride

The reviewer replaces the extracted value with a corrected value while preserving that a change happened.

SSummarize

The dashboard can show progress through counts such as located, verified, and overridden fields.

Changing the model's role

The model becomes a first-pass worker, not the final decision-maker.

Once the review loop exists, the model does not need to pretend to be certain about everything. It can extract candidates, preserve citations, and let the rest of the system decide what deserves trust.

That is healthier engineering. The architecture can separate extraction quality, evidence quality, arbitration quality, and review quality. Each part can be measured and improved without hiding every problem inside one prompt.

Review turns AI output into an operational workflow.

The final user experience is not "AI returned JSON." It is "a reviewer inspected the evidence, resolved exceptions, and left a record of the decision."

  • 1
    The model extracts structured candidates and source phrases.
  • 2
    The system maps candidates back to source evidence and conflict context.
  • 3
    The reviewer applies human judgement where automation should not silently decide.

Production hardening

A production review loop needs queues, ownership, and audit history.

The current PolicyTrace implementation proves the interaction pattern with session-based review state. That is enough for a reference project, but not enough for a regulated or high-volume workflow.

What PolicyTrace demonstrates

  • 1
    A split-screen review UI with PDF evidence and structured Golden Record fields.
  • 2
    Field-level provenance hints, match confidence, active PDF switching, and highlighted source regions.
  • 3
    Session review actions for verification, flagging, and overrides.

What production would add

  • 1
    Reviewer assignment, queues, service levels, RBAC, and escalation rules.
  • 2
    Durable audit logs with reviewer identity, timestamps, old values, new values, and evidence snapshots.
  • 3
    Sampling, reviewer metrics, override analytics, monitoring, and feedback into evaluation sets.

Reusable pattern

Design the review surface before the system reaches production.

Review loops are hard to bolt on after the fact. If the extraction system did not preserve source filenames, matched text, page geometry, conflicts, and field paths, the UI cannot magically explain the output later.

That is the broader AI Tool Stack lesson: practical AI engineering is not just model selection. It is evidence design, state design, workflow design, and a clear path for humans to intervene when the system reaches its limits.

Next implementation note

Deploying PolicyTrace with GitHub, Docker, and Hugging Face Spaces.

The next post should move from product workflow to delivery: how the app is packaged, where the demo runs, and what production deployment would still require.