Code-First AI Systems: Why the Model Should Not Run the Workflow

The model should classify, extract, summarize, judge, or propose. Code should own routing, state, validation, retries, evidence, review, and evaluation.

By Teja Sagiraju

May 25, 2026 13 min read

System Layer Orchestration Layer Code-owned workflow

The demo works until the model starts making process decisions nobody can inspect.

At first, the prompt feels like the fastest place to put the workflow. Ask the model to read the input, decide what kind of job it is, choose the right rules, extract the answer, check itself, recover from mistakes, and explain the result. One prompt. One response. A clean demo.

Then the first real workflow arrives. The input is ambiguous. The model picks the wrong path. A retry changes the answer but nobody can see which rule fired. A reviewer asks why the system skipped a document. The only answer is buried somewhere inside a long instruction block and a final response.

The model should be a worker inside the system, not the workflow engine.

A production AI workflow should be owned by code. The model can classify, extract, summarize, judge, or propose. The application should own routing, state, validation, retries, evidence, review, and evaluation.

Why model-run workflows work in demos

Model-run workflows are seductive because demos are usually short, single-path, and forgiving. There is one input, one happy path, one user watching closely, and no operational history to preserve.

1 The state is tiny

The model can keep the whole story in context because the demo has only a few facts and no long-running lifecycle.

2 The path is obvious

The demo input is chosen to make the next step clear. Production inputs are messier and often need explicit routing.

3 The cost of being wrong is hidden

A bad branch can be explained away in a demo. In production, it can create bad state, missed evidence, or review work.

4 The human fills the gaps

The presenter knows what the workflow was meant to do. Future maintainers, reviewers, and evaluators do not have that context.

The responsibility split

The practical difference is not whether the system uses an LLM. It is whether the LLM is allowed to decide the process invisibly.

Workflow style	Path	What happens operationally
Model-run workflow	User input → giant prompt → model decides everything → final answer	Routing, validation, retries, evidence, and review are mixed into prose. Failures are hard to isolate because the workflow is not represented as system state.
Code-first workflow	User input → explicit state → router → task-specific model call → validation → evidence → review/eval → final output	Each step has a contract, logs, failure modes, retry policy, and evaluation boundary. The model performs bounded work inside an inspectable workflow.

What should live in code

If a decision changes the workflow, creates state, affects risk, or must be inspected later, it belongs in application code or a workflow engine, not only in a prompt.

S State

What has been received, parsed, classified, extracted, validated, reviewed, rejected, or accepted.

R Routing

Which task runs next, which prompt or tool is allowed, and which branch handles exceptions.

V Validation

Schema checks, deterministic business rules, required evidence fields, and rejection conditions.

F Retries and fallbacks

When to retry, when to use a cheaper or stronger model, when to ask for review, and when to stop.

E Evidence

Source references, citations, artifact links, provenance, confidence signals, and reviewer-visible context.

Q Review and evaluation

Human checkpoints, review outcomes, golden examples, regression gates, and step-level quality signals.

What the model should own

The model is still useful. It just needs a bounded job with clear inputs, clear outputs, and a surrounding system that decides what happens next.

C Classify

Propose a document type, request type, intent, risk class, or task category for code to route.

X Extract

Pull fields, clauses, entities, dates, values, or table facts into a typed output contract.

J Judge

Score a bounded claim, compare two candidates, or flag uncertainty for review.

P Propose

Suggest a summary, next action, draft response, or explanation that the workflow can accept, revise, or reject.

Workflow state should be explicit

Once the workflow has state, the application can make decisions from facts it owns instead of hoping the model remembers the process correctly.

01 Received Input stored with run ID and source reference.

02 Parsed Text, structure, pages, or artifacts are available.

03 Classified Task type and routing decision are recorded.

04 Extracted Model output exists as a bounded claim.

05 Validated Schema and deterministic checks have run.

06 Reviewed Evidence, exceptions, and outcomes are attached.

from pydantic import BaseModel


class WorkflowState(BaseModel):
    run_id: str
    input_ref: str
    parsed_artifact_ref: str | None = None
    route: str | None = None
    extraction_status: str = "pending"
    validation_errors: list[str] = []
    evidence_refs: list[str] = []
    review_required: bool = False
    evaluation_case_id: str | None = None

Use a state mutation contract

A model node should not directly mutate the database, workflow context, or business record. It should return a claim. Code decides whether that claim is allowed to change state.

from typing import Literal
from pydantic import BaseModel, Field


class ExtractionClaim(BaseModel):
    field_name: str
    proposed_value: str | None = None
    status: Literal["success", "missing_evidence", "requires_review"]
    evidence_refs: list[str] = Field(default_factory=list)


def apply_claim(state: WorkflowState, claim: ExtractionClaim) -> WorkflowState:
    if claim.status != "success":
        state.review_required = True
        return state

    if not claim.evidence_refs:
        state.validation_errors.append(f"{claim.field_name}: missing evidence")
        state.review_required = True
        return state

    # Only deterministic code gets to update accepted workflow state.
    state.evidence_refs.extend(claim.evidence_refs)
    state.extraction_status = "accepted"
    return state

Deterministic state mutation is the wall between prediction and record.

The LLM can propose a value, status, confidence, or explanation. The parent workflow validates the claim, checks evidence, applies policy, records the decision, and only then mutates state.

Routing should be testable

Routing is where many AI workflows quietly become unmaintainable. If the model decides the route and the prompt decides what each route means, the team cannot reliably test the workflow before a real user hits it.

A routing decision should be data, not a hidden side effect.

The model can propose document_type = "schedule" or risk_level = "high". Code should decide which extractor runs, whether extra evidence is required, whether a stronger model is needed, and whether review is mandatory.

Agentic routing is not the same as graph routing

If a business process can be mapped as explicit states and transitions, asking a model to invent the sequence is not innovation. It is giving up the part of the system software is already good at.

Attribute	Model-driven agentic routing	Code-first graph routing
Control flow	Probabilistic. The model reasons in text and decides which tool or step comes next.	Deterministic. Code routes from explicit state, typed outputs, rules, and policy.
Testability	High variance. Requires broad behavioral sampling to gain confidence.	Standard unit and integration tests can cover edge-case paths and terminal states.
Failure mode	Wrong tool, looped tool calls, silent dead ends, or a plausible explanation after a bad branch.	Validation error, timeout, known fallback, review route, or explicit terminal failure.
Best fit	Open-ended exploration, research, creative workflows, and ad-hoc discovery.	Compliance, data pipelines, document processing, financial workflows, and reviewable automation.

Retries and fallbacks should be policy, not vibes

A retry loop inside a prompt is not a retry policy. Production workflows need explicit limits, reasons, escalation paths, and traces.

Failure	Code-owned policy	Why it matters
Invalid shape	Retry once with validation errors, then store the failure and route to review.	The system knows whether the model failed the contract rather than silently repairing the final answer.
Missing evidence	Ask for source-specific extraction or mark the field as unsupported.	The workflow preserves the difference between an answer and an answer with support.
Conflicting values	Apply deterministic authority rules or send the conflict to review.	The model does not invent a tie-breaker that nobody can audit.
High-risk route	Use a stricter prompt, stronger validation, mandatory evidence, and human checkpoint.	Risk changes the workflow explicitly instead of relying on a more careful-sounding instruction.

Evidence and review should be workflow outputs

A final answer is not enough when a user, reviewer, operator, or evaluator needs to inspect why the system behaved the way it did. Evidence and review state should be produced by the workflow, not added as decorative explanation after the fact.

Do not ask the model to be the only witness.

The workflow should carry source snippets, document references, page context, extraction traces, validation failures, conflicts, reviewer actions, and final acceptance state. The model can help generate or interpret evidence, but the application should decide how evidence is stored and exposed.

Evaluation needs step-level boundaries

If the only thing you evaluate is the final answer, you cannot tell whether a regression came from parsing, routing, extraction, validation, arbitration, review policy, or the model call itself.

R Route checks

Did the system send the input to the right task, prompt, model policy, and review path?

X Extraction checks

Did the model extract the right fields into the right typed contract with required evidence?

P Policy checks

Did validation, fallback, conflict handling, and review behavior follow the expected rules?

Do not do this.

Ask one giant prompt to classify, route, extract, validate, retry, produce evidence, and decide review.
Let the model choose tools, branches, or fallback behavior without recording the decision as workflow state.
Hide retry loops inside natural-language instructions instead of explicit code policy.
Use the same model call for low-risk summaries and high-risk structured decisions.
Only evaluate the final answer when the workflow has multiple failure points.
Treat a fluent explanation as a substitute for source evidence, trace logs, or reviewer-visible state.

For long-running workflows, use durable execution

Code-first does not mean hand-rolling every workflow loop in one Python process. Mission-critical workflows often need durable state, retries, timers, concurrency control, and recovery after crashes.

The state machine can live in production workflow infrastructure.

Tools such as Temporal, AWS Step Functions, or LangGraph can hold the graph, retries, checkpoints, and long-running state. The principle stays the same: code owns the transitions; model calls are bounded nodes inside the workflow.

Implementation options to test

You do not need a complicated agent platform to start. You need explicit state, bounded model calls, and testable transitions. Add frameworks when they help the workflow become clearer, not just more impressive.

Need	Implementation options	What to evaluate
Predictable sequence	Plain Python or TypeScript functions with typed inputs, outputs, and stored run state.	Whether a developer can read the workflow and identify every state transition.
Branching workflow	LangGraph workflows, state machines, or a small custom router.	Whether routing, human checkpoints, retries, and terminal states are explicit and testable.
Durable execution	Temporal, AWS Step Functions, durable queues, or workflow engines that persist state between steps.	Whether the workflow can resume after crashes, wait for review, retry safely, and handle long-running jobs without losing state.
Typed model calls	Pydantic models, OpenAI structured outputs, Instructor, or JSON Schema contracts.	Whether model output fails loudly before it becomes application state.
Retries and fallback	Validation-error retry loops, model policy tables, fallback branches, and stop conditions.	Whether repeated attempts improve the result without hiding unstable behavior.
Evidence and review	Artifact storage, source references, provenance records, review queues, and review outcomes.	Whether reviewers can inspect what happened and evaluators can learn from corrections.
Release safety	Golden examples, step-level regression tests, prompt versions, route fixtures, and evaluation gates.	Whether a prompt, model, parser, or routing change can be tested before shipping.

Where this shows up

Code-owned orchestration matters anywhere the workflow has multiple steps, multiple document types, multiple risk levels, or reviewer-facing output.

P PolicyTrace

PolicyTrace uses code-owned orchestration: parsing, masking, classification, specialist extraction, arbitration, provenance, and review are separate workflow steps.

C Future ContractCopilot

A contract workflow would need code-owned routing for clauses, obligations, amendments, risk flags, and evidence.

I Future invoice intelligence

An invoice workflow would need code-owned handling for suppliers, line items, totals, tax, PO matching, exceptions, and review queues.

The practical takeaway

The more important the workflow becomes, the less you should ask the model to run it. Let code own the process. Let the model do bounded work inside that process.

Useful AI is not the prompt. It is the system around it.

A code-first workflow gives the team places to test, trace, retry, review, evaluate, and improve. That is what turns a model call into a working system.

Continue reading Read the surrounding orchestration, evidence, and evaluation posts next.

This post sits inside the Orchestration Layer. The next step is connecting code-owned workflow boundaries to typed outputs, evidence-backed claims, and regression-safe evaluation.

1,000-line prompt Pydantic boundary Evidence Evaluation PolicyTrace