Code-First AI Systems: Why the Model Should Not Run the Workflow
The model should classify, extract, summarize, judge, or propose. Code should own routing, state, validation, retries, evidence, review, and evaluation.
At first, the prompt feels like the fastest place to put the workflow. Ask the model to read the input, decide what kind of job it is, choose the right rules, extract the answer, check itself, recover from mistakes, and explain the result. One prompt. One response. A clean demo.
Then the first real workflow arrives. The input is ambiguous. The model picks the wrong path. A retry changes the answer but nobody can see which rule fired. A reviewer asks why the system skipped a document. The only answer is buried somewhere inside a long instruction block and a final response.
A production AI workflow should be owned by code. The model can classify, extract, summarize, judge, or propose. The application should own routing, state, validation, retries, evidence, review, and evaluation.
Why model-run workflows work in demos
Model-run workflows are seductive because demos are usually short, single-path, and forgiving. There is one input, one happy path, one user watching closely, and no operational history to preserve.
The model can keep the whole story in context because the demo has only a few facts and no long-running lifecycle.
The demo input is chosen to make the next step clear. Production inputs are messier and often need explicit routing.
A bad branch can be explained away in a demo. In production, it can create bad state, missed evidence, or review work.
The presenter knows what the workflow was meant to do. Future maintainers, reviewers, and evaluators do not have that context.
The responsibility split
The practical difference is not whether the system uses an LLM. It is whether the LLM is allowed to decide the process invisibly.
| Workflow style | Path | What happens operationally |
|---|---|---|
| Model-run workflow | User input → giant prompt → model decides everything → final answer | Routing, validation, retries, evidence, and review are mixed into prose. Failures are hard to isolate because the workflow is not represented as system state. |
| Code-first workflow | User input → explicit state → router → task-specific model call → validation → evidence → review/eval → final output | Each step has a contract, logs, failure modes, retry policy, and evaluation boundary. The model performs bounded work inside an inspectable workflow. |
What should live in code
If a decision changes the workflow, creates state, affects risk, or must be inspected later, it belongs in application code or a workflow engine, not only in a prompt.
What has been received, parsed, classified, extracted, validated, reviewed, rejected, or accepted.
Which task runs next, which prompt or tool is allowed, and which branch handles exceptions.
Schema checks, deterministic business rules, required evidence fields, and rejection conditions.
When to retry, when to use a cheaper or stronger model, when to ask for review, and when to stop.
Source references, citations, artifact links, provenance, confidence signals, and reviewer-visible context.
Human checkpoints, review outcomes, golden examples, regression gates, and step-level quality signals.
What the model should own
The model is still useful. It just needs a bounded job with clear inputs, clear outputs, and a surrounding system that decides what happens next.
Propose a document type, request type, intent, risk class, or task category for code to route.
Pull fields, clauses, entities, dates, values, or table facts into a typed output contract.
Score a bounded claim, compare two candidates, or flag uncertainty for review.
Suggest a summary, next action, draft response, or explanation that the workflow can accept, revise, or reject.
Workflow state should be explicit
Once the workflow has state, the application can make decisions from facts it owns instead of hoping the model remembers the process correctly.
from pydantic import BaseModel
class WorkflowState(BaseModel):
run_id: str
input_ref: str
parsed_artifact_ref: str | None = None
route: str | None = None
extraction_status: str = "pending"
validation_errors: list[str] = []
evidence_refs: list[str] = []
review_required: bool = False
evaluation_case_id: str | None = None
Use a state mutation contract
A model node should not directly mutate the database, workflow context, or business record. It should return a claim. Code decides whether that claim is allowed to change state.
from typing import Literal
from pydantic import BaseModel, Field
class ExtractionClaim(BaseModel):
field_name: str
proposed_value: str | None = None
status: Literal["success", "missing_evidence", "requires_review"]
evidence_refs: list[str] = Field(default_factory=list)
def apply_claim(state: WorkflowState, claim: ExtractionClaim) -> WorkflowState:
if claim.status != "success":
state.review_required = True
return state
if not claim.evidence_refs:
state.validation_errors.append(f"{claim.field_name}: missing evidence")
state.review_required = True
return state
# Only deterministic code gets to update accepted workflow state.
state.evidence_refs.extend(claim.evidence_refs)
state.extraction_status = "accepted"
return state
The LLM can propose a value, status, confidence, or explanation. The parent workflow validates the claim, checks evidence, applies policy, records the decision, and only then mutates state.
Routing should be testable
Routing is where many AI workflows quietly become unmaintainable. If the model decides the route and the prompt decides what each route means, the team cannot reliably test the workflow before a real user hits it.
The model can propose document_type = "schedule" or risk_level = "high". Code should decide which extractor runs, whether extra evidence is required, whether a stronger model is needed, and whether review is mandatory.
Agentic routing is not the same as graph routing
If a business process can be mapped as explicit states and transitions, asking a model to invent the sequence is not innovation. It is giving up the part of the system software is already good at.
| Attribute | Model-driven agentic routing | Code-first graph routing |
|---|---|---|
| Control flow | Probabilistic. The model reasons in text and decides which tool or step comes next. | Deterministic. Code routes from explicit state, typed outputs, rules, and policy. |
| Testability | High variance. Requires broad behavioral sampling to gain confidence. | Standard unit and integration tests can cover edge-case paths and terminal states. |
| Failure mode | Wrong tool, looped tool calls, silent dead ends, or a plausible explanation after a bad branch. | Validation error, timeout, known fallback, review route, or explicit terminal failure. |
| Best fit | Open-ended exploration, research, creative workflows, and ad-hoc discovery. | Compliance, data pipelines, document processing, financial workflows, and reviewable automation. |
Retries and fallbacks should be policy, not vibes
A retry loop inside a prompt is not a retry policy. Production workflows need explicit limits, reasons, escalation paths, and traces.
| Failure | Code-owned policy | Why it matters |
|---|---|---|
| Invalid shape | Retry once with validation errors, then store the failure and route to review. | The system knows whether the model failed the contract rather than silently repairing the final answer. |
| Missing evidence | Ask for source-specific extraction or mark the field as unsupported. | The workflow preserves the difference between an answer and an answer with support. |
| Conflicting values | Apply deterministic authority rules or send the conflict to review. | The model does not invent a tie-breaker that nobody can audit. |
| High-risk route | Use a stricter prompt, stronger validation, mandatory evidence, and human checkpoint. | Risk changes the workflow explicitly instead of relying on a more careful-sounding instruction. |
Evidence and review should be workflow outputs
A final answer is not enough when a user, reviewer, operator, or evaluator needs to inspect why the system behaved the way it did. Evidence and review state should be produced by the workflow, not added as decorative explanation after the fact.
The workflow should carry source snippets, document references, page context, extraction traces, validation failures, conflicts, reviewer actions, and final acceptance state. The model can help generate or interpret evidence, but the application should decide how evidence is stored and exposed.
Evaluation needs step-level boundaries
If the only thing you evaluate is the final answer, you cannot tell whether a regression came from parsing, routing, extraction, validation, arbitration, review policy, or the model call itself.
Did the system send the input to the right task, prompt, model policy, and review path?
Did the model extract the right fields into the right typed contract with required evidence?
Did validation, fallback, conflict handling, and review behavior follow the expected rules?
- Ask one giant prompt to classify, route, extract, validate, retry, produce evidence, and decide review.
- Let the model choose tools, branches, or fallback behavior without recording the decision as workflow state.
- Hide retry loops inside natural-language instructions instead of explicit code policy.
- Use the same model call for low-risk summaries and high-risk structured decisions.
- Only evaluate the final answer when the workflow has multiple failure points.
- Treat a fluent explanation as a substitute for source evidence, trace logs, or reviewer-visible state.
For long-running workflows, use durable execution
Code-first does not mean hand-rolling every workflow loop in one Python process. Mission-critical workflows often need durable state, retries, timers, concurrency control, and recovery after crashes.
Tools such as Temporal, AWS Step Functions, or LangGraph can hold the graph, retries, checkpoints, and long-running state. The principle stays the same: code owns the transitions; model calls are bounded nodes inside the workflow.
Implementation options to test
You do not need a complicated agent platform to start. You need explicit state, bounded model calls, and testable transitions. Add frameworks when they help the workflow become clearer, not just more impressive.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Predictable sequence | Plain Python or TypeScript functions with typed inputs, outputs, and stored run state. | Whether a developer can read the workflow and identify every state transition. |
| Branching workflow | LangGraph workflows, state machines, or a small custom router. | Whether routing, human checkpoints, retries, and terminal states are explicit and testable. |
| Durable execution | Temporal, AWS Step Functions, durable queues, or workflow engines that persist state between steps. | Whether the workflow can resume after crashes, wait for review, retry safely, and handle long-running jobs without losing state. |
| Typed model calls | Pydantic models, OpenAI structured outputs, Instructor, or JSON Schema contracts. | Whether model output fails loudly before it becomes application state. |
| Retries and fallback | Validation-error retry loops, model policy tables, fallback branches, and stop conditions. | Whether repeated attempts improve the result without hiding unstable behavior. |
| Evidence and review | Artifact storage, source references, provenance records, review queues, and review outcomes. | Whether reviewers can inspect what happened and evaluators can learn from corrections. |
| Release safety | Golden examples, step-level regression tests, prompt versions, route fixtures, and evaluation gates. | Whether a prompt, model, parser, or routing change can be tested before shipping. |
Where this shows up
Code-owned orchestration matters anywhere the workflow has multiple steps, multiple document types, multiple risk levels, or reviewer-facing output.
PolicyTrace uses code-owned orchestration: parsing, masking, classification, specialist extraction, arbitration, provenance, and review are separate workflow steps.
A contract workflow would need code-owned routing for clauses, obligations, amendments, risk flags, and evidence.
An invoice workflow would need code-owned handling for suppliers, line items, totals, tax, PO matching, exceptions, and review queues.
The practical takeaway
The more important the workflow becomes, the less you should ask the model to run it. Let code own the process. Let the model do bounded work inside that process.
A code-first workflow gives the team places to test, trace, retry, review, evaluate, and improve. That is what turns a model call into a working system.
This post sits inside the Orchestration Layer. The next step is connecting code-owned workflow boundaries to typed outputs, evidence-backed claims, and regression-safe evaluation.