The Death of the 1,000-Line Prompt
Giant prompts feel productive at demo stage. In production, workflow logic needs to move into code, schemas, routing, validation, evidence, review, and evaluation.
It starts innocently. Add instructions for classification. Add extraction rules. Add edge cases. Add validation hints. Add citation requirements. Add fallbacks. Add reviewer notes. The prompt grows until nobody can tell which part controls the workflow, which part controls the model, and which part is only there because something broke last week.
That might survive a demo. It does not survive production pressure. Serious AI systems need orchestration: code-owned workflow, schema-owned boundaries, prompt-owned task instructions, and reviewable evidence when the output matters.
The prompt should not own the workflow
A prompt is a useful interface to a model. It is not a durable place to store business control flow, document routing, validation logic, evidence rules, retry policy, or release gates.
When workflow logic lives inside a giant prompt, you cannot test it, version it, debug it, or explain it cleanly. When orchestration lives in code and schemas, the model becomes one component in a controlled system.
The demo path feels faster
The giant prompt is attractive because it collapses the entire system into one artifact. It feels like progress. The problem is that every new responsibility makes it harder to change safely.
Giant prompt Demo path
Orchestrated workflow System path
Reviewable system Trust path
What the giant prompt hides
Most giant prompts are not long because the task is naturally long. They are long because the system has not been decomposed yet.
The prompt decides document type, task type, and which rules apply, instead of code making those decisions explicitly.
Intermediate facts, source context, retries, and previous decisions exist only inside model context.
The prompt asks the model to check itself, instead of running deterministic checks against typed output.
Citation and provenance requirements are written as prose, but not enforced as workflow artifacts.
When the output changes, it is hard to know whether routing, extraction, validation, or prompt wording caused the regression.
Every call carries rules for tasks that may not apply. A 4,000-token prompt used for routing, extraction, validation, and review makes even simple requests pay for the whole monolith.
Long prompts create attention and regression problems
Long context does not make every instruction equally reliable. Research on lost-in-the-middle behavior shows that model performance can depend on where relevant information appears in the input. In workflow prompts, that means a rule buried halfway through a long instruction block should not be treated like a hard system boundary.
Critical constraints compete with unrelated instructions. The model may follow the visible shape of the task while missing a rule hidden in the middle.
A one-line fix for an extraction edge case can change classification, citation, or validation behavior elsewhere in the same prompt.
A small router plus specialist prompts can avoid sending every rule on every call. The exact savings depend on traffic shape, but the design lever is real.
If a prompt change for one document type breaks another task hundreds of lines away, the prompt has become a shared mutable dependency. Split the workflow until each step can be tested with its own examples.
The orchestration layer owns the sequence
A production AI workflow should be readable as a system diagram, not only as a prompt. Each step should have a purpose, input, output, failure mode, and evidence path.
Prepare input and source structure.
Identify task and document type.
Choose specialist prompt or tool.
Ask the model for bounded output.
Check schema, formats, and rules.
Resolve duplicates or conflicts.
Attach source support and provenance.
Let humans verify and correct.
A prompt should be a component
The goal is not to stop writing prompts. The goal is to stop making one prompt responsible for the entire system.
| Responsibility | Bad home | Better home |
|---|---|---|
| Workflow sequence | One prompt explains every step and asks the model to follow the sequence. | Code or workflow graph controls the order of operations. |
| Document routing | The prompt tells the model to infer which rules apply. | A classifier or deterministic router selects the task-specific extractor. |
| Output shape | The prompt says "return valid JSON" and hopes for compliance. | Structured outputs, Pydantic models, or schema validation define the boundary. |
| Business validation | The prompt asks the model to verify dates, totals, conflicts, and required fields. | Code validates known formats, required values, and deterministic business rules. |
| Evidence | The prompt asks for citations as prose. | The workflow stores field-level citations, provenance, and review context. |
| Regression isolation | One giant prompt where every edit can affect every task. | Small task prompts with evaluation examples for each workflow step. |
A minimal orchestration contract
An orchestration layer does not need to be complicated. It needs to make state explicit enough that each step can be tested, traced, and changed without rewriting the whole prompt.
from pydantic import BaseModel
class WorkflowStep(BaseModel):
name: str
input_keys: list[str]
output_keys: list[str]
prompt_version: str | None = None
model_policy: str | None = None
validation_rules: list[str] = []
evidence_required: bool = False
fallback_step: str | None = None
- Put classification, extraction, validation, evidence, fallback, and review policy into one prompt.
- Ask the model to enforce rules that code can check deterministically.
- Hide retry behavior and fallback decisions inside prose instructions.
- Change prompt wording without regression examples.
- Use the same prompt for every document type, user state, or risk level.
- Treat "valid JSON" as the same thing as a validated workflow result.
Implementation options to test
The right orchestration approach depends on risk, state, team skill, and how much of the workflow must be inspected later. Start simple, but make the boundaries explicit.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Simple predictable workflow | Plain Python or TypeScript functions with typed inputs and outputs. | Whether the sequence is readable, testable, and easy to trace before adding a framework. |
| Stateful workflow graph | LangGraph workflows or a similar explicit graph/state-machine pattern. | Whether graph state, branching, retries, and human checkpoints are easier to reason about than custom glue code. |
| Typed output boundary | Pydantic models, OpenAI structured outputs, or Instructor. | Whether model output can be parsed, validated, versioned, retried, and compared across releases. |
| Prompt versioning | Prompt registry files, explicit prompt IDs, changelogs, and evaluation gates. | Whether a prompt change can be tied to a test run and rolled back when behavior regresses. |
| Observability | Run traces, step logs, artifact storage, evaluation examples, and review outcomes. | Whether the team can answer which step failed and why, not just that the final answer was wrong. |
Where this shows up
This is the layer that turns an AI feature from a single model call into a workflow a team can operate.
PolicyTrace separates document parsing, PII masking, classification, specialist extraction, arbitration, provenance, and human review instead of asking one prompt to own the whole workflow.
A contract workflow would need routing by clause type, obligation type, amendment context, evidence requirement, and reviewer risk level.
An invoice workflow would need separate handling for supplier identity, line items, totals, taxes, PO matching, exceptions, and review queues.
The practical takeaway
Prompts are still part of production AI systems. They should be small, versioned, task-specific components inside a workflow that code can inspect, validate, and evaluate.
The safer path is to split the workflow into explicit steps, keep prompt responsibilities narrow, validate typed outputs, attach evidence, and use evaluation examples before changing behavior.
This post opens the Orchestration Layer. The next natural code-level companion is Pydantic as an AI architecture boundary.