Why One Model Should Not Handle Every AI Task
Using one model for every AI task turns model choice into a hidden default. Production workflows need explicit model policies for task type, risk, latency, evidence, fallback, and evaluation.
Not because capable models are bad. Because one model for every task means every task pays the same latency, cost, and failure mode, whether it is classifying a document type or reasoning through conflicting clauses across multiple sources.
A contract analysis workflow that routes classification, extraction, clause matching, risk assessment, and reviewer summaries through the same model looks simple at first. In production, the fast tasks wait behind slow tasks, cheap tasks pay for expensive reasoning, and failures become harder to isolate because every trace looks like one generic model call.
The one-model approach feels like simplicity. In production, it is complexity deferred.
Each task should have an explicit policy for model tier, context budget, output limit, evidence requirement, fallback behavior, latency tier, and review conditions.
What the one-model trap actually costs
The cost is not only the model bill. It shows up as latency, under-specified hard cases, blurred traces, and workflow behavior that cannot adapt to risk.
Classification and routing steps need consistency, speed, and a constrained label set, not broad reasoning capability.
Risk assessment and conflict handling need stricter evidence, stronger reasoning, and clearer escalation than routine extraction.
A workflow with no fast path makes cheap, synchronous steps wait behind expensive reasoning calls.
When every task uses the same model and prompt, traces cannot easily show which task introduced the error.
Model choice as routing policy
The fix is not to find the perfect single model. It is to make model selection explicit, per-task, testable, and tied to workflow state.
from pydantic import BaseModel
class ModelPolicy(BaseModel):
task: str
model_tier: str
max_input_tokens: int
max_output_tokens: int
evidence_required: bool
review_required: bool
fallback_policy: str
latency_tier: str
Every call can record which policy governed it, what token budget was allocated, whether evidence was required, and what fallback was configured. A model change becomes something the team can evaluate instead of guess.
Route by task type
Different task types in a production workflow have different requirements. Treating them as a spectrum makes routing decisions concrete.
A routing table reveals the workflow
Building a model policy table forces explicit answers to questions most teams defer until something breaks.
| Question | What it reveals | Policy decision |
|---|---|---|
| Which tasks are latency-sensitive? | Whether the workflow has fast, slow, batch, and background paths. | Assign latency tiers instead of routing all calls through the same queue. |
| Which tasks require evidence? | Whether outputs can be traced to source context and reviewed safely. | Require evidence refs for extraction, reasoning, and risk routes. |
| Which tasks escalate to review? | Whether the workflow admits that some model outputs need human authority. | Define review thresholds by risk, evidence gap, confidence, and impact. |
| Which tasks have route-level evals? | Whether model policy can change without rerunning only end-to-end smoke tests. | Maintain golden examples by task, tier, prompt, schema, and risk level. |
| Which tasks can fail safely? | Whether the system knows when to return partial output, stop, or ask for input. | Attach fallback policy to each route rather than hiding it in the prompt. |
A model policy table beats a model default
The system should be able to explain why a task used a cheap model, a strong model, a short prompt, a long context, or mandatory review.
| Task | Routing policy | Evaluation signal |
|---|---|---|
| Intent or document classification | Use a fast model with strict labels, small context, and an unknown route. | Route accuracy, unknown rate, downstream correction rate. |
| Field extraction | Use structured output, schema validation, source evidence, and bounded retries. | Field accuracy, validation failures, evidence coverage. |
| Risk judgement | Use stronger model policy, stricter evidence requirements, and human checkpoint for high impact. | Reviewer disagreement, false-safe rate, escalation rate. |
| Drafting or summarization | Choose model tier by audience, source complexity, stakes, and allowed latency. | Review edits, factual support, length, clarity, latency. |
| Fallback and exception handling | Use policy code first; ask the model only for bounded diagnosis or explanation. | Fallback success, stop correctness, review resolution time. |
The trap at the other end
The answer is not infinite routing. A workflow can be over-routed into too many bespoke paths, each with its own policy, prompt, and eval suite.
Split tasks when their quality, cost, latency, risk, evidence, or failure modes need independent control. Consolidate paths when the operational complexity of maintaining the routing layer exceeds the value it creates.
- Use the strongest model for every task because routing feels inconvenient.
- Use the cheapest model everywhere and hide quality gaps behind review.
- Let the model choose its own next model, tool, or risk tier without code-owned policy.
- Evaluate model quality only at the final answer level.
- Ignore latency tiers when the workflow has user-facing steps.
- Create so many routes that the routing layer becomes harder to operate than the workflow.
Implementation options to test
Start with explicit policy tables before adopting complex routing systems. The first win is making model choice visible.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Simple routing | Typed route enums, model policy tables, per-task prompt IDs, and traceable policy versions. | Whether every model call can explain why it used that policy. |
| Cost control | Token budgets by route, model tier caps, and cost per trusted completed unit. | Whether cheap paths stay cheap without pushing errors to review. |
| Latency tiers | Fast path, slow path, background path, and review path. | Whether user-facing tasks avoid unnecessary slow calls. |
| Risk-based escalation | Evidence requirements, stronger model policy, and human review for high-impact routes. | Whether high-risk tasks receive stricter treatment consistently. |
| Route evaluation | Golden examples by task, model tier, context budget, prompt version, and risk level. | Whether model policy changes are safe before release. |
Where this shows up
Model routing appears anywhere an AI workflow has multiple task types, latency needs, or risk levels.
PolicyTrace separates classification, specialist extraction, arbitration, provenance, and review, giving each step a different model-policy shape.
A contract workflow would need different policies for clause routing, obligation extraction, risk flags, amendment reasoning, and review support.
An invoice workflow would not need the same model tier for supplier matching, line-item extraction, tax ambiguity, PO matching, and exceptions.
The practical takeaway
A production AI system that runs every task through the same model is making a default choice, not a policy choice.
Once model choice becomes workflow policy, the team can tune quality, cost, latency, review, and risk independently, and trace failures to task boundaries instead of blaming the workflow as a whole.
Model routing reduces waste only if the workflow also controls prompt size, context, retries, and output length.