Token Economics: Why Prompt Bloat Kills AI Margins
Token waste is architecture debt. Oversized prompts, broad context, verbose outputs, retries, review, and eval runs compound into the cost per trusted completed unit.
A document classification workflow can look clean in development: one model call, one structured prompt, reliable output. Then production volume arrives and three quiet decisions compound.
The system prompt carries routing and fallback instructions for many task types, even when the current call needs only one. The retrieval window uses a broad default, even when the answer is near the top of the document. The retry loop resubmits the full prompt and full context after validation failures. None of these choices look dramatic in isolation. Together, they turn token cost into architecture debt.
Token economics is not about optimizing pennies per call. It is about recognizing that every workflow decision has a token cost that compounds at production volume.
When routing, validation, context assembly, retry behavior, review, and evaluation are invisible, token cost becomes invisible too. The bill is often the first place the architecture tells the truth.
Where waste actually compounds
Token cost rarely comes from one obvious mistake. It accumulates across layers, and each layer multiplies the ones before it.
A bloated prompt plus a broad retrieval window plus a recurring retry path means each repair attempt pays for the same excess again. The retry token share is often where prompt and schema problems show up first.
Prompt bloat hides architecture decisions
The most reliable sign of a prompt that has grown beyond its purpose is that it contains logic that would be cheaper and more reliable as code.
Move document type, task choice, and route selection into code so every call does not pay for every branch.
Move deterministic checks into schemas, validators, and acceptance gates rather than long natural-language instructions.
Move retry, review, unsupported, and stop behavior into workflow policy with observable state.
Send the smallest source context that can support the claim, not the largest context that might contain it.
Classification can be cheap and narrow. Extraction can be schema-bound and evidence-focused. Reasoning can spend more tokens only when the workflow actually needs judgment.
The one metric that matters
Cost per model call is the wrong denominator. It tells you how much each call costs, not how much each useful outcome costs.
This means the total token cost, including retries, review, and evaluation, required to produce one result that passed validation, carried evidence, and did not require correction.
| Metric | What it reveals | Better decision |
|---|---|---|
| Tokens per route | Which task paths consume the most context and output. | Split routes, shrink prompts, or choose task-specific model policy. |
| Retry token share | How much spend is the workflow repairing its own failures. | Improve schema prompts, evidence selection, task scope, or fallback policy. |
| Review-adjusted cost | Which cheap-looking routes create expensive human correction. | Change model tier, source scope, evidence requirements, or review routing. |
| Evidence quality per token | Whether more context actually improves source support. | Improve chunking, retrieval, deduplication, or provenance matching. |
| Eval run cost | How expensive it is to safely change prompts, models, and schemas. | Use targeted eval slices and route-specific regression suites. |
Token budgets should be workflow policy
Once cost per trusted completed unit is measurable by route, token budgets stop being accounting constraints and become design decisions.
| Route | Token policy | Quality signal |
|---|---|---|
| Classification | Minimum context, strict label set, short output. | If it needs a large context or powerful model, decompose the task. |
| Extraction | Targeted source snippets, structured output, field-level evidence refs. | If fields lack citations, retrieval is too broad, too narrow, or poorly ranked. |
| Reasoning | Higher model tier and context budget only for conflict, risk, or synthesis tasks. | If reasoning routes spike review, evidence or authority rules are weak. |
| Summarization | Audience-specific source scope and length cap. | If reviewers need missing details, the cap or source scope is too tight. |
| Evaluation | Route-specific examples and targeted regression slices. | If eval cost blocks testing, the suite is not segmented enough. |
The tension worth naming
Every token optimization exists in tension with output quality, evidence reliability, review burden, and latency. The answer is not always "make it shorter."
Tighter retrieval windows reduce spend, but can miss the passage that would have grounded the answer.
Shorter answers reduce generation cost, but can remove reviewer context when review is part of the workflow.
Smaller prompts are cheaper, but require cleaner routing and stronger validation upstream.
The useful question is not whether the prompt is too long in the abstract. It is whether reducing it changes retry rate, review rate, evidence quality, or regression behavior on the cases that matter.
- Use one long prompt for every route because it is easier during development.
- Send entire documents when a field-level extraction needs one section.
- Let retries grow invisibly without cost, latency, and failure-reason tracking.
- Measure model spend without including review corrections and eval runs.
- Optimize for cheaper calls while increasing fallback and reviewer burden.
- Cut evidence context so aggressively that the system becomes cheap but untrustworthy.
What to build first
Do not start with a spreadsheet of vendor prices. Start by making token use visible by route and outcome.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Token tracing | Record input tokens, output tokens, route, model version, schema version, retry count, retry trigger, validation status, and review outcome. | Whether cost can be tied to trusted workflow outcomes. |
| Route budgets | Define token ceilings by task type: classification, extraction, reasoning, summarization, evaluation. | Whether each route has the budget it needs, not a shared worst-case default. |
| Prompt decomposition | Move routing logic, validation rules, and fallback conditions from prompt prose into code and schemas. | Whether smaller prompts preserve quality and reduce regression surface. |
| Context control | Use section-aware retrieval, source deduplication, evidence windows, and field-specific context. | Whether context shrinks without lowering evidence quality. |
| Targeted evaluation | Build route-specific regression slices from known failures, schema edges, and reviewer corrections. | Whether changes can be tested cheaply enough to run often. |
Where this shows up
Token economics matters most when the workflow runs repeatedly, has review costs, or must be evaluated before changes.
PolicyTrace separates parsing, classification, specialist extraction, arbitration, provenance, and review, which makes token use easier to tie to workflow steps and trusted outputs.
A contract workflow would need route budgets for clause retrieval, amendment context, obligation extraction, risk reasoning, and reviewer summaries.
An invoice workflow would need cheap high-volume extraction with stronger paths only for exceptions, PO mismatches, tax ambiguity, and review.
The architecture implication
The token budget is a forcing function that makes the architecture visible.
A workflow decomposed into cheap classification, targeted extraction, evidence-bounded reasoning, structured output, and targeted evaluation is not just cheaper. It is easier to change, test, trace, and improve.
After routing and token budgets, the next tempting optimization is caching. That is useful, but only if freshness, evidence, model versions, and policy changes are respected.