Why Chunking Breaks Business Documents
Chunking is not just a retrieval setting. In business documents, the wrong chunks can separate values from labels, tables from headers, clauses from definitions, and evidence from the answer.
A supplier invoice might put the invoice total on page 1, the tax breakdown in a table on page 2, payment terms in small text, and purchase order references in headers or footers. A naive chunking step can capture the words while losing the relationships that make those words useful.
Chunking is not just a retrieval setting. In business workflows, it decides which facts travel together, which labels survive, and whether the answer can be traced back to evidence.
Chunking is a system design decision
The common framing is simple: make chunks small enough to embed and retrieve, with enough overlap to avoid losing context. That is a reasonable starting point for articles, support docs, and plain prose. It is not enough for contracts, invoices, insurance packs, policies, statements, schedules, and other business documents.
If the chunk boundary separates a value from its label, a clause from its definition, or a table row from its header, the retrieval system may return a fragment that looks relevant but is not reviewable.
The naive pipeline hides the problem
Chunking often sits between parsing and retrieval, so it feels like plumbing. But it is where document structure can quietly disappear.
Naive chunking Text-first
Structure-aware chunking Workflow-first
Reviewable chunking Trust path
What chunking breaks
Most chunking failures do not look like syntax errors. They look like confident answers built from partial context.
A chunk contains `GBP 1,250` but not whether it is premium, excess, tax, settlement, or total payable.
A row may be retrieved without the column names or section title that explains what each cell means.
A contract clause can depend on a defined term, schedule, appendix, or amendment that lives far away in the document.
An exclusion, endorsement, footnote, or caveat can be split away from the section it modifies.
The same value type appears in several documents, but the chunk does not carry source authority or conflict context.
The answer may be correct, but the system cannot show the page, text span, row, or source document that supports it.
A token splitter hits a table halfway through. Chunk A gets the heading and rows 1-3. Chunk B gets rows 4-8, but no column headers. The model still sees numbers, dates, and labels, but it no longer knows that column 3 means `Exclusion limit` and column 4 means `Deductible`. That is a loss of vertical context, not a language problem.
A chunk should carry its contract
For business documents, a chunk needs more than content. It needs enough structure and metadata for retrieval, extraction, validation, and review to agree on what the chunk means.
The smallest coherent unit of source content.
Section, heading, table, row, column, and parent block.
Document, page, coordinates, parser confidence, OCR state.
Previous, next, parent, child, appendix, definition, authority.
What a reviewer can inspect when the field is used.
A minimal chunk contract
The implementation does not have to look exactly like this, but the shape matters. Passing raw strings through the pipeline gives retrieval very little to work with. Passing reviewable chunk objects gives extraction, validation, evidence, and review a shared boundary.
from pydantic import BaseModel
class ReviewableChunk(BaseModel):
text: str
section_path: list[str]
source_document_id: str
page_number: int
bounding_box: tuple[float, float, float, float] | None = None
ocr_confidence: float | None = None
relations: list[str] = []
evidence_id: str | None = None
Bad chunks versus reviewable chunks
Chunk quality should be judged by workflow usefulness, not only by token count.
| Document reality | Bad chunk | Reviewable chunk |
|---|---|---|
| Invoice line item table | One chunk has the row values; another chunk has the column headers. | Each row keeps headers, section title, page number, and table identity. |
| Policy schedule and certificate disagree | Chunks retrieve two values without source authority or conflict state. | Chunks carry source document, field type, and enough metadata for arbitration. |
| Contract clause uses defined terms | The clause appears alone, detached from definitions and amendments. | The clause links to definitions, schedules, amendment references, and section hierarchy. |
| Scanned page with OCR uncertainty | The OCR text is embedded as if it were clean source text. | The chunk carries page quality, OCR confidence, and a path back to the page image. |
| Reviewer checks a field | The system shows a text snippet but not the exact source context. | The reviewer can inspect the document, page, section, and evidence used. |
- Choose chunk size only by token limit.
- Use the same chunking strategy for policies, contracts, invoices, emails, and tables.
- Split tables before preserving row and column context.
- Assume overlap fixes missing structure.
- Embed OCR text without page quality or confidence signals.
- Evaluate only on questions that retrieve one clean paragraph.
Implementation options to test
Libraries can help with splitting, but they will not decide your business boundaries for you. Treat these as implementation options after you have defined what a useful chunk must preserve.
| Need | Implementation options | What to evaluate |
|---|---|---|
| Baseline text splitting | LangChain recursive text splitter or Haystack DocumentSplitter. | Good for a baseline, but test whether labels, table headers, and evidence survive. |
| Node-based document pipelines | LlamaIndex node parsers and sentence splitters. | Whether node metadata carries section, source, and retrieval context through the pipeline. |
| Element-aware chunking | Unstructured chunking strategies after partitioning. | Whether chunking respects titles, pages, elements, tables, and document hierarchy. |
| Business-rule chunking | Custom code over parsed document elements, tables, and field schemas. | Whether chunks preserve exactly the relationships the workflow depends on. |
| Chunk evaluation | Golden examples that test retrieval, extraction, evidence, and review outcomes. | Whether hard cases fail visibly: split tables, repeated fields, OCR uncertainty, and cross-reference questions. |
Where this shows up
Chunking is not just a RAG concern. It shapes every document workflow that has to extract facts, explain sources, and let a person review the answer.
PolicyTrace depends on document structure, field citations, provenance matching, conflict handling, and reviewable source context.
A contract workflow would need chunks that preserve clauses, definitions, obligations, amendments, schedules, and evidence.
An invoice workflow would need chunks that keep totals, tax, line items, supplier identity, PO references, and payment terms connected.
The practical takeaway
Chunking should be designed around the document's business meaning, not only around embedding size. A good chunk is useful for retrieval, extraction, validation, evidence, and review.
The safer path is to preserve document hierarchy, attach source metadata, keep related facts together, and evaluate chunks against the questions real users will ask.
This post follows the PDF extraction article. Together, they explain why document AI needs structure before retrieval, extraction, evaluation, and review can work reliably.