When Documents Fight Back

Page 1: Account summary, two columns. Page 15: Same account, three columns, different header names. Page 47: A scan with a coffee stain. Page 89: The totals page, which references transactions you extracted 70 pages ago.

This is a single "document."

Your extraction pipeline works beautifully on it. Page by page, every transaction lands in the right column. Amounts parse cleanly. Dates look correct. You ship it.

Then accounting calls. The credits and debits don't add up. They're off by €3,200. Not because any single page was wrong, but because page 12 uses commas for decimals and page 68 uses periods. Your parser handled both correctly, in isolation. Together, they produced two different interpretations of "1.234" and nobody noticed until the balance equation broke.

Individual page extraction can be perfect while the document as a whole is garbage.

One document, fifteen personalities

Long documents aren't monolithic. They're stitched-together sections with different layouts, different scan qualities, and different ideas about how numbers should look.

A real 120-page bank statement we processed had:

Layout shifts: - Pages 1-10: Portrait, two-column transaction table - Pages 11-50: Landscape, three columns with a running balance - Pages 51-80: Back to portrait, different header names entirely - Pages 81-120: Summary pages with aggregated totals

Quality variation: - Pages 1-40: Native PDF, clean text layer - Pages 41-60: Scanned at some point, needs OCR - Pages 61-80: Scanned with handwritten annotations in the margins - Pages 81-120: Native PDF again

Semantic surprises: - Three separate accounts bundled into one statement - Sub-statements nested inside the main statement - Totals on page 118 that reference a transaction on page 3

Any extraction system that treats this as "one document, one pass" is going to have a bad time. The layout on page 11 invalidates assumptions from page 1. The scan quality on page 45 requires a completely different extraction strategy than the clean PDF on page 5.

The question isn't "did we extract each page correctly?" It's "did we extract the whole thing consistently?"

One File. One Upload. Several Realities.

What happens when you don't split

The naive approach is to throw the whole document at one extraction pass and hope for the best. Here's what actually happens:

Context windows fill up. Even large models choke on 120 pages of transaction data. They start dropping rows around page 40, silently. No error, no warning. Just missing transactions that only surface when someone runs reconciliation.

Layout confusion compounds. The model learns the two-column layout from pages 1-10, then encounters three columns on page 11. Sometimes it adapts. Sometimes it merges two columns and produces phantom transactions with concatenated descriptions. On page 51, when the layout changes again, all bets are off.

One bad page poisons everything. That coffee-stained scan on page 47? In a single-pass extraction, a misread amount cascades. The running balance shifts, the model "corrects" subsequent transactions to match, and now you have plausible-looking data that's wrong in ways you can't easily detect.

The fix is segmentation: split the document into coherent chunks, extract each one independently, then reassemble with cross-segment validation.

But layout isn't the only thing that varies across 120 pages. The physical quality of the pages fights back too.

The scanner always wins

Layout changes are at least logical. Someone designed a three-column format for pages 11-50 and a two-column format for the rest. You can reason about it.

Scan quality is chaos.

A 120-page bank statement arrives. Pages 1-40 are native PDF: clean text layer, perfect coordinates, extractable down to the glyph. Then page 41 hits and the text layer vanishes. Someone printed the statement, scanned it back in, and merged the result. You're now doing OCR on pixels instead of reading text from a structured document.

That's the easy case. Here's what actually comes through the door:

The phone photo. Someone held their phone over the statement instead of using a scanner. Camera flash washes out the top-right corner, exactly where the closing balance lives. The image is sharp in the center and blurry at the edges. Overexposed highlights create white blobs that swallow digits. We detect this by measuring overexposed pixel ratios and specular highlight blobs. If more than 15% of a page is blown out, you're looking at a photo, not a scan.

The 75 DPI fax relic. A statement that passed through a fax machine somewhere in its life. The resolution is so low that an "8" and a "6" are indistinguishable. A "3" could be an "8." OCR confidence drops below 50%, and now you're guessing at amounts. At 75 DPI, edge sharpness collapses. The Sobel gradient magnitude that reads 15+ on a clean scan drops below 2.5. Individual characters stop having edges at all.

The background bleed. A worn roller in the scanner's document feeder leaves a gray stripe down every page. Or worse: ink from the previous page transferred during feeding, and now page 47 has ghost text from page 46 showing through. OCR doesn't know which text layer is the real one. It extracts both, and suddenly you have phantom transactions that don't exist.

The rotated section. Pages 51-60 were scanned in landscape but the PDF metadata says portrait. The text is sideways. OCR can sometimes handle 90-degree rotation, but 2-3 degrees of skew from a misaligned page in the feeder? That turns straight table lines into slightly curved ones, and column alignment, the thing your extraction logic depends on, goes soft.

The quality cliff within a single document:

This is the real problem. It's not that some documents are bad scans. It's that page 40 is pristine and page 41 is terrible, in the same file. Your pipeline needs to handle both extremes and everything in between, and it needs to know, per page, how much to trust what it extracted.

This is where visual ML models earn their keep. Traditional OCR stares at individual characters and guesses. A vision model sees the whole page: the table structure, the column alignment, the context around that coffee-stained digit. It can read "€12,345.67" even when the decimal point is obscured, because it understands what a bank statement looks like, not just what individual glyphs decode to.

The shift from text-level OCR to page-level visual understanding is what makes degraded scans survivable. A fax relic that defeats character recognition still has visible structure: rows, columns, amounts in predictable positions. A vision model that's seen thousands of bank statements can extract from the pattern even when the pixels are rough.

Same Document. Different Damage.

The balance equation minefield

For bank statements, there's one unforgiving equation: start_balance + credits - debits = end_balance.

Four words. Infinite ways to break them.

Cross-page references. The opening balance is on page 1. The closing balance is on page 120. The equation spans every segment, every run, every layout change in between. If any segment extracts incorrectly, the equation breaks. But which one? With 7 segments and 4,000 transactions, finding the bad data is its own problem.

Page-break transactions. A transaction description spans pages 47-48: - Page 47: "TRANSFER TO ACME CORP" - Page 48: "REF: 12345 - INVOICE PAYMENT"

One transaction or two? Count it twice and the balance breaks by exactly one transaction amount. The kind of error that looks plausible until someone audits. The solution is overlap: when we split the document into segments, boundary pages are shared between adjacent runs. Both runs see the transaction, but the aggregator deduplicates by page coordinates during reassembly. Only one copy survives.

Locale roulette. Same bank, same statement: - Segment 1: €1.234,56 (comma decimal) - Segment 2: €1,234.56 (period decimal)

Parse both "correctly" according to their local convention and you get two different numbers. The balance equation doesn't care about conventions. It just sees that the sum is wrong.

Sign convention chaos. Bank A: debits are negative. Bank B: all amounts positive, with a separate D/C column. Bank C: debits in parentheses, (1,234.56). The equation assumes consistent signs. If extraction doesn't normalize before summing, credits cancel debits and the statement looks empty.

The €0.02 tolerance trap. A 120-page statement might have thousands of transactions, each parsed from rendered text and carrying the risk of source-level rounding. Requiring exact match rejects valid documents. Too much tolerance hides real errors. €0.02 is what we landed on after running production data. Tight enough to catch mistakes, loose enough to survive rounding.

Strict, then smart, then desperate

When we were building Holofin's validation pipeline, we tried running all fixes at once. It masked data quality issues. Documents that should have been flagged sailed through. So we rebuilt it as three stages. The order is the design.

Stage 1: Strict. Every transaction must have a value date. Every amount must parse. The balance equation must hold within tolerance. If strict passes, the data is high quality and we're done. Most clean documents stop here.

If strict fails, we note what failed and move on.

Stage 2: Normalization. The data extracted cleanly, but conventions clash across segments. Mixed decimal separators: commas in segment 1, periods in segment 3. Ambiguous dates: 01/02/2024 could be January 2nd or February 1st depending on the bank's locale. Sign conventions that flip between segments. Normalization unifies these before re-running the balance check.

Stage 3: Human review. When automation can't resolve it, the document gets flagged for an operator. Not "something is wrong, good luck," but "segment 4, pages 41-60, the credits are €3.20 higher than expected. Start there."

Why this order matters: strict first preserves data quality. We only normalize when strict fails. We only involve a human when normalization isn't enough.

Running everything at once would mask problems. A document that needed human review should be treated differently from one that passed strict. The order is the quality signal.

When most of the document is fine

You've segmented the 120 page monster. Each segment extracted independently. Quality scores attached per page. Now what?

The segments need to become one document again. Opening balance from the first segment, closing balance from the last, every credit and debit summed across all of them. Then the balance equation runs one final time, not per segment, but across the whole thing.

When it passes, you're done. When it fails, the interesting question is where.

This is where most pipelines make the wrong call. The balance is off by €3.20, so the whole document gets rejected. The operator re-uploads, re-extracts, re-validates, all 120 pages, because one segment stumbled.

That's the wrong answer. In production, most documents are mostly correct.

Say segment 4 out of 7 has the discrepancy. The other six segments are clean, validated, reconciled, exportable. We don't throw them away. We flag segment 4, highlight the mismatch, and let the operator resolve the conflict with the source PDF side by side. The good work stays good. The bad segment gets human attention.

A degraded bank statement scan with mixed quality across pages

The source: a real bank statement scan with quality shifts mid-document

Holofin UI extracting from a degraded scan with bounding boxes and extracted properties

The extraction: bounding boxes trace every value back to its source, even on degraded pages

The operator sees exactly what needs fixing. Not "your document failed," but "segment 4, pages 41-60, the credits are €3.20 higher than expected, and page 47 had low OCR confidence (34%). Start there."

That's the difference between a pipeline that processes documents and one that respects the operator's time.

The principles that survive production

If you're building something similar, these are the lessons that stuck:

Segment before you extract. Treat layout changes as boundaries, not noise.
Measure before you trust. Per-page quality scores tell you where extraction is fragile.
Validate across segments, not just within. A page can be correct while the document is wrong.
Degrade gracefully, in order. Strict validation first. Normalization second. Human review last. The order encodes quality.
Log everything the user didn't ask for. System corrections are invisible in the UI and visible in the audit trail.

Bank statements are the stress test, thousands of transactions, multiple layouts, scan quality that shifts mid-document, and a balance equation that demands perfection across all of it. If your pipeline survives that, single-page invoices are a rounding error.

Extraction is solved on page 1. Consistency is solved on page 120. And the space between them, the layout shifts, the coffee stains, the fax relics, the sign conventions, is where naive pipelines go to die and careful engineering earns its keep.

Build for the 120-page monster from day one. The single-pagers will take care of themselves.