Your LLM Isn't a Document Pipeline

There's a moment in every AI project where the demo looks so good that your brain quietly starts deleting code. You watch a model "read" a bank statement and think: this is it. We can skip OCR. We can skip layout parsing. Maybe we can skip half the pipeline. In the movie version, someone presses Enter and JSON waterfalls out of the cloud.

In reality, the waterfall is a drip. And somebody still has to bring a mop.

We're writing this from the scar tissue of trying to make LLMs and VLMs do end-to-end extraction on financial PDFs. Not a carefully curated three-pager with perfect kerning, but the unglamorous PDFs that show up in production: 120 pages, multi layouts, and a few scans in te middle, with a coffee ring giving you a helpful watermark around page 42.

If you want the ending up front: models make great assistants inside a document pipeline. They make terrible pipelines.

The first hit

My first success was almost cinematic. I showed a vision-language model a clean page. "Give me date, merchant, amount, balance." It obliged, politely, in JSON. The table boundaries looked right. The decimal marks were correct. I wrote a slack message to the team that started with "What if…"

This is the part where the camera cuts to a montage of me trying "What if…" on a real statement.

Same bank, different template. Totals renamed from "This period" to "Subtotal." Debits printed as positives. A page break that slices a row description in half. A header slogan that the model decides is the last line of a transaction. A scan with light gray gridlines that vanish just enough to confuse row detection. Missed row, alterated content, repetitions.

Helpful is the problem. Helpful is not auditable.

Here’s a concrete example of where VLMs stumble even with Mistral AI’s “world’s best OCR model” claim.

Ground‑truth table from the original document

Mistral OCR output on a bank‑statement table

Comparing the two, the Credit column disappears and several values land in the wrong columns, classic table‑structure failure.

In contrast, Holofin’s models are trained specifically for financial table structure. On these pages, and those are on the lower range of the complecity scale, they recover the full schema and correct cell assignments with perfect accuracy. If you want to replace manual data entry, getting structure right at the OCR stage is non‑negotiable.

The quiet ways a model can be wrong

What unnerves us aren't the obvious mistakes. Those are easy to catch. What unnerves us are the plausible ones.

You can look at a model's output and feel everything is fine. The numbers are formatted. The fields are present. The JSON validates. Only later do you notice that an apparently reasonable "Total" is a year-to-date figure, or that an opening balance was read as a transaction, or that a wrapped line merged two merchants into a chimera that never existed. Nobody threw an error; the system soldiered on.

We used to think this is an accuracy problem. It's not (or not only). It's a calibration problem. Accuracy is "How close are we?" Calibration is "How much should we trust this?" Models look accurate in demos. Pipelines need calibration in production.

As we saw, there are blog posts and public tests that reinforce this. Some VLMs marketed as "OCR-capable" can indeed read tidy tables on tidy pages, but stumble on the exact kinds of messiness we just described. Scan quality, layout drift, signage of amounts: the kinds of variations that are routine in the real world turn confident extractions into confident fiction. You don't need malice to hallucinate; you just need page 57.

The cost nobody quotes in the demo

There’s also the bill.

Take a conservative example: a 120-page statement. Even if you chunk it into two pages per request, you’re sending \~60 prompts. You’ll add instructions, maybe a few examples to stabilize outputs. If you’re using a VLM, every page is an image embedding on top of tokens. Multiply by 60. Multiply by retries. Multiply by a second pass when you realize you need verification. Latency climbs from “snappy” to “make some tea.” Cost climbs from cents to euros.

Order of magnitude, not a promise: processing a 100+ page file with a single large VLM can land in the “several euros per document” neighborhood, and a minutes of wall time.

You can, of course, tune this down, but notice what that implies. The very tricks that make it affordable (smaller models, fewer calls, tighter prompts) usually come from doing less with the model and more with code. Which brings us to the boring, unskippable part.

And the bill quietly balloons when you add “k‑LLM” or Agentic cleverness. You route the weird pages to a second model, sample twice to be safer, add a verification pass, maybe caption first and extract later—each safety net is another round‑trip and another meter running. What looked like one call becomes a small committee, and the cost just balloons up.

The pipeline you end up building anyway

At some point we stopped asking "Can a model read this?" and started asking "If the model reads this, can we prove it?"

Proof is unsexy. It means you keep coordinates for every token. It means you can show a regulator the pixel box on page 83 that became "€-1,237.45." It means you reconcile opening and closing balances across pages, and you freak out if the sum of deltas doesn't match. It means you notice when two different totals live on the same page and they don't agree about being totals.

You can do all of that with models. But once you've written enough of the scaffolding to be safe (the OCR redundancy, the geometry, the strict parsers, the reconciliation) a funny thing happens: the model is no longer the hero. It's the specialist you page in for disputes and edge cases.

This is a good thing.

Where the model actually shines

We still like these models. We just like them at the right scope.

If two deterministic passes disagree about the 'Debit/Credit' sign convention—for example, amounts printed as positives with a separate D/C column, ask the model to arbitrate and explain its choice. If a header quietly changed from 'Frais' to 'Frais de tenue de compte' or 'Commissions d’intervention', ask the model to map it to your canonical schema. If there’s a footnote such as 'dont TVA 20 %' or a 'CB différé' section that shifts when amounts are applied, ask the model to flag it before you misclassify a batch of transactions.

In other words: treat the model as a judge or translator. Don't ask it to be the scanner, the parser, and the accountant.

But aren't models getting better?

Absolutely. We would love to be wrong in six months. Larger contexts reduce chunking pain. Vision backbones keep improving. Tool-use and multi-pass help a ton.

Even then, the parts we care about for financial docs don't go away: provenance, reconciliation, drift detection, and the ability to explain any number to someone who doesn't care about your prompts. Better models make the judgment calls cleaner. They don't absolve us from offering receipts.

If your use case is "extract totals from glossy one-pagers," maybe the black box is fine. If your use case is "ingest 120 pages of mixed-quality statements from 17 banks, meet an SLA, and pass an audit," you'll still want the guardrails. You'll also want the bill to be predictable.

The invisible checklist we now carry around

We don't print a checklist anymore; we just feel for it like keys in a pocket. Can we process a 120-page statement without throwing timeouts? Do we retain page and coordinate provenance for every extracted value? If we sum the transactions, do we get the balance delta, or do we have a hole somewhere? What happens when a model stalls on page 87: do we keep going or do we hide the glitch in a neat JSON envelope? Could we run this with models off and still get something usable? Will we notice when a bank quietly ships "Template v7" and moves a column, or will the system politely accept the new reality and file nonsense under the right field names?

These aren’t deep learning questions. They’re software questions. Which is comforting, honestly.

If you must pick a fight, pick a small one

The thing that consistently shrinks cost and increases trust is scope discipline. Ask the model to choose between candidates, not to invent structure. Ask for an index, not a paragraph. Batch ambiguous cells together so that 20 decisions cost one round-trip. Cache everything you can: if you've seen a template, learn its quirks once and reuse that knowledge. And when the model loses, make sure it loses in a way a human can fix in one click.

Do this and you'll get the magic where it matters: in the weird corners where rules get brittle. Skip this and you'll get a very expensive way to be confidently wrong.

LLMs and VLMs remain the most delightful part of the stack, as long as they're not the whole stack. The job isn't to eliminate the boring bits. The job is to build boring things so the magic has something sturdy to stand on.

At Holofin, we lean into that discipline: build the reliable scaffolding first—provenance, reconciliations, drift checks then invite models to arbitrate where rules blur. It keeps failures local and loud, makes costs predictable, and lets the magic do its part without carrying the whole load.

December 2025 update: Mistral OCR 3

We revisited this topic in December 2025 when Mistral released mistral-ocr-3. Despite the new version and continued claims of improved document understanding, we observed the same spatial issues on tables: columns merging, values landing in wrong cells, and the Credit column disappearing on the same bank statement samples. The fundamental challenge of table structure recovery remains unsolved by general-purpose VLMs.