At holofin, bank-statement extraction is one of our core jobs, and we run it in production. Lenders, accountants and finance teams hand us statements from hundreds of different banks and expect every transaction back, exactly, with nothing invented and nothing dropped.
Extraction sits at the very front of that pipeline, so its mistakes never stay put. One missing or fabricated row doesn't just shave a point off an accuracy score. It becomes a balance that won't reconcile, an affordability decision built on a number that was never on the page, a ledger no one downstream can trust. A bank statement is boolean: it is either entirely correct, or it is a liability.
So we wanted to know how reliably today's best models actually do this, not on a hand-picked demo but on real statements, graded the way a finance team grades them, where the only thing that counts is whether the whole statement holds. We built a benchmark to find out.
The dataset47 real statements, one per bank
Every statement is real, then anonymized so layout, tables and totals survive but the names and numbers are synthetic: French majors, German banks, neobanks and EMIs, each with its own idea of what a transaction table should look like. The gold labels were hand-verified against the source PDFs.
Every statement is real, then anonymized so layout, tables and totals survive but names and numbers are synthetic. Click any page to zoom; switch to By bank to filter.





























































































Per-row accuracy is a vanity metric
The number that matters to a customer is not "what fraction of rows are right" but "is this statement right." Those are not the same metric. A statement is correct only if every row is, so one missed or invented row fails the whole document.
- Per-statement, not per-row. holofin extracts 98% of statements with zero errors; the best frontier model manages 80%. Across 44 documents holofin produced one errored row; the frontier models produced 70–115 each.
- The gap is fabrication, not reading. Every system reads the page well (recall 0.88–1.00). holofin fabricates one row in 44 statements (0.1%); frontier models invent 8–10% of every row they return.
- A bigger window is not the fix. Feeding more pages per call is a wash; per-page is reliable because it bounds fabrication.
What we found
Four reads of the same benchmark. The first places every system on completeness (did it find the rows?) against accuracy (are the rows it returned real?). The rest follow the arithmetic from there.
Every system finds the rows (completeness, x). They differ on how many of the rows they return actually exist (accuracy, y). holofin sits in the top-right corner; frontier models drop down the accuracy axis as they fabricate. Frontier shown per-page.
A statement is correct only if every row is. Share of statements extracted with zero errors (no dropped rows, no fabricated rows) against the hand-verified gold. The sub-label is total errored rows across all 44 documents: holofin made one; the frontier models made dozens.
Share of returned transactions that do not exist on the page. A fabricated row reconciles to a wrong balance and looks plausible: the silent failure. Frontier shown at their best (per-page) setting.
holofin runs one page at a time and tops every axis. For the frontier models, feeding more pages per call is a wash: recall slips a little, precision ticks up a little, two-page is often the sweet spot. The gap that matters is the one to the green bar.
No aggregates to hide behind. This is the raw count of errored rows (dropped +
fabricated, vs gold) on every statement, per model, at the per-page setting. Read holofin's
column top to bottom: it is empty. · = clean; numbers = errors on that document.
The quiet destruction of the invented row
It isn't a failure to read the ink on the page. If a transaction is visibly printed, every model finds it. The problem is what they find when the transaction isn't there. There is a massive operational difference between a dropped row and a fabricated one. A dropped row is annoying: the balance fails to reconcile and an operator spots the gap. A fabricated row is a silent killer. The model scrapes a running balance, a subtotal or a stray date and formats it as a valid transaction. It looks perfectly plausible doing it. It just slowly, invisibly poisons the arithmetic.
The gold is human, not a model
We did not let a model grade other models. The ground truth was built by hand: on every document where the systems disagreed, a person opened the source PDF and checked the transactions line by line. The benchmark scores against what is actually printed on the page, verified by a human, not against another model's opinion of it.
MethodologyHow the benchmark is wired
Frontier candidates receive page images with a generic extraction prompt at three context sizes. holofin is the real production pipeline (classify → OCR → per-page extract), driven over HTTP. Every metric is doc-macro: computed per document, then averaged.
The obvious production check is whether a statement's math ties out: opening balance + Σ transactions = closing balance. We measured it, and it is necessary but not sufficient as a truth metric. GPT-5.5's statements reconcile 42/45 of the time, yet it still fabricates ~8% of rows against the actual page; a fabricated row offset by another error still ties out, and a model that omits balances entirely (Gemini left them blank on 12 documents) can't be checked at all. A statement can pass the math and still be wrong. So we score every transaction against gold that was hand-verified against the source PDF.
You don't need a larger window. You need a harness.
You don't solve extraction by passing an entire PDF to an endpoint and asking a model to be careful. At holofin that's the job description. We build the cage the intelligence runs inside:
- Structure before semantics. Deterministic OCR and geometry build the page context first. Prompts capture meaning well and visual structure poorly.
- Bound the problem. We process strictly per-page, never asking a model to hold an entire ledger in working memory.
- Constraints > vibes. Strict accounting rules decide what counts as a transaction before a result is ever finalized.
Once you've written enough scaffolding to be safe (the OCR redundancy, the bounding geometry, the strict parsers, the reconciliations), the model is no longer the hero. It's the specialist you page in for disputes and edge cases. The job isn't to eliminate the boring bits; it's to build boring things so the magic has something sturdy to stand on.
Related Articles

Your Table Extractor Passed. The Numbers Didn't.
An auditor opens your extraction output for a balance sheet. The model reports 99.2% cell accuracy. Impressive. Then she totals the asset column by hand, the way auditors do, and it comes to a number that is off by one row. Assets no longer equal liabilities plus equity. The statement does not close.

Document Fraud Detection: What a PDF Can't Hide
We used to think document fraud was a visual problem. Wrong fonts. Misaligned columns. A logo that felt slightly off. We built checks around what humans see, because what humans see is all we had.

When Documents Fight Back
Page 1: Account summary, two columns. Page 15: Same account, three columns, different header names. Page 47: A scan with a coffee stain. Page 89: The totals page, which references transactions you extracted 70 pages ago.