The Bank Statement Extraction Benchmark

A real-world benchmark for transaction extraction, and why a model that looks 90% accurate returns a fully-correct statement almost never.

H
Holofin Engineering · Engineering· 18 min read·Jun 27, 2026
BENCHMARK
98%
holofin statements with zero errors
1
holofin errored row in 44 docs
70–115
errored rows per frontier model
47
banks · gold hand-verified

At holofin, bank-statement extraction is one of our core jobs, and we run it in production. Lenders, accountants and finance teams hand us statements from hundreds of different banks and expect every transaction back, exactly, with nothing invented and nothing dropped.

Extraction sits at the very front of that pipeline, so its mistakes never stay put. One missing or fabricated row doesn't just shave a point off an accuracy score. It becomes a balance that won't reconcile, an affordability decision built on a number that was never on the page, a ledger no one downstream can trust. A bank statement is boolean: it is either entirely correct, or it is a liability.

So we wanted to know how reliably today's best models actually do this, not on a hand-picked demo but on real statements, graded the way a finance team grades them, where the only thing that counts is whether the whole statement holds. We built a benchmark to find out.

The dataset

47 real statements, one per bank

Every statement is real, then anonymized so layout, tables and totals survive but the names and numbers are synthetic: French majors, German banks, neobanks and EMIs, each with its own idea of what a transaction table should look like. The gold labels were hand-verified against the source PDFs.

The benchmark corpus · 47 banks, 93 pages

Every statement is real, then anonymized so layout, tables and totals survive but names and numbers are synthetic. Click any page to zoom; switch to By bank to filter.

bami banque michel inchausp p1/4
bami banque michel inchauspp1/4
bami banque michel inchausp p2/4
bami banque michel inchauspp2/4
bami banque michel inchausp p3/4
bami banque michel inchauspp3/4
bami banque michel inchausp p4/4
bami banque michel inchauspp4/4
banque dupuy de parseval
banque dupuy de parseval
banque transatlantique p1/2
banque transatlantiquep1/2
banque transatlantique p2/2
banque transatlantiquep2/2
berliner sparkasse
berliner sparkasse
berliner volksbank
berliner volksbank
bnp paribas
bnp paribas
boursobank
boursobank
bred banque populaire p1/2
bred banque populairep1/2
bred banque populaire p2/2
bred banque populairep2/2
bunq p1/2
bunqp1/2
bunq p2/2
bunqp2/2
bwebank p1/2
bwebankp1/2
bwebank p2/2
bwebankp2/2
caisse d epargne p1/2
caisse d epargnep1/2
caisse d epargne p2/2
caisse d epargnep2/2
commerzbank p1/2
commerzbankp1/2
commerzbank p2/2
commerzbankp2/2
credit agricole brie picardie
credit agricole brie picardie
credit cooperatif p1/2
credit cooperatifp1/2
credit cooperatif p2/2
credit cooperatifp2/2
credit industriel et commercial p1/2
credit industriel et commercialp1/2
credit industriel et commercial p2/2
credit industriel et commercialp2/2
cr dit mutuel
cr dit mutuel
deutsche bank p1/2
deutsche bankp1/2
deutsche bank p2/2
deutsche bankp2/2
deutsche skatbank p1/2
deutsche skatbankp1/2
deutsche skatbank p2/2
deutsche skatbankp2/2
dkb deutsche kreditbank ag p1/3
dkb deutsche kreditbank agp1/3
dkb deutsche kreditbank ag p2/3
dkb deutsche kreditbank agp2/3
dkb deutsche kreditbank ag p3/3
dkb deutsche kreditbank agp3/3
fiducial banque
fiducial banque
finom
finom
grenke bank ag p1/3
grenke bank agp1/3
grenke bank ag p2/3
grenke bank agp2/3
grenke bank ag p3/3
grenke bank agp3/3
hsbc
hsbc
hypovereinsbank p1/2
hypovereinsbankp1/2
hypovereinsbank p2/2
hypovereinsbankp2/2
ibanfirst p1/3
ibanfirstp1/3
ibanfirst p2/3
ibanfirstp2/3
ibanfirst p3/3
ibanfirstp3/3
kontist p1/2
kontistp1/2
kontist p2/2
kontistp2/2
la banque postale p1/3
la banque postalep1/3
la banque postale p2/3
la banque postalep2/3
la banque postale p3/3
la banque postalep3/3
lcl banque et assurance
lcl banque et assurance
manager one p1/2
manager onep1/2
manager one p2/2
manager onep2/2
mein elba p1/3
mein elbap1/3
mein elba p2/3
mein elbap2/3
mein elba p3/3
mein elbap3/3
memo bank
memo bank
monabanq p1/2
monabanqp1/2
monabanq p2/2
monabanqp2/2
oberbank ag
oberbank ag
paypal p1/4
paypalp1/4
paypal p2/4
paypalp2/4
paypal p3/4
paypalp3/4
paypal p4/4
paypalp4/4
postbank
postbank
qonto
qonto
raiffeisenbank s dstormarn m lln eg p1/8
raiffeisenbank s dstormarn m lln egp1/8
raiffeisenbank s dstormarn m lln eg p2/8
raiffeisenbank s dstormarn m lln egp2/8
raiffeisenbank s dstormarn m lln eg p3/8
raiffeisenbank s dstormarn m lln egp3/8
raiffeisenbank s dstormarn m lln eg p4/8
raiffeisenbank s dstormarn m lln egp4/8
raiffeisenbank s dstormarn m lln eg p5/8
raiffeisenbank s dstormarn m lln egp5/8
raiffeisenbank s dstormarn m lln eg p6/8
raiffeisenbank s dstormarn m lln egp6/8
raiffeisenbank s dstormarn m lln eg p7/8
raiffeisenbank s dstormarn m lln egp7/8
raiffeisenbank s dstormarn m lln eg p8/8
raiffeisenbank s dstormarn m lln egp8/8
revolut business
revolut business
sg credit du nord p1/2
sg credit du nordp1/2
sg credit du nord p2/2
sg credit du nordp2/2
sg societe generale
sg societe generale
shine
shine
sparda bank p1/3
sparda bankp1/3
sparda bank p2/3
sparda bankp2/3
sparda bank p3/3
sparda bankp3/3
sumup p1/4
sumupp1/4
sumup p2/4
sumupp2/4
sumup p3/4
sumupp3/4
sumup p4/4
sumupp4/4
targox bank p1/4
targox bankp1/4
targox bank p2/4
targox bankp2/4
targox bank p3/4
targox bankp3/4
targox bank p4/4
targox bankp4/4
unicredit
unicredit
viva wallet
viva wallet
wise
wise
fig · 47 anonymized statements / 93 pages · click any page to zoom
The takeaway

Per-row accuracy is a vanity metric

The number that matters to a customer is not "what fraction of rows are right" but "is this statement right." Those are not the same metric. A statement is correct only if every row is, so one missed or invented row fails the whole document.

  • Per-statement, not per-row. holofin extracts 98% of statements with zero errors; the best frontier model manages 80%. Across 44 documents holofin produced one errored row; the frontier models produced 70–115 each.
  • The gap is fabrication, not reading. Every system reads the page well (recall 0.88–1.00). holofin fabricates one row in 44 statements (0.1%); frontier models invent 8–10% of every row they return.
  • A bigger window is not the fix. Feeding more pages per call is a wash; per-page is reliable because it bounds fabrication.
Results

What we found

Four reads of the same benchmark. The first places every system on completeness (did it find the rows?) against accuracy (are the rows it returned real?). The rest follow the arithmetic from there.

FIG.01
Reads everything, invents a tenth of it

Every system finds the rows (completeness, x). They differ on how many of the rows they return actually exist (accuracy, y). holofin sits in the top-right corner; frontier models drop down the accuracy axis as they fabricate. Frontier shown per-page.

85%90%95%100%90%95%100%COMPLETENESS · RECALL →ACC ↑holofinR 1.000 · P 0.999GPT-5.5R 0.939 · P 0.917Claude Opus 4.8R 0.929 · P 0.908Gemini 3.1 ProR 0.931 · P 0.900
FIG.02
Reading 90% of rows is not getting 90% of statements right

A statement is correct only if every row is. Share of statements extracted with zero errors (no dropped rows, no fabricated rows) against the hand-verified gold. The sub-label is total errored rows across all 44 documents: holofin made one; the frontier models made dozens.

holofin1 errored row / 44 docs98%
Gemini 3.1 Pro115 errored rows / 44 docs80%
GPT-5.584 errored rows / 44 docs77%
Claude Opus 4.870 errored rows / 44 docs75%
0%STATEMENTS WITH ZERO ERRORS →100%
FIG.03
The silent error is the invented row

Share of returned transactions that do not exist on the page. A fabricated row reconciles to a wrong balance and looks plausible: the silent failure. Frontier shown at their best (per-page) setting.

holofinproduction · per-page0.1%
GPT-5.5per-page8.3%
Claude Opus 4.8per-page9.2%
Gemini 3.1 Proper-page10.0%
0%FABRICATED-ROW RATE →15%
FIG.04
A bigger window is not the fix

holofin runs one page at a time and tops every axis. For the frontier models, feeding more pages per call is a wash: recall slips a little, precision ticks up a little, two-page is often the sweet spot. The gap that matters is the one to the green bar.

holofin1.000
GPT-5.5
per-page0.939
two-page0.942
whole-doc0.932
Gemini 3.1 Pro
per-page0.931
two-page0.953
whole-doc0.932
Claude Opus 4.8
per-page0.929
two-page0.948
whole-doc0.940
0.00HIGHER IS BETTER →1.00
FIG.05
Every document, every error

No aggregates to hide behind. This is the raw count of errored rows (dropped + fabricated, vs gold) on every statement, per model, at the per-page setting. Read holofin's column top to bottom: it is empty. · = clean; numbers = errors on that document.

bankrowsholofinGPT-5.5GEMINIOPUS 4.8
bami banque michel inchausp47·173117
banque dupuy de parseval2·1·1
banque transatlantique23····
berliner sparkasse1····
berliner volksbank3····
bnp paribas1····
boursobank4··9·
bred banque populaire2····
bunq36····
bwebank7·433
caisse d epargne1····
commerzbank7····
credit agricole brie picardie7····
credit industriel et commercial13·352929
cr dit mutuel11····
deutsche bank1····
dkb deutsche kreditbank ag9····
fiducial banque6····
finom1····
grenke bank ag4····
hsbc3····
hypovereinsbank2····
ibanfirst25····
kontist2····
lcl banque et assurance1···1
manager one4····
mein elba33·111
memo bank4···4
monabanq34····
oberbank ag1····
paypal2·464
postbank1····
qonto8·16··
raiffeisenbank s dstormarn m lln eg63·3323
revolut business1····
sg credit du nord4····
sg societe generale3····
shine13····
sparda bank23····
sumup39····
targox bank241235
unicredit1····
viva wallet1····
wise2·112
clean1–23–56+TOTAL ERRORED ROWS   holofin 1GPT-5.5 84GEMINI 115OPUS 4.8 70
Where models break down

The quiet destruction of the invented row

It isn't a failure to read the ink on the page. If a transaction is visibly printed, every model finds it. The problem is what they find when the transaction isn't there. There is a massive operational difference between a dropped row and a fabricated one. A dropped row is annoying: the balance fails to reconcile and an operator spots the gap. A fabricated row is a silent killer. The model scrapes a running balance, a subtotal or a stray date and formats it as a valid transaction. It looks perfectly plausible doing it. It just slowly, invisibly poisons the arithmetic.

The gold is human, not a model

We did not let a model grade other models. The ground truth was built by hand: on every document where the systems disagreed, a person opened the source PDF and checked the transactions line by line. The benchmark scores against what is actually printed on the page, verified by a human, not against another model's opinion of it.

Methodology

How the benchmark is wired

Frontier candidates receive page images with a generic extraction prompt at three context sizes. holofin is the real production pipeline (classify → OCR → per-page extract), driven over HTTP. Every metric is doc-macro: computed per document, then averaged.

47 bank PDFs
one per distinct bank
Anonymize
pdf-holomask · tables & totals preserved
Render windows
per-page · two-page · whole-doc
Extract
3 frontier models + holofin pipeline
Score
vs hand-verified gold
Gold = human-verified
checked line-by-line against every source PDF
Match rule
exact (transaction_date, signed amount) at cent precision
Why not just score by balance reconciliation?

The obvious production check is whether a statement's math ties out: opening balance + Σ transactions = closing balance. We measured it, and it is necessary but not sufficient as a truth metric. GPT-5.5's statements reconcile 42/45 of the time, yet it still fabricates ~8% of rows against the actual page; a fabricated row offset by another error still ties out, and a model that omits balances entirely (Gemini left them blank on 12 documents) can't be checked at all. A statement can pass the math and still be wrong. So we score every transaction against gold that was hand-verified against the source PDF.

Production performance

You don't need a larger window. You need a harness.

You don't solve extraction by passing an entire PDF to an endpoint and asking a model to be careful. At holofin that's the job description. We build the cage the intelligence runs inside:

  • Structure before semantics. Deterministic OCR and geometry build the page context first. Prompts capture meaning well and visual structure poorly.
  • Bound the problem. We process strictly per-page, never asking a model to hold an entire ledger in working memory.
  • Constraints > vibes. Strict accounting rules decide what counts as a transaction before a result is ever finalized.

Once you've written enough scaffolding to be safe (the OCR redundancy, the bounding geometry, the strict parsers, the reconciliations), the model is no longer the hero. It's the specialist you page in for disputes and edge cases. The job isn't to eliminate the boring bits; it's to build boring things so the magic has something sturdy to stand on.

Related Articles

Holofin