Bank Statement Extraction
From PDF to Verified, Structured Data
OCR reads the text. But a bank statement isn't text, it's a table. OCR gives you "1.250,00" but not whether it's a debit, a credit, or a running balance. It gives you "VIREMENT RECU / ÜBERWEISUNG" but not which row it belongs to. Get one assignment wrong and every balance after it is off. Holofin reconstructs the table structure, assigns every value to its row and column, and proves the result by reconciling the balance.
Schedule a DemoWhy Generic OCR
Keeps Getting It Wrong
A bank statement looks like a simple table. It is not. Every issuer formats things differently, and the PDF format itself is working against you. Here's what actually breaks.
Every bank does it differently
There's no standard for bank statement layout. BNP Paribas puts dates on the left and uses separate Debit/Credit columns. Deutsche Bank uses a single Amount column with D/C indicators. Revolut doesn't even include running balances. A template trained on one bank produces garbage on another.
Is "1.250" a thousand or 1.25?
French banks write "1 250,00 €". German ones write "1.250,00 EUR". British ones write "£1,250.00".
The same dot means "thousands" in Frankfurt and "decimals" in London. The same comma means the opposite. A space is a thousand separator in Paris and nothing in New York.
Misread one separator and a €1,250 rent payment becomes €1.25. Your balance check won't catch it. The numbers still add up, just to the wrong total.
Which column is the debit?
One column or two? Negative numbers or a "D/C" indicator? A minus on the left, on the right, or parentheses? German banks use "S" and "H". Some just leave the other column blank. The table looks obvious to a human. It's a nightmare to parse programmatically.
Tables that break across pages
200 transactions don't fit on one page. The table continues on page 2, sometimes with headers repeated, sometimes not. A transaction might start on one page and finish on the next. You need to stitch the table back together before you can extract anything.
Multiple accounts in one PDF
Your client sends a single 47-page PDF. It contains three accounts (current, savings, credit card) across four quarters. That's 12 separate statements inside one file. Treat it as one continuous table and you get nonsense.

Not everything that looks like a transaction is one
Banks pad statements with auxiliary tables that look exactly like transactions: card payment breakdowns listing every contactless tap, SEPA transfer summaries repeating each direct debit, fee schedules, interest calculations. Extract them and you double-count. Skip the wrong one and your balance is off.
The real transactions live in the main table. Everything else is noise dressed up as data.
How It Works
Every bank statement goes through four stages. No templates, no issuer-specific configuration. The same pipeline handles BNP Paribas and Chase.
Classification
Our classifier identifies 100+ bank issuers using both content and visual clues: header positions, column structures, logos, text patterns. No templates to configure per bank.
Segmentation
Multi-account PDFs get split before extraction. We detect account boundaries by IBAN, account number, and period markers. That 47-page PDF becomes 12 segments, processed in parallel.
Extraction
A visual model reads the page layout and extracts accurate transaction data: date, description, debit, credit, running balance, and account metadata. No template rules. The model understands the table structure.
Every extraction produces a JSON like this:
{
"bank_name": "Qonto",
"currency": "EUR",
"account_type": "current",
"usage_type": "business",
"client_names": ["Starflight Dynamics GmbH"],
"account_number": "DE15100101232339317943",
"start_balance": 3071.69,
"end_balance": 3030.39,
"start_date": "2025-05-01",
"end_date": "2025-05-31",
"validation_status": "OK",
"transactions": [
{
"transaction_date": "2025-05-02",
"value_date": "2025-05-02",
"amount": -963.9,
"description": "Schmittlein Kloster Arbeitsrecht Partnerschaft",
"credit": null,
"debit": 963.9,
"page": 1,
"row": 1
}
]
}Validation
This is where most tools stop, and where we start. Every extracted segment gets checked:
- Balance reconciliation: opening balance + total credits − total debits = closing balance, within €2 tolerance. If the equation doesn't balance, the extraction is flagged.
- Running balance continuity: each transaction's running balance must equal the previous balance plus/minus the transaction amount. Breaks indicate missing or mis-extracted rows.
- Date ordering: transaction dates must be in chronological sequence within the statement period. Out-of-order dates suggest row assignment errors.
- Duplicate detection: identical transactions (same date, description, amount) are flagged for review rather than silently included.
Balance reconciliation equation:
Show Your Work
Every extracted value carries coordinates that point back to its exact position on the source page. Not just "this came from page 3" but the pixel-level bounding box around the original text. You can verify any number by clicking on it.
Auditors love this
When an auditor asks "where did this number come from?", you show them. The exact location on the source PDF, highlighted. No "the system said so."
Fix errors in seconds
Your reviewer spots a wrong amount. They click the value. The source region highlights on the original document. Compare, correct, move on.
Full data lineage
Trace any number from the credit decision back to the original bank statement, page, and row. The full chain is documented. Regulators don't have to take your word for it.
Scale and Coverage
We process 100K+ documents a month for lending teams across Europe. Here's what the infrastructure looks like.
Infrastructure
~40 seconds per statement
Upload to validated JSON. Multi-segment documents process in parallel, so a 12-segment PDF doesn't take 12x longer.
REST API + webhooks
Upload via API, get a webhook when it's done. Batch upload supported.
European infrastructure, GDPR-compliant
99.9% uptime SLA. Configurable retention. Data never leaves the EU.
Banks we cover
French banks
BNP Paribas, Société Générale, Crédit Agricole, Crédit Mutuel, La Banque Postale, Boursorama, CIC, LCL, Caisse d'Épargne
German banks
Deutsche Bank, Commerzbank, Sparkasse, Volksbank, N26, DKB, ING DiBa, HypoVereinsbank
Pan-European & international
ING, HSBC, Revolut, Wise, Barclays, Lloyds, NatWest, UniCredit, Rabobank, ABN AMRO, Santander
UK & US banks
Chase, Bank of America, Wells Fargo, Citi, HSBC UK, Barclays UK, Monzo, Starling
Don't see your bank? It probably works anyway.
We don't use templates. The extraction engine reads layout from the document itself. New issuers work without setup.
FAQ
The questions we get most from lending and accounting teams.
Holofin processes native PDF bank statements from any issuer worldwide, including all major European, UK, and US banks. It handles both digitally-generated and scanned statements. No templates or issuer-specific configuration needed. The system learns layout from the document itself. We actively cover 100+ issuers with validated extraction accuracy, and new issuers typically work without any configuration.
Holofin's segmentation engine detects account boundaries (IBAN, account number, period markers) and splits combined PDFs into individual statement segments before extraction. A 47-page PDF with 3 accounts across 4 quarters becomes 12 individual, independently validated segments. Each segment is extracted and balance-reconciled separately, then aggregated into a unified JSON response.
Field-level accuracy exceeds 97% on native PDF bank statements across tested issuers. But raw accuracy isn't the full story. Every extraction includes automatic balance reconciliation (opening + credits − debits = closing), providing mathematical validation that catches extraction errors a simple accuracy metric would miss. When reconciliation fails, the extraction is flagged for human review rather than silently passed through.
Yes. Scanned bank statements are processed through OCR with font decoding and layout recognition. Accuracy depends on scan quality (300 DPI or higher recommended). The balance reconciliation step catches most OCR errors that affect financial totals. For degraded scans, the system flags low-confidence values so reviewers focus on the fields that need attention, not the entire document.
Yes. Holofin provides a REST API for programmatic document submission and result retrieval. Upload a PDF, receive a webhook when extraction completes, fetch the structured JSON result. Batch processing is supported: submit hundreds of documents in a single API call and collect results as they complete. Authentication uses API keys with organization-level scoping.
After extraction, Holofin verifies the accounting equation: opening balance + total credits − total debits = closing balance, within a tolerance of €0.01 in the statement currency. Running balance continuity is also checked: each transaction's running balance must equal the previous balance plus or minus the transaction amount. Date ordering and duplicate detection round out the validation suite. When any check fails, the extraction is flagged with specific error details rather than a generic failure.
Holofin handles all major number formats automatically: European comma decimals (1.234,56), US/UK period decimals (1,234.56), space-separated thousands (1 234.56), parenthesized negatives, and D/C indicators. Format detection is per-document, not per-issuer. The system reads the actual format used in the statement and parses accordingly. No configuration or locale settings required.
Yes. Holofin processes all data on European infrastructure. Document retention is configurable per organization. Data is encrypted at rest and in transit. No document content is used for model training. Holofin can execute data deletion requests in compliance with GDPR Article 17 (right to erasure). A Data Processing Agreement (DPA) is available for enterprise customers.
Data You Can
Bank On.
Send us the bank statements that broke your last tool. The 47-page multi-account PDFs. The degraded scans. The obscure German Sparkasse format. We'll show you what comes out the other side.