Blog

Insights and updates on document processing, AI, and financial technology

The Bank Statement Extraction BenchmarkBenchmark

The Bank Statement Extraction Benchmark

A real-world benchmark for bank-statement transaction extraction. holofin clears 98% of statements with zero errors. Frontier LLMs read almost every row, but on a minority of layouts they hand back rows that aren't on the page — unpredictably — so a model that's 90% accurate per row returns a fully-correct statement only ~75–80% of the time.

H

Holofin Engineering

Jun 27, 2026
Your Table Extractor Passed. The Numbers Didn't.

Your Table Extractor Passed. The Numbers Didn't.

An auditor opens your extraction output for a balance sheet. The model reports 99.2% cell accuracy. Impressive. Then she totals the asset column by hand, the way auditors do, and it comes to a number that is off by one row. Assets no longer equal liabilities plus equity. The statement does not close.

G

Greg T

Jun 21, 2026
Document Fraud Detection: What a PDF Can't Hide

Document Fraud Detection: What a PDF Can't Hide

We used to think document fraud was a visual problem. Wrong fonts. Misaligned columns. A logo that felt slightly off. We built checks around what humans see, because what humans see is all we had.

G

Greg T

Mar 23, 2026
When Documents Fight Back

When Documents Fight Back

Page 1: Account summary, two columns. Page 15: Same account, three columns, different header names. Page 47: A scan with a coffee stain. Page 89: The totals page, which references transactions you extracted 70 pages ago.

G

Greg T

Feb 24, 2026
The Invisible Audit Trail

The Invisible Audit Trail

An auditor opens your export file, finds a closing balance of €47,500, and pulls up the source PDF. Page 3, bottom-right corner: €47,000. Different number. "Where does the difference come from? Who changed it?"

G

Greg T

Feb 07, 2026
HoloRecall: Show, Don't Tell

HoloRecall: Show, Don't Tell

There's a moment in every classification project where you watch the model confidently get something wrong. Not a hard case. Not an ambiguous edge. Something a human would solve in half a second without thinking.

G

Greg T

Jan 21, 2026
Your LLM Isn't a Document Pipeline

Your LLM Isn't a Document Pipeline

There's a moment in every AI project where the demo looks so good that your brain quietly starts deleting code. You watch a model "read" a bank statement and think: this is it. We can skip OCR. We can skip layout parsing. Maybe we can skip half the pipeline. In the movie version, someone presses Enter and JSON waterfalls out of the cloud.

G

Greg T

Sep 21, 2025
PDFs Are For People, Not For Data

PDFs Are For People, Not For Data

We love PDFs. They look the same on every device, they print beautifully at any size, and they’re the closest thing we have to digital paper. But every time someone on our team says "let’s just extract the data from the PDF," we feel an ancient PostScript daemon wake up and whisper: “I was born to paint pixels, not to structure your rows.”

G

Greg T

Sep 20, 2025
Holofin