Your Table Extractor Passed. The Numbers Didn't.

Or: why "99% cell accuracy" is the most dangerous number on a financial table.

G
Greg T · Engineering· 11 min read·Jun 21, 2026

An auditor opens your extraction output for a balance sheet. The model reports 99.2% cell accuracy. Impressive. Then she totals the asset column by hand, the way auditors do, and it comes to a number that is off by one row. Assets no longer equal liabilities plus equity. The statement does not close.

The 0.8% the model got wrong was not a typo in a footnote. It was the total.

This is the quiet scandal of table extraction in finance. We know it firsthand, because we built our early pipelines to chase exactly these scores: benchmarks that grade a table as a grid of strings, while the only question a financial reader actually cares about goes unasked. Do the numbers still hold? The metrics that win leaderboards are, mathematically, blind to the errors that end careers.


What the benchmarks actually measure

If you have shopped for a table-extraction model, you have seen the scores. TEDS. Cell-match accuracy. Grid similarity. They all reduce to the same idea: line up the predicted table against the ground-truth table, walk the cells, and count how many match.

TEDS (Tree-Edit-Distance Similarity) is the popular one. It turns each table into a tree of rows and cells and measures how many edits it takes to turn one tree into the other. Fewer edits, higher score. It is a genuinely clever metric, and it was built to answer a genuinely useful question: is this roughly the right table, with roughly the right shape and text?

That question is fine for a literature review or a Wikipedia scrape. It is the wrong question for a cash flow statement.

Because here is what every one of these metrics treats a number as: a string. The cell 1,234.56 is, to the scorer, six glyphs in a box. It has no idea that this box is supposed to be the sum of the four boxes above it. It has no idea that the value in the row labeled "Total assets" is load-bearing in a way the value in "Misc. accruals" is not. Every cell is worth the same fraction of the score, and the score is an average.

In finance, the errors are never evenly distributed. And the average hides exactly the place you need to look.


Four ways a 99% score still hands you wrong numbers

Here are the failure modes a string-similarity metric cannot see. Each of these scores beautifully and ruins your day.

1. The column that slipped

A multi-column financial table has "2023" and "2022" side by side. The extractor reads the geometry slightly wrong and shifts every value in the prior-year column one cell down. Now every 2022 figure is attributed to the wrong line item.

To a cell-match metric, almost nothing happened. The same strings are present, in nearly the same cells. TEDS barely flinches. To a reader, every single prior-year number is wrong, and the variance analysis built on top of it is fiction.

The source statement

The kind of statement an extractor is actually handed: a header cell merged across both year columns, a notes column, full-width section rows, indented sub-items and dotted leaders. Obvious to a human, a minefield for a parser deciding which number belongs to which row and year. Every figure here is correct.

Scanned consolidated statement of profit or loss with a merged Year ended 31 December header spanning the 2023 and 2022 columns, a notes column, and grouped sections
fig.1a · synthetic scanned P&L with merged header + section rows (ground truth)

The column that slipped

The 2023 column extracted fine. Every value in the 2022 column landed one row too low, so each prior-year figure now sits on the wrong line. The label tells you whose number actually landed in that cell.

↓  the entire 2022 column slipped down one row
Line item2022 — in the document2022 — as extracted
Revenue1,610value dropped
Cost of sales8701,610Revenue’s
Gross profit740870Cost of sales’
Opex505740Gross profit’s
Operating income235505Opex’s
✓ TEDS similarity: 98.6%✗ Prior-year column: 100% wrong
Same strings, nearly the same cells, so the metric scores it near-perfect. To a reader, every 2022 number now sits on the wrong line item, and the variance analysis built on top of it is fiction.
fig.1b · ground truth vs slipped extraction

2. The decimal that moved

A German statement writes 1.234,56. The model, trained on a comma-as-thousands world, helpfully "corrects" it to 1,234.56, or worse, strips the separators and returns 123456.

The digits are all there. String similarity is delighted. The value is off by a factor of one hundred. Switch to a French statement, where the thousands separator is a space, and the same model splits one number into three.

Locale is not formatting. Locale is arithmetic. A comma in the wrong place is not a style choice, it is a different number.

3. The header that wandered off

The numbers extract perfectly. Every digit correct. But the column header "Restated" got merged into the one next to it, so now you cannot tell which figures are the restatement and which are the original. The cells match the ground truth. The meaning does not survive.

A table where the numbers are right and the headers are wrong is not 95% correct. It is a pile of correct digits with no idea what they count. In finance, a number without its label is not data. It is noise that happens to be numeric.

4. The negative that turned positive

Accountants write negatives as (1,200). Plenty of extractors read the parentheses as decoration and return 1,200. A provision becomes an asset. An outflow becomes an inflow.

One character. The sign of the cash flow. The metric counts the digits as a match and moves on.


Why the metric literally cannot see the problem

Notice the pattern. In every case above, the table is structurally fine and arithmetically broken. The rows are there, the columns are there, the strings are close. Structure passed. Arithmetic failed. And the score cannot tell the difference, because it was never measuring arithmetic in the first place.

The document the benchmark scores

A revenue note straight from a report: merged year header, a notes column, numbered lines. A model will read every digit perfectly and still hand you a total that does not add up.

Scanned revenue note whose reported current-year total of 1,824 does not equal the sum of its line items (1,284)
Straight off the page, current year: 612 + 288 + 154 + 230 = 1,284. The reported total prints 1,824. A cell-similarity score never checks the sum, so it never notices.
fig.2a · synthetic scanned revenue note (merged header)

Structure passed. Arithmetic failed.

Every cell matches the reference, so each earns its green check. But add the four line items up and they come to 1,284 — not the 1,824 printed on the total line. A string metric has no way to see that gap.

Line itemAmount (€000)
Product revenue612
Services revenue288
Licensing154
Other income230
Sum of the four lines1,284
Total revenue (as printed)1,824Δ 540
The four cells are individually correct. They simply do not add up to the printed total. Cell-match accuracy never checks a sum, so it scores this near-perfect.
✓ Cell-match accuracy: 99.2%✓ String similarity: PASS✗ Reconciliation: FAIL
fig.2b · what a cell-match metric sees

This is not a tuning problem. You cannot fix it by weighting some cells more heavily, because the metric has no representation of the relationships that matter. It does not know that a column should sum to its total. It does not know that assets equal liabilities plus equity. It does not know that this quarter's closing balance is next quarter's opening balance. To a string-similarity score, a balance sheet and a takeout menu are the same kind of object: a grid of text.

The information that makes a financial table financial, the constraints between the numbers, is exactly the information these benchmarks throw away before scoring even begins.

So you get models that top the leaderboard and still cannot be trusted with a P&L. The leaderboard was never testing the thing you need.


The metric finance actually needs

Here is the reframe. Stop asking "how similar is this table to the reference table." Start asking "does this table close?"

A financial statement is not a fuzzy artifact you score with a percentage. It is a system of equations with known invariants:

  • Line items sum to their subtotals. Subtotals sum to their totals.
  • On a balance sheet, assets equal liabilities plus equity.
  • In double-entry data, debits equal credits.
  • A closing balance in one period equals the opening balance in the next, absent an explicit restatement.
  • A percentage column, recomputed from its source columns, reproduces itself.

These are not heuristics. They are the definition of a correct statement. And they give you something a similarity score never can: a binary, document-internal check that needs no ground truth at all. The table either reconciles or it does not. When it does not, you do not need an annotator to tell you something is wrong. The arithmetic tells you.

In pseudocode, the test that actually matters looks nothing like an edit distance:

# Not: how close are these two grids of strings?
# But: does the extracted table obey its own arithmetic?

def reconciles(table):
    for total_row in table.totals():
        components = table.rows_feeding(total_row)
        if abs(sum(components) - total_row.value) > tolerance:
            return Fail(total_row, expected=sum(components), got=total_row.value)

    if table.is_balance_sheet():
        if abs(table.assets - (table.liabilities + table.equity)) > tolerance:
            return Fail("balance sheet does not balance")

    return Pass()

The beautiful part: this check runs on a single document, with no reference annotation, in production, on the real statement your customer just uploaded. A cell-similarity benchmark can only tell you how a model did on someone else's test set last month. Reconciliation tells you whether this number, the one about to flow into a credit decision, can be trusted right now.

Two questions, two metrics

The industry built table metrics to answer one question. Finance was always asking a different one.

What the benchmarks ask

“Is this roughly the right table?”

Metrics
TEDS · cell-match · GriTS
What a cell is, to the score
  • A string of glyphs in a box
  • Worth the same as every other cell
  • Averaged into one number
Blind to
  • Sums and subtotals
  • Sign and parenthesis-negatives
  • Decimal and locale shifts
  • Orphaned headers
Verdict: a fuzzy percentage
What finance asks

“Does this table close?”

Checks
  • Σ line items = subtotal = total
  • Assets = liabilities + equity
  • Debits = credits
  • Closing balance = next opening*
Needs
  • No ground-truth annotation
  • Runs on one live document
Verdict: it balances, or it does not
fig.3 · *absent an explicit restatement

A table that does not balance is not 99% right. It is 0% trustworthy. There is no partial credit on a statement that does not close.


How we think about this at Holofin

Turning chaotic financial documents into numbers you can stake a decision on is our actual job. The metric problem above is not academic to us. It is the difference between a model demo and a system an auditor will sign off on.

A few principles fall out of taking it seriously:

  • Reconcile, don't resemble. We do not consider a table extracted until it obeys its own arithmetic. Totals must sum. Balance sheets must balance. Periods must roll forward. Resemblance to a reference is a development convenience. Reconciliation is the actual contract.
  • A number is not a string. Every value carries its type, its sign, its locale, and its currency, parsed deliberately, not inferred from how it happened to be punctuated. (1.234,56) is a negative twelve-hundred, and we treat it that way from the first pass.
  • Constraints over vibes. When the arithmetic does not close, that is not a rounding nuisance to suppress. It is a signal. We surface it, try alternative extraction strategies, and escalate to review rather than ship a confident wrong answer.
  • Provenance per number. Every extracted value carries its page, its bounding box, and its header lineage, so a reviewer can click any number back to the exact pixels it came from. A figure you cannot trace is a figure you cannot defend.

This is also why we are careful about the numbers we publish. We see 97%+ zero-shot precision on common financial documents, and we build the tooling to catch the last few percent rather than pretend it does not exist, because in finance the last few percent is precisely where the total lives.


Closing

The industry built table-extraction metrics to answer "is this roughly the right table." Finance was always asking a different question: "do these numbers hold?"

A model can pass the first test and fail the second on the one row that matters, and the score will never tell you. So stop grading financial tables on how they look. Grade them on whether they balance.

If your extractor has ever passed a benchmark and still handed you a statement that did not close, you already know which number was wrong. It was the total. It is always the total.

Related Articles

Holofin