PDFs Are For People, Not For Data

We love PDFs. They look the same on every device, they print beautifully at any size, and they’re the closest thing we have to digital paper. But every time someone on our team says "let’s just extract the data from the PDF," we feel an ancient PostScript daemon wake up and whisper: “I was born to paint pixels, not to structure your rows.”

In this article, we walk through why PDFs are great for presentation and terrible for data. We’ll peek inside a PDF’s guts, build a tiny example, and then try (and fail, and try again) to extract a fictional bank statement. By the end, we hope you’ll stop expecting PDFs to behave like CSVs wearing a trench coat.

The wrong ancestry

PDF didn’t grow up wanting to be an API. It grew up wanting to be a printer driver that never had to meet your printer. The PDF model is essentially: a sequence of drawing instructions.

That means “draw text here, in this font, at this size, then shift a bit, then draw more text.” It does not mean “this is a table with 5 columns and a header.” There’s no native concept of rows, columns, or even words. There are only glyphs placed at coordinates.

If you keep that in mind, everything that follows makes… well, not sense, but at least feels less malicious.

A 30‑second tour of a PDF

Inside a PDF you’ll find:

Objects: numbered blobs (dictionaries, arrays, streams) that define pages, fonts, images, etc.
Content streams: compressed byte streams containing the drawing commands for each page.
Text operators: things like BT (begin text), Tj/TJ (show text), Td/Tm (move the text matrix), etc.
Cross‑reference (xref):

The phone book that tells a reader where each object lives in the file.

You can open a simple PDF in a text editor and squint at it. Parts will be readable; parts look like static because they’re compressed streams. That’s normal.

A toy content stream

Here's a drastically simplified (and sanitized) snippet like you might see after decompressing a page content stream:

PDF Bank Statement Example

BT                      % Begin text object
/F1 12 Tf               % Select font F1 at 12 points
72 720 Td               % Move to position (72, 720)
(2024-05-01) Tj         % Draw simple text "2024-05-01"
0 -20 Td                % Move down 20 units
(VEGA PARKING - REF: 827492) Tj        % Draw simple text "VEGA PARKING - REF: 827492"
0 -20 Td                % Move down 20 units
[(1) -50 (2) -50 (3) -50 (4)] TJ    % Draw "1234" with spacing
0 -20 Td                % Move down 20 units
[(Amount:) -100 ($) -20 (1) -30 (2) -30 (3) -30 (.) -20 (4) -30 (5)] TJ
                        % Complex TJ: "Amount: $123.45" with kerning
0 -20 Td                % Move down 20 units
(Regular text here) Tj  % Simple text again
ET                      % End text object

Two important things just happened:

Tj and TJ draw glyphs, not “characters in a word.” Sometimes the font encodes glyphs in weird ways (custom encodings, subsets). You can’t assume bytes map to Unicode.
TJ takes an array of strings and numbers. The numbers tweak spacing (kerning). People use it to make numbers look pretty. Your extractor now has to compute the visual positions to reconstruct where “1 2 3 4” actually land.

Why your extractor gets gaslit

Let’s run through three common gotchas that turn “easy PDF” into “weekend ruined.”

1) Text is not text

If a file embeds a subsetted font (e.g. /F1+ABCDEE+Inter), the visible letter A might actually be glyph ID 37 that this font maps to A, but another PDF’s glyph 37 might map to Ω. If the font forgets to include a /ToUnicode map (or lies), raw extraction yields garbage. OCR won’t save you if the text is already vector glyphs.

Symptom: You see Tj strings like \x12\x7F\x03 and your tool happily returns .

Fix-ish: Heuristics + font decoding + praying the /ToUnicode map exists. Otherwise you’re in “computer vision” land.

2) Words are an illusion

There are no spaces unless a glyph is literally a space. Many PDFs draw "CHAMPAGNE" as nine independent glyphs with arbitrary gaps. Whether something is a word or two columns coincidentally near each other is your job to infer by clustering coordinates.

Symptom: Your “split on spaces” returns ['CH', 'AMP', 'AGNE'] or worse, merges two columns into one monster word.

Fix-ish: Reconstruct reading order using the text matrix and tolerance thresholds for X/Y gaps. Expect to tune per‑document.

3) Layouts are bespoke snowflakes

Bank A’s “Amount” may be right‑aligned at x=480; Bank B uses a table; Bank C renders each digit individually with TJ to align decimals. The only consistent thing about “Description / Date / Amount” is that it is inconsistent.

Symptom: Your rule‑based parser works on 9 PDFs then silently drops decimals on the 10th.

Fix-ish: Separate rendering from semantics. Build a visual model (boxes, lines, text runs) and then a learned mapping from visuals → fields. Or maintain per‑issuer templates if your long‑term plan includes sadness.

A tiny experiment: create, then extract

First, we'll generate a toy bank statement. Any PDF library works; the key is: we'll render what humans see, not "structured data."

# Generate a one-page statement as humans see it
open_page(width=letter, height=letter)
draw_text("Sample Bank Statement", at=(x_left, y_top))
draw_text("Date", at=(x_date, y_head))
draw_text("Description", at=(x_desc, y_head))
draw_text_right("Amount", right_edge=x_amount_right, y=y_head)

for (date, desc, amount) in transactions:
  draw_text(date, at=(x_date, y))
  draw_text(desc, at=(x_desc, y))
  # critical: render amount with digit-by-digit kerning (TJ)
  draw_digits_with_spacing(amount, right_edge=x_amount_right, y=y)
  y = y - line_height

save_pdf("sample.pdf")

Now let's try to extract. We'll demo three strategies in pseudocode first (see Appendix for runnable scripts): "strings hammer," "naive text walk," and "layout-aware."

# Method 1: YOLO (string hunting)
bytes = read_file("sample.pdf")
visible = grep_text(bytes, patterns=["%PDF", "xref", "/Type", "/Page"])
print(visible)  # may find container/meta tokens; page streams are compressed

# Method 2: Naive text walk
lines = []
for page in pdf_pages("sample.pdf"):
  buf = ""
  prev_y = None
  for op in page.text_ops():  # yields chars in operator order
    if op.type == "char":
      if prev_y is not None and (prev_y - op.y) > y_tol:
        lines.append(buf.strip()); buf = ""
      buf += op.text
      prev_y = op.y
    elif op.type in {"space", "newline"}:
      buf += " "
  if buf: lines.append(buf.strip())
print(lines)  # words mashed, columns merged, garbled text if ToUnicode is missing

# Method 3: Layout-aware graph
runs = []
for page in pdf_pages("sample.pdf"):
  glyphs = [ (g.x, g.y, g.w, g.text) for g in page.glyphs() ]
  lines = cluster_by_y(glyphs, tol=y_tol)
  words = [ cluster_by_x(line, gap=x_tol) for line in lines ]
  right_guide = infer_right_edge([w for line in words for w in line if looks_numeric(w)])
  for line in words:
    amount = snap_right_aligned(line, right_guide, tol=snap_tol)
    date = first_token_like("YYYY-MM-DD", line)
    desc = tokens_between(date, amount, line)
    if date and amount:
      runs.append({
        "date": date.text,
        "description": join(desc),
        "amount": parse_locale_number(amount.text)
      })
print({ "transactions": runs })

{"transactions": [
  {"date": "2024-01-03", "description": "VEGA PARKING - REF: 827492", "amount": -3.45},
  {"date": "2024-01-05", "description": "LUMINA PAYMENTS - Transfer ID: 14x8Nqm7", "amount": 796.60},
  {"date": "2024-01-08", "description": "STELLAR TRANSPORT - Invoice F887", "amount": -63.36}
]}

Spoiler: Method 1 extracts almost nothing. Method 2 returns garbled text for that missing /ToUnicode line and merges columns. Method 3 is… finally usable, until you try a second bank’s template.

The bank‑statement boss fight

Let’s pretend a customer uploads three different bank statements for KYC. They look the same to a person:

Date, Description, Amount.
Negative amounts have a minus sign; decimals use a comma because… Europe.

Under the hood, they’re completely different monsters:

Issuer A uses ToUnicode and plain Tj. Easy mode.
Issuer B draws digits with TJ like [ (1) -30 (2) -30 (3) -30 (,) -20 (4) ] so decimals align. Right‑aligned at x≈480.
Issuer C embedded a Type 3 font with custom glyphs. No ToUnicode. The comma is actually glyph id 17, which your library decodes as \x11.

A pipeline must:

Decode fonts when possible; fall back to OCR only where necessary.
Reconstruct reading order by geometry (not text order).
Normalize locales (, vs .; non‑breaking spaces; parentheses for negatives).
Detect and fix “digit salad” from TJ by computing actual glyph positions.

Here’s the kind of post‑processing you end up writing (pseudocode):

# (placeholder) postprocess.py
for run in text_runs:
    if looks_like_right_aligned_amount(run):
        amount = stitch_digits_by_x(run.glyphs, tolerance=1.5)
        amount = normalize_decimal_separators(amount)
        yield {"amount": parse_money(amount), "x": run.right_x}

And no, it’s not one function. It’s a garden of heuristics, each adopted after a bug report.

“But mine works on 95% of files!”

Same. The median PDF is fine. The tail is where compliance teams live. The long tail includes:

Scans (bitmaps) interleaved with vector text.
Linearized PDFs where objects get shuffled for streaming.
Weird xrefs and compressed object streams.
Tagged PDFs that pretend to have structure (sometimes helpful! sometimes lies!).

You don’t notice these until a critical client uses exactly that export from exactly that core banking system.

Closing thought → and how we do this at Holofin

PDFs are wonderful at what they were designed to do: reliably show humans the same page. If you need data, ask for data, CSV, JSON, XLSX. And when reality says “the regulator wants PDFs,” you need a pipeline that treats PDFs like the tiny graphics programs they are.

At Holofin, that’s literally our job description: turn chaotic, real‑world PDFs into structured, validated, production‑ready data, from clean exports to coffee‑stained 1990s scans.

Our principles

Structure before semantics. We rebuild geometry first (glyphs → words → lines → blocks → tables) and only then assign meaning. This avoids “right numbers, wrong header” bugs.
Anchors everywhere. Every value carries page number, bounding box, and header stack lineage so you can click back to source in review/debug.
Deterministic outputs. We deliver consistent, auditable values derived from the document content; units and currencies are preserved as provided by the source.
Constraints > vibes. Totals must sum to subtotals; balance sheets must balance; dates must be plausible. When rules fail, we try alternative strategies and heuristics.

How this looks on a bank statement

Remember our three issuers (plain Tj, kerning with TJ, subset fonts with no ToUnicode)? Holofin handles them by:

Resilient text extraction: We combine native text decoding with OCR fallback where appropriate to ensure consistent results across fonts, encodings, and scans.
Geometry‑first layout reconstruction: We rebuild reading order, lines, and columns from on‑page geometry so formatting differences don’t break parsing.
Domain‑aware interpretation: We assign semantics with financial validations (e.g., balances and subtotals must reconcile) to prevent plausible‑but‑wrong values.
Auditable, reviewable output: We return structured data with provenance to support human review and traceability.

Why this matters

In finance, a one‑percent extraction error doesn’t feel like a typo, it feels like a valuation shift. We engineer for accuracy and consistency through layered controls, reconciliations, and comprehensive audit trails, delivering reliability that downstream systems and reviewers can trust.

What you get out of the box

97%+ zero‑shot precision on common financial documents (and tools to review the last few percent).
Multi‑document processing (e.g., a year of statements) in a single API call with consolidated, normalized output.
Debug mode with source ground truth & audit trails for every value.
Enterprise‑ready REST API and web UI, built for scale and security, with configurable data retention (GDPR‑friendly by default).

If this resonates with the scars on your own PDF parser, let’s talk.