Document Fraud Detection: What a PDF Can't Hide

We used to think document fraud was a visual problem. Wrong fonts. Misaligned columns. A logo that felt slightly off. We built checks around what humans see, because what humans see is all we had.

Then a bank statement came through our pipeline. Clean layout. Correct balances. Every visual check passed. The extraction ran perfectly. But something about the file felt heavy. Too many objects for a six-page statement, like a suitcase that weighs more than its contents should allow. We opened it in a hex editor and found three cross-reference sections, two fonts that only appeared on page 4, and a /TouchUp_TextEdit MP operator: Adobe Acrobat's own breadcrumb, left behind every time someone uses "Edit Text & Images."

The statement was a fake. The numbers were fiction. And our eyes never stood a chance.

The fraud wasn't in what we could see. It was in how the file was built.

Document fraud detection illustration

Artisanal forgery is dead

Document fraud used to require skill. A forger needed design tools, font knowledge, patience, and a reasonable understanding of what a bank statement should look like.

That was before template farms.

Today, there are over 160 websites selling pre-built document templates: bank statements, payslips, tax returns, utility bills. Average price: $28. Some offer subscription plans. The buyer fills in their own numbers, exports a PDF, and submits it for a loan, a lease, or an account opening. Industry reports analyzing hundreds of millions of documents paint a consistent picture: roughly 1 in 3 shows structural integrity issues, and serial fraud (the same template reused across multiple applications) has increased severalfold year over year. One cluster alone contained over 23,000 coordinated documents from a single campaign.

This isn't craft anymore. It's a supply chain.

The pixel-perfect lie

A trained analyst can spot obvious fakes. But the gap between "looks wrong" and "looks right" has collapsed. Modern editing tools produce results that are visually indistinguishable from the real thing.

Here's what we had to learn the hard way: visual quality doesn't imply structural integrity.

A PDF is not a picture. It's a program. If you've read our piece on PDF internals, you know that every page is a sequence of drawing instructions: glyphs placed at coordinates, wrapped in objects, linked by cross-reference tables, annotated with metadata, compressed into streams. All of this structure exists below the visual surface.

When someone edits a PDF, they change what you see. But they also change the structure. New fonts get embedded. Object counts shift. Content streams get rewritten. Metadata timestamps update (or get stripped). The file's internal coding style (how its trailer is organized, what keys appear in its cross-reference table, whether it uses LF or CRLF line endings) may no longer match what the metadata claims.

A PDF can lie about what it shows. It can't easily lie about how it was built.

Listening to the file

We had to stop looking at the page and start looking inside the file. Here's the trail of breadcrumbs we learned to follow, layer by layer.

The easy-to-fake stuff

Every PDF carries creation and modification dates, a producer application, and often an author field. Does the producer match what you'd expect from this bank? Is there a suspicious gap between creation and modification? Were the metadata fields stripped entirely?

But metadata is the weakest signal. Any competent editor can spoof it. Some legitimate banks ship PDFs with minimal metadata. And merely downloading a PDF updates the modification date in some viewers. Metadata anomalies are a starting point, not a conclusion.

The fingerprint forgers can't wipe

This is where it gets interesting.

In 2021, researchers Adhatarao and Lauradoux published a paper¹ showing that the coding style of a PDF (the specific combination of keys in its trailer, the format of its cross-reference table, header magic bytes, and line endings) acts as a fingerprint for the software that created it.

LibreOffice always includes a /DocChecksum key. Microsoft Word uses both /Prev and /XRefStm in its trailer. PDFLaTeX writes a lowercase /info key where everyone else capitalizes it. Chrome's Skia engine omits /ID from the trailer and uses LF line endings.

These patterns survive metadata stripping. You can delete the "Producer: LibreOffice" string from the metadata, but you can't easily remove the /DocChecksum from the trailer without re-encoding the entire file. The structural fingerprint reveals the actual producer even when the metadata lies.

When we detect a mismatch, say metadata claims "BankingCorePlatform 4.2" but the structural fingerprint says LibreOffice, that's a signal. Not proof. But a signal worth corroborating.

Adobe's tattletale operator

PDF editors leave breadcrumbs in the content streams themselves.

Adobe Acrobat inserts a /TouchUp_TextEdit MP operator every time someone uses the text editing tool. It's a marked-point operator, part of the PDF spec for tagging content, repurposed by Adobe to track its own edits. Each edited region gets one. Edit five amounts on a page, get five markers. (Adobe didn't build this to catch fraudsters. They built it for their own content management. We just happen to find it useful.)

PDF editor content markers

Iceni Infix, a professional PDF editor, uses a different mechanism: /IceniObject <<...>> DP operators wrapping modified text blocks. The dictionary contains metadata about the edit.

These are not hidden in obscure locations. They're inside the content stream, right next to the drawing instructions. Most PDF viewers ignore them. We read them as directly as we read font commands.

When fonts tell on you

Fonts are surprisingly talkative. A PDF generated by a single application, in a single pass, will have consistent font characteristics: same embedding strategy, same subset naming convention, compatible creation timestamps in the font's internal tables.

A PDF that's been edited tells a different story.

A font that appears on only one page, while every other page uses a different set, suggests that page was modified or assembled separately. A font subset containing 3 glyphs but weighing 15KB. Or a "subset" with 500+ glyphs, essentially the full font, in a document where everything else is properly subsetted. Something smells wrong.

Then there are the timestamps. The head table inside a TrueType font contains a creation date. When that date is years apart from the PDF's creation date, the font was likely embedded from a different source. And the OS/2 table includes a vendor ID. A document with fonts from three different vendors is unusual if the claimed producer is a banking application that ships its own font set.

The best part? Font editors leave their name in the font's name table. Finding "FontForge" or "AFDKO" markers inside a font that's supposed to come from a bank's core system is... educational.

The edit history PDFs can't delete

PDFs support incremental saves. Instead of rewriting the entire file, an editor appends new objects and a new cross-reference table at the end. The original content remains intact earlier in the file.

PDF edit history and incremental saves

This means a PDF can contain its own edit history. The original page objects, the modified page objects, and the trail connecting them. We can count revisions (more than one is unusual for a bank-generated statement), identify which objects changed, detect content modified after a digital signature was applied, and spot files that were re-saved by a different tool without changing content (a common obfuscation technique).

Three or more cross-reference sections in a bank statement is a critical signal. Banks generate statements in a single pass. They don't go back and edit them.

One anomaly is a coincidence

Here's the part that most fraud detection articles skip: individual signals are unreliable.

A metadata gap? The bank's server might have a clock offset. Font page isolation? Could be a legitimate layout change between sections. A high object count? Some PDF generators are verbose. Every signal we've described has an innocent explanation.

The key isn't any single signal. It's the convergence.

We organize forensic evidence into six domains: content, typography, metadata, structure, media, and security. Each domain captures a different dimension of the document's integrity. A finding in one domain is a note. Findings in two domains are a concern. Findings in three or more domains are a pattern that's hard to explain away.

A document with stripped metadata and nothing else? Plenty of legitimate documents have minimal metadata. Low score.

That same document with stripped metadata, plus fonts that don't match the claimed producer, plus a content stream containing editor markers, plus two incremental saves? Now you have evidence from four domains. Each finding individually has an innocent explanation. Together, the probability that all four are coincidental drops fast.

One lie is an anomaly. Four lies are a pattern.

The scoring reflects this. A single-domain finding gets no amplification. Two corroborating domains: 1.25x. Three or more: 1.5x. A sophisticated fake that leaves traces across multiple forensic layers gets flagged much more aggressively than a document that merely has unusual metadata.

Flipping the question

Forensic signals detect anomalies. But anomaly detection has a symmetry problem: a document from an unusual but legitimate source looks just as "anomalous" as a tampered one.

Templates flip the question. Instead of asking "what's wrong with this document?" you ask "does this document match a known-good example?"

For high-volume document types (bank statements from major institutions, utility bills from large providers) we build template baselines from verified samples. A template captures structural fingerprints (expected fonts, metadata patterns, layout characteristics) and visual identity. We teach the system what a real Société Générale statement looks like: not just the logo, but the layout, the header region, the structural patterns. So when a new document arrives, we can say "this is consistent with what we've seen before" or "this doesn't match anything we trust."

A strong template match is a trust signal: positive evidence that the document's visual structure matches verified examples. When combined with clean forensic signals, it produces a "trusted" assessment. When forensic signals fire despite a template match, that's especially interesting: it suggests someone built a document to look like a known template, but the structural internals tell a different story.

What we can't catch (yet)

We're not going to pretend this catches everything. It doesn't.

Image-only submissions defeat structural analysis. If someone photographs a screen showing a fake statement, the result is a JPEG in a PDF wrapper. There's no content stream to analyze, no fonts to inspect, no revision history. The analysis falls back to image forensics (spectral analysis, noise patterns, DCT block artifacts), which is a different and weaker game.

Format-hopping is deliberate evasion. Roughly 1 in 4 high-risk submissions use a different file format than the source document. Someone generates a PDF, screenshots it, submits the screenshot as a JPEG, then wraps it back into a PDF. Each conversion strips forensic evidence. It's the document equivalent of laundering a serial number.

Perfect template reproduction is possible. If a fraudster obtains the exact software and configuration used by a bank, they can produce PDFs with matching structural fingerprints. No mismatch to detect. The document looks legitimate because it was produced by legitimate tools. At that point, the fraud is in the content, not the container.

This is why fraud detection is a layered problem. Structural forensics catches the class of fraud where the container contradicts its visual claims. Content validation (do the numbers add up? does the balance equation hold?) catches another. Network analysis (have we seen this exact template across different applicants?) catches a third.

No single layer is sufficient. The question is always: how many layers would a fraudster need to defeat simultaneously?

The lessons that survived production

When we started building this at Holofin, we thought we could just count anomalies. Flag anything with more than five signals. Ship it.

We quickly realized a raw signal count is useless. Twenty low-severity signals in one domain (say, a verbose PDF generator that triggers a dozen structural checks) aren't as meaningful as three medium-severity signals across three different domains. The signal count was noise. The signal convergence was the insight.

So we rebuilt around a few principles:

Signals are cheap, findings are expensive. Running dozens of checks is fast. Interpreting them correctly is the hard part. Raw counts are misleading. What matters is whether signals corroborate across domains.
Trust requires evidence, not just absence of risk. A clean scan doesn't earn "trusted" status. That requires positive template evidence, a verified match against a known-good baseline. Absence of fraud signals might mean the document is clean. It might also mean it's a format we haven't learned to analyze yet. We'd rather say "we don't know" than "looks fine."
No coin flips. Every signal is computed from the file's binary structure. Same input, same output, every time. No model confidence, no temperature setting, no variation between runs. When a forensic signal fires, it points to a specific structural fact (an object, a font table entry, a content stream operator) that you can open in a hex editor and verify yourself. Black-box risk scores are useless in compliance.
Explain everything. An analyst seeing "high risk" should be able to trace the assessment back to specific findings, specific signals, specific bytes in the file. If we can't explain the score, the score is worthless.

The uncomfortable truth about document fraud is that it's asymmetric. Forging a visually convincing PDF takes $28 and twenty minutes. Detecting that forgery requires examining the file at a level most humans never see: font binaries, content stream operators, cross-reference table structures, revision chains.

But the asymmetry cuts both ways. A forger can make a PDF look like anything. Making it be built like the real thing is a much harder problem. The structural fingerprint of a genuine bank statement (its fonts, its producer coding style, its single-pass generation, its consistent metadata) is the accumulated result of a specific software stack processing real data.

Replicating that is possible. Replicating it at scale, across dozens of document types, while also getting the content right, the balances correct, and the dates plausible?

That's no longer a $28 problem.