Document Tampering Detection: How AI Identifies Forged and Manipulated Documents

Document fraud is one of the oldest forms of deception. What’s changed is the tooling. AI has made it dramatically easier to produce convincing forged documents — and dramatically more necessary to deploy AI-powered detection.

A fraudulent police report, medical record, repair estimate, or identity document can look indistinguishable from a genuine one when produced with modern tools. Large language models generate text with appropriate terminology and formatting. Image generation tools produce logos, signatures, stamps, and letterheads. Layout tools combine them into documents that pass visual inspection.

Detection must match this capability.

Types of Document Tampering

1. Fully AI-Generated Documents

What it is: The entire document is created from scratch — content, formatting, logos, signatures — using AI tools. No genuine original exists.

Example: A fraudulent police report for a fictitious accident, generated with an LLM for the text, AI-generated logos for the police department header, and a generated signature for the “filing officer.”

Current capability: LLMs (GPT-4, Claude, Gemini) produce text that uses appropriate legal, medical, or technical terminology. Image generation tools produce logos and visual elements. Template-based tools combine them into properly formatted PDFs. The output is sufficient to pass visual review by someone unfamiliar with the specific issuing authority’s document formatting.

2. Altered Genuine Documents

What it is: A genuine document is modified — changing dates, amounts, names, or other details while retaining the authentic structure, formatting, and most of the original content.

Example: A genuine repair estimate for A$2,000 altered to read A$12,000 by modifying specific text fields in a PDF. The document layout, repair shop logo, and formatting are authentic; only the figures have been changed.

Current capability: PDF editing tools (including free ones) allow text modification in many PDFs. For scanned documents, image editing tools can alter text within the scan. The original document provides authentic context that makes the alteration harder to detect visually.

3. Composite Documents

What it is: Elements from multiple genuine documents are combined to create a new document that didn’t exist as a whole.

Example: A medical report with a genuine hospital letterhead (from one document), genuine formatting (from another), and fabricated clinical content. Each individual element is authentic; the combination is fraudulent.

4. Document Element Fraud

What it is: Specific elements within a document are fabricated or replaced while the document structure remains genuine.

Example: A genuine invoice from a real business, with the amount or line items altered. Or a genuine medical report with the diagnosis or treatment details changed.

Detection Methods

Text Analysis (Industry and Research Context)

Current product scope: deetech does not currently perform AI-generated-text classification. Its supported document pipeline uses text layers or confidence-gated OCR only for deterministic fabrication checks, alongside structural, metadata, and supported visual analysis. The AI-authorship techniques below describe broader industry and research methods, not a current product feature.

AI-generated text detection. LLM-generated text has statistical properties that differ from human-written text:

Token probability distributions. LLM output tends to select high-probability tokens, creating text that is statistically “smoother” than human writing. Detection models identify this smoothness.
Perplexity analysis. LLM output typically has lower perplexity (more predictable) than human-written text of the same complexity.
Stylistic consistency. LLM-generated text maintains more consistent style throughout a document than human writing, which naturally varies in formality, sentence structure, and vocabulary.
Factual consistency. LLMs may generate internally consistent text that contains factual errors invisible to someone unfamiliar with the subject matter — incorrect procedure codes, non-existent policy numbers, or formatting that doesn’t match the claimed issuer’s actual standards.

Text modification detection. When genuine text has been altered:

Font inconsistencies. Modified text may use slightly different font rendering, character spacing, or line spacing than the surrounding original text.
Copy-paste artifacts. Text pasted from different sources may carry formatting artifacts — different character encoding, invisible formatting marks, or spacing variations.
Linguistic analysis. Altered sections may show different writing style, vocabulary, or technical accuracy than the original text.

Visual Element Analysis

Logo and signature verification. AI-generated logos and signatures have detectable characteristics:

Resolution and rendering. Generated logos may have different resolution, compression, or rendering quality than authentic logos extracted from genuine documents.
Consistency with known templates. If the detection system has access to authentic document templates from the claimed issuer, generated visual elements can be compared against the known standard.
Generation artifacts. Logos and signatures produced by image generation tools carry the same frequency-domain signatures as other AI-generated images.

Stamp and seal analysis. Official stamps, notary seals, and certification marks have physical characteristics (uneven ink distribution, pressure variation) that AI generation doesn’t reproduce accurately.

Document Structure Analysis

Layout consistency. Every issuing authority (hospital, police department, repair shop, insurance company) has characteristic document layouts — margins, header placement, font choices, field ordering. Documents that claim to be from a specific issuer but deviate from that issuer’s known layout are suspicious.

Template matching. If the detection system maintains a database of authentic document templates, submitted documents can be compared against the expected template for the claimed issuer. Deviations — wrong field order, different logo placement, non-standard formatting — indicate potential forgery.

Cross-field consistency. Within a document, fields should be consistent — dates should be in the correct format for the jurisdiction, reference numbers should match the issuer’s numbering scheme, and calculated fields (totals, sub-totals) should be arithmetically correct.

Metadata and File Analysis

PDF metadata. PDF files contain metadata about creation and modification:

Creation date and modification date
Authoring software (Acrobat, Word, Chrome print-to-PDF, AI tools)
Producer and creator fields
Page count and document structure

A document claiming to be a printed police report but created in Google Docs yesterday is suspicious. A medical record with a modification date after the claimed filing date warrants scrutiny.

File structure analysis. PDF internal structure reveals its history:

Incremental updates. Modifications to a PDF leave traces in the file structure, even when the visual output looks seamless.
Font embedding. Genuine documents embed fonts consistently; modified documents may embed additional fonts for altered text.
Object structure. The internal object hierarchy of a PDF can reveal whether content was added, removed, or modified after initial creation.

Scan analysis. For scanned documents (images of physical documents):

Consistent scan quality. All pages should show similar scanning artifacts if scanned on the same device.
Paper texture. Genuine scans show paper texture and print characteristics that differ from digitally created documents printed and then scanned.
Modification over scan. If a scanned document has been digitally modified (text changed after scanning), the modification typically lacks the scan artifacts present in the unmodified areas.

Application in Insurance

Insurance claims processing involves extensive documentation:

Document Type	Fraud Risk	Detection Priority
Police reports	Fabrication to support staged accidents	High
Medical records	Fabrication or alteration to inflate injury severity	High
Repair estimates	Amount inflation, fabricated line items	High
Invoices	Fabricated invoices for non-existent work	Medium
Correspondence	Fabricated letters from solicitors or authorities	Medium
Identity documents	Synthetic or altered identity	High
Photographic evidence	See our image detection content	High

For a detailed comparison of AI methods versus traditional forensics for insurance documents, see our document forgery detection article.

Building Document Detection Capability

Integration Points

Document detection should run at the point where documents enter your system:

Claim submission (web portal, mobile app, email)
Post-submission document upload
Third-party document receipt (from solicitors, medical providers, repair shops)

What to Report

Effective document detection provides:

Overall authenticity score — probability that the document is genuine
Specific findings — which elements triggered alerts (text, visual elements, metadata, structure)
Confidence levels — per finding
Comparison data — if template matching was used, how the document compares to the expected template
Audit trail — when the analysis was performed, which models were used, and the full chain of evidence

Accuracy Considerations

Document detection faces the same lab-to-production gap as image detection:

Documents from diverse issuers (thousands of hospitals, police departments, repair shops) vary enormously in format
Scan quality varies from high-resolution to barely legible
Legitimate documents may have unusual formatting (small businesses, foreign documents, handwritten elements)
False positives on legitimate documents from unfamiliar issuers must be minimized

deetech™ analyses supported documents submitted with insurance claims using deterministic structural and fabrication checks, confidence-gated OCR where supported, metadata signals, and supported visual findings. It does not currently classify document text as AI-generated. Request a demo.