Back to Projects
QA Engineering / Test Automation

Automated PDF Compliance Test Harness

Multi-layer validation framework that programmatically tests 300 PDF documents against structural and content compliance rules — achieving 100% true positive and true negative rates across a controlled ground-truth corpus — with per-file failure diagnosis and a colour-coded Excel compliance report.

Type
Test Automation / QA Engineering
Domain
Document Compliance
Methods
PDF parsing, regex validation, structural analysis, ground-truth testing
Test Corpus
300 documents — 210 known-good, 90 known-defective

The Challenge

Automated document generation pipelines produce outputs at scale — but without systematic validation, defects accumulate silently. A PDF generator producing thousands of letters, certificates, or regulated notices may drift from specification without triggering any visible error: spacing breaks, required fields go missing, formatting rules are violated.

The question is not whether the documents were generated, but whether they are compliant. Answering that reliably requires a test harness that defines compliance precisely, executes deterministically, and distinguishes true defects from false alarms — the same standard applied to any quality-critical system.

This project built exactly that: a two-layer automated framework capable of parsing generated PDFs, asserting compliance against a defined specification, and reporting results at both aggregate and per-file granularity.

Why this matters for regulated environments: In financial services, insurance, and legal contexts, document compliance is not a cosmetic concern — it is an audit and regulatory obligation. Manual spot-checking at volume is statistically insufficient. Automated harnesses with quantified false-positive and false-negative rates are the only defensible approach.

Approach

01
Ground-Truth Corpus Design
Constructed a controlled test corpus of 300 synthetic PDF letters: 210 known-good (conforming to all specification rules) and 90 known-defective (containing deliberate, categorised violations). This separation is essential — without a ground truth, test accuracy cannot be measured, only assumed.
02
Layer 1 — Structural Validation (Y-Gap)
Implemented paragraph spacing analysis using PyMuPDF (fitz), extracting the vertical positions of text blocks and asserting that inter-paragraph gaps conform to the defined layout specification. Each document is evaluated against a configurable y-gap threshold, with the exact failure reason logged when the rule is violated.
03
Layer 2 — Content Integrity Validation
Applied a sequential regex-based assertion engine that validates required content fields in strict top-to-bottom order: salutation format, date, property address, order reference, phone, website, bold formatting markers, and two weather-line phrases. Presence, format, and ordering are all checked — partial compliance is not accepted.
04
Reporting and Diagnosis
Generated a per-file validation report with individual PASS/FAIL verdicts, specific failure reasons, and a cross-tabulation of structural vs content failures. Final output: a colour-coded Excel workbook (green = PASS, red = FAIL) combining both validation layers, plus an executive summary with aggregate metrics and a breakdown of the most frequent failure modes.
TEST HARNESS — FINAL ACCURACY REPORT
300 DOCS
GROUND TRUTH    Known-good: 210    Known-defective: 90    Total: 300
TRUE POSITIVES  210 / 210   (good → PASS)
TRUE NEGATIVES  90 / 90    (bad → FAIL)
FALSE POSITIVES 0        (bad → PASS)
FALSE NEGATIVES 0        (good → FAIL)
OVERALL         100% TRUE POSITIVE RATE   100% TRUE NEGATIVE RATE

Results

100%
True positive rate — all 210 compliant documents correctly passed
100%
True negative rate — all 90 defective documents correctly failed
0
False positives and false negatives — zero misclassifications across 300 documents

The harness achieved perfect classification across all 300 documents — 210 true positives and 90 true negatives, with zero false positives and zero false negatives. The two validation layers were complementary: structural validation (y-gap) flagged 58 failures, content integrity flagged 90, with the combined verdict correctly identifying all defective documents.

Failure diagnosis identified the most common root causes, enabling the document generator to be targeted for fixes rather than requiring broad rework. Weather-line content was the highest-frequency failure mode (74 instances), followed by salutation format errors (58) — findings that directly informed prioritised corrections upstream.

Failure Mode Count
Weather line 1 — missing required phrase74
Weather line 2 — missing required phrase74
Salutation — incorrect format or missing58
Order number — absent or malformed36
Date — absent or malformed36
Property line — absent or malformed36
Phone — absent or incorrect20
Website — absent or incorrect20
Bold formatting — expected markers missing16

Relevance to Production Contexts

This harness demonstrates capabilities that translate directly to production quality engineering: structured test design with explicit ground truth, multi-layer validation with independent failure modes, per-document failure attribution rather than aggregate-only reporting, and machine-readable output (Excel) for downstream audit processes.

The same architecture — parse, assert, report — scales to any document type where compliance rules can be formalised: regulated financial notices, insurance certificates, legal contracts, or automated correspondence systems. The pattern is domain-agnostic; only the assertion rules change.

For organisations running AI-assisted document generation pipelines, this kind of post-generation compliance testing is the QA layer that makes the pipeline auditable — not just functional.

Technology Stack

Python PyMuPDF (fitz) pandas openpyxl regex logging pathlib
Need document compliance testing or QA automation?