Pharma PDF → Word Extraction & Audit Traceability Platform
End-to-end Python system that extracts structured data from regulatory PDFs (CSRs, CoAs, stability reports) and populates Word templates with inline source citations. Every value is traceable, auditable, and 21 CFR Part 11 aligned.
Overview
An end-to-end Python platform that extracts structured data from regulatory PDFs — CSRs, AE summaries, CoAs, stability reports, WHO monographs — and populates Word templates with inline source citations. Every extracted value is tagged with its source filename, page number, and section header. The entire data journey is reproducible, auditable, and GMP-ready.
System Architecture — End-to-End Pipeline
Each stage produces verifiable artefacts (provenance records, audit JSONL, populated .docx) so the entire data journey is reproducible and inspectable.
flowchart LR
A["[1] Ingest\nSHA-256 hash\nSession ID\nPage type detect"] --> B["[2] Multi-Engine Extract\n10 engines in parallel\npdfplumber · PyMuPDF\nCamelot · Tesseract · Claude AI"]
B --> C["[3] Provenance Record\nkey · raw_value · page\nsection · bbox · confidence\nengine · source_doc_id"]
C --> D["[4] Field Mapping\n90+ regex rules\nClaude AI Auto-Map\n match"]
D --> E["[5] Word Population\npython-docx\nInline citations\nformat preserved"]
E --> F["[6] Audit & Package\nSHA-256 hash-chain\nJSONL audit log\nZIP artefacts"]
style B fill:#0E7490,color:#fff,stroke:#0E7490
style C fill:#065f46,color:#fff,stroke:#065f46
style F fill:#7c3aed,color:#fff,stroke:#7c3aed
Stage Details
| Stage | What It Does |
|---|---|
| [1] Ingest | SHA-256 hashes each PDF. Detects native-text vs. scanned pages via pdfplumber. Assigns session ID and source_doc_id. Records file metadata. |
| [2] Multi-Engine Extract | Runs up to 10 engines in parallel: pdfplumber (KV + tables), PyMuPDF (metadata), Camelot/Tabula (bordered tables), Tesseract OCR (scanned), AWS Textract, Azure Form Recognizer, AcroForm parser, Claude 3.5 Sonnet, GPT-4o. |
| [3] Provenance Record | Every extracted value becomes a structured record with key, raw_value, page_number, section_header, bounding_box, confidence_score, extraction_engine, and source_doc_id. |
| [4] Field Mapping | 90+ regex rules + Claude AI Auto-Map match provenance records to `` fields in Word templates. Every match and miss is logged. |
| [5] Word Population | python-docx replaces each `` with "value [Src: filename.pdf, p.N, Section: Header]". Preserves template formatting, fonts, and table structure. |
| [6] Audit & Package | SHA-256 hash-chained JSONL audit log (one entry per mapped field). 7-section human-readable audit report PDF. ZIP package of all artefacts. |
Provenance Record Schema
Every single extracted value is a structured, traceable record:
ProvenanceRecord {
field_id : UUID # unique per extraction
key : str # field name (e.g. 'Total Impurities')
raw_value : str # extracted text exactly as in PDF
normalized_value : str | None # cleaned / unit-converted value
page_number : int # 1-based PDF page
section_header : str # nearest heading above the value
bounding_box : {x0,y0,x1,y1} # PDF-point coords for highlighting
confidence_score : float # 0.0 – 1.0
extraction_engine: str # pdfplumber | camelot | claude | ...
source_doc_id : str # SHA-256 of source PDF
record_type : str # kv | table | acroform | llm | ocr
}
Extraction Quality — Citation Examples
Certificate of Analysis
| Field | Extracted Value | Word Template Output |
|---|---|---|
| Assay (HPLC) | 98.7 % | 98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1 Assay Results] |
| Total Impurities | 0.42 % | 0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2 Impurity Profile] |
| Water Content (KF) | 0.18 % | 0.18 % [Src: CoA_Batch2024.pdf, p.4, Sec: 2.3 Physical Tests] |
| Microbial Limit | < 100 CFU/g | < 100 CFU/g [Src: CoA_Batch2024.pdf, p.4, Sec: 2.4 Microbiological] |
| Batch Number | AXV-2024-0042 | AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1, Sec: 1.0 Identification] |
Stability Report
| Field | Extracted Value | Word Template Output |
|---|---|---|
| Storage Condition | 25°C / 60% RH | 25°C / 60% RH [Src: Stability_AXV101.pdf, p.2, Sec: 3.1 ICH Conditions] |
| T=12 months Assay | 97.9 % | 97.9 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.2 Assay Data] |
| Degradation Product A | 0.09 % | 0.09 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.3 Degradants] |
| Retest Date | 2026-08 | 2026-08 [Src: Stability_AXV101.pdf, p.2, Sec: 2.1 Shelf Life] |
| Conclusion | Meets ICH Q1A criteria | Meets ICH Q1A criteria [Src: Stability_AXV101.pdf, p.9, Sec: 6.0 Conclusion] |
Before & After — Word Template Population
Batch Number:
Assay Result:
Total Impurities:
Retest Date:
Approved By: Batch Number: AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1]
Assay Result: 98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1]
Total Impurities: 0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2]
Retest Date: 2026-08 [Src: Stability_AXV101.pdf, p.2]
Approved By: Dr. J. Smith [Src: CoA_Batch2024.pdf, p.1] Technology Stack
Extraction Layer
| pdfplumber | Native text, table KV, two-column layout detection |
| PyMuPDF | PDF metadata, fast page rendering, info-dict |
| Camelot / Tabula | Bordered & stream table extraction (Java-backed) |
| Tesseract OCR | Local OCR for scanned / image-only pages |
| AWS Textract | Cloud OCR with table/form structure (optional) |
| Azure Form Recognizer | Pre-built pharma/invoice models (optional) |
| AcroForm parser | Interactive PDF form fields, checkboxes, dropdowns |
LLM Layer
| Claude 3.5 Sonnet | Schema-driven field extraction, Q&A, Auto-Map |
| GPT-4o | Alternative LLM for extraction comparison |
Document Generation
| python-docx | Template population, inline citation insertion |
| fpdf2 | Audit report PDF generation |
API / Compliance
| FastAPI + Uvicorn | REST API, auth, rate-limiting, CORS |
| SHA-256 hash chain | Tamper-evident JSONL audit log |
| 21 CFR Part 11 | Electronic record / e-signature alignment |
| React + TypeScript | Dashboard: upload, extract, review, map, Q&A |
What This Delivers
Python extraction module
Configurable multi-engine PDF parser tuned to CoAs, analytical reports, stability data, and NDA sections. Returns structured provenance records with page + section citations.
Word template populator
python-docx engine maps extracted records to fields and writes inline citations [Src: filename, p.N, Section: X] for every value. Preserves formatting.
Source-reference system
Every output value carries source filename, 1-based page number, section header (nearest heading), extraction engine, and bounding-box coordinates for PDF highlight/annotation.
Audit trail
SHA-256 hash-chained JSONL log + 7-section human-readable audit report PDF. Suitable for GMP environments and FDA submission packages.
REST API (optional)
FastAPI service exposing upload, extract, review, and export endpoints. Includes a React dashboard for human review and confidence-based QA.
Tests + documentation
pytest suite, docstrings, and a README covering setup, configuration, and how to add new document types or template placeholders.
Why This Approach
- Already built — not a prototype. A deployed FastAPI + React platform with 10 extraction engines, tested on real regulatory documents.
- Pharma domain knowledge — understands ICH E3, 21 CFR Part 11, CoA structure, analytical report sections, and GMP audit requirements.
- Full citation chain — every value carries filename, page number, and section header out of the box.
- Handles edge cases — scanned PDFs (OCR), AcroForms, two-column layouts, bordered tables, and hybrid documents.
- LLM-augmented — Claude and GPT-4o fill gaps where regex/heuristic extraction fails, with confidence scoring on every result.
- Clean, tested Python — type-annotated, documented, and structured for handoff.
Get In Touch
Interested in this system for your regulatory document workflow, or have a similar problem?