Pharma PDF → Word Extraction & Audit Traceability Platform

End-to-end Python system that extracts structured data from regulatory PDFs (CSRs, CoAs, stability reports) and populates Word templates with inline source citations. Every value is traceable, auditable, and 21 CFR Part 11 aligned.

Regulatory AI Document Intelligence 21 CFR Part 11 Available Now

Overview

An end-to-end Python platform that extracts structured data from regulatory PDFs — CSRs, AE summaries, CoAs, stability reports, WHO monographs — and populates Word templates with inline source citations. Every extracted value is tagged with its source filename, page number, and section header. The entire data journey is reproducible, auditable, and GMP-ready.

10+
Extraction Engines
530+
Fields from One PDF
100%
Source Citations
21 CFR
Part 11 Aligned

System Architecture — End-to-End Pipeline

Each stage produces verifiable artefacts (provenance records, audit JSONL, populated .docx) so the entire data journey is reproducible and inspectable.

flowchart LR
    A["[1] Ingest\nSHA-256 hash\nSession ID\nPage type detect"] --> B["[2] Multi-Engine Extract\n10 engines in parallel\npdfplumber · PyMuPDF\nCamelot · Tesseract · Claude AI"]
    B --> C["[3] Provenance Record\nkey · raw_value · page\nsection · bbox · confidence\nengine · source_doc_id"]
    C --> D["[4] Field Mapping\n90+ regex rules\nClaude AI Auto-Map\n match"]
    D --> E["[5] Word Population\npython-docx\nInline citations\nformat preserved"]
    E --> F["[6] Audit & Package\nSHA-256 hash-chain\nJSONL audit log\nZIP artefacts"]
    style B fill:#0E7490,color:#fff,stroke:#0E7490
    style C fill:#065f46,color:#fff,stroke:#065f46
    style F fill:#7c3aed,color:#fff,stroke:#7c3aed

Stage Details

Stage What It Does
[1] Ingest SHA-256 hashes each PDF. Detects native-text vs. scanned pages via pdfplumber. Assigns session ID and source_doc_id. Records file metadata.
[2] Multi-Engine Extract Runs up to 10 engines in parallel: pdfplumber (KV + tables), PyMuPDF (metadata), Camelot/Tabula (bordered tables), Tesseract OCR (scanned), AWS Textract, Azure Form Recognizer, AcroForm parser, Claude 3.5 Sonnet, GPT-4o.
[3] Provenance Record Every extracted value becomes a structured record with key, raw_value, page_number, section_header, bounding_box, confidence_score, extraction_engine, and source_doc_id.
[4] Field Mapping 90+ regex rules + Claude AI Auto-Map match provenance records to `` fields in Word templates. Every match and miss is logged.
[5] Word Population python-docx replaces each `` with "value [Src: filename.pdf, p.N, Section: Header]". Preserves template formatting, fonts, and table structure.
[6] Audit & Package SHA-256 hash-chained JSONL audit log (one entry per mapped field). 7-section human-readable audit report PDF. ZIP package of all artefacts.

Provenance Record Schema

Every single extracted value is a structured, traceable record:

ProvenanceRecord {
  field_id         : UUID    # unique per extraction
  key              : str     # field name (e.g. 'Total Impurities')
  raw_value        : str     # extracted text exactly as in PDF
  normalized_value : str | None  # cleaned / unit-converted value
  page_number      : int     # 1-based PDF page
  section_header   : str     # nearest heading above the value
  bounding_box     : {x0,y0,x1,y1}  # PDF-point coords for highlighting
  confidence_score : float   # 0.0 – 1.0
  extraction_engine: str     # pdfplumber | camelot | claude | ...
  source_doc_id    : str     # SHA-256 of source PDF
  record_type      : str     # kv | table | acroform | llm | ocr
}

Extraction Quality — Citation Examples

Certificate of Analysis

Field Extracted Value Word Template Output
Assay (HPLC) 98.7 % 98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1 Assay Results]
Total Impurities 0.42 % 0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2 Impurity Profile]
Water Content (KF) 0.18 % 0.18 % [Src: CoA_Batch2024.pdf, p.4, Sec: 2.3 Physical Tests]
Microbial Limit < 100 CFU/g < 100 CFU/g [Src: CoA_Batch2024.pdf, p.4, Sec: 2.4 Microbiological]
Batch Number AXV-2024-0042 AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1, Sec: 1.0 Identification]

Stability Report

Field Extracted Value Word Template Output
Storage Condition 25°C / 60% RH 25°C / 60% RH [Src: Stability_AXV101.pdf, p.2, Sec: 3.1 ICH Conditions]
T=12 months Assay 97.9 % 97.9 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.2 Assay Data]
Degradation Product A 0.09 % 0.09 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.3 Degradants]
Retest Date 2026-08 2026-08 [Src: Stability_AXV101.pdf, p.2, Sec: 2.1 Shelf Life]
Conclusion Meets ICH Q1A criteria Meets ICH Q1A criteria [Src: Stability_AXV101.pdf, p.9, Sec: 6.0 Conclusion]

Before & After — Word Template Population

BEFORE — Word template placeholders
Batch Number:      
Assay Result:      
Total Impurities:  
Retest Date:       
Approved By:       
AFTER — Populated with inline citations
Batch Number:     AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1]
Assay Result:     98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1]
Total Impurities: 0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2]
Retest Date:      2026-08 [Src: Stability_AXV101.pdf, p.2]
Approved By:      Dr. J. Smith [Src: CoA_Batch2024.pdf, p.1]

Technology Stack

Extraction Layer
pdfplumber Native text, table KV, two-column layout detection
PyMuPDF PDF metadata, fast page rendering, info-dict
Camelot / Tabula Bordered & stream table extraction (Java-backed)
Tesseract OCR Local OCR for scanned / image-only pages
AWS Textract Cloud OCR with table/form structure (optional)
Azure Form Recognizer Pre-built pharma/invoice models (optional)
AcroForm parser Interactive PDF form fields, checkboxes, dropdowns
LLM Layer
Claude 3.5 Sonnet Schema-driven field extraction, Q&A, Auto-Map
GPT-4o Alternative LLM for extraction comparison
Document Generation
python-docx Template population, inline citation insertion
fpdf2 Audit report PDF generation
API / Compliance
FastAPI + Uvicorn REST API, auth, rate-limiting, CORS
SHA-256 hash chain Tamper-evident JSONL audit log
21 CFR Part 11 Electronic record / e-signature alignment
React + TypeScript Dashboard: upload, extract, review, map, Q&A

What This Delivers

Python extraction module

Configurable multi-engine PDF parser tuned to CoAs, analytical reports, stability data, and NDA sections. Returns structured provenance records with page + section citations.

Word template populator

python-docx engine maps extracted records to fields and writes inline citations [Src: filename, p.N, Section: X] for every value. Preserves formatting.

Source-reference system

Every output value carries source filename, 1-based page number, section header (nearest heading), extraction engine, and bounding-box coordinates for PDF highlight/annotation.

Audit trail

SHA-256 hash-chained JSONL log + 7-section human-readable audit report PDF. Suitable for GMP environments and FDA submission packages.

REST API (optional)

FastAPI service exposing upload, extract, review, and export endpoints. Includes a React dashboard for human review and confidence-based QA.

Tests + documentation

pytest suite, docstrings, and a README covering setup, configuration, and how to add new document types or template placeholders.


Why This Approach

  • Already built — not a prototype. A deployed FastAPI + React platform with 10 extraction engines, tested on real regulatory documents.
  • Pharma domain knowledge — understands ICH E3, 21 CFR Part 11, CoA structure, analytical report sections, and GMP audit requirements.
  • Full citation chain — every value carries filename, page number, and section header out of the box.
  • Handles edge cases — scanned PDFs (OCR), AcroForms, two-column layouts, bordered tables, and hybrid documents.
  • LLM-augmented — Claude and GPT-4o fill gaps where regex/heuristic extraction fails, with confidence scoring on every result.
  • Clean, tested Python — type-annotated, documented, and structured for handoff.

Get In Touch

Interested in this system for your regulatory document workflow, or have a similar problem?

Email me LinkedIn