Pharma PDF → Word Extraction & Audit Traceability Platform

Regulatory AI Document Intelligence 21 CFR Part 11 Available Now

Overview

An end-to-end Python platform that extracts structured data from regulatory PDFs — CSRs, AE summaries, CoAs, stability reports, WHO monographs — and populates Word templates with inline source citations. Every extracted value is tagged with its source filename, page number, and section header. The entire data journey is reproducible, auditable, and GMP-ready.

10+

Extraction Engines

530+

Fields from One PDF

100%

Source Citations

21 CFR

Part 11 Aligned

▶ Watch the walkthrough View the code Contact me

Video Walkthrough

System Architecture — End-to-End Pipeline

Each stage produces verifiable artefacts (provenance records, audit JSONL, populated .docx) so the entire data journey is reproducible and inspectable.

flowchart LR
    A["[1] Ingest\nSHA-256 hash\nSession ID\nPage type detect"] --> B["[2] Multi-Engine Extract\n10 engines in parallel\npdfplumber · PyMuPDF\nCamelot · Tesseract · Claude AI"]
    B --> C["[3] Provenance Record\nkey · raw_value · page\nsection · bbox · confidence\nengine · source_doc_id"]
    C --> D["[4] Field Mapping\n90+ regex rules\nClaude AI Auto-Map\n match"]
    D --> E["[5] Word Population\npython-docx\nInline citations\nformat preserved"]
    E --> F["[6] Audit & Package\nSHA-256 hash-chain\nJSONL audit log\nZIP artefacts"]
    style B fill:#0E7490,color:#fff,stroke:#0E7490
    style C fill:#065f46,color:#fff,stroke:#065f46
    style F fill:#7c3aed,color:#fff,stroke:#7c3aed

Stage Details

Stage	What It Does
[1] Ingest	SHA-256 hashes each PDF. Detects native-text vs. scanned pages via pdfplumber. Assigns session ID and `source_doc_id`. Records file metadata.
[2] Multi-Engine Extract	Runs up to 10 engines in parallel: pdfplumber (KV + tables), PyMuPDF (metadata), Camelot/Tabula (bordered tables), Tesseract OCR (scanned), AWS Textract, Azure Form Recognizer, AcroForm parser, Claude 3.5 Sonnet, GPT-4o.
[3] Provenance Record	Every extracted value becomes a structured record with `key`, `raw_value`, `page_number`, `section_header`, `bounding_box`, `confidence_score`, `extraction_engine`, and `source_doc_id`.
[4] Field Mapping	90+ regex rules + Claude AI Auto-Map match provenance records to `` fields in Word templates. Every match and miss is logged.
[5] Word Population	python-docx replaces each `` with `"value [Src: filename.pdf, p.N, Section: Header]"`. Preserves template formatting, fonts, and table structure.
[6] Audit & Package	SHA-256 hash-chained JSONL audit log (one entry per mapped field). 7-section human-readable audit report PDF. ZIP package of all artefacts.

Provenance Record Schema

Every single extracted value is a structured, traceable record:

ProvenanceRecord {
  field_id         : UUID    # unique per extraction
  key              : str     # field name (e.g. 'Total Impurities')
  raw_value        : str     # extracted text exactly as in PDF
  normalized_value : str | None  # cleaned / unit-converted value
  page_number      : int     # 1-based PDF page
  section_header   : str     # nearest heading above the value
  bounding_box     : {x0,y0,x1,y1}  # PDF-point coords for highlighting
  confidence_score : float   # 0.0 – 1.0
  extraction_engine: str     # pdfplumber | camelot | claude | ...
  source_doc_id    : str     # SHA-256 of source PDF
  record_type      : str     # kv | table | acroform | llm | ocr
}

Extraction Quality — Citation Examples

Certificate of Analysis

Field	Extracted Value	Word Template Output
Assay (HPLC)	98.7 %	`98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1 Assay Results]`
Total Impurities	0.42 %	`0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2 Impurity Profile]`
Water Content (KF)	0.18 %	`0.18 % [Src: CoA_Batch2024.pdf, p.4, Sec: 2.3 Physical Tests]`
Microbial Limit	< 100 CFU/g	`< 100 CFU/g [Src: CoA_Batch2024.pdf, p.4, Sec: 2.4 Microbiological]`
Batch Number	AXV-2024-0042	`AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1, Sec: 1.0 Identification]`

Stability Report

Field	Extracted Value	Word Template Output
Storage Condition	25°C / 60% RH	`25°C / 60% RH [Src: Stability_AXV101.pdf, p.2, Sec: 3.1 ICH Conditions]`
T=12 months Assay	97.9 %	`97.9 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.2 Assay Data]`
Degradation Product A	0.09 %	`0.09 % [Src: Stability_AXV101.pdf, p.5, Sec: 4.3 Degradants]`
Retest Date	2026-08	`2026-08 [Src: Stability_AXV101.pdf, p.2, Sec: 2.1 Shelf Life]`
Conclusion	Meets ICH Q1A criteria	`Meets ICH Q1A criteria [Src: Stability_AXV101.pdf, p.9, Sec: 6.0 Conclusion]`

Before & After — Word Template Population

BEFORE — Word template placeholders

Batch Number:      
Assay Result:      
Total Impurities:  
Retest Date:       
Approved By:

→

AFTER — Populated with inline citations

Batch Number:     AXV-2024-0042 [Src: CoA_Batch2024.pdf, p.1]
Assay Result:     98.7 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.1]
Total Impurities: 0.42 % [Src: CoA_Batch2024.pdf, p.3, Sec: 2.2]
Retest Date:      2026-08 [Src: Stability_AXV101.pdf, p.2]
Approved By:      Dr. J. Smith [Src: CoA_Batch2024.pdf, p.1]

Technology Stack

Extraction Layer

pdfplumber	Native text, table KV, two-column layout detection
PyMuPDF	PDF metadata, fast page rendering, info-dict
Camelot / Tabula	Bordered & stream table extraction (Java-backed)
Tesseract OCR	Local OCR for scanned / image-only pages
AWS Textract	Cloud OCR with table/form structure (optional)
Azure Form Recognizer	Pre-built pharma/invoice models (optional)
AcroForm parser	Interactive PDF form fields, checkboxes, dropdowns

LLM Layer

Claude 3.5 Sonnet	Schema-driven field extraction, Q&A, Auto-Map
GPT-4o	Alternative LLM for extraction comparison

Document Generation

python-docx	Template population, inline citation insertion
fpdf2	Audit report PDF generation

API / Compliance

FastAPI + Uvicorn	REST API, auth, rate-limiting, CORS
SHA-256 hash chain	Tamper-evident JSONL audit log
21 CFR Part 11	Electronic record / e-signature alignment
React + TypeScript	Dashboard: upload, extract, review, map, Q&A

What This Delivers

Python extraction module

Configurable multi-engine PDF parser tuned to CoAs, analytical reports, stability data, and NDA sections. Returns structured provenance records with page + section citations.

Word template populator

python-docx engine maps extracted records to fields and writes inline citations [Src: filename, p.N, Section: X] for every value. Preserves formatting.

Source-reference system

Every output value carries source filename, 1-based page number, section header (nearest heading), extraction engine, and bounding-box coordinates for PDF highlight/annotation.

Audit trail

SHA-256 hash-chained JSONL log + 7-section human-readable audit report PDF. Suitable for GMP environments and FDA submission packages.

REST API (optional)

FastAPI service exposing upload, extract, review, and export endpoints. Includes a React dashboard for human review and confidence-based QA.

Tests + documentation

pytest suite, docstrings, and a README covering setup, configuration, and how to add new document types or template placeholders.

Why This Approach

Already built — not a prototype. A deployed FastAPI + React platform with 10 extraction engines, tested on real regulatory documents.
Pharma domain knowledge — understands ICH E3, 21 CFR Part 11, CoA structure, analytical report sections, and GMP audit requirements.
Full citation chain — every value carries filename, page number, and section header out of the box.
Handles edge cases — scanned PDFs (OCR), AcroForms, two-column layouts, bordered tables, and hybrid documents.
LLM-augmented — Claude and GPT-4o fill gaps where regex/heuristic extraction fails, with confidence scoring on every result.
Clean, tested Python — type-annotated, documented, and structured for handoff.

Get In Touch

Interested in this system for your regulatory document workflow, or have a similar problem?

Email me LinkedIn