100% Local PII Detection for GDPR/HIPAA Compliance
Two-layer architecture combining regex patterns with Swiss LLM for contextual PII detection.
Key Results
The Problem
Companies want to use their data for AI — fine-tuning LLMs, building RAG systems, sharing datasets with partners. But the data contains personal information: names, emails, medical records, financial data.
Sending it to a cloud API for cleaning = GDPR/HIPAA violation.
Existing solutions are either:
- Cloud-based — privacy risk, data leaves your infrastructure
- Regex-only — miss contextual PII like "the patient" referring to a named person earlier in the document
The Solution
100% local PII detection and removal with two-layer architecture.
Regex Layer (fast) — pattern matching for unambiguous PII: emails, phone numbers, SSNs, credit cards, IBANs, Swiss AHV numbers. Processes ~100 documents/minute.
LLM Layer (accurate) — Swiss-made Apertus-8B model verifies regex candidates and finds contextual PII that patterns miss. Processes ~5 documents/minute.
Data never leaves the machine. No API calls. No cloud. Full GDPR/HIPAA compliance by design.
What Gets Detected
| Category | Examples |
|---|---|
| Names | John Smith, Dr. Miller, Patient: Jane Doe |
| Contact | Emails, phones (US, EU, Swiss, international) |
| Identifiers | SSN, passport, driver license, Swiss AHV |
| Medical (PHI) | MRN, health plan IDs, insurance numbers |
| Financial | Credit cards, IBAN, account numbers |
| Dates | DOB, appointment dates with context |
| Addresses | Street addresses, postal codes (US, UK, DE, CH) |
Technical Highlights
Dual mode operation — regex_only for pre-filtering (fast), full for final cleaning (accurate).
Confidence scoring — each detection has 0.0-1.0 confidence. Regex passes candidates to LLM for verification.
Overlap handling — when multiple patterns match, keeps highest confidence detection.
Reversible mapping — mapping.json stores [NAME_1] → "John Smith" for audit trail and potential reversal.
Batch processing — process folders with progress bar, generates report.json with statistics.
Multilingual — 1,800+ languages supported via Apertus-8B model.
Why Apertus-8B
Swiss-made — aligns with our Swiss company focus on privacy and compliance.
Fully open — weights, training data, methodology all published. No black box.
Compliant — trained respecting opt-out consent, not scraped without permission.
65K context — handles long documents without chunking, understands full document context.
Usage
Fast mode requires no model download:
python -m src.cli --input ./documents --output ./clean --mode regex_only
Full mode downloads ~16GB model on first run:
python -m src.cli --input ./documents --output ./clean --mode full
Output includes cleaned documents, mapping file for audit trail, and statistics report.
Results
- Two-layer precision — regex catches obvious PII, LLM catches contextual references
- Full audit trail — know exactly what was redacted and where
- Zero data exposure — everything runs locally, no API calls
- Compliance-ready — output includes documentation for GDPR/HIPAA audits
Demo available on HuggingFace Spaces. Enterprise deployment available as part of our Privacy Architecture consulting.
