PIIGDPRHIPAAPrivacyLLMCompliance

100% Local PII Detection for GDPR/HIPAA Compliance

Two-layer architecture combining regex patterns with Swiss LLM for contextual PII detection.

Privacy & Compliance2 monthsOpen Source

Key Results

Zero data exposure

Dual-mode processing

Full audit trail

1800+ languages supported

Services Used:Privacy Architecture GDPR/HIPAA Compliance

The Problem

Companies want to use their data for AI — fine-tuning LLMs, building RAG systems, sharing datasets with partners. But the data contains personal information: names, emails, medical records, financial data.

Sending it to a cloud API for cleaning = GDPR/HIPAA violation.

Existing solutions are either:

Cloud-based — privacy risk, data leaves your infrastructure
Regex-only — miss contextual PII like "the patient" referring to a named person earlier in the document

The Solution

100% local PII detection and removal with two-layer architecture.

Regex Layer (fast) — pattern matching for unambiguous PII: emails, phone numbers, SSNs, credit cards, IBANs, Swiss AHV numbers. Processes ~100 documents/minute.

LLM Layer (accurate) — Swiss-made Apertus-8B model verifies regex candidates and finds contextual PII that patterns miss. Processes ~5 documents/minute.

Data never leaves the machine. No API calls. No cloud. Full GDPR/HIPAA compliance by design.

What Gets Detected

Category	Examples
Names	John Smith, Dr. Miller, Patient: Jane Doe
Contact	Emails, phones (US, EU, Swiss, international)
Identifiers	SSN, passport, driver license, Swiss AHV
Medical (PHI)	MRN, health plan IDs, insurance numbers
Financial	Credit cards, IBAN, account numbers
Dates	DOB, appointment dates with context
Addresses	Street addresses, postal codes (US, UK, DE, CH)

Technical Highlights

Dual mode operation — regex_only for pre-filtering (fast), full for final cleaning (accurate).

Confidence scoring — each detection has 0.0-1.0 confidence. Regex passes candidates to LLM for verification.

Overlap handling — when multiple patterns match, keeps highest confidence detection.

Reversible mapping — mapping.json stores [NAME_1] → "John Smith" for audit trail and potential reversal.

Batch processing — process folders with progress bar, generates report.json with statistics.

Multilingual — 1,800+ languages supported via Apertus-8B model.

Why Apertus-8B

Swiss-made — aligns with our Swiss company focus on privacy and compliance.

Fully open — weights, training data, methodology all published. No black box.

Compliant — trained respecting opt-out consent, not scraped without permission.

65K context — handles long documents without chunking, understands full document context.

Usage

Fast mode requires no model download:

python -m src.cli --input ./documents --output ./clean --mode regex_only

Full mode downloads ~16GB model on first run:

python -m src.cli --input ./documents --output ./clean --mode full

Output includes cleaned documents, mapping file for audit trail, and statistics report.

Results

Two-layer precision — regex catches obvious PII, LLM catches contextual references
Full audit trail — know exactly what was redacted and where
Zero data exposure — everything runs locally, no API calls
Compliance-ready — output includes documentation for GDPR/HIPAA audits

Demo available on HuggingFace Spaces. Enterprise deployment available as part of our Privacy Architecture consulting.