We use only essential, cookie‑free logs by default. Turn on analytics to help us improve. Read our Privacy Policy.
Back to case studies
PIIGDPRHIPAAPrivacyLLMCompliance

100% Local PII Detection for GDPR/HIPAA Compliance

Two-layer architecture combining regex patterns with Swiss LLM for contextual PII detection.

Privacy & Compliance2 monthsOpen Source

Key Results

Zero data exposure
Dual-mode processing
Full audit trail
1800+ languages supported

The Problem

Companies want to use their data for AI — fine-tuning LLMs, building RAG systems, sharing datasets with partners. But the data contains personal information: names, emails, medical records, financial data.

Sending it to a cloud API for cleaning = GDPR/HIPAA violation.

Existing solutions are either:

  • Cloud-based — privacy risk, data leaves your infrastructure
  • Regex-only — miss contextual PII like "the patient" referring to a named person earlier in the document

The Solution

100% local PII detection and removal with two-layer architecture.

Regex Layer (fast) — pattern matching for unambiguous PII: emails, phone numbers, SSNs, credit cards, IBANs, Swiss AHV numbers. Processes ~100 documents/minute.

LLM Layer (accurate) — Swiss-made Apertus-8B model verifies regex candidates and finds contextual PII that patterns miss. Processes ~5 documents/minute.

Data never leaves the machine. No API calls. No cloud. Full GDPR/HIPAA compliance by design.


What Gets Detected

CategoryExamples
NamesJohn Smith, Dr. Miller, Patient: Jane Doe
ContactEmails, phones (US, EU, Swiss, international)
IdentifiersSSN, passport, driver license, Swiss AHV
Medical (PHI)MRN, health plan IDs, insurance numbers
FinancialCredit cards, IBAN, account numbers
DatesDOB, appointment dates with context
AddressesStreet addresses, postal codes (US, UK, DE, CH)

Technical Highlights

Dual mode operationregex_only for pre-filtering (fast), full for final cleaning (accurate).

Confidence scoring — each detection has 0.0-1.0 confidence. Regex passes candidates to LLM for verification.

Overlap handling — when multiple patterns match, keeps highest confidence detection.

Reversible mappingmapping.json stores [NAME_1] → "John Smith" for audit trail and potential reversal.

Batch processing — process folders with progress bar, generates report.json with statistics.

Multilingual — 1,800+ languages supported via Apertus-8B model.


Why Apertus-8B

Swiss-made — aligns with our Swiss company focus on privacy and compliance.

Fully open — weights, training data, methodology all published. No black box.

Compliant — trained respecting opt-out consent, not scraped without permission.

65K context — handles long documents without chunking, understands full document context.


Usage

Fast mode requires no model download:

python -m src.cli --input ./documents --output ./clean --mode regex_only

Full mode downloads ~16GB model on first run:

python -m src.cli --input ./documents --output ./clean --mode full

Output includes cleaned documents, mapping file for audit trail, and statistics report.


Results

  • Two-layer precision — regex catches obvious PII, LLM catches contextual references
  • Full audit trail — know exactly what was redacted and where
  • Zero data exposure — everything runs locally, no API calls
  • Compliance-ready — output includes documentation for GDPR/HIPAA audits

Demo available on HuggingFace Spaces. Enterprise deployment available as part of our Privacy Architecture consulting.

Have a similar challenge?

Let's discuss how we can help. Free consultation, no obligations.

Book a Call