
AI Data Preparation
Clean Data, Compliant Models
Your AI is only as good as your data. We prepare training datasets that are clean, compliant, and actually useful. PII/PHI removal, quality assessment, bias detection — before your model learns the wrong things.
Why It Matters
GDPR fines up to €20M
For PII in training data
HIPAA violations
PHI in AI systems
Model bias lawsuits
From biased training data
Garbage in, garbage out
Poor data = poor AI
Try It Yourself
Our PII Scrubber runs entirely on your infrastructure. Try the demo below — paste text and see PII detection in action. This lightweight demo uses regex-only mode; the full tool adds context-aware verification using a locally deployed Swiss multilingual model trained on compliant data.
Your data never leaves your browser in the demo, and never leaves your network with the full tool.
Full version includes: batch processing for large datasets, domain-specific detection tuning, and controlled pseudonymization with optional restoration outside the training environment.
Try Demo on Hugging FaceQuick Answers
What is PII/PHI in training data?
Personal Identifiable Information (names, emails, phones, addresses, SSN) and Protected Health Information (medical records, diagnoses, treatment data). If your model was trained on it, you have a compliance problem.
Can't I just use regex?
Regex catches obvious patterns. It misses context-dependent PII ('my doctor', 'the patient in room 5'), misspellings, and encoded data. Context-aware detection combined with rules-based methods catches what regex misses, with full auditability.
What about synthetic data?
We can help generate synthetic replacements that preserve statistical properties without real PII. Good for fine-tuning when you need realistic but compliant data.
Do you handle multiple languages?
Yes. PII patterns vary by language and locale — German addresses, French phone formats, Cyrillic names. We configure detection for your specific data sources.
What formats can you process?
Text, JSON, CSV, PDF, DOCX, database exports. If it contains text, we can process it.
How do I know nothing was missed?
We provide detailed reports: what was found, what was redacted, confidence scores, and flagged items for human review. Full audit trail for compliance.
Is there a self-hosted option?
Yes. Our PII Scrubber runs entirely on your infrastructure — no data leaves your network. Detection combines rules-based methods with a locally deployed Swiss multilingual model trained exclusively on compliant data. For production, we configure domain-specific detection, batch processing, and controlled pseudonymization with strict separation from training systems.
What We Do
PII/PHI Detection & Removal
Find and redact personal data from training datasets, documents, and data exports. Names, addresses, phones, emails, SSN, medical record numbers, diagnoses — a hybrid detection approach combining rules, contextual analysis, and human review.
Output: Clean dataset + audit report
Training Data Quality Assessment
Before you fine-tune, know what you're feeding your model. Duplicate detection, inconsistency analysis, label quality check, coverage gaps.
Output: Quality report + recommendations
Bias Detection
Identify potential bias in your training data before it becomes bias in your model. Demographic representation, sentiment skew, label distribution analysis.
Output: Bias report + mitigation strategies
Synthetic Data Generation
Need realistic data without real PII? We generate synthetic replacements that preserve statistical properties and semantic meaning.
Output: Synthetic dataset + validation report
Automated Data Pipeline
For ongoing data collection, we build automated pipelines that clean and validate data before it enters your training set.
Output: Deployed pipeline + documentation
What We Detect
Personal Identifiers
- Names
- Email addresses
- Phone numbers
- Physical addresses
- Social Security Numbers
- Passport/ID numbers
- Date of birth
- Financial account numbers
Protected Health Information (PHI – Regulated)
- Medical record numbers
- Health plan IDs
- Patient names in clinical context
- Diagnoses and conditions
- Treatment information
- Provider names
- Facility identifiers
- Clinical notes and unstructured medical text
Context-Dependent PII
- Indirect identifiers ('my boss', 'the CEO')
- Location context ('the clinic on Main St')
- Temporal identifiers ('last Tuesday's appointment')
- Relationship references
- Workplace/school identifiers
Process
Fast turnaround for one-time cleanups. Ongoing engagement for continuous pipelines.
Data Assessment
- • Sample analysis of your data sources
- • Identify PII types and patterns specific to your domain
- • Define redaction/replacement strategy
- • Estimate scope and timeline
Processing
- • Configure detection for your specific data
- • Run hybrid detection (rules + contextual analysis) with mandatory human review of low-confidence and PHI-related cases
- • Apply redaction/replacement
- • Validate output quality
Delivery
- • Clean dataset in your preferred format
- • Detailed audit report
- • Compliance documentation for GDPR/HIPAA
- • Recommendations for ongoing data hygiene

Typical Use Cases
Fine-tuning Preparation
Clean your training data before fine-tuning. Remove PII that leaked into scraped content, customer interactions, or internal documents.
RAG Knowledge Base
Building a RAG system from internal docs? Make sure your knowledge base doesn't expose employee data, customer PII, or confidential information.
Healthcare AI
Train models on clinical data without HIPAA violations. PHI detection specifically tuned for medical terminology and workflows.
Financial Services
Process transaction data, customer communications, and financial documents while maintaining PCI DSS and privacy compliance.
Data Sharing
Need to share datasets with partners or researchers? Anonymize while preserving analytical value.
What You Get
Clean Dataset
Your data with PII removed, replaced, or synthesized — in the format you need.
Audit Report
What was found, where, what was done. Full traceability for compliance and legal.
Compliance Documentation
Evidence package for GDPR Article 30, HIPAA requirements, and internal or external audits.
Quality Metrics
Detection confidence scores, false positive/negative rates, coverage statistics.
Investment
Pricing depends on data volume, complexity, and PII density. One-time projects and ongoing pipelines priced differently.
Data Assessment
CHF 3,000
Sample analysis, PII inventory, scope estimation, strategy recommendation
Batch Processing
From CHF 10,000
One-time cleanup of existing dataset. Price based on volume and complexity.
Automated Pipeline
From CHF 25,000
Deployed pipeline for ongoing data cleaning. Includes setup, testing, documentation.
Enterprise Agreement
Custom
High-volume, multi-source, ongoing engagement with SLA and support.
What We Need From You
- Sample data (can be anonymized subset for initial assessment)
- Data format and schema documentation
- Target use case (fine-tuning, RAG, sharing, etc.)
- Regulatory requirements (GDPR, HIPAA, industry-specific)
- Preferred redaction strategy (remove, mask, synthesize)
- Data sensitivity classification (PII, PHI, mixed, or regulated domain-specific data)
Building AI on Sensitive Data?
Let's make sure your training data doesn't become a compliance nightmare. Start with an assessment — we'll tell you exactly what's in there and what needs to happen.
Request Data Assessment