We use only essential, cookie‑free logs by default. Turn on analytics to help us improve. Read our Privacy Policy.
Kenaz

AI Data Preparation

Clean Data, Compliant Models

Your AI is only as good as your data. We prepare training datasets that are clean, compliant, and actually useful. PII/PHI removal, quality assessment, bias detection — before your model learns the wrong things.

Why It Matters

GDPR fines up to €20M

For PII in training data

HIPAA violations

PHI in AI systems

Model bias lawsuits

From biased training data

Garbage in, garbage out

Poor data = poor AI

Try It Yourself

Our PII Scrubber runs entirely on your infrastructure. Try the demo below — paste text and see PII detection in action. This lightweight demo uses regex-only mode; the full tool adds context-aware verification using a locally deployed Swiss multilingual model trained on compliant data.

Your data never leaves your browser in the demo, and never leaves your network with the full tool.

Full version includes: batch processing for large datasets, domain-specific detection tuning, and controlled pseudonymization with optional restoration outside the training environment.

Try Demo on Hugging Face

Quick Answers

What is PII/PHI in training data?

Personal Identifiable Information (names, emails, phones, addresses, SSN) and Protected Health Information (medical records, diagnoses, treatment data). If your model was trained on it, you have a compliance problem.

Can't I just use regex?

Regex catches obvious patterns. It misses context-dependent PII ('my doctor', 'the patient in room 5'), misspellings, and encoded data. Context-aware detection combined with rules-based methods catches what regex misses, with full auditability.

What about synthetic data?

We can help generate synthetic replacements that preserve statistical properties without real PII. Good for fine-tuning when you need realistic but compliant data.

Do you handle multiple languages?

Yes. PII patterns vary by language and locale — German addresses, French phone formats, Cyrillic names. We configure detection for your specific data sources.

What formats can you process?

Text, JSON, CSV, PDF, DOCX, database exports. If it contains text, we can process it.

How do I know nothing was missed?

We provide detailed reports: what was found, what was redacted, confidence scores, and flagged items for human review. Full audit trail for compliance.

Is there a self-hosted option?

Yes. Our PII Scrubber runs entirely on your infrastructure — no data leaves your network. Detection combines rules-based methods with a locally deployed Swiss multilingual model trained exclusively on compliant data. For production, we configure domain-specific detection, batch processing, and controlled pseudonymization with strict separation from training systems.

What We Do

PII/PHI Detection & Removal

Find and redact personal data from training datasets, documents, and data exports. Names, addresses, phones, emails, SSN, medical record numbers, diagnoses — a hybrid detection approach combining rules, contextual analysis, and human review.

Output: Clean dataset + audit report

Training Data Quality Assessment

Before you fine-tune, know what you're feeding your model. Duplicate detection, inconsistency analysis, label quality check, coverage gaps.

Output: Quality report + recommendations

Bias Detection

Identify potential bias in your training data before it becomes bias in your model. Demographic representation, sentiment skew, label distribution analysis.

Output: Bias report + mitigation strategies

Synthetic Data Generation

Need realistic data without real PII? We generate synthetic replacements that preserve statistical properties and semantic meaning.

Output: Synthetic dataset + validation report

Automated Data Pipeline

For ongoing data collection, we build automated pipelines that clean and validate data before it enters your training set.

Output: Deployed pipeline + documentation

What We Detect

Personal Identifiers

  • Names
  • Email addresses
  • Phone numbers
  • Physical addresses
  • Social Security Numbers
  • Passport/ID numbers
  • Date of birth
  • Financial account numbers

Protected Health Information (PHI – Regulated)

  • Medical record numbers
  • Health plan IDs
  • Patient names in clinical context
  • Diagnoses and conditions
  • Treatment information
  • Provider names
  • Facility identifiers
  • Clinical notes and unstructured medical text

Context-Dependent PII

  • Indirect identifiers ('my boss', 'the CEO')
  • Location context ('the clinic on Main St')
  • Temporal identifiers ('last Tuesday's appointment')
  • Relationship references
  • Workplace/school identifiers

Process

Fast turnaround for one-time cleanups. Ongoing engagement for continuous pipelines.

1 week

Data Assessment

  • • Sample analysis of your data sources
  • • Identify PII types and patterns specific to your domain
  • • Define redaction/replacement strategy
  • • Estimate scope and timeline
1-4 weeks

Processing

  • • Configure detection for your specific data
  • • Run hybrid detection (rules + contextual analysis) with mandatory human review of low-confidence and PHI-related cases
  • • Apply redaction/replacement
  • • Validate output quality
Included

Delivery

  • • Clean dataset in your preferred format
  • • Detailed audit report
  • • Compliance documentation for GDPR/HIPAA
  • • Recommendations for ongoing data hygiene
Data Preparation

Typical Use Cases

Fine-tuning Preparation

Clean your training data before fine-tuning. Remove PII that leaked into scraped content, customer interactions, or internal documents.

RAG Knowledge Base

Building a RAG system from internal docs? Make sure your knowledge base doesn't expose employee data, customer PII, or confidential information.

Healthcare AI

Train models on clinical data without HIPAA violations. PHI detection specifically tuned for medical terminology and workflows.

Financial Services

Process transaction data, customer communications, and financial documents while maintaining PCI DSS and privacy compliance.

Data Sharing

Need to share datasets with partners or researchers? Anonymize while preserving analytical value.

What You Get

Clean Dataset

Your data with PII removed, replaced, or synthesized — in the format you need.

Audit Report

What was found, where, what was done. Full traceability for compliance and legal.

Compliance Documentation

Evidence package for GDPR Article 30, HIPAA requirements, and internal or external audits.

Quality Metrics

Detection confidence scores, false positive/negative rates, coverage statistics.

Investment

Pricing depends on data volume, complexity, and PII density. One-time projects and ongoing pipelines priced differently.

Data Assessment

CHF 3,000

Sample analysis, PII inventory, scope estimation, strategy recommendation

Batch Processing

From CHF 10,000

One-time cleanup of existing dataset. Price based on volume and complexity.

Automated Pipeline

From CHF 25,000

Deployed pipeline for ongoing data cleaning. Includes setup, testing, documentation.

Enterprise Agreement

Custom

High-volume, multi-source, ongoing engagement with SLA and support.

What We Need From You

  • Sample data (can be anonymized subset for initial assessment)
  • Data format and schema documentation
  • Target use case (fine-tuning, RAG, sharing, etc.)
  • Regulatory requirements (GDPR, HIPAA, industry-specific)
  • Preferred redaction strategy (remove, mask, synthesize)
  • Data sensitivity classification (PII, PHI, mixed, or regulated domain-specific data)

Building AI on Sensitive Data?

Let's make sure your training data doesn't become a compliance nightmare. Start with an assessment — we'll tell you exactly what's in there and what needs to happen.

Request Data Assessment