← Back to recipes

Anonymise sensitive data for AI projects

complianceintermediateproven

The problem

You want to use AI with data that contains personal or sensitive information. Maybe it's service user records for analysis, or case notes you want to summarise. You can't just upload identifiable data to external AI services, but you also can't do the AI work without the data. You need to balance privacy protection with analytical utility.

The solution

Apply appropriate anonymisation techniques based on your data and use case. This ranges from simple removal of identifying fields, through pseudonymisation (replacing names with codes), to more sophisticated techniques like generalisation and suppression. The right approach depends on what you're doing and who might see the results.

What you get

A dataset that's safe to use with AI tools while remaining useful for your analysis. You'll have documented what anonymisation was applied, what risks remain, and any limitations on how the data can be used. This protects your organisation and the people in your data.

Before you start

  • Data you need to anonymise
  • Understanding of what fields are sensitive
  • Knowledge of how the anonymised data will be used
  • Ideally: input from your DPO or data protection lead

When to use this

  • Before sending data to external AI services
  • When sharing data with external analysts
  • Creating datasets for testing or development
  • Before storing data long-term for research

When not to use this

  • The AI runs entirely on-premises with proper access controls
  • You have explicit consent for identified data use
  • The data isn't personal data (already anonymous or not about people)

Steps

  1. 1

    Identify direct identifiers

    List fields that directly identify individuals: names, email addresses, phone numbers, national insurance numbers, addresses, photos. These must always be removed or replaced. There's no safe way to include these in external AI processing.

  2. 2

    Identify indirect identifiers

    List fields that could identify someone in combination: date of birth, postcode, job title, rare conditions, unusual circumstances. "35-year-old in SW1A 1AA with rare condition X" might identify one person. These need careful handling.

  3. 3

    Choose your approach based on use case

    For aggregate analysis: remove all identifiers and generalise indirect identifiers. For individual-level analysis (like case note summarisation): pseudonymise by replacing names with codes, generalise dates and locations. For maximum protection: consider synthetic data or on-premise processing.

  4. 4

    Remove direct identifiers

    Delete or replace: names → "Person A" or "[NAME]", emails → removed, phone numbers → removed, addresses → postcode prefix only (SW1A), NHS numbers → removed. Keep a secure mapping only if you need to re-identify later, and store it separately. Note: pseudonymised data (replaced with codes) is still personal data under GDPR - treat it as sensitive.

  5. 5

    Generalise indirect identifiers

    Reduce precision: exact DOB → age band (25-34), exact postcode → outward code (SW1A), job title → category (manager), rare conditions → broader category. The goal is k-anonymity: each person should be indistinguishable from at least k-1 others in the dataset. Excel users: You can achieve similar results using Flash Fill for pattern-based transformations, or formulas like LEFT() for postcodes and ROUND() for age banding.

  6. 6

    Handle free text carefully

    Free text fields (notes, descriptions) often contain embedded identifiers: "John's mother called..." Use find-and-replace for known names. Consider using an LLM locally to identify and redact names, locations, and identifying details from text before external processing. IMPORTANT: Automated redaction needs manual spot-checking - regex may over-redact common words that are also names (e.g., "Will", "May", "Joy").

  7. 7

    Test for re-identification risk

    Review your anonymised dataset: could you identify anyone? Check rare combinations. A "60-year-old trustee who joined in 1985" might be unique even without names. If individuals are still identifiable, apply more generalisation or remove those records.

  8. 8

    Document and govern

    Record: what data was anonymised, what techniques were used, what risks remain, who approved it, and any restrictions on use. This creates accountability and helps with future similar requests. Store documentation with the dataset.

Example code

Basic anonymisation script

Python code to anonymise common personal data fields.

import pandas as pd
import hashlib
import re

def anonymise_dataset(df, config):
    """
    Anonymise a dataset according to configuration.

    config = {
        'remove': ['email', 'phone'],  # Delete these columns
        'pseudonymise': ['name', 'case_id'],  # Replace with hashed codes
        'generalise': {
            'dob': 'age_band',  # Convert DOB to age band
            'postcode': 'outward_code',  # Keep only first part
        },
        'redact_text': ['notes'],  # Redact names from text fields
        'known_names': ['John Smith', 'Jane Doe']  # Names to redact
    }
    """
    df = df.copy()

    # Remove direct identifiers
    for col in config.get('remove', []):
        if col in df.columns:
            df = df.drop(columns=[col])

    # Pseudonymise with hashed codes
    # IMPORTANT: Add a secret salt for better security against rainbow table attacks
    # Store your salt securely and consistently - if you lose it, you can't reproduce the same IDs
    salt = config.get('salt', 'your-secret-salt-here')
    for col in config.get('pseudonymise', []):
        if col in df.columns:
            # Create consistent pseudonym using salted hash
            df[col] = df[col].apply(
                lambda x: f"ID-{hashlib.sha256((salt + str(x)).encode()).hexdigest()[:8]}"
                if pd.notna(x) else None
            )

    # Generalise dates to age bands
    if 'dob' in config.get('generalise', {}):
        if config['generalise']['dob'] == 'age_band' and 'dob' in df.columns:
            df['dob'] = pd.to_datetime(df['dob'], errors='coerce')
            ages = (pd.Timestamp.now() - df['dob']).dt.days / 365.25
            bins = [0, 18, 25, 35, 45, 55, 65, 75, 100]
            labels = ['<18', '18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']
            df['age_band'] = pd.cut(ages, bins=bins, labels=labels)
            df = df.drop(columns=['dob'])

    # Generalise postcodes to outward code
    if 'postcode' in config.get('generalise', {}):
        if config['generalise']['postcode'] == 'outward_code' and 'postcode' in df.columns:
            df['postcode'] = df['postcode'].str.split(' ').str[0]

    # Redact names from text fields
    for col in config.get('redact_text', []):
        if col in df.columns:
            for name in config.get('known_names', []):
                # Replace full name and first name
                df[col] = df[col].str.replace(name, '[REDACTED]', case=False, regex=False)
                first_name = name.split()[0]
                df[col] = df[col].str.replace(
                    rf'\b{first_name}\b', '[NAME]', case=False, regex=True
                )

    return df

# Example usage
config = {
    'salt': 'my-organisation-secret-2024',  # Keep this secret and consistent
    'remove': ['email', 'phone', 'nhs_number'],
    'pseudonymise': ['name', 'case_id'],
    'generalise': {
        'dob': 'age_band',
        'postcode': 'outward_code',
    },
    'redact_text': ['notes', 'description'],
    'known_names': ['John Smith', 'Jane Doe', 'Bob Wilson']  # From your data
}

df = pd.read_csv('sensitive_data.csv')
df_anon = anonymise_dataset(df, config)
df_anon.to_csv('anonymised_data.csv', index=False)

print(f"Original columns: {list(df.columns)}")
print(f"Anonymised columns: {list(df_anon.columns)}")
print(f"Sample of anonymised data:")
print(df_anon.head())

Tools

Python with pandaslibrary · free · open source
Visit →
Excelplatform · freemium

Resources

At a glance

Time to implement
hours
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
small, medium, large
Target audience
data-analyst, it-technical, operations-manager

Anonymisation tools are free. The process takes staff time.

Written by Edd Baldry

Last updated: 2026-01-08

Photo by CDC on Unsplash