Anonymise sensitive data for AI projects

complianceintermediateproven

The problem

You want to use AI with data that contains personal or sensitive information. Maybe it's service user records for analysis, or case notes you want to summarise. You can't just upload identifiable data to external AI services, but you also can't do the AI work without the data. You need to balance privacy protection with analytical utility.

The solution

Apply appropriate anonymisation techniques based on your data and use case. This ranges from simple removal of identifying fields, through pseudonymisation (replacing names with codes), to more sophisticated techniques like generalisation and suppression. The right approach depends on what you're doing and who might see the results.

What you get

A dataset that's safe to use with AI tools while remaining useful for your analysis. You'll have documented what anonymisation was applied, what risks remain, and any limitations on how the data can be used. This protects your organisation and the people in your data.

Before you start

Data you need to anonymise
Understanding of what fields are sensitive
Knowledge of how the anonymised data will be used
Ideally: input from your DPO or data protection lead

When to use this

Before sending data to external AI services
When sharing data with external analysts
Creating datasets for testing or development
Before storing data long-term for research

When not to use this

The AI runs entirely on-premises with proper access controls
You have explicit consent for identified data use
The data isn't personal data (already anonymous or not about people)

Steps

1
Identify direct identifiers
List fields that directly identify individuals: names, email addresses, phone numbers, national insurance numbers, addresses, photos. These must always be removed or replaced. There's no safe way to include these in external AI processing.
2
Identify indirect identifiers
List fields that could identify someone in combination: date of birth, postcode, job title, rare conditions, unusual circumstances. "35-year-old in SW1A 1AA with rare condition X" might identify one person. These need careful handling.
3
Choose your approach based on use case
For aggregate analysis: remove all identifiers and generalise indirect identifiers. For individual-level analysis (like case note summarisation): pseudonymise by replacing names with codes, generalise dates and locations. For maximum protection: consider synthetic data or on-premise processing.
4
Remove direct identifiers
Delete or replace: names → "Person A" or "[NAME]", emails → removed, phone numbers → removed, addresses → postcode prefix only (SW1A), NHS numbers → removed. Keep a secure mapping only if you need to re-identify later, and store it separately. Note: pseudonymised data (replaced with codes) is still personal data under GDPR - treat it as sensitive.
5
Generalise indirect identifiers
Reduce precision: exact DOB → age band (25-34), exact postcode → outward code (SW1A), job title → category (manager), rare conditions → broader category. The goal is k-anonymity: each person should be indistinguishable from at least k-1 others in the dataset. Excel users: You can achieve similar results using Flash Fill for pattern-based transformations, or formulas like LEFT() for postcodes and ROUND() for age banding.
6
Handle free text carefully
Free text fields (notes, descriptions) often contain embedded identifiers: "John's mother called..." Use find-and-replace for known names. Consider using an LLM locally to identify and redact names, locations, and identifying details from text before external processing. IMPORTANT: Automated redaction needs manual spot-checking - regex may over-redact common words that are also names (e.g., "Will", "May", "Joy").
7
Test for re-identification risk
Review your anonymised dataset: could you identify anyone? Check rare combinations. A "60-year-old trustee who joined in 1985" might be unique even without names. If individuals are still identifiable, apply more generalisation or remove those records.
8
Document and govern
Record: what data was anonymised, what techniques were used, what risks remain, who approved it, and any restrictions on use. This creates accountability and helps with future similar requests. Store documentation with the dataset.

Example code

Basic anonymisation script

Python code to anonymise common personal data fields.

import pandas as pd
import hashlib
import re

def anonymise_dataset(df, config):
    """
    Anonymise a dataset according to configuration.

    config = {
        'remove': ['email', 'phone'],  # Delete these columns
        'pseudonymise': ['name', 'case_id'],  # Replace with hashed codes
        'generalise': {
            'dob': 'age_band',  # Convert DOB to age band
            'postcode': 'outward_code',  # Keep only first part
        },
        'redact_text': ['notes'],  # Redact names from text fields
        'known_names': ['John Smith', 'Jane Doe']  # Names to redact
    }
    """
    df = df.copy()

    # Remove direct identifiers
    for col in config.get('remove', []):
        if col in df.columns:
            df = df.drop(columns=[col])

    # Pseudonymise with hashed codes
    # IMPORTANT: Add a secret salt for better security against rainbow table attacks
    # Store your salt securely and consistently - if you lose it, you can't reproduce the same IDs
    salt = config.get('salt', 'your-secret-salt-here')
    for col in config.get('pseudonymise', []):
        if col in df.columns:
            # Create consistent pseudonym using salted hash
            df[col] = df[col].apply(
                lambda x: f"ID-{hashlib.sha256((salt + str(x)).encode()).hexdigest()[:8]}"
                if pd.notna(x) else None
            )

    # Generalise dates to age bands
    if 'dob' in config.get('generalise', {}):
        if config['generalise']['dob'] == 'age_band' and 'dob' in df.columns:
            df['dob'] = pd.to_datetime(df['dob'], errors='coerce')
            ages = (pd.Timestamp.now() - df['dob']).dt.days / 365.25
            bins = [0, 18, 25, 35, 45, 55, 65, 75, 100]
            labels = ['<18', '18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75+']
            df['age_band'] = pd.cut(ages, bins=bins, labels=labels)
            df = df.drop(columns=['dob'])

    # Generalise postcodes to outward code
    if 'postcode' in config.get('generalise', {}):
        if config['generalise']['postcode'] == 'outward_code' and 'postcode' in df.columns:
            df['postcode'] = df['postcode'].str.split(' ').str[0]

    # Redact names from text fields
    for col in config.get('redact_text', []):
        if col in df.columns:
            for name in config.get('known_names', []):
                # Replace full name and first name
                df[col] = df[col].str.replace(name, '[REDACTED]', case=False, regex=False)
                first_name = name.split()[0]
                df[col] = df[col].str.replace(
                    rf'\b{first_name}\b', '[NAME]', case=False, regex=True
                )

    return df

# Example usage
config = {
    'salt': 'my-organisation-secret-2024',  # Keep this secret and consistent
    'remove': ['email', 'phone', 'nhs_number'],
    'pseudonymise': ['name', 'case_id'],
    'generalise': {
        'dob': 'age_band',
        'postcode': 'outward_code',
    },
    'redact_text': ['notes', 'description'],
    'known_names': ['John Smith', 'Jane Doe', 'Bob Wilson']  # From your data
}

df = pd.read_csv('sensitive_data.csv')
df_anon = anonymise_dataset(df, config)
df_anon.to_csv('anonymised_data.csv', index=False)

print(f"Original columns: {list(df.columns)}")
print(f"Anonymised columns: {list(df_anon.columns)}")
print(f"Sample of anonymised data:")
print(df_anon.head())

Tools

Python with pandaslibrary · free · open source

Visit →

Excelplatform · freemium

Resources

ICO Anonymisation guidancedocumentation

UK regulator guidance on anonymisation techniques.

k-anonymity explanationdocumentation

Overview of k-anonymity concept for privacy protection.

At a glance

Time to implement: hours
Setup cost: free
Ongoing cost: free
Cost trend: stable
Organisation size: small, medium, large
Target audience: data-analyst, it-technical, operations-manager

Anonymisation tools are free. The process takes staff time.

The problem

The solution

What you get

Before you start

When to use this

When not to use this

Steps

Identify direct identifiers

Identify indirect identifiers

Choose your approach based on use case

Remove direct identifiers

Generalise indirect identifiers

Handle free text carefully

Test for re-identification risk

Document and govern

Example code

Basic anonymisation script

Tools

Resources

At a glance