Generate synthetic test data for AI experiments

data-analysisintermediateproven

The problem

You want to experiment with AI techniques but you can't use real beneficiary or donor data. Maybe you're testing a new approach, building a demo for trustees, or learning how to use a tool. You need realistic data that looks like your actual records but doesn't contain any real people. Creating test data manually is tedious and rarely captures the messiness of real charity data.

The solution

Use AI to generate synthetic datasets that mirror the structure, patterns, and quirks of your real data without containing any actual personal information. You describe what your data looks like - the fields, the distributions, the common issues - and the AI generates hundreds or thousands of realistic fake records. This gives you safe material to experiment with, demonstrate to stakeholders, and use in training.

What you get

A synthetic dataset in CSV or Excel format that resembles your real data in structure and statistical properties. The data will include realistic names (from name generators), plausible postcodes, believable dates, and the kind of inconsistencies and gaps you'd find in real charity data. Safe to share, upload to external tools, and use in public demonstrations.

Before you start

Knowledge of what fields your real data contains
Understanding of typical values and patterns in your data
Ideally: a blank template or schema of your data structure
IMPORTANT: Never paste actual sensitive records into the AI as examples. Only describe your data structure using column headers or dummy descriptions - never real beneficiary or donor data, even as a template.

When to use this

Testing a new AI technique before using real data
Building demos or training materials for trustees or staff
Learning how to use a new tool without privacy concerns
Creating test datasets for developers building your systems
Running workshops where participants need practice data

When not to use this

You need data that exactly replicates real statistical distributions (use proper synthetic data tools instead)
You have anonymised real data that would serve the purpose
The AI model needs to learn patterns that synthetic data can't capture

Steps

1
Document your data structure
List all the fields in your real dataset: column names, data types, and what each contains. For example: "first_name (text), donation_amount (number, £5-£500), donation_date (date, 2020-2024), postcode (UK postcode), contact_preference (email/post/phone)".
2
Describe the patterns and quirks
Real charity data has patterns. Donations cluster around £10, £25, £50. Some postcodes appear more often. 15% of records have missing phone numbers. Describe these characteristics so the synthetic data feels realistic, not artificially clean.
3
Specify the volume and format
Decide how many records you need (100 for a quick test, 1000+ for realistic analysis) and what format (CSV is usually best). If you need multiple related tables (donors and donations), specify the relationships.
4
Generate using Claude or ChatGPT
Paste your specification into Claude or ChatGPT and ask it to generate the data. For small datasets (<100 rows), it can output directly. For larger datasets, ask it to write Python code using the Faker library that generates the data to your specification.
5
Review and iterate
Check a sample of the generated data. Does it look realistic? Are there obvious patterns that would never appear in real data? Refine your prompt and regenerate if needed. Real data is messy; synthetic data should be too.
6
Document the synthetic nature
Clearly label the file as synthetic data (e.g., 'donor_data_SYNTHETIC.csv'). This prevents confusion later if someone mistakes it for real records. Include a README noting how it was generated and what it should be used for.

Example code

Example prompt for generating donor data

A prompt you can adapt for Claude or ChatGPT.

Generate 50 rows of synthetic charity donor data in CSV format.

Fields needed:
- donor_id: Sequential starting from D001
- first_name: Common UK first names
- last_name: Common UK surnames
- email: Realistic email addresses (mix of gmail, outlook, yahoo, work domains)
- postcode: Valid UK postcodes, weighted towards London (SW, SE, E, N) and Manchester (M)
- first_donation_date: Between 2018-01-01 and 2024-01-01
- total_donated: Amount in GBP, realistic distribution (many at £10-50, some at £100-500, rare £1000+)
- donation_count: Number between 1 and 20, correlated with total_donated
- contact_preference: One of "email", "post", "phone", "no contact" - 60% email, 25% post, 10% phone, 5% no contact
- last_contact_date: After first_donation_date, some should be blank (15%)

Make it realistic:
- Include some missing email addresses (10%)
- Some postcodes should have formatting inconsistencies (lowercase, missing space)
- A few donation amounts should be round numbers (£100, £250)
- Include one or two obvious data entry errors (e.g., unrealistic dates, negative amounts)

Python script for larger datasets

Use this for generating 500+ records with the Faker library.

import csv
import random
from datetime import datetime, timedelta
from faker import Faker

fake = Faker('en_GB')  # UK locale for realistic UK names and addresses

def generate_charity_donors(num_records=500):
    """Generate synthetic charity donor data."""

    donors = []

    # Weighted postcodes (London and Manchester more common)
    postcode_prefixes = ['SW', 'SE', 'E', 'N', 'W', 'M', 'B', 'L', 'G', 'EH', 'CF', 'BS']
    postcode_weights = [15, 12, 10, 10, 8, 12, 8, 6, 5, 5, 4, 5]

    contact_prefs = ['email', 'post', 'phone', 'no contact']
    contact_weights = [60, 25, 10, 5]

    for i in range(num_records):
        donor_id = f"D{str(i+1).zfill(4)}"
        first_name = fake.first_name()
        last_name = fake.last_name()

        # 10% missing emails
        if random.random() < 0.1:
            email = ""
        else:
            domains = ['gmail.com', 'outlook.com', 'yahoo.co.uk', 'hotmail.com', fake.company().lower().replace(' ', '') + '.co.uk']
            email = f"{first_name.lower()}.{last_name.lower()}@{random.choice(domains)}"

        # Generate postcode with occasional formatting issues
        prefix = random.choices(postcode_prefixes, weights=postcode_weights)[0]
        postcode = f"{prefix}{random.randint(1,20)} {random.randint(1,9)}{fake.random_uppercase_letter()}{fake.random_uppercase_letter()}"
        if random.random() < 0.05:  # 5% formatting inconsistencies
            postcode = postcode.lower() if random.random() < 0.5 else postcode.replace(' ', '')

        # Donation patterns - log-normal distribution feels realistic
        total_donated = round(random.lognormvariate(4, 1), 2)  # Most £20-100, some higher
        if random.random() < 0.15:  # 15% round numbers
            total_donated = round(total_donated / 50) * 50
        total_donated = max(5, min(10000, total_donated))  # Clamp to realistic range

        donation_count = max(1, int(total_donated / random.uniform(20, 80)))

        first_donation = fake.date_between(start_date='-6y', end_date='-6m')

        # 15% missing last contact
        if random.random() < 0.15:
            last_contact = ""
        else:
            last_contact = fake.date_between(start_date=first_donation, end_date='today')

        contact_pref = random.choices(contact_prefs, weights=contact_weights)[0]

        donors.append({
            'donor_id': donor_id,
            'first_name': first_name,
            'last_name': last_name,
            'email': email,
            'postcode': postcode,
            'first_donation_date': str(first_donation),
            'total_donated': total_donated,
            'donation_count': donation_count,
            'contact_preference': contact_pref,
            'last_contact_date': str(last_contact) if last_contact else ""
        })

        # Add occasional data errors (1%)
        if random.random() < 0.01:
            donors[-1]['total_donated'] = -abs(donors[-1]['total_donated'])  # Negative amount error

    return donors

# Generate and save
donors = generate_charity_donors(500)

with open('donor_data_SYNTHETIC.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=donors[0].keys())
    writer.writeheader()
    writer.writerows(donors)

print(f"Generated {len(donors)} synthetic donor records")
print("Sample record:", donors[0])