Generate synthetic test data for AI experiments
The problem
You want to experiment with AI techniques but you can't use real beneficiary or donor data. Maybe you're testing a new approach, building a demo for trustees, or learning how to use a tool. You need realistic data that looks like your actual records but doesn't contain any real people. Creating test data manually is tedious and rarely captures the messiness of real charity data.
The solution
Use AI to generate synthetic datasets that mirror the structure, patterns, and quirks of your real data without containing any actual personal information. You describe what your data looks like - the fields, the distributions, the common issues - and the AI generates hundreds or thousands of realistic fake records. This gives you safe material to experiment with, demonstrate to stakeholders, and use in training.
What you get
A synthetic dataset in CSV or Excel format that resembles your real data in structure and statistical properties. The data will include realistic names (from name generators), plausible postcodes, believable dates, and the kind of inconsistencies and gaps you'd find in real charity data. Safe to share, upload to external tools, and use in public demonstrations.
Before you start
- Knowledge of what fields your real data contains
- Understanding of typical values and patterns in your data
- Ideally: a blank template or schema of your data structure
- IMPORTANT: Never paste actual sensitive records into the AI as examples. Only describe your data structure using column headers or dummy descriptions - never real beneficiary or donor data, even as a template.
When to use this
- Testing a new AI technique before using real data
- Building demos or training materials for trustees or staff
- Learning how to use a new tool without privacy concerns
- Creating test datasets for developers building your systems
- Running workshops where participants need practice data
When not to use this
- You need data that exactly replicates real statistical distributions (use proper synthetic data tools instead)
- You have anonymised real data that would serve the purpose
- The AI model needs to learn patterns that synthetic data can't capture
Steps
- 1
Document your data structure
List all the fields in your real dataset: column names, data types, and what each contains. For example: "first_name (text), donation_amount (number, £5-£500), donation_date (date, 2020-2024), postcode (UK postcode), contact_preference (email/post/phone)".
- 2
Describe the patterns and quirks
Real charity data has patterns. Donations cluster around £10, £25, £50. Some postcodes appear more often. 15% of records have missing phone numbers. Describe these characteristics so the synthetic data feels realistic, not artificially clean.
- 3
Specify the volume and format
Decide how many records you need (100 for a quick test, 1000+ for realistic analysis) and what format (CSV is usually best). If you need multiple related tables (donors and donations), specify the relationships.
- 4
Generate using Claude or ChatGPT
Paste your specification into Claude or ChatGPT and ask it to generate the data. For small datasets (<100 rows), it can output directly. For larger datasets, ask it to write Python code using the Faker library that generates the data to your specification.
- 5
Review and iterate
Check a sample of the generated data. Does it look realistic? Are there obvious patterns that would never appear in real data? Refine your prompt and regenerate if needed. Real data is messy; synthetic data should be too.
- 6
Document the synthetic nature
Clearly label the file as synthetic data (e.g., 'donor_data_SYNTHETIC.csv'). This prevents confusion later if someone mistakes it for real records. Include a README noting how it was generated and what it should be used for.
Example code
Example prompt for generating donor data
A prompt you can adapt for Claude or ChatGPT.
Generate 50 rows of synthetic charity donor data in CSV format.
Fields needed:
- donor_id: Sequential starting from D001
- first_name: Common UK first names
- last_name: Common UK surnames
- email: Realistic email addresses (mix of gmail, outlook, yahoo, work domains)
- postcode: Valid UK postcodes, weighted towards London (SW, SE, E, N) and Manchester (M)
- first_donation_date: Between 2018-01-01 and 2024-01-01
- total_donated: Amount in GBP, realistic distribution (many at £10-50, some at £100-500, rare £1000+)
- donation_count: Number between 1 and 20, correlated with total_donated
- contact_preference: One of "email", "post", "phone", "no contact" - 60% email, 25% post, 10% phone, 5% no contact
- last_contact_date: After first_donation_date, some should be blank (15%)
Make it realistic:
- Include some missing email addresses (10%)
- Some postcodes should have formatting inconsistencies (lowercase, missing space)
- A few donation amounts should be round numbers (£100, £250)
- Include one or two obvious data entry errors (e.g., unrealistic dates, negative amounts)Python script for larger datasets
Use this for generating 500+ records with the Faker library.
import csv
import random
from datetime import datetime, timedelta
from faker import Faker
fake = Faker('en_GB') # UK locale for realistic UK names and addresses
def generate_charity_donors(num_records=500):
"""Generate synthetic charity donor data."""
donors = []
# Weighted postcodes (London and Manchester more common)
postcode_prefixes = ['SW', 'SE', 'E', 'N', 'W', 'M', 'B', 'L', 'G', 'EH', 'CF', 'BS']
postcode_weights = [15, 12, 10, 10, 8, 12, 8, 6, 5, 5, 4, 5]
contact_prefs = ['email', 'post', 'phone', 'no contact']
contact_weights = [60, 25, 10, 5]
for i in range(num_records):
donor_id = f"D{str(i+1).zfill(4)}"
first_name = fake.first_name()
last_name = fake.last_name()
# 10% missing emails
if random.random() < 0.1:
email = ""
else:
domains = ['gmail.com', 'outlook.com', 'yahoo.co.uk', 'hotmail.com', fake.company().lower().replace(' ', '') + '.co.uk']
email = f"{first_name.lower()}.{last_name.lower()}@{random.choice(domains)}"
# Generate postcode with occasional formatting issues
prefix = random.choices(postcode_prefixes, weights=postcode_weights)[0]
postcode = f"{prefix}{random.randint(1,20)} {random.randint(1,9)}{fake.random_uppercase_letter()}{fake.random_uppercase_letter()}"
if random.random() < 0.05: # 5% formatting inconsistencies
postcode = postcode.lower() if random.random() < 0.5 else postcode.replace(' ', '')
# Donation patterns - log-normal distribution feels realistic
total_donated = round(random.lognormvariate(4, 1), 2) # Most £20-100, some higher
if random.random() < 0.15: # 15% round numbers
total_donated = round(total_donated / 50) * 50
total_donated = max(5, min(10000, total_donated)) # Clamp to realistic range
donation_count = max(1, int(total_donated / random.uniform(20, 80)))
first_donation = fake.date_between(start_date='-6y', end_date='-6m')
# 15% missing last contact
if random.random() < 0.15:
last_contact = ""
else:
last_contact = fake.date_between(start_date=first_donation, end_date='today')
contact_pref = random.choices(contact_prefs, weights=contact_weights)[0]
donors.append({
'donor_id': donor_id,
'first_name': first_name,
'last_name': last_name,
'email': email,
'postcode': postcode,
'first_donation_date': str(first_donation),
'total_donated': total_donated,
'donation_count': donation_count,
'contact_preference': contact_pref,
'last_contact_date': str(last_contact) if last_contact else ""
})
# Add occasional data errors (1%)
if random.random() < 0.01:
donors[-1]['total_donated'] = -abs(donors[-1]['total_donated']) # Negative amount error
return donors
# Generate and save
donors = generate_charity_donors(500)
with open('donor_data_SYNTHETIC.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=donors[0].keys())
writer.writeheader()
writer.writerows(donors)
print(f"Generated {len(donors)} synthetic donor records")
print("Sample record:", donors[0])Tools
Resources
At a glance
- Time to implement
- hours
- Setup cost
- free
- Ongoing cost
- free
- Cost trend
- stable
- Organisation size
- micro, small, medium, large
- Target audience
- data-analyst, it-technical, operations-manager
Free tiers of Claude or ChatGPT work fine for generating test data.