Clean and standardise contact data
The problem
Your contact database is a mess: postcodes in address fields, 'N/A' in email columns, phone numbers with inconsistent formatting (+44, 0, spaces, dashes), 'Mr' and 'mrs' and 'MS' mixing case, dates as text ('Jan 2024', '01/01/24', '2024-01-01'). Messy data means: failed mailshots (wrong address format), broken email campaigns (invalid addresses), impossible analysis (can't filter by postcode if they're in wrong fields). Cleaning manually takes weeks.
The solution
Use Python data cleaning libraries (pandas) to systematically fix common issues. The code identifies patterns: postcodes in wrong fields (matches postcode format, moves to correct column), standardises phone numbers (removes spaces, adds country code), fixes case inconsistencies (Title Case for names), validates emails (removes obvious invalids), standardises dates. Run once, fix thousands of records. You review changes, don't blindly apply.
What you get
A cleaned database with: standardised formats (all phone numbers as +44XXXXXXXXXX, all postcodes in postcode field), consistent case (names in Title Case, emails lowercase), validated data (invalid emails flagged), standardised dates (all YYYY-MM-DD). Before: 40% data quality issues. After: 95%+ clean. Plus a report showing what was fixed so you can review changes.
Before you start
- Contact data exported as CSV
- Understanding of what your data should look like when clean
- Backup of original data (never work on your only copy)
- Lawful basis to process this contact data under GDPR
- A Google account for Colab
- Basic willingness to run provided code (very beginner-friendly)
When to use this
- Your database has obvious quality issues (wrong fields, inconsistent formatting)
- You're about to do a mailshot and know addresses are messy
- You can't analyse data because formatting is inconsistent
- You've merged databases and formats clash
When not to use this
- Your data is already clean - no need
- You have under 100 contacts - clean manually
- Your data issues are semantic not formatting (wrong person in record, outdated info) - code can't fix that
- You don't have lawful basis under GDPR to process this contact data for data quality purposes - check your privacy notice and consent
- You don't understand what clean data looks like for your needs - define standards first
Steps
- 1
Export and backup your data
Export contacts to CSV. Make a backup copy before cleaning - keep original safe. Never work on your only copy of data. If cleaning goes wrong, you can restart from backup. Export should include: names, addresses, postcodes, phone, email, any custom fields.
- 2
Identify common data problems
Look at your data: what's messy? Postcodes in address field? Phone numbers with inconsistent formatting (some +44, some 0)? Case inconsistencies (mr, MR, Mr)? Invalid emails (no @, obvious typos)? Dates in multiple formats? List 5-10 issues you see repeatedly. These are what you'll fix systematically.
- 3
Define your standards
Decide: what does clean look like? Phone numbers: +44XXXXXXXXXX format. Postcodes: uppercase, space in right place (SW1A 1AA). Names: Title Case. Emails: lowercase. Dates: YYYY-MM-DD. Having standards means you can check if cleaning worked. Document these - they become your data quality policy.
- 4
Run the cleaning code
Use the example code (adapt to your column names). It runs through systematically: moves postcodes to correct field (pattern matching), standardises phone numbers (removes spaces/dashes, adds +44), fixes case (names to Title Case), validates emails (basic format check), standardises dates. Review the changes report before saving.
- 5
Review changes carefully
Critical: don't blindly accept all changes. The code generates a report showing: what was changed, how many records affected, examples. Check: are phone number changes correct? Are postcodes really postcodes (not house numbers)? Did case fixes work (not turned 'MacGregor' to 'Macgregor')? Fix any issues in the code.
- 6
Handle exceptions manually
Some issues can't be automated: clearly wrong data (email: 'none'), missing critical info, ambiguous cases (is '07' a phone number or something else?). Export the exceptions, fix manually or contact people to update. Automation does bulk work, humans handle edge cases.
- 7
Validate cleaned data
Before importing back to your CRM: spot check 20-30 random records. Do they look right? Run basic checks: all postcodes in postcode field? All emails have @? Phone numbers consistent format? If validation passes, you're safe to import. If not, fix issues and revalidate.
- 8
Import and document
Import cleaned data back to your CRM (or save as new clean database). Document: what standards you applied, what issues you fixed, when you cleaned. This helps next time and shows data governance. Share standards with team so new data is added cleanly.
- 9
Set up regular cleaning(optional)
Data gets messy again as people add records. Run cleaning quarterly: export, run code, review, import. Build it into your data maintenance routine. Better yet: train staff on data entry standards so less cleaning is needed. Prevention beats cleaning.
Example code
Clean and standardise contact data
This systematically fixes common data quality issues. Adapt column names and cleaning rules to your needs.
import pandas as pd
import re
from datetime import datetime
# Load data
contacts = pd.read_csv('contacts.csv')
print(f"Loaded {len(contacts)} contacts")
print(f"Columns: {', '.join(contacts.columns)}")
# Create a copy for cleaning
cleaned = contacts.copy()
changes_log = []
def log_change(category, description, count):
"""Track what changes were made"""
changes_log.append({
'category': category,
'description': description,
'records_affected': count
})
# 1. Fix postcodes in wrong fields
# UK postcode pattern
postcode_pattern = r'[A-Z]{1,2}[0-9][A-Z0-9]? ?[0-9][A-Z]{2}'
# Check if address field contains postcodes
if 'address' in cleaned.columns and 'postcode' in cleaned.columns:
# Find addresses that contain postcodes
for idx, row in cleaned.iterrows():
address = str(row['address'])
# Search for postcode in address
match = re.search(postcode_pattern, address, re.IGNORECASE)
if match and pd.isna(row['postcode']):
# Extract postcode from address
postcode = match.group()
# Remove from address, add to postcode field
cleaned.at[idx, 'postcode'] = postcode.upper()
cleaned.at[idx, 'address'] = address.replace(match.group(), '').strip()
fixed_postcodes = cleaned['postcode'].notna().sum() - contacts['postcode'].notna().sum()
if fixed_postcodes > 0:
log_change('Postcodes', f'Moved {fixed_postcodes} postcodes from address to postcode field', fixed_postcodes)
# 2. Standardise phone numbers
if 'phone' in cleaned.columns:
def clean_phone(phone):
if pd.isna(phone):
return phone
phone = str(phone)
# Remove common separators
phone = phone.replace(' ', '').replace('-', '').replace('(', '').replace(')', '')
# Handle UK numbers
if phone.startswith('0'):
phone = '+44' + phone[1:] # Replace leading 0 with +44
elif not phone.startswith('+'):
phone = '+44' + phone # Add +44 if missing
return phone
original_phones = cleaned['phone'].copy()
cleaned['phone'] = cleaned['phone'].apply(clean_phone)
changed_phones = (cleaned['phone'] != original_phones).sum()
if changed_phones > 0:
log_change('Phone', f'Standardised {changed_phones} phone numbers to +44 format', changed_phones)
# 3. Fix name case inconsistencies
if 'name' in cleaned.columns:
original_names = cleaned['name'].copy()
# Title Case for names
cleaned['name'] = cleaned['name'].str.title()
changed_names = (cleaned['name'] != original_names).sum()
if changed_names > 0:
log_change('Names', f'Standardised {changed_names} names to Title Case', changed_names)
# 4. Clean and validate emails
if 'email' in cleaned.columns:
def clean_email(email):
if pd.isna(email):
return email
email = str(email).lower().strip()
# Basic validation: must have @ and domain
if '@' not in email or '.' not in email.split('@')[1]:
return None # Invalid
# Remove obvious placeholders
if email in ['none', 'n/a', 'na', 'null', 'no email']:
return None
return email
original_emails = cleaned['email'].copy()
cleaned['email'] = cleaned['email'].apply(clean_email)
changed_emails = (cleaned['email'] != original_emails).sum()
if changed_emails > 0:
log_change('Emails', f'Cleaned {changed_emails} emails (lowercase, removed invalids)', changed_emails)
# 5. Standardise dates
date_columns = [col for col in cleaned.columns if 'date' in col.lower()]
for col in date_columns:
def parse_date(date_str):
if pd.isna(date_str):
return date_str
# Try multiple date formats
formats = ['%Y-%m-%d', '%d/%m/%Y', '%d-%m-%Y', '%Y/%m/%d', '%b %Y', '%B %Y']
for fmt in formats:
try:
parsed = datetime.strptime(str(date_str).strip(), fmt)
return parsed.strftime('%Y-%m-%d') # Standardise to YYYY-MM-DD
except:
continue
return date_str # Couldn't parse
original_dates = cleaned[col].copy()
cleaned[col] = cleaned[col].apply(parse_date)
changed_dates = (cleaned[col] != original_dates).sum()
if changed_dates > 0:
log_change('Dates', f'Standardised {changed_dates} dates in {col} to YYYY-MM-DD', changed_dates)
# 6. Remove leading/trailing spaces from all text fields
text_columns = cleaned.select_dtypes(include=['object']).columns
for col in text_columns:
cleaned[col] = cleaned[col].str.strip()
# Generate cleaning report
print("\n" + "="*60)
print("DATA CLEANING REPORT")
print("="*60)
if len(changes_log) == 0:
print("\nNo changes needed - data is already clean!")
else:
print(f"\nTotal changes: {len(changes_log)} categories")
print("\nChanges made:")
for change in changes_log:
print(f"\n{change['category']}:")
print(f" {change['description']}")
total_affected = sum(c['records_affected'] for c in changes_log)
print(f"\nTotal records affected: {total_affected} of {len(cleaned)}")
# Show examples of changes
print("\n" + "="*60)
print("Sample of cleaned data (first 5 rows):")
print("="*60)
print(cleaned.head().to_string())
# Save cleaned data
cleaned.to_csv('contacts_cleaned.csv', index=False)
# Save changes log
changes_df = pd.DataFrame(changes_log)
if len(changes_df) > 0:
changes_df.to_csv('cleaning_report.csv', index=False)
print("\n" + "="*60)
print("Files saved:")
print(" contacts_cleaned.csv - cleaned data")
if len(changes_log) > 0:
print(" cleaning_report.csv - what was changed")
print("\nNext steps:")
print("1. Review cleaned data carefully")
print("2. Spot check 20-30 random records")
print("3. Handle any exceptions manually")
print("4. Import back to CRM if validation passes")
print("5. Document standards and share with team")Tools
Resources
At a glance
- Time to implement
- hours
- Setup cost
- free
- Ongoing cost
- free
- Cost trend
- stable
- Organisation size
- small, medium, large
- Target audience
- operations-manager, data-analyst, fundraising
All tools are free. All processing runs locally (Google Colab) - no data sent elsewhere. One-time cleanup of 5,000 contacts: 1-2 hours vs days/weeks manually. Ongoing: run quarterly to catch new messy data. Clean data prevents wasted costs (failed mailshots, bounced emails) and enables better segmentation.