← Back to recipes

Clean and standardise contact data

operationsbeginnerproven

The problem

Your contact database is a mess: postcodes in address fields, 'N/A' in email columns, phone numbers with inconsistent formatting (+44, 0, spaces, dashes), 'Mr' and 'mrs' and 'MS' mixing case, dates as text ('Jan 2024', '01/01/24', '2024-01-01'). Messy data means: failed mailshots (wrong address format), broken email campaigns (invalid addresses), impossible analysis (can't filter by postcode if they're in wrong fields). Cleaning manually takes weeks.

The solution

Use Python data cleaning libraries (pandas) to systematically fix common issues. The code identifies patterns: postcodes in wrong fields (matches postcode format, moves to correct column), standardises phone numbers (removes spaces, adds country code), fixes case inconsistencies (Title Case for names), validates emails (removes obvious invalids), standardises dates. Run once, fix thousands of records. You review changes, don't blindly apply.

What you get

A cleaned database with: standardised formats (all phone numbers as +44XXXXXXXXXX, all postcodes in postcode field), consistent case (names in Title Case, emails lowercase), validated data (invalid emails flagged), standardised dates (all YYYY-MM-DD). Before: 40% data quality issues. After: 95%+ clean. Plus a report showing what was fixed so you can review changes.

Before you start

  • Contact data exported as CSV
  • Understanding of what your data should look like when clean
  • Backup of original data (never work on your only copy)
  • Lawful basis to process this contact data under GDPR
  • A Google account for Colab
  • Basic willingness to run provided code (very beginner-friendly)

When to use this

  • Your database has obvious quality issues (wrong fields, inconsistent formatting)
  • You're about to do a mailshot and know addresses are messy
  • You can't analyse data because formatting is inconsistent
  • You've merged databases and formats clash

When not to use this

  • Your data is already clean - no need
  • You have under 100 contacts - clean manually
  • Your data issues are semantic not formatting (wrong person in record, outdated info) - code can't fix that
  • You don't have lawful basis under GDPR to process this contact data for data quality purposes - check your privacy notice and consent
  • You don't understand what clean data looks like for your needs - define standards first

Steps

  1. 1

    Export and backup your data

    Export contacts to CSV. Make a backup copy before cleaning - keep original safe. Never work on your only copy of data. If cleaning goes wrong, you can restart from backup. Export should include: names, addresses, postcodes, phone, email, any custom fields.

  2. 2

    Identify common data problems

    Look at your data: what's messy? Postcodes in address field? Phone numbers with inconsistent formatting (some +44, some 0)? Case inconsistencies (mr, MR, Mr)? Invalid emails (no @, obvious typos)? Dates in multiple formats? List 5-10 issues you see repeatedly. These are what you'll fix systematically.

  3. 3

    Define your standards

    Decide: what does clean look like? Phone numbers: +44XXXXXXXXXX format. Postcodes: uppercase, space in right place (SW1A 1AA). Names: Title Case. Emails: lowercase. Dates: YYYY-MM-DD. Having standards means you can check if cleaning worked. Document these - they become your data quality policy.

  4. 4

    Run the cleaning code

    Use the example code (adapt to your column names). It runs through systematically: moves postcodes to correct field (pattern matching), standardises phone numbers (removes spaces/dashes, adds +44), fixes case (names to Title Case), validates emails (basic format check), standardises dates. Review the changes report before saving.

  5. 5

    Review changes carefully

    Critical: don't blindly accept all changes. The code generates a report showing: what was changed, how many records affected, examples. Check: are phone number changes correct? Are postcodes really postcodes (not house numbers)? Did case fixes work (not turned 'MacGregor' to 'Macgregor')? Fix any issues in the code.

  6. 6

    Handle exceptions manually

    Some issues can't be automated: clearly wrong data (email: 'none'), missing critical info, ambiguous cases (is '07' a phone number or something else?). Export the exceptions, fix manually or contact people to update. Automation does bulk work, humans handle edge cases.

  7. 7

    Validate cleaned data

    Before importing back to your CRM: spot check 20-30 random records. Do they look right? Run basic checks: all postcodes in postcode field? All emails have @? Phone numbers consistent format? If validation passes, you're safe to import. If not, fix issues and revalidate.

  8. 8

    Import and document

    Import cleaned data back to your CRM (or save as new clean database). Document: what standards you applied, what issues you fixed, when you cleaned. This helps next time and shows data governance. Share standards with team so new data is added cleanly.

  9. 9

    Set up regular cleaning(optional)

    Data gets messy again as people add records. Run cleaning quarterly: export, run code, review, import. Build it into your data maintenance routine. Better yet: train staff on data entry standards so less cleaning is needed. Prevention beats cleaning.

Example code

Clean and standardise contact data

This systematically fixes common data quality issues. Adapt column names and cleaning rules to your needs.

import pandas as pd
import re
from datetime import datetime

# Load data
contacts = pd.read_csv('contacts.csv')
print(f"Loaded {len(contacts)} contacts")
print(f"Columns: {', '.join(contacts.columns)}")

# Create a copy for cleaning
cleaned = contacts.copy()
changes_log = []

def log_change(category, description, count):
    """Track what changes were made"""
    changes_log.append({
        'category': category,
        'description': description,
        'records_affected': count
    })

# 1. Fix postcodes in wrong fields
# UK postcode pattern
postcode_pattern = r'[A-Z]{1,2}[0-9][A-Z0-9]? ?[0-9][A-Z]{2}'

# Check if address field contains postcodes
if 'address' in cleaned.columns and 'postcode' in cleaned.columns:
    # Find addresses that contain postcodes
    for idx, row in cleaned.iterrows():
        address = str(row['address'])
        # Search for postcode in address
        match = re.search(postcode_pattern, address, re.IGNORECASE)
        if match and pd.isna(row['postcode']):
            # Extract postcode from address
            postcode = match.group()
            # Remove from address, add to postcode field
            cleaned.at[idx, 'postcode'] = postcode.upper()
            cleaned.at[idx, 'address'] = address.replace(match.group(), '').strip()

    fixed_postcodes = cleaned['postcode'].notna().sum() - contacts['postcode'].notna().sum()
    if fixed_postcodes > 0:
        log_change('Postcodes', f'Moved {fixed_postcodes} postcodes from address to postcode field', fixed_postcodes)

# 2. Standardise phone numbers
if 'phone' in cleaned.columns:
    def clean_phone(phone):
        if pd.isna(phone):
            return phone

        phone = str(phone)
        # Remove common separators
        phone = phone.replace(' ', '').replace('-', '').replace('(', '').replace(')', '')

        # Handle UK numbers
        if phone.startswith('0'):
            phone = '+44' + phone[1:]  # Replace leading 0 with +44
        elif not phone.startswith('+'):
            phone = '+44' + phone  # Add +44 if missing

        return phone

    original_phones = cleaned['phone'].copy()
    cleaned['phone'] = cleaned['phone'].apply(clean_phone)
    changed_phones = (cleaned['phone'] != original_phones).sum()
    if changed_phones > 0:
        log_change('Phone', f'Standardised {changed_phones} phone numbers to +44 format', changed_phones)

# 3. Fix name case inconsistencies
if 'name' in cleaned.columns:
    original_names = cleaned['name'].copy()
    # Title Case for names
    cleaned['name'] = cleaned['name'].str.title()
    changed_names = (cleaned['name'] != original_names).sum()
    if changed_names > 0:
        log_change('Names', f'Standardised {changed_names} names to Title Case', changed_names)

# 4. Clean and validate emails
if 'email' in cleaned.columns:
    def clean_email(email):
        if pd.isna(email):
            return email

        email = str(email).lower().strip()

        # Basic validation: must have @ and domain
        if '@' not in email or '.' not in email.split('@')[1]:
            return None  # Invalid

        # Remove obvious placeholders
        if email in ['none', 'n/a', 'na', 'null', 'no email']:
            return None

        return email

    original_emails = cleaned['email'].copy()
    cleaned['email'] = cleaned['email'].apply(clean_email)
    changed_emails = (cleaned['email'] != original_emails).sum()
    if changed_emails > 0:
        log_change('Emails', f'Cleaned {changed_emails} emails (lowercase, removed invalids)', changed_emails)

# 5. Standardise dates
date_columns = [col for col in cleaned.columns if 'date' in col.lower()]
for col in date_columns:
    def parse_date(date_str):
        if pd.isna(date_str):
            return date_str

        # Try multiple date formats
        formats = ['%Y-%m-%d', '%d/%m/%Y', '%d-%m-%Y', '%Y/%m/%d', '%b %Y', '%B %Y']

        for fmt in formats:
            try:
                parsed = datetime.strptime(str(date_str).strip(), fmt)
                return parsed.strftime('%Y-%m-%d')  # Standardise to YYYY-MM-DD
            except:
                continue

        return date_str  # Couldn't parse

    original_dates = cleaned[col].copy()
    cleaned[col] = cleaned[col].apply(parse_date)
    changed_dates = (cleaned[col] != original_dates).sum()
    if changed_dates > 0:
        log_change('Dates', f'Standardised {changed_dates} dates in {col} to YYYY-MM-DD', changed_dates)

# 6. Remove leading/trailing spaces from all text fields
text_columns = cleaned.select_dtypes(include=['object']).columns
for col in text_columns:
    cleaned[col] = cleaned[col].str.strip()

# Generate cleaning report
print("\n" + "="*60)
print("DATA CLEANING REPORT")
print("="*60)

if len(changes_log) == 0:
    print("\nNo changes needed - data is already clean!")
else:
    print(f"\nTotal changes: {len(changes_log)} categories")
    print("\nChanges made:")
    for change in changes_log:
        print(f"\n{change['category']}:")
        print(f"  {change['description']}")

    total_affected = sum(c['records_affected'] for c in changes_log)
    print(f"\nTotal records affected: {total_affected} of {len(cleaned)}")

# Show examples of changes
print("\n" + "="*60)
print("Sample of cleaned data (first 5 rows):")
print("="*60)
print(cleaned.head().to_string())

# Save cleaned data
cleaned.to_csv('contacts_cleaned.csv', index=False)

# Save changes log
changes_df = pd.DataFrame(changes_log)
if len(changes_df) > 0:
    changes_df.to_csv('cleaning_report.csv', index=False)

print("\n" + "="*60)
print("Files saved:")
print("  contacts_cleaned.csv - cleaned data")
if len(changes_log) > 0:
    print("  cleaning_report.csv - what was changed")
print("\nNext steps:")
print("1. Review cleaned data carefully")
print("2. Spot check 20-30 random records")
print("3. Handle any exceptions manually")
print("4. Import back to CRM if validation passes")
print("5. Document standards and share with team")

Tools

Google Colabplatform · freemium
Visit →
pandaslibrary · free · open source
Visit →

Resources

At a glance

Time to implement
hours
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
small, medium, large
Target audience
operations-manager, data-analyst, fundraising

All tools are free. All processing runs locally (Google Colab) - no data sent elsewhere. One-time cleanup of 5,000 contacts: 1-2 hours vs days/weeks manually. Ongoing: run quarterly to catch new messy data. Clean data prevents wasted costs (failed mailshots, bounced emails) and enables better segmentation.

Part of this pathway