Clean and standardise contact data

data-analysisintermediateproven

The problem

Your contact database is a mess: postcodes in address fields, 'N/A' in email columns, phone numbers with inconsistent formatting (+44, 0, spaces, dashes), 'Mr' and 'mrs' and 'MS' mixing case, dates as text ('Jan 2024', '01/01/24', '2024-01-01'). Messy data means: failed mailshots (wrong address format), broken email campaigns (invalid addresses), impossible analysis (can't filter by postcode if they're in wrong fields). Cleaning manually takes weeks.

The solution

Use Python data cleaning libraries (pandas) to systematically fix common issues. The code identifies patterns: postcodes in wrong fields (matches postcode format, moves to correct column), standardises phone numbers (removes spaces, adds country code), fixes case inconsistencies (Title Case for names), validates emails (removes obvious invalids), standardises dates. Run once, fix thousands of records. You review changes, don't blindly apply.

What you get

A cleaned database with: standardised formats (all phone numbers as +44XXXXXXXXXX, all postcodes in postcode field), consistent case (names in Title Case, emails lowercase), validated data (invalid emails flagged), standardised dates (all YYYY-MM-DD). Before: 40% data quality issues. After: 95%+ clean. Plus a report showing what was fixed so you can review changes.

Before you start

Contact data exported as CSV
Understanding of what your data should look like when clean
Backup of original data (never work on your only copy)
Lawful basis to process this contact data under GDPR
A Google account for Colab
Check your data protection policy permits uploading contact data (PII) to Google Colab - data is processed on Google's infrastructure
Willingness to run and adapt Python code (intermediate skill level)

When to use this

Your database has obvious quality issues (wrong fields, inconsistent formatting)
You're about to do a mailshot and know addresses are messy
You can't analyse data because formatting is inconsistent
You've merged databases and formats clash

When not to use this

Your data is already clean - no need
You have under 100 contacts - clean manually
Your data issues are semantic not formatting (wrong person in record, outdated info) - code can't fix that
You don't have lawful basis under GDPR to process this contact data for data quality purposes - check your privacy notice and consent
You don't understand what clean data looks like for your needs - define standards first

Steps

1
Export and backup your data
Export contacts to CSV. Make a backup copy before cleaning - keep original safe. Never work on your only copy of data. If cleaning goes wrong, you can restart from backup. Export should include: names, addresses, postcodes, phone, email, any custom fields.
2
Identify common data problems
Look at your data: what's messy? Postcodes in address field? Phone numbers with inconsistent formatting (some +44, some 0)? Case inconsistencies (mr, MR, Mr)? Invalid emails (no @, obvious typos)? Dates in multiple formats? List 5-10 issues you see repeatedly. These are what you'll fix systematically.
3
Define your standards
Decide: what does clean look like? Phone numbers: +44XXXXXXXXXX format. Postcodes: uppercase, space in right place (SW1A 1AA). Names: Title Case. Emails: lowercase. Dates: YYYY-MM-DD. Having standards means you can check if cleaning worked. Document these - they become your data quality policy.
4
Run the cleaning code
Use the example code (adapt to your column names). It runs through systematically: moves postcodes to correct field (pattern matching), standardises phone numbers (removes spaces/dashes, adds +44), fixes case (names to Title Case), validates emails (basic format check), standardises dates. Review the changes report before saving.
5
Review changes carefully
Critical: don't blindly accept all changes. The code generates a report showing: what was changed, how many records affected, examples. Check: are phone number changes correct? Are postcodes really postcodes (not house numbers)? Did case fixes work? Pay special attention to surnames with unusual casing - Title Case will incorrectly change 'MacDonald' to 'Macdonald' and 'O'Neill' to 'O'neill'. You may need to handle these manually or add exceptions to the code.
6
Handle exceptions manually
Some issues can't be automated: clearly wrong data (email: 'none'), missing critical info, ambiguous cases (is '07' a phone number or something else?). Export the exceptions, fix manually or contact people to update. Automation does bulk work, humans handle edge cases.
7
Validate cleaned data
Before importing back to your CRM: spot check 20-30 random records. Do they look right? Run basic checks: all postcodes in postcode field? All emails have @? Phone numbers consistent format? If validation passes, you're safe to import. If not, fix issues and revalidate.
8
Import and document
Import cleaned data back to your CRM (or save as new clean database). Document: what standards you applied, what issues you fixed, when you cleaned. This helps next time and shows data governance. Share standards with team so new data is added cleanly.
9
Set up regular cleaning(optional)
Data gets messy again as people add records. Run cleaning quarterly: export, run code, review, import. Build it into your data maintenance routine. Better yet: train staff on data entry standards so less cleaning is needed. Prevention beats cleaning.

Example code