Detect duplicate records in your database

data-analysisintermediateproven

The problem

Your donor database has John Smith on Oak Street and J. Smith on Oak St - same person or different? Sarah Johnson and Sara Johnstone with similar postcodes? Your CRM is full of potential duplicates but you can't spot them manually. Duplicates mean: double-counting beneficiaries, sending duplicate mailings, wasted resources, inaccurate reporting.

The solution

Use fuzzy matching algorithms to find probable duplicates even when names are misspelt, addresses abbreviated, or data formatted differently. The algorithm scores similarity between records (name, address, postcode, email, phone) and flags likely matches. 'John Smith, 10 Oak Street' and 'J. Smith, 10 Oak St' score 95% similar - probably duplicates. You review high-scoring pairs rather than checking thousands manually.

What you get

A list of probable duplicate pairs scored by similarity: '95% match: Sarah Johnson and Sara Johnstone both at same postcode', '88% match: different spelling but same phone number'. For each pair you see: what matches (same address), what differs (spelling), and an overall confidence score. You review and merge/keep as appropriate.

Before you start

Your database exported as CSV with: name, address, postcode, email, phone (or similar). Adapt the column names in the code to match your data
At least 100+ records (otherwise just check manually)
Lawful basis under GDPR to process this personal data (names, addresses, emails)
A Google account for Colab - check your data protection policy allows uploading personal data to Google cloud services, or run the code locally in VS Code/Jupyter if it doesn't
Basic Python skills or willingness to adapt example code

When to use this

You suspect your database has many duplicates but can't spot them manually
Names are spelt differently or abbreviated inconsistently
Addresses are formatted differently (Street vs St, Road vs Rd)
You're merging databases from multiple sources

When not to use this

Your database is tiny (<100 records) - check manually
You need 100% certainty - fuzzy matching gives probabilities, not guarantees
Your data is so messy that matches are impossible (missing fields, random text)
You don't have time to review flagged matches - automation finds candidates, humans decide

Steps

1
Export your data
Get your records as CSV with key fields: full name, address, postcode, email, phone number. More fields give better matching. Include an ID column so you can identify records later. Don't clean the data much - the algorithm handles messy data.
2
Understand fuzzy matching
Fuzzy matching scores similarity between text strings: 'John Smith' vs 'J. Smith' might score 85%, 'Oak Street' vs 'Oak St' scores 95%. The algorithm handles typos, abbreviations, different formatting. You set a threshold: pairs scoring above 80% get flagged for review.
3
Run duplicate detection
Use the example code to compare all records pairwise. The algorithm scores similarity across name, address, postcode. It combines scores (weighted: name matters more than address format) into overall match probability. High scores = probable duplicates. Note: the pairwise comparison is O(n²), meaning it gets much slower as dataset size increases - for 1,000 records expect a few minutes, for 10,000+ records expect significant wait times. For very large datasets (50k+), you'd need more advanced techniques like 'blocking' or 'indexing' to reduce comparisons.
4
Review high-confidence matches
Sort results by match score. Start with 95%+ matches - these are almost certainly duplicates. Check a few: are they really the same person/organisation? If yes, the algorithm is working. If no, you might need to adjust matching rules.
5
Set your threshold
Decide your cutoff: above 90% you're confident they're duplicates, 80-90% needs checking, below 80% probably different records. This depends on how messy your data is. Conservative threshold (90%) means fewer false positives but might miss some duplicates.
6
Review borderline cases
For matches scoring 80-90%, manually check. Same postcode and similar name but different phone? Could be family members at same address. Different postcode but same phone? Probably moved house. Use your judgement - the algorithm flags candidates, you decide.
7
Merge confirmed duplicates
Once you've identified real duplicates, merge them in your CRM. Keep the most complete record, combine donation history, update to most recent address. Document what you merged in case you need to undo later.
8
Re-run periodically(optional)
Run duplicate detection quarterly or after importing new data. Duplicates creep in as people are added manually, data is imported from events, forms have different name formats. Regular checking keeps your database clean.

Example code

Detect duplicates using fuzzy string matching

This finds probable duplicate records by comparing name, address, and postcode. Adapt the column names to match your data structure. Note: the library 'fuzzywuzzy' has been superseded by 'thefuzz' - install with: pip install thefuzz python-Levenshtein pandas

# pip install thefuzz python-Levenshtein pandas
import pandas as pd
from thefuzz import fuzz  # Modern replacement for fuzzywuzzy
from itertools import combinations

# Load your data
df = pd.read_csv('contacts.csv')

print(f"Loaded {len(df)} records")
print(f"Columns: {', '.join(df.columns)}")

# Function to calculate similarity between two records
def calculate_match_score(row1, row2):
    """
    Calculate similarity score between two records
    Returns score 0-100 (higher = more similar)
    """

    scores = {}

    # Name similarity (most important - weight: 40%)
    name1 = str(row1.get('name', '')).lower()
    name2 = str(row2.get('name', '')).lower()
    scores['name'] = fuzz.ratio(name1, name2)

    # Address similarity (weight: 30%)
    address1 = str(row1.get('address', '')).lower()
    address2 = str(row2.get('address', '')).lower()
    scores['address'] = fuzz.ratio(address1, address2)

    # Postcode similarity (weight: 20%)
    # Exact postcodes or very similar (typos)
    postcode1 = str(row1.get('postcode', '')).replace(' ', '').lower()
    postcode2 = str(row2.get('postcode', '')).replace(' ', '').lower()
    scores['postcode'] = fuzz.ratio(postcode1, postcode2)

    # Email similarity (weight: 10%)
    # If emails exist and match exactly, strong signal
    email1 = str(row1.get('email', '')).lower()
    email2 = str(row2.get('email', '')).lower()
    if email1 and email2 and email1 != 'nan':
        scores['email'] = 100 if email1 == email2 else 0
    else:
        scores['email'] = 0

    # Weighted overall score
    overall_score = (
        scores['name'] * 0.40 +
        scores['address'] * 0.30 +
        scores['postcode'] * 0.20 +
        scores['email'] * 0.10
    )

    return overall_score, scores

# Find potential duplicates
duplicates = []
threshold = 80  # Minimum similarity score to flag (adjust this)

print(f"\nSearching for duplicates (this may take a while for large datasets)...")

# Compare all pairs of records
for (idx1, row1), (idx2, row2) in combinations(df.iterrows(), 2):
    overall_score, component_scores = calculate_match_score(row1, row2)

    if overall_score >= threshold:
        duplicates.append({
            'id1': row1.get('id', idx1),
            'name1': row1.get('name'),
            'address1': row1.get('address'),
            'id2': row2.get('id', idx2),
            'name2': row2.get('name'),
            'address2': row2.get('address'),
            'match_score': round(overall_score, 1),
            'name_score': round(component_scores['name'], 1),
            'address_score': round(component_scores['address'], 1),
            'postcode_score': round(component_scores['postcode'], 1),
            'email_score': round(component_scores['email'], 1)
        })

# Sort by match score (highest first)
duplicates_df = pd.DataFrame(duplicates)
duplicates_df = duplicates_df.sort_values('match_score', ascending=False)

print(f"\nFound {len(duplicates_df)} potential duplicate pairs")

# Show results by confidence level
high_confidence = len(duplicates_df[duplicates_df['match_score'] >= 90])
medium_confidence = len(duplicates_df[(duplicates_df['match_score'] >= 80) & (duplicates_df['match_score'] < 90)])

print(f"\nBreakdown:")
print(f"  High confidence (90%+): {high_confidence} pairs - almost certainly duplicates")
print(f"  Medium confidence (80-90%): {medium_confidence} pairs - need manual review")

# Show top 10 matches
print("\nTop 10 most likely duplicates:")
print(duplicates_df.head(10)[['name1', 'name2', 'match_score', 'name_score', 'address_score']].to_string(index=False))

# Save results
duplicates_df.to_csv('potential_duplicates.csv', index=False)
print(f"\nFull results saved to potential_duplicates.csv")
print("\nNext steps:")
print("1. Review high confidence matches (90%+) - likely real duplicates")
print("2. Manually check medium confidence (80-90%)")
print("3. Merge confirmed duplicates in your CRM")
print("4. Adjust threshold if too many/few matches")

Tools

Google Colabplatform · freemium

Visit →

Python (thefuzz)library · free · open source

Visit →

pandaslibrary · free · open source

Visit →

Resources

thefuzz library documentationdocumentation

Modern string matching library for fuzzy matching (successor to fuzzywuzzy).

Data deduplication techniquestutorial

Advanced deduplication library with machine learning.

Data quality best practicesdocumentation

NCVO guidance on maintaining good data quality.

At a glance

Time to implement: days
Setup cost: free
Ongoing cost: free
Cost trend: stable
Organisation size: medium, large
Target audience: data-analyst, operations-manager, fundraising

All tools are free. All processing runs locally on your computer - no data sent to external services. GDPR compliant as long as you have lawful basis to process the data. Initial setup takes a few hours. Once built, checking for duplicates takes minutes. Much faster than manual review and catches matches you'd miss.

Part of this pathway

API loops and programmatic AI

This recipe is in the "Production patterns" stage

intermediate4 stages

The problem

The solution

What you get

Before you start

When to use this

When not to use this

Steps

Export your data

Understand fuzzy matching

Run duplicate detection

Review high-confidence matches

Set your threshold

Review borderline cases

Merge confirmed duplicates

Re-run periodically(optional)

Example code

Detect duplicates using fuzzy string matching

Tools

Resources

At a glance

Part of this pathway