Detect duplicate records in your database
The problem
Your donor database has John Smith on Oak Street and J. Smith on Oak St - same person or different? Sarah Johnson and Sara Johnstone with similar postcodes? Your CRM is full of potential duplicates but you can't spot them manually. Duplicates mean: double-counting beneficiaries, sending duplicate mailings, wasted resources, inaccurate reporting.
The solution
Use fuzzy matching algorithms to find probable duplicates even when names are misspelt, addresses abbreviated, or data formatted differently. The algorithm scores similarity between records (name, address, postcode, email, phone) and flags likely matches. 'John Smith, 10 Oak Street' and 'J. Smith, 10 Oak St' score 95% similar - probably duplicates. You review high-scoring pairs rather than checking thousands manually.
What you get
A list of probable duplicate pairs scored by similarity: '95% match: Sarah Johnson and Sara Johnstone both at same postcode', '88% match: different spelling but same phone number'. For each pair you see: what matches (same address), what differs (spelling), and an overall confidence score. You review and merge/keep as appropriate.
Before you start
- Your database exported as CSV with: name, address, postcode, email, phone (or similar)
- At least 100+ records (otherwise just check manually)
- Lawful basis under GDPR to process this personal data (names, addresses, emails)
- A Google account for Colab
- Basic Python skills or willingness to adapt example code
When to use this
- You suspect your database has many duplicates but can't spot them manually
- Names are spelt differently or abbreviated inconsistently
- Addresses are formatted differently (Street vs St, Road vs Rd)
- You're merging databases from multiple sources
When not to use this
- Your database is tiny (<100 records) - check manually
- You need 100% certainty - fuzzy matching gives probabilities, not guarantees
- Your data is so messy that matches are impossible (missing fields, random text)
- You don't have time to review flagged matches - automation finds candidates, humans decide
Steps
- 1
Export your data
Get your records as CSV with key fields: full name, address, postcode, email, phone number. More fields give better matching. Include an ID column so you can identify records later. Don't clean the data much - the algorithm handles messy data.
- 2
Understand fuzzy matching
Fuzzy matching scores similarity between text strings: 'John Smith' vs 'J. Smith' might score 85%, 'Oak Street' vs 'Oak St' scores 95%. The algorithm handles typos, abbreviations, different formatting. You set a threshold: pairs scoring above 80% get flagged for review.
- 3
Run duplicate detection
Use the example code to compare all records pairwise. The algorithm scores similarity across name, address, postcode. It combines scores (weighted: name matters more than address format) into overall match probability. High scores = probable duplicates.
- 4
Review high-confidence matches
Sort results by match score. Start with 95%+ matches - these are almost certainly duplicates. Check a few: are they really the same person/organisation? If yes, the algorithm is working. If no, you might need to adjust matching rules.
- 5
Set your threshold
Decide your cutoff: above 90% you're confident they're duplicates, 80-90% needs checking, below 80% probably different records. This depends on how messy your data is. Conservative threshold (90%) means fewer false positives but might miss some duplicates.
- 6
Review borderline cases
For matches scoring 80-90%, manually check. Same postcode and similar name but different phone? Could be family members at same address. Different postcode but same phone? Probably moved house. Use your judgement - the algorithm flags candidates, you decide.
- 7
Merge confirmed duplicates
Once you've identified real duplicates, merge them in your CRM. Keep the most complete record, combine donation history, update to most recent address. Document what you merged in case you need to undo later.
- 8
Re-run periodically(optional)
Run duplicate detection quarterly or after importing new data. Duplicates creep in as people are added manually, data is imported from events, forms have different name formats. Regular checking keeps your database clean.
Example code
Detect duplicates using fuzzy string matching
This finds probable duplicate records by comparing name, address, and postcode. Adapt the fields to match your data structure.
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
# Load your data
df = pd.read_csv('contacts.csv')
print(f"Loaded {len(df)} records")
print(f"Columns: {', '.join(df.columns)}")
# Function to calculate similarity between two records
def calculate_match_score(row1, row2):
"""
Calculate similarity score between two records
Returns score 0-100 (higher = more similar)
"""
scores = {}
# Name similarity (most important - weight: 40%)
name1 = str(row1.get('name', '')).lower()
name2 = str(row2.get('name', '')).lower()
scores['name'] = fuzz.ratio(name1, name2)
# Address similarity (weight: 30%)
address1 = str(row1.get('address', '')).lower()
address2 = str(row2.get('address', '')).lower()
scores['address'] = fuzz.ratio(address1, address2)
# Postcode similarity (weight: 20%)
# Exact postcodes or very similar (typos)
postcode1 = str(row1.get('postcode', '')).replace(' ', '').lower()
postcode2 = str(row2.get('postcode', '')).replace(' ', '').lower()
scores['postcode'] = fuzz.ratio(postcode1, postcode2)
# Email similarity (weight: 10%)
# If emails exist and match exactly, strong signal
email1 = str(row1.get('email', '')).lower()
email2 = str(row2.get('email', '')).lower()
if email1 and email2 and email1 != 'nan':
scores['email'] = 100 if email1 == email2 else 0
else:
scores['email'] = 0
# Weighted overall score
overall_score = (
scores['name'] * 0.40 +
scores['address'] * 0.30 +
scores['postcode'] * 0.20 +
scores['email'] * 0.10
)
return overall_score, scores
# Find potential duplicates
duplicates = []
threshold = 80 # Minimum similarity score to flag (adjust this)
print(f"\nSearching for duplicates (this may take a while for large datasets)...")
# Compare all pairs of records
for (idx1, row1), (idx2, row2) in combinations(df.iterrows(), 2):
overall_score, component_scores = calculate_match_score(row1, row2)
if overall_score >= threshold:
duplicates.append({
'id1': row1.get('id', idx1),
'name1': row1.get('name'),
'address1': row1.get('address'),
'id2': row2.get('id', idx2),
'name2': row2.get('name'),
'address2': row2.get('address'),
'match_score': round(overall_score, 1),
'name_score': round(component_scores['name'], 1),
'address_score': round(component_scores['address'], 1),
'postcode_score': round(component_scores['postcode'], 1),
'email_score': round(component_scores['email'], 1)
})
# Sort by match score (highest first)
duplicates_df = pd.DataFrame(duplicates)
duplicates_df = duplicates_df.sort_values('match_score', ascending=False)
print(f"\nFound {len(duplicates_df)} potential duplicate pairs")
# Show results by confidence level
high_confidence = len(duplicates_df[duplicates_df['match_score'] >= 90])
medium_confidence = len(duplicates_df[(duplicates_df['match_score'] >= 80) & (duplicates_df['match_score'] < 90)])
print(f"\nBreakdown:")
print(f" High confidence (90%+): {high_confidence} pairs - almost certainly duplicates")
print(f" Medium confidence (80-90%): {medium_confidence} pairs - need manual review")
# Show top 10 matches
print("\nTop 10 most likely duplicates:")
print(duplicates_df.head(10)[['name1', 'name2', 'match_score', 'name_score', 'address_score']].to_string(index=False))
# Save results
duplicates_df.to_csv('potential_duplicates.csv', index=False)
print(f"\nFull results saved to potential_duplicates.csv")
print("\nNext steps:")
print("1. Review high confidence matches (90%+) - likely real duplicates")
print("2. Manually check medium confidence (80-90%)")
print("3. Merge confirmed duplicates in your CRM")
print("4. Adjust threshold if too many/few matches")Tools
Resources
At a glance
- Time to implement
- days
- Setup cost
- free
- Ongoing cost
- free
- Cost trend
- stable
- Organisation size
- medium, large
- Target audience
- data-analyst, operations-manager, fundraising
All tools are free. All processing runs locally on your computer - no data sent to external services. GDPR compliant as long as you have lawful basis to process the data. Initial setup takes a few hours. Once built, checking for duplicates takes minutes. Much faster than manual review and catches matches you'd miss.