Detect duplicate donations
The problem
Same donor gives £50 online and sends a £50 cheque - both get recorded, you've counted £100 when you only received £50. Someone donates twice in one campaign and you miss it. Payment processing errors create duplicates. Import mishaps add donations twice. Your fundraising reports are overstated and you don't realise until the bank reconciliation doesn't match. By then you've made commitments based on inflated figures.
The solution
Use fuzzy matching to find probable duplicate donations even when details aren't identical. The algorithm compares: donor name (even with spelling variations), amount (exactly same or suspiciously similar), date (same day or within window), payment method, campaign. 'John Smith £50 online 15-Oct' and 'J. Smith £50 cheque 18-Oct' score 92% match - probably a duplicate. You review high-scoring pairs and merge/delete as appropriate.
What you get
List of probable duplicate donations scored by similarity: '95% match: Sarah Johnson £100 card payment 12-Oct and S. Johnson £100 bank transfer 12-Oct - likely duplicate', '88% match: Same donor, £50 and £51, 2 days apart - possibly duplicate or processing error'. For each pair: what matches (donor, amount, timing), what differs, confidence score. You review and clean.
Before you start
- Donation data exported as CSV: donor name, amount, date, payment method, campaign
- At least several hundred donations (otherwise manual review is faster)
- Lawful basis under GDPR to process donor data for financial accuracy
- A Google account for Colab
- Basic Python skills or willingness to adapt example code
When to use this
- You suspect duplicate donations are inflating your income figures
- Donors give through multiple channels (online, cheque, standing order) and might be recorded twice
- You've imported donation data from multiple sources and fear duplicates
- Bank reconciliation doesn't match donation records - duplicates might be why
When not to use this
- You have very few donations (under 100) - check manually
- Your CRM already has robust duplicate detection
- You need 100% certainty - fuzzy matching gives probabilities, humans must verify
- You don't have lawful basis under GDPR to process donor data for financial accuracy purposes - ensure your privacy notice covers data quality processing
- You don't have time to review flagged matches - automation finds candidates, you decide
Steps
- 1
Export donation data
Get donations as CSV with: donor name, amount, date, payment method (card/cheque/bank transfer), campaign/appeal code, transaction ID if you have it. Include donations from last 6-12 months (duplicates usually happen close together). More fields give better matching.
- 2
Understand fuzzy matching
Fuzzy matching scores similarity: 'John Smith' vs 'J. Smith' scores 85%, '£50.00' vs '£50.01' scores 99%, dates 2 days apart might indicate same donation posted twice. The algorithm handles: name variations, small amount discrepancies (£49.99 vs £50), payment method differences (donor paid online, you recorded as cheque).
- 3
Run duplicate detection
Use the example code to compare donations pairwise. Algorithm scores similarity across: donor name (most important), amount (must be very similar), date (within reasonable window - 7 days typical), campaign (if recorded). Combines scores into overall match probability. High scores = probable duplicates.
- 4
Review high-confidence matches
Sort by match score. Check 95%+ matches first - almost certainly duplicates: same donor, same amount, same day, different payment methods (online and cheque). Are they really the same gift? If yes, the algorithm works. Tag one for deletion, keep the most accurate record (check which has more detail).
- 5
Investigate medium-confidence matches
For 80-90% matches, check carefully. Same donor, same amount, few days apart - duplicate or two genuine gifts? Check: campaign context (did you ask twice?), payment methods (both online suggests separate gifts, online + cheque suggests duplicate), donor history (do they normally give twice in a week?). Use judgement.
- 6
Check edge cases
Some patterns need investigation: same donor, slightly different amounts (£50 and £51 - typo?), same amount from different donors (family members?), exact matches weeks apart (monthly standing order vs one-off gift?). Don't auto-delete - understand the pattern first.
- 7
Delete confirmed duplicates
Once you've verified real duplicates, delete or merge in your CRM. Keep the record with more information (if one has campaign code and one doesn't, keep the detailed one). Document: which donations were merged, why, when. If you need to undo later, you have a trail.
- 8
Fix root causes(optional)
Why did duplicates happen? Common causes: importing same file twice, donors using multiple channels and both get recorded, staff manually entering donations already in the system. Fix processes to prevent recurrence: unique transaction IDs, import validation, staff training. Prevention beats detection.
- 9
Run periodically(optional)
Check quarterly or after data imports. Duplicates creep in as donations arrive through multiple channels, imports happen, data entry errors occur. Regular checking keeps financial data accurate. Track: how many duplicates found? Is it decreasing (processes improving) or staying constant (systemic issue)?
Example code
Detect duplicate donations
This finds probable duplicate donations using fuzzy matching. Install with: pip install fuzzywuzzy python-Levenshtein pandas (python-Levenshtein speeds up matching significantly). PERFORMANCE NOTE: This compares all pairs of donations (O(n²) complexity) - for 1,000 donations that's 500,000 comparisons, for 10,000 it's 50 million. For large datasets, filter by date range first (e.g., last 3 months) or use blocking (only compare within same week). If your data has separate First Name/Last Name columns instead of donor_name, concatenate them first: df['donor_name'] = df['first_name'] + ' ' + df['last_name'].
# pip install fuzzywuzzy python-Levenshtein pandas
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
from datetime import datetime, timedelta
# Load donation data
donations = pd.read_csv('donations.csv')
donations['date'] = pd.to_datetime(donations['date'])
print(f"Loaded {len(donations)} donations")
print(f"Total value: £{donations['amount'].sum():,.2f}")
# Function to calculate similarity between two donations
def calculate_duplicate_score(don1, don2):
"""
Score similarity between two donations (0-100)
Higher = more likely to be duplicate
"""
scores = {}
# Donor name similarity (weight: 40%)
name1 = str(don1.get('donor_name', '')).lower()
name2 = str(don2.get('donor_name', '')).lower()
scores['name'] = fuzz.ratio(name1, name2)
# Amount similarity (weight: 35%)
# Must be very similar for duplicates
amount1 = float(don1.get('amount', 0))
amount2 = float(don2.get('amount', 0))
if amount1 == amount2:
scores['amount'] = 100
else:
# Allow tiny difference (£50.00 vs £49.99)
diff = abs(amount1 - amount2)
avg = (amount1 + amount2) / 2
if avg > 0:
diff_percent = (diff / avg) * 100
if diff_percent < 1: # Less than 1% difference
scores['amount'] = 99
elif diff_percent < 5: # Less than 5% difference
scores['amount'] = 80
else:
scores['amount'] = max(0, 100 - diff_percent * 10)
else:
scores['amount'] = 0
# Date proximity (weight: 20%)
# Duplicates usually within days
date1 = don1.get('date')
date2 = don2.get('date')
if pd.notna(date1) and pd.notna(date2):
days_apart = abs((date1 - date2).days)
if days_apart == 0:
scores['date'] = 100
elif days_apart <= 3:
scores['date'] = 90
elif days_apart <= 7:
scores['date'] = 70
elif days_apart <= 14:
scores['date'] = 50
else:
scores['date'] = max(0, 100 - days_apart * 5)
else:
scores['date'] = 0
# Campaign/appeal match (weight: 5%)
campaign1 = str(don1.get('campaign', '')).lower()
campaign2 = str(don2.get('campaign', '')).lower()
if campaign1 and campaign2 and campaign1 != 'nan':
scores['campaign'] = 100 if campaign1 == campaign2 else 30
else:
scores['campaign'] = 50 # Unknown, neutral
# Calculate weighted overall score
overall_score = (
scores['name'] * 0.40 +
scores['amount'] * 0.35 +
scores['date'] * 0.20 +
scores['campaign'] * 0.05
)
return overall_score, scores
# Find potential duplicates
duplicates = []
threshold = 75 # Minimum score to flag (adjust if too many/few matches)
print(f"\nSearching for duplicates (threshold: {threshold})...")
# Compare all pairs
checked = 0
for (idx1, don1), (idx2, don2) in combinations(donations.iterrows(), 2):
overall_score, component_scores = calculate_duplicate_score(don1, don2)
if overall_score >= threshold:
duplicates.append({
'donation_id_1': don1.get('id', idx1),
'donor_name_1': don1.get('donor_name'),
'amount_1': don1.get('amount'),
'date_1': don1.get('date'),
'method_1': don1.get('payment_method'),
'donation_id_2': don2.get('id', idx2),
'donor_name_2': don2.get('donor_name'),
'amount_2': don2.get('amount'),
'date_2': don2.get('date'),
'method_2': don2.get('payment_method'),
'match_score': round(overall_score, 1),
'name_score': round(component_scores['name'], 1),
'amount_score': round(component_scores['amount'], 1),
'date_score': round(component_scores['date'], 1)
})
checked += 1
if checked % 10000 == 0:
print(f" Checked {checked:,} pairs...")
# Sort by match score
duplicates_df = pd.DataFrame(duplicates)
if len(duplicates_df) > 0:
duplicates_df = duplicates_df.sort_values('match_score', ascending=False)
print(f"\nFound {len(duplicates_df)} potential duplicate pairs")
if len(duplicates_df) > 0:
# Calculate financial impact
total_duplicate_value = 0
for _, dup in duplicates_df.iterrows():
# Assume if high match, one is duplicate
if dup['match_score'] >= 90:
total_duplicate_value += min(dup['amount_1'], dup['amount_2'])
print(f"Estimated duplicate value (90%+ matches): £{total_duplicate_value:,.2f}")
# Breakdown by confidence
high_conf = len(duplicates_df[duplicates_df['match_score'] >= 90])
med_conf = len(duplicates_df[(duplicates_df['match_score'] >= 80) & (duplicates_df['match_score'] < 90)])
low_conf = len(duplicates_df[(duplicates_df['match_score'] >= 75) & (duplicates_df['match_score'] < 80)])
print(f"\nBreakdown:")
print(f" High confidence (90%+): {high_conf} pairs - almost certainly duplicates")
print(f" Medium confidence (80-90%): {med_conf} pairs - need manual review")
print(f" Low confidence (75-80%): {low_conf} pairs - check carefully")
# Show top 10
print(f"\nTop 10 most likely duplicates:")
print("="*80)
for idx, dup in duplicates_df.head(10).iterrows():
print(f"\nMatch Score: {dup['match_score']}%")
print(f" Donation 1: {dup['donor_name_1']}, £{dup['amount_1']:.2f}, {dup['date_1'].strftime('%d-%b-%Y')}, {dup['method_1']}")
print(f" Donation 2: {dup['donor_name_2']}, £{dup['amount_2']:.2f}, {dup['date_2'].strftime('%d-%b-%Y')}, {dup['method_2']}")
print(f" Scores - Name: {dup['name_score']}, Amount: {dup['amount_score']}, Date: {dup['date_score']}")
# Save results
duplicates_df.to_csv('potential_duplicate_donations.csv', index=False)
print(f"\nFull results saved to potential_duplicate_donations.csv")
print(f"\nNext steps:")
print(f"1. Review high confidence matches (90%+) - likely real duplicates")
print(f"2. Investigate medium confidence (80-90%) - check donor history")
print(f"3. Delete confirmed duplicates in CRM (keep most detailed record)")
print(f"4. Document which donations merged and why")
print(f"5. Fix root causes to prevent future duplicates")
else:
print("\nNo potential duplicates found - your data looks clean!")Tools
Resources
At a glance
- Time to implement
- days
- Setup cost
- free
- Ongoing cost
- free
- Cost trend
- stable
- Organisation size
- medium, large
- Target audience
- fundraising, operations-manager, data-analyst
All tools are free. All processing runs locally - no data sent to external services. GDPR compliant for donor data. Initial setup takes a few hours. Once built, checking for duplicates takes minutes. Prevents financial misreporting: £1,000 in false duplicates overstates income, leads to bad decisions.