← Back to recipes

Detect duplicate donations

fundraisingintermediateproven

The problem

Same donor gives £50 online and sends a £50 cheque - both get recorded, you've counted £100 when you only received £50. Someone donates twice in one campaign and you miss it. Payment processing errors create duplicates. Import mishaps add donations twice. Your fundraising reports are overstated and you don't realise until the bank reconciliation doesn't match. By then you've made commitments based on inflated figures.

The solution

Use fuzzy matching to find probable duplicate donations even when details aren't identical. The algorithm compares: donor name (even with spelling variations), amount (exactly same or suspiciously similar), date (same day or within window), payment method, campaign. 'John Smith £50 online 15-Oct' and 'J. Smith £50 cheque 18-Oct' score 92% match - probably a duplicate. You review high-scoring pairs and merge/delete as appropriate.

What you get

List of probable duplicate donations scored by similarity: '95% match: Sarah Johnson £100 card payment 12-Oct and S. Johnson £100 bank transfer 12-Oct - likely duplicate', '88% match: Same donor, £50 and £51, 2 days apart - possibly duplicate or processing error'. For each pair: what matches (donor, amount, timing), what differs, confidence score. You review and clean.

Before you start

  • Donation data exported as CSV: donor name, amount, date, payment method, campaign
  • At least several hundred donations (otherwise manual review is faster)
  • Lawful basis under GDPR to process donor data for financial accuracy
  • A Google account for Colab
  • Basic Python skills or willingness to adapt example code

When to use this

  • You suspect duplicate donations are inflating your income figures
  • Donors give through multiple channels (online, cheque, standing order) and might be recorded twice
  • You've imported donation data from multiple sources and fear duplicates
  • Bank reconciliation doesn't match donation records - duplicates might be why

When not to use this

  • You have very few donations (under 100) - check manually
  • Your CRM already has robust duplicate detection
  • You need 100% certainty - fuzzy matching gives probabilities, humans must verify
  • You don't have lawful basis under GDPR to process donor data for financial accuracy purposes - ensure your privacy notice covers data quality processing
  • You don't have time to review flagged matches - automation finds candidates, you decide

Steps

  1. 1

    Export donation data

    Get donations as CSV with: donor name, amount, date, payment method (card/cheque/bank transfer), campaign/appeal code, transaction ID if you have it. Include donations from last 6-12 months (duplicates usually happen close together). More fields give better matching.

  2. 2

    Understand fuzzy matching

    Fuzzy matching scores similarity: 'John Smith' vs 'J. Smith' scores 85%, '£50.00' vs '£50.01' scores 99%, dates 2 days apart might indicate same donation posted twice. The algorithm handles: name variations, small amount discrepancies (£49.99 vs £50), payment method differences (donor paid online, you recorded as cheque).

  3. 3

    Run duplicate detection

    Use the example code to compare donations pairwise. Algorithm scores similarity across: donor name (most important), amount (must be very similar), date (within reasonable window - 7 days typical), campaign (if recorded). Combines scores into overall match probability. High scores = probable duplicates.

  4. 4

    Review high-confidence matches

    Sort by match score. Check 95%+ matches first - almost certainly duplicates: same donor, same amount, same day, different payment methods (online and cheque). Are they really the same gift? If yes, the algorithm works. Tag one for deletion, keep the most accurate record (check which has more detail).

  5. 5

    Investigate medium-confidence matches

    For 80-90% matches, check carefully. Same donor, same amount, few days apart - duplicate or two genuine gifts? Check: campaign context (did you ask twice?), payment methods (both online suggests separate gifts, online + cheque suggests duplicate), donor history (do they normally give twice in a week?). Use judgement.

  6. 6

    Check edge cases

    Some patterns need investigation: same donor, slightly different amounts (£50 and £51 - typo?), same amount from different donors (family members?), exact matches weeks apart (monthly standing order vs one-off gift?). Don't auto-delete - understand the pattern first.

  7. 7

    Delete confirmed duplicates

    Once you've verified real duplicates, delete or merge in your CRM. Keep the record with more information (if one has campaign code and one doesn't, keep the detailed one). Document: which donations were merged, why, when. If you need to undo later, you have a trail.

  8. 8

    Fix root causes(optional)

    Why did duplicates happen? Common causes: importing same file twice, donors using multiple channels and both get recorded, staff manually entering donations already in the system. Fix processes to prevent recurrence: unique transaction IDs, import validation, staff training. Prevention beats detection.

  9. 9

    Run periodically(optional)

    Check quarterly or after data imports. Duplicates creep in as donations arrive through multiple channels, imports happen, data entry errors occur. Regular checking keeps financial data accurate. Track: how many duplicates found? Is it decreasing (processes improving) or staying constant (systemic issue)?

Example code

Detect duplicate donations

This finds probable duplicate donations using fuzzy matching. Install with: pip install fuzzywuzzy python-Levenshtein pandas (python-Levenshtein speeds up matching significantly). PERFORMANCE NOTE: This compares all pairs of donations (O(n²) complexity) - for 1,000 donations that's 500,000 comparisons, for 10,000 it's 50 million. For large datasets, filter by date range first (e.g., last 3 months) or use blocking (only compare within same week). If your data has separate First Name/Last Name columns instead of donor_name, concatenate them first: df['donor_name'] = df['first_name'] + ' ' + df['last_name'].

# pip install fuzzywuzzy python-Levenshtein pandas
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
from datetime import datetime, timedelta

# Load donation data
donations = pd.read_csv('donations.csv')
donations['date'] = pd.to_datetime(donations['date'])

print(f"Loaded {len(donations)} donations")
print(f"Total value: £{donations['amount'].sum():,.2f}")

# Function to calculate similarity between two donations
def calculate_duplicate_score(don1, don2):
    """
    Score similarity between two donations (0-100)
    Higher = more likely to be duplicate
    """
    scores = {}

    # Donor name similarity (weight: 40%)
    name1 = str(don1.get('donor_name', '')).lower()
    name2 = str(don2.get('donor_name', '')).lower()
    scores['name'] = fuzz.ratio(name1, name2)

    # Amount similarity (weight: 35%)
    # Must be very similar for duplicates
    amount1 = float(don1.get('amount', 0))
    amount2 = float(don2.get('amount', 0))

    if amount1 == amount2:
        scores['amount'] = 100
    else:
        # Allow tiny difference (£50.00 vs £49.99)
        diff = abs(amount1 - amount2)
        avg = (amount1 + amount2) / 2
        if avg > 0:
            diff_percent = (diff / avg) * 100
            if diff_percent < 1:  # Less than 1% difference
                scores['amount'] = 99
            elif diff_percent < 5:  # Less than 5% difference
                scores['amount'] = 80
            else:
                scores['amount'] = max(0, 100 - diff_percent * 10)
        else:
            scores['amount'] = 0

    # Date proximity (weight: 20%)
    # Duplicates usually within days
    date1 = don1.get('date')
    date2 = don2.get('date')
    if pd.notna(date1) and pd.notna(date2):
        days_apart = abs((date1 - date2).days)
        if days_apart == 0:
            scores['date'] = 100
        elif days_apart <= 3:
            scores['date'] = 90
        elif days_apart <= 7:
            scores['date'] = 70
        elif days_apart <= 14:
            scores['date'] = 50
        else:
            scores['date'] = max(0, 100 - days_apart * 5)
    else:
        scores['date'] = 0

    # Campaign/appeal match (weight: 5%)
    campaign1 = str(don1.get('campaign', '')).lower()
    campaign2 = str(don2.get('campaign', '')).lower()
    if campaign1 and campaign2 and campaign1 != 'nan':
        scores['campaign'] = 100 if campaign1 == campaign2 else 30
    else:
        scores['campaign'] = 50  # Unknown, neutral

    # Calculate weighted overall score
    overall_score = (
        scores['name'] * 0.40 +
        scores['amount'] * 0.35 +
        scores['date'] * 0.20 +
        scores['campaign'] * 0.05
    )

    return overall_score, scores

# Find potential duplicates
duplicates = []
threshold = 75  # Minimum score to flag (adjust if too many/few matches)

print(f"\nSearching for duplicates (threshold: {threshold})...")

# Compare all pairs
checked = 0
for (idx1, don1), (idx2, don2) in combinations(donations.iterrows(), 2):
    overall_score, component_scores = calculate_duplicate_score(don1, don2)

    if overall_score >= threshold:
        duplicates.append({
            'donation_id_1': don1.get('id', idx1),
            'donor_name_1': don1.get('donor_name'),
            'amount_1': don1.get('amount'),
            'date_1': don1.get('date'),
            'method_1': don1.get('payment_method'),
            'donation_id_2': don2.get('id', idx2),
            'donor_name_2': don2.get('donor_name'),
            'amount_2': don2.get('amount'),
            'date_2': don2.get('date'),
            'method_2': don2.get('payment_method'),
            'match_score': round(overall_score, 1),
            'name_score': round(component_scores['name'], 1),
            'amount_score': round(component_scores['amount'], 1),
            'date_score': round(component_scores['date'], 1)
        })

    checked += 1
    if checked % 10000 == 0:
        print(f"  Checked {checked:,} pairs...")

# Sort by match score
duplicates_df = pd.DataFrame(duplicates)
if len(duplicates_df) > 0:
    duplicates_df = duplicates_df.sort_values('match_score', ascending=False)

print(f"\nFound {len(duplicates_df)} potential duplicate pairs")

if len(duplicates_df) > 0:
    # Calculate financial impact
    total_duplicate_value = 0
    for _, dup in duplicates_df.iterrows():
        # Assume if high match, one is duplicate
        if dup['match_score'] >= 90:
            total_duplicate_value += min(dup['amount_1'], dup['amount_2'])

    print(f"Estimated duplicate value (90%+ matches): £{total_duplicate_value:,.2f}")

    # Breakdown by confidence
    high_conf = len(duplicates_df[duplicates_df['match_score'] >= 90])
    med_conf = len(duplicates_df[(duplicates_df['match_score'] >= 80) & (duplicates_df['match_score'] < 90)])
    low_conf = len(duplicates_df[(duplicates_df['match_score'] >= 75) & (duplicates_df['match_score'] < 80)])

    print(f"\nBreakdown:")
    print(f"  High confidence (90%+): {high_conf} pairs - almost certainly duplicates")
    print(f"  Medium confidence (80-90%): {med_conf} pairs - need manual review")
    print(f"  Low confidence (75-80%): {low_conf} pairs - check carefully")

    # Show top 10
    print(f"\nTop 10 most likely duplicates:")
    print("="*80)
    for idx, dup in duplicates_df.head(10).iterrows():
        print(f"\nMatch Score: {dup['match_score']}%")
        print(f"  Donation 1: {dup['donor_name_1']}, £{dup['amount_1']:.2f}, {dup['date_1'].strftime('%d-%b-%Y')}, {dup['method_1']}")
        print(f"  Donation 2: {dup['donor_name_2']}, £{dup['amount_2']:.2f}, {dup['date_2'].strftime('%d-%b-%Y')}, {dup['method_2']}")
        print(f"  Scores - Name: {dup['name_score']}, Amount: {dup['amount_score']}, Date: {dup['date_score']}")

    # Save results
    duplicates_df.to_csv('potential_duplicate_donations.csv', index=False)
    print(f"\nFull results saved to potential_duplicate_donations.csv")

    print(f"\nNext steps:")
    print(f"1. Review high confidence matches (90%+) - likely real duplicates")
    print(f"2. Investigate medium confidence (80-90%) - check donor history")
    print(f"3. Delete confirmed duplicates in CRM (keep most detailed record)")
    print(f"4. Document which donations merged and why")
    print(f"5. Fix root causes to prevent future duplicates")

else:
    print("\nNo potential duplicates found - your data looks clean!")

Tools

Google Colabplatform · freemium
Visit →
Python (fuzzywuzzy)library · free · open source
Visit →
pandaslibrary · free · open source
Visit →

Resources

At a glance

Time to implement
days
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
medium, large
Target audience
fundraising, operations-manager, data-analyst

All tools are free. All processing runs locally - no data sent to external services. GDPR compliant for donor data. Initial setup takes a few hours. Once built, checking for duplicates takes minutes. Prevents financial misreporting: £1,000 in false duplicates overstates income, leads to bad decisions.