Check your data for problems automatically

data-analysisintermediateproven

The problem

Your database or spreadsheet has grown over the years and you suspect there are problems: duplicate records, invalid postcodes, dates that don't make sense, missing fields that should be filled in. You need to know what's wrong before it causes problems in reports or service delivery.

The solution

For a quick check, paste a sample of your data (with personal details removed or anonymised) into Claude or ChatGPT and ask it to spot problems. For a thorough check, use a Python script in Google Colab that scans for duplicates, invalid formats, outliers, and missing fields. Either way, you get a plain English summary of what's wrong and what to fix first.

What you get

A report listing the problems found, grouped by severity and type. Each issue includes examples and the number of affected records. The summary tells you what's most urgent to fix and suggests likely causes. You can export the problem records to clean up.

Before you start

Your data exported as a CSV or Excel file
A rough idea of what the data should look like (valid formats, required fields)
A Google account for Colab
For the LLM summary: a ChatGPT or Claude account

When to use this

You're about to do a major report and want to check the data first
You've inherited a database and don't know what state it's in
You're merging data from multiple sources and expect inconsistencies
You want to set up regular data quality monitoring

When not to use this

Your dataset is tiny and you can just eyeball it
You already have data validation built into your system at the point of entry
The data is temporary and will be thrown away anyway
You want to use the LLM quick-check with sensitive beneficiary data and don't have an enterprise AI agreement in place. Note: You can run the Python checks locally using VS Code or Jupyter Lab if you're uncomfortable uploading data to Google Colab

Steps

1
Quick option: paste a sample into Claude or ChatGPT
For a fast check without any coding, export 50-100 rows as CSV and paste them into Claude or ChatGPT. IMPORTANT: Remove or anonymise personal data first (names, addresses, dates of birth, contact details) before pasting into any AI tool. If you need to check real data, use an enterprise version with a data processing agreement, or run the Python checks locally instead. Ask: "Check this data for problems: duplicates, missing values, invalid formats, anything that looks wrong." This catches obvious issues in minutes.
2
Export your full data
Get your data into a CSV or Excel file. If it's very large, you might start with a sample to test your checks before running on the full dataset. Note down what each column should contain (dates, postcodes, numbers, required fields, etc.).
3
Run basic checks in Colab
Upload to Colab and use the example code to check for: missing values in required fields, duplicate records, values outside expected ranges, invalid formats (like postcodes or email addresses), and dates that don't make sense (in the future, before birth dates, etc.).
4
Look for outliers
Check for values that are technically valid but suspiciously unusual. Ages over 100, postcodes you don't normally cover, dates from decades ago. These aren't necessarily wrong but are worth flagging for review.
5
Check logical consistency
Look for combinations that don't make sense: end dates before start dates, ages that don't match birth dates, services that don't match referral sources. These often indicate data entry errors or system problems.
6
Get a plain English summary
Paste the findings (not the raw data, just the problem summary from the script) into ChatGPT or Claude and ask: "Here are the data quality issues I found. Please summarise the main problems, prioritise them by severity, and suggest likely causes." The script output contains counts and patterns, not personal data, so this is safe to share with AI tools.
7
Export problem records for cleaning
For each type of problem, export the affected records so you can fix them. Some issues are quick to fix in bulk (standardising formats), others need individual review (possible duplicates). Start with the most impactful issues.

Example code

Basic data quality checks

This runs common checks on your data and summarises what's wrong. Adapt the column names and validation rules to match your data.

import pandas as pd
import re

# Load your data (CSV or Excel)
# For CSV files:
df = pd.read_csv('your_data.csv')
# For Excel files, use instead:
# df = pd.read_excel('your_data.xlsx')

problems = []

# Check for missing required fields
required_columns = ['name', 'date_of_birth', 'postcode', 'service_type']
for col in required_columns:
    if col in df.columns:
        missing = df[col].isna().sum()
        if missing > 0:
            problems.append({
                'type': 'Missing required field',
                'column': col,
                'count': missing,
                'percent': f"{100*missing/len(df):.1f}%"
            })

# Check for duplicates
duplicates = df.duplicated(subset=['name', 'date_of_birth'], keep=False)
if duplicates.sum() > 0:
    problems.append({
        'type': 'Possible duplicates',
        'column': 'name + date_of_birth',
        'count': duplicates.sum(),
        'examples': df[duplicates][['name', 'date_of_birth']].head(5).to_dict('records')
    })

# Check postcode format (UK)
if 'postcode' in df.columns:
    uk_postcode_pattern = r'^[A-Z]{1,2}[0-9][0-9A-Z]?s?[0-9][A-Z]{2}$'
    invalid_postcodes = ~df['postcode'].fillna('').str.upper().str.match(uk_postcode_pattern)
    invalid_count = invalid_postcodes.sum()
    if invalid_count > 0:
        problems.append({
            'type': 'Invalid postcode format',
            'column': 'postcode',
            'count': invalid_count,
            'examples': df[invalid_postcodes]['postcode'].head(10).tolist()
        })

# Check for future dates
if 'date_of_birth' in df.columns:
    df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], errors='coerce')
    future_dates = df['date_of_birth'] > pd.Timestamp.now()
    if future_dates.sum() > 0:
        problems.append({
            'type': 'Date in the future',
            'column': 'date_of_birth',
            'count': future_dates.sum()
        })

# Check for unlikely ages
if 'date_of_birth' in df.columns:
    ages = (pd.Timestamp.now() - df['date_of_birth']).dt.days / 365
    unlikely = (ages > 100) | (ages < 0)
    if unlikely.sum() > 0:
        problems.append({
            'type': 'Unlikely age',
            'column': 'date_of_birth',
            'count': unlikely.sum(),
            'examples': df[unlikely]['date_of_birth'].head(5).tolist()
        })

# Check logical consistency: end dates should be after start dates
if 'start_date' in df.columns and 'end_date' in df.columns:
    df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
    df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')
    invalid_dates = df['end_date'] < df['start_date']
    if invalid_dates.sum() > 0:
        problems.append({
            'type': 'End date before start date',
            'column': 'start_date / end_date',
            'count': invalid_dates.sum(),
            'examples': df[invalid_dates][['start_date', 'end_date']].head(5).to_dict('records')
        })

# Print summary
print(f"Data quality check complete. Found {len(problems)} types of issues:\n")
for p in problems:
    print(f"- {p['type']} in '{p.get('column', 'N/A')}': {p['count']} records ({p.get('percent', '')})")
    if 'examples' in p:
        print(f"  Examples: {p['examples'][:3]}")
    print()

# Export problems for LLM summary
import json
print("\n--- Copy this to ChatGPT/Claude for a plain English summary ---\n")
print(json.dumps(problems, indent=2, default=str))