← Back to recipes

Check your data for problems automatically

data-analysisintermediateproven

The problem

Your database or spreadsheet has grown over the years and you suspect there are problems: duplicate records, invalid postcodes, dates that don't make sense, missing fields that should be filled in. You need to know what's wrong before it causes problems in reports or service delivery.

The solution

For a quick check, paste a sample of your data (with personal details removed or anonymised) into Claude or ChatGPT and ask it to spot problems. For a thorough check, use a Python script in Google Colab that scans for duplicates, invalid formats, outliers, and missing fields. Either way, you get a plain English summary of what's wrong and what to fix first.

What you get

A report listing the problems found, grouped by severity and type. Each issue includes examples and the number of affected records. The summary tells you what's most urgent to fix and suggests likely causes. You can export the problem records to clean up.

Before you start

  • Your data exported as a CSV or Excel file
  • A rough idea of what the data should look like (valid formats, required fields)
  • A Google account for Colab
  • For the LLM summary: a ChatGPT or Claude account

When to use this

  • You're about to do a major report and want to check the data first
  • You've inherited a database and don't know what state it's in
  • You're merging data from multiple sources and expect inconsistencies
  • You want to set up regular data quality monitoring

When not to use this

  • Your dataset is tiny and you can just eyeball it
  • You already have data validation built into your system at the point of entry
  • The data is temporary and will be thrown away anyway
  • You want to use the LLM quick-check with sensitive beneficiary data and don't have an enterprise AI agreement in place

Steps

  1. 1

    Quick option: paste a sample into Claude or ChatGPT

    For a fast check without any coding, export 50-100 rows as CSV and paste them into Claude or ChatGPT. IMPORTANT: Remove or anonymise personal data first (names, addresses, dates of birth, contact details) before pasting into any AI tool. If you need to check real data, use an enterprise version with a data processing agreement, or run the Python checks locally instead. Ask: "Check this data for problems: duplicates, missing values, invalid formats, anything that looks wrong." This catches obvious issues in minutes.

  2. 2

    Export your full data

    Get your data into a CSV or Excel file. If it's very large, you might start with a sample to test your checks before running on the full dataset. Note down what each column should contain (dates, postcodes, numbers, required fields, etc.).

  3. 3

    Run basic checks in Colab

    Upload to Colab and use the example code to check for: missing values in required fields, duplicate records, values outside expected ranges, invalid formats (like postcodes or email addresses), and dates that don't make sense (in the future, before birth dates, etc.).

  4. 4

    Look for outliers

    Check for values that are technically valid but suspiciously unusual. Ages over 100, postcodes you don't normally cover, dates from decades ago. These aren't necessarily wrong but are worth flagging for review.

  5. 5

    Check logical consistency

    Look for combinations that don't make sense: end dates before start dates, ages that don't match birth dates, services that don't match referral sources. These often indicate data entry errors or system problems.

  6. 6

    Get a plain English summary

    Paste the findings (not the raw data, just the problem summary from the script) into ChatGPT or Claude and ask: "Here are the data quality issues I found. Please summarise the main problems, prioritise them by severity, and suggest likely causes." The script output contains counts and patterns, not personal data, so this is safe to share with AI tools.

  7. 7

    Export problem records for cleaning

    For each type of problem, export the affected records so you can fix them. Some issues are quick to fix in bulk (standardising formats), others need individual review (possible duplicates). Start with the most impactful issues.

Example code

Basic data quality checks

This runs common checks on your data and summarises what's wrong. Adapt the column names and validation rules to match your data.

import pandas as pd
import re

# Load your data (CSV or Excel)
# For CSV files:
df = pd.read_csv('your_data.csv')
# For Excel files, use instead:
# df = pd.read_excel('your_data.xlsx')

problems = []

# Check for missing required fields
required_columns = ['name', 'date_of_birth', 'postcode', 'service_type']
for col in required_columns:
    if col in df.columns:
        missing = df[col].isna().sum()
        if missing > 0:
            problems.append({
                'type': 'Missing required field',
                'column': col,
                'count': missing,
                'percent': f"{100*missing/len(df):.1f}%"
            })

# Check for duplicates
duplicates = df.duplicated(subset=['name', 'date_of_birth'], keep=False)
if duplicates.sum() > 0:
    problems.append({
        'type': 'Possible duplicates',
        'column': 'name + date_of_birth',
        'count': duplicates.sum(),
        'examples': df[duplicates][['name', 'date_of_birth']].head(5).to_dict('records')
    })

# Check postcode format (UK)
if 'postcode' in df.columns:
    uk_postcode_pattern = r'^[A-Z]{1,2}[0-9][0-9A-Z]?s?[0-9][A-Z]{2}$'
    invalid_postcodes = ~df['postcode'].fillna('').str.upper().str.match(uk_postcode_pattern)
    invalid_count = invalid_postcodes.sum()
    if invalid_count > 0:
        problems.append({
            'type': 'Invalid postcode format',
            'column': 'postcode',
            'count': invalid_count,
            'examples': df[invalid_postcodes]['postcode'].head(10).tolist()
        })

# Check for future dates
if 'date_of_birth' in df.columns:
    df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], errors='coerce')
    future_dates = df['date_of_birth'] > pd.Timestamp.now()
    if future_dates.sum() > 0:
        problems.append({
            'type': 'Date in the future',
            'column': 'date_of_birth',
            'count': future_dates.sum()
        })

# Check for unlikely ages
if 'date_of_birth' in df.columns:
    ages = (pd.Timestamp.now() - df['date_of_birth']).dt.days / 365
    unlikely = (ages > 100) | (ages < 0)
    if unlikely.sum() > 0:
        problems.append({
            'type': 'Unlikely age',
            'column': 'date_of_birth',
            'count': unlikely.sum(),
            'examples': df[unlikely]['date_of_birth'].head(5).tolist()
        })

# Print summary
print(f"Data quality check complete. Found {len(problems)} types of issues:\n")
for p in problems:
    print(f"- {p['type']} in '{p.get('column', 'N/A')}': {p['count']} records ({p.get('percent', '')})")
    if 'examples' in p:
        print(f"  Examples: {p['examples'][:3]}")
    print()

# Export problems for LLM summary
import json
print("\n--- Copy this to ChatGPT/Claude for a plain English summary ---\n")
print(json.dumps(problems, indent=2, default=str))

Tools

Google Colabplatform · freemium
Visit →
pandaslibrary · free · open source
Visit →
ChatGPT or Claudeservice · freemium
Visit →

Resources

At a glance

Time to implement
hours
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
small, medium, large
Target audience
data-analyst, operations-manager, it-technical

All the tools are free. Once you've set up the checks, you can re-run them whenever you want.

Part of this pathway