← Back to recipes

Prepare your data for different AI techniques

data-analysisintermediateproven

The problem

You have data and want to use AI, but different AI approaches need different preparations. LLMs work with text and need context. Machine learning classification needs labelled examples in specific formats. Statistical analysis needs clean numerical data. Preparing data wrong wastes time and produces poor results.

The solution

Match your data preparation to your AI technique. This recipe covers the three main approaches charities use: LLMs (for text tasks), classification models (for categorisation), and statistical analysis (for patterns and predictions). Each has different requirements for format, volume, and structure.

What you get

Data prepared correctly for your chosen AI technique. For LLMs: formatted prompts with context. For classification: labelled training data in the right structure. For statistics: clean numerical data ready for analysis. This preparation is often the difference between AI that works and AI that fails.

Before you start

  • Data you want to use for AI
  • A clear idea of what technique you'll use
  • Basic spreadsheet or coding skills
  • IMPORTANT: Service data often contains PII (beneficiary names, case notes, safeguarding details). Anonymise or pseudonymise before sending to cloud-based AI services. Check your data processing agreements cover this use case

When to use this

  • Before feeding data to any AI system
  • When switching between AI approaches
  • When an AI system is giving poor results (often a data prep issue)
  • When adapting a recipe that uses a different data format

When not to use this

  • Using pre-built tools that handle their own data prep
  • The AI tool specifies exact format requirements (follow those)

Steps

  1. 1

    Identify your AI technique

    Determine which approach you're using: LLM text processing (summarisation, extraction, generation), ML classification (categorising items into predefined groups), or statistical analysis (finding patterns, predictions, correlations). Each needs different prep.

  2. 2

    For LLMs: Prepare text with context

    LLMs need readable text with enough context to understand the task. Structure your data as clear prompts: include relevant background, specify the task, provide examples of desired output. Batch similar items together. Don't strip out context that helps understanding. Clean up obvious errors but don't over-sanitise.

  3. 3

    For LLMs: Format for consistency

    If processing many items, use consistent formatting. Put each document in the same structure. Use clear delimiters between sections. Include metadata (date, source, type) that might be relevant. Keep items self-contained - the LLM sees each in isolation.

  4. 4

    For classification: Create labelled examples

    ML classification needs examples of each category. Export your data as CSV with columns for the features (inputs) and the label (the category you want to predict). You need at least 50 examples per category, ideally 200+. Balance matters: if 90% of examples are one category, the model will just predict that. For charity-specific classifications like safeguarding flags, accuracy is critical - a missed flag could have serious consequences.

  5. 5

    For classification: Clean and encode

    Handle missing values (fill with median/mode or remove rows). Convert categories to numbers if required (one-hot encoding for nominal categories). Normalise numerical values to similar scales. Remove columns that leak the answer (like a "rejected" flag when predicting rejection). Note: if using `pd.get_dummies` for encoding, ensure consistent categories between training and future data to avoid errors.

  6. 6

    For statistics: Ensure numerical data

    Statistical analysis needs numbers. Convert dates to numerical values (days since epoch, month number). Convert categories to dummy variables or numerical codes. Handle missing values explicitly (don't let them become zeros silently). Check for outliers that might skew results.

  7. 7

    For statistics: Check distributions

    Look at your numerical columns: are they normally distributed, skewed, or bimodal? Many statistical techniques assume normal distributions. Consider transformations (log, square root) for highly skewed data. Identify and decide how to handle outliers.

  8. 8

    Validate before using

    Before running your full analysis, test with a small sample. For LLMs: try 5-10 examples and check outputs. For classification: does the model train without errors? For statistics: do summary statistics look sensible? Early validation catches prep problems before you waste hours. IMPORTANT: Always have a human verify AI-categorised data before using it for decision-making, especially for sensitive classifications.

Example code

Prepare data for different techniques

Example code showing preparation for LLM, classification, and statistics.

import pandas as pd
import numpy as np

# Load your data
df = pd.read_csv('enquiries.csv')

# ============================================
# PREPARATION FOR LLM TEXT PROCESSING
# ============================================
# Goal: Summarise enquiry notes

def prepare_for_llm(row):
    """Format a row as a prompt for an LLM."""
    return f"""Enquiry from {row['date']}
Source: {row['referral_source']}
Service requested: {row['service_type']}

Notes:
{row['notes']}

---
Please summarise this enquiry in 2-3 sentences, highlighting the key need."""

# Create prompts for each row
llm_prompts = df.apply(prepare_for_llm, axis=1).tolist()

# ============================================
# PREPARATION FOR CLASSIFICATION
# ============================================
# Goal: Predict urgency from enquiry features

# Select features and target
features = ['referral_source', 'service_type', 'notes_length', 'has_safeguarding_mention']
target = 'urgency'

# Create notes_length feature
df['notes_length'] = df['notes'].str.len()

# Create binary feature for safeguarding mentions
df['has_safeguarding_mention'] = df['notes'].str.lower().str.contains('safeguard|concern|risk').astype(int)

# One-hot encode categorical features
df_encoded = pd.get_dummies(df[features + [target]], columns=['referral_source', 'service_type'])

# Encode target (urgency) as numbers
urgency_map = {'routine': 0, 'soon': 1, 'urgent': 2, 'emergency': 3}
df_encoded['urgency_numeric'] = df_encoded[target].map(urgency_map)

# Split features and target
X = df_encoded.drop([target, 'urgency_numeric'], axis=1)
y = df_encoded['urgency_numeric']

print(f"Classification data: {len(X)} samples, {len(X.columns)} features")
print(f"Class distribution: {y.value_counts().to_dict()}")

# ============================================
# PREPARATION FOR STATISTICAL ANALYSIS
# ============================================
# Goal: Analyse patterns in response times

# Convert dates to numerical values
df['enquiry_date'] = pd.to_datetime(df['date'])
df['response_date'] = pd.to_datetime(df['first_response_date'])
df['response_days'] = (df['response_date'] - df['enquiry_date']).dt.days

# Handle missing/invalid values
df['response_days'] = df['response_days'].clip(lower=0)  # No negative response times
df = df.dropna(subset=['response_days'])  # Remove rows with no response

# Check for outliers
q99 = df['response_days'].quantile(0.99)
print(f"99th percentile response time: {q99} days")
df_clean = df[df['response_days'] <= q99]  # Remove extreme outliers

# Summary statistics
print(f"Mean response: {df_clean['response_days'].mean():.1f} days")
print(f"Median response: {df_clean['response_days'].median():.1f} days")
print(f"Std dev: {df_clean['response_days'].std():.1f} days")

# Group by category for comparison
print(df_clean.groupby('service_type')['response_days'].agg(['mean', 'median', 'count']))

Tools

Google Colab or Excelplatform · free
Claude or ChatGPTservice · freemium
Visit →

Resources

At a glance

Time to implement
hours
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
small, medium, large
Target audience
data-analyst, it-technical

Preparation is free; the AI technique may have costs.