← Back to recipes

Structure your data collection for future AI use

data-analysisbeginnerproven

The problem

You're setting up a new system, form, or process and want to make sure the data you collect will be usable for AI in the future. It's much cheaper to collect data well from the start than to clean it up later. But you're not sure what 'AI-ready' data collection looks like or what mistakes to avoid.

The solution

Follow a set of principles when designing data collection: use structured fields over free text where possible, enforce consistent formats, capture timestamps, record categories that might be useful for analysis, and avoid collecting data you can't actually use. This doesn't mean complex systems - even a well-designed spreadsheet beats a badly designed database.

What you get

A data collection design (form, spreadsheet, or system configuration) that will be usable for AI and analysis later. You'll avoid common mistakes that make data unusable: inconsistent categories, ambiguous fields, missing timestamps, and unstructured dumps that require manual interpretation.

Before you start

  • A data collection process you're designing or redesigning
  • Understanding of what you might want to analyse later
  • Authority to influence how data is collected

When to use this

  • Setting up a new CRM, database, or system
  • Designing intake forms or assessment tools
  • Reviewing existing data collection for improvements
  • After discovering your data is unusable for analysis

When not to use this

  • You have no control over data collection
  • The data is temporary and will never be analysed
  • You're collecting data for a one-off project

Steps

  1. 1

    Use structured fields over free text

    Wherever possible, use dropdowns, checkboxes, or radio buttons instead of free text. "Referral source" as a dropdown with 10 options is infinitely more useful than a free text field where people type "google", "Google", "internet", "web search", "they googled us". Free text is fine for notes, but not for data you want to analyse.

  2. 2

    Standardise categories upfront

    Define your categories before collecting. What are the valid service types? Referral sources? Outcome categories? Write these down and enforce them. Add "Other" with a free text field for exceptions, but make people choose a category first. Review "Other" responses monthly to see if you need new categories.

  3. 3

    Capture timestamps automatically

    Record when things happen: created date, modified date, status change dates. These are essential for trend analysis and often impossible to reconstruct later. Most systems can do this automatically. For spreadsheets, add date columns and train people to fill them.

  4. 4

    Separate different types of information

    Don't combine multiple things in one field. "John Smith - referred by GP - urgent" in a single notes field is impossible to analyse. Use separate fields: name, referral source, priority. This applies to addresses too: separate street, city, postcode rather than one address blob.

  5. 5

    Use consistent formats

    Enforce formats for dates (always DD/MM/YYYY or YYYY-MM-DD), postcodes (uppercase, with space), phone numbers (consistent format). Validation at entry is better than cleaning later. For spreadsheets, use data validation. For forms, use input masks.

  6. 6

    Record what you might want to group by

    Think about what questions you'll ask later: outcomes by service type, retention by referral source, costs by programme. Make sure you're capturing the fields you'd group by. If you'll want to compare outcomes by location, make sure location is captured consistently.

  7. 7

    Don't collect what you won't use

    Every field creates maintenance burden and privacy risk. If you won't analyse it or report on it, question whether you need it. "Nice to have" fields often stay empty and add noise. Be ruthless: collect what you need, not what you might possibly want someday. Under UK GDPR, collecting data for "future AI use" may require: (1) updating your privacy notice to reflect this purpose, (2) conducting a Data Protection Impact Assessment (DPIA) if you plan to use AI for profiling or automated decision-making, and (3) ensuring a lawful basis beyond consent if you want to use historical data. The ICO has guidance on AI and data protection.

  8. 8

    Document your schema

    Write down what each field means, what values are valid, and why it exists. This takes 30 minutes and saves hours of confusion later. Include: field name, description, data type, valid values, who fills it in, and when. Store this alongside your data.

Example code

Example: AI-ready referral form design

Comparison of bad vs good data collection design.

# Referral Form Design Comparison

## Bad design (hard to analyse)

| Field | Type | Problem |
|-------|------|---------|
| Client info | Free text | Name, DOB, address all mixed together |
| Date | Free text | People write "Jan 5th", "5/1/24", "05-01-2024" |
| Service needed | Free text | 200 different ways to say the same 8 services |
| Notes | Free text | Everything dumped here, no structure |
| Urgency | Not captured | Critical info missing |

## Good design (AI-ready)

| Field | Type | Validation |
|-------|------|------------|
| First name | Text | Required |
| Last name | Text | Required |
| Date of birth | Date | DD/MM/YYYY format enforced |
| Postcode | Text | UK postcode format validation |
| Referral date | Date | Auto-filled with today, editable |
| Referral source | Dropdown | GP / Self / Social Services / School / Other |
| If Other, specify | Text | Only shows if Other selected |
| Primary service | Dropdown | List of 8 services |
| Secondary service | Dropdown | Same list, optional |
| Urgency | Dropdown | Routine / Soon / Urgent / Emergency |
| Additional notes | Text | Free text for context |
| Created timestamp | Auto | System-generated |

## Why this matters for AI
- Can count referrals by source instantly
- Can analyse service demand patterns
- Can predict urgency from referral characteristics
- Can identify which sources lead to which services
- Dates enable trend analysis over time
- Structured data simplifies funder reporting (e.g., "65% of referrals came from GPs")

Tools

Any form builder or databaseplatform · free

Resources

At a glance

Time to implement
hours
Setup cost
free
Ongoing cost
free
Cost trend
stable
Organisation size
micro, small, medium, large
Target audience
operations-manager, data-analyst, it-technical

Good design costs nothing extra but saves huge amounts later.