← Back to recipes

Extract key facts from case notes

service-deliveryintermediateemerging

The problem

You've got years of case notes in free-text format: 'Spoke to client about housing situation. Referred to partner org for debt advice. Client mentioned feeling stressed.' There's valuable information buried in there, but you can't query it. You can't answer questions like 'how many people with housing issues also have debt problems?' or 'which risk factors appear most before crisis?'

The solution

Use an LLM to read your case notes and extract structured facts: dates, interventions provided, services mentioned, risk indicators flagged, needs identified, outcomes noted. What was narrative becomes a database you can query for patterns, track intervention pathways, and spot early warning signs across your caseload.

What you get

A structured database with rows for each case note and columns for: client ID, date, interventions mentioned, needs identified, risk factors present, outcomes recorded, services accessed. You can now query 'show me all cases with housing + mental health needs' or 'what interventions precede positive outcomes?'

Before you start

  • Case notes exported as text (CSV with one note per row)
  • Defined categories for what you want to extract (your services, risk factors, needs taxonomy)
  • An OpenAI or Anthropic API key for batch processing
  • Data protection approval to use AI tools with case data

When to use this

  • You've got hundreds or thousands of unstructured case notes
  • You need to answer questions that require querying across all cases
  • You want to identify patterns in service pathways or risk factors
  • Manual coding of notes would take longer than you have

When not to use this

  • Your case notes are already structured data in your CRM
  • You have very few cases - might be quicker to manually code
  • The notes are too brief or inconsistent for extraction to work
  • Data protection policy doesn't permit AI processing of case notes

Steps

  1. 1

    Check data protection permissions

    Before extracting anything, check with your data protection lead whether you can process case notes through AI tools. You may need to anonymise heavily first (removing names, specific locations, etc.). Some organisations will say no - respect that boundary.

  2. 2

    Define your extraction categories

    List what you want to pull out: service types you offer, common needs categories, risk indicators you track, intervention types, outcome measures. Be specific. 'Mental health mentioned' is better than 'wellbeing'. Create a controlled vocabulary so extraction is consistent.

  3. 3

    Anonymise your notes

    Remove or replace client names, addresses, specific identifying details. You want the factual content (what was discussed, what services offered) not the personal identifiers. The extraction works on anonymised text just as well.

  4. 4

    Test extraction on a sample

    Take 20-30 notes and run them through the extraction prompt manually in Claude or ChatGPT. Does it identify the right categories? Does it miss important details? Does it hallucinate things that aren't there? Refine your prompt based on what you learn.

  5. 5

    Build your extraction prompt

    Create a prompt that includes your categories and asks for structured JSON output. Tell it to only extract facts present in the note, mark confidence for uncertain extractions, and flag when something's mentioned but details are vague. Include 2-3 example notes with their correct extractions.

  6. 6

    Process in batches

    Use the API and example code to process all your notes. The code loops through each note, extracts the structured data, and builds a CSV. This might take a few hours for thousands of notes. Monitor a sample to check quality stays consistent.

  7. 7

    Validate the results

    Spot-check extractions against original notes. Do the extracted facts match what's written? Are categories applied consistently? Check a random sample of at least 100 notes. If accuracy is below 85%, revise your prompt and re-run.

  8. 8

    Start querying your new database(optional)

    Now you can ask questions like: 'How many cases mentioned housing + debt together?', 'What's the typical pathway for mental health referrals?', 'Which risk factors most commonly precede crisis?'. Use spreadsheet filters or simple queries to find patterns.

Example code

Extract structured facts from case notes

This processes case notes to extract structured information. Adapt the categories to match what you track.

from openai import OpenAI
import pandas as pd
import json
import time

client = OpenAI()

# Your categories to extract - adapt these
categories = {
    "services": ["housing support", "debt advice", "mental health", "food parcels", "benefits advice"],
    "needs": ["emergency accommodation", "financial hardship", "mental health crisis", "addiction support"],
    "risk_factors": ["eviction notice", "domestic abuse", "suicide ideation", "child protection concerns"],
    "outcomes": ["crisis averted", "referred to specialist", "ongoing support", "case closed"]
}

def extract_from_note(note_text):
    prompt = f"""Extract structured information from this case note.

Categories to look for:
{json.dumps(categories, indent=2)}

Return JSON with:
- services_mentioned: list of services from the categories (only if explicitly mentioned)
- needs_identified: list of needs from the categories
- risk_factors: list of risk factors present
- outcomes_noted: list of outcomes mentioned
- confidence: overall confidence in extraction (0-100)
- notes: any important context or uncertainty

Only extract information explicitly present in the note. Don't infer or guess.

Case note:
{note_text}

Return only valid JSON."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# Load case notes
notes_df = pd.read_csv('case_notes.csv')
print(f"Processing {len(notes_df)} case notes...")

results = []
for idx, row in notes_df.iterrows():
    if idx % 50 == 0:
        print(f"Progress: {idx}/{len(notes_df)}")

    try:
        extraction = extract_from_note(row['note_text'])

        # Flatten for CSV
        results.append({
            'note_id': row.get('note_id', idx),
            'client_id': row.get('client_id'),
            'date': row.get('date'),
            'services': ', '.join(extraction.get('services_mentioned', [])),
            'needs': ', '.join(extraction.get('needs_identified', [])),
            'risks': ', '.join(extraction.get('risk_factors', [])),
            'outcomes': ', '.join(extraction.get('outcomes_noted', [])),
            'confidence': extraction.get('confidence'),
            'extraction_notes': extraction.get('notes')
        })

        time.sleep(0.2)  # Rate limiting

    except Exception as e:
        print(f"Error processing note {idx}: {e}")
        results.append({
            'note_id': row.get('note_id', idx),
            'error': str(e)
        })

# Save results
output_df = pd.DataFrame(results)
output_df.to_csv('extracted_facts.csv', index=False)

print(f"\nExtraction complete. Processed {len(results)} notes")
print(f"Average confidence: {output_df['confidence'].mean():.1f}%")
print(f"\nLow confidence notes (< 70%) for review: {len(output_df[output_df['confidence'] < 70])}")

# Summary statistics
print("\nMost common services mentioned:")
all_services = [s for services in output_df['services'].dropna() for s in services.split(', ') if s]
print(pd.Series(all_services).value_counts().head(10))

Tools

OpenAI APIservice · paid
Visit →
Google Colabplatform · freemium
Visit →
Claudeservice · freemium
Visit →

Resources

At a glance

Time to implement
weeks
Setup cost
low
Ongoing cost
low
Cost trend
decreasing
Organisation size
medium, large
Target audience
data-analyst, operations-manager, program-delivery

API costs are ~£0.001-0.01 per note depending on length. For 10,000 notes that's £10-100. Main cost is setup and validation time.

Part of this pathway