Process documents in bulk with LLM APIs

data-analysisintermediateproven

The problem

You've got 100 grant applications to review, 50 case notes to summarise, or 200 beneficiary feedback forms to extract themes from. Each is a separate document (PDF, Word doc, text file). Reading and analysing them manually would take weeks. You need to apply consistent AI analysis across all documents, but copying each into Claude.ai is still too slow.

The solution

Write a script that loops through a folder of documents, extracts the text, sends it to an LLM API with a consistent prompt (e.g., 'score this grant application' or 'summarise this case note'), and saves the structured results to a spreadsheet. This is the batch processing pattern for documents instead of CSV rows.

What you get

A CSV file with one row per document containing AI-generated analysis. For grant applications: score, strengths, weaknesses, recommendation. For case notes: summary, key issues, follow-up actions. For feedback forms: themes, sentiment, key quotes. Typically processes 50-200 documents in 20-60 minutes.

Before you start

Folder of documents to process (PDF, DOCX, or TXT files)
API key from OpenAI or Anthropic (budget £10-50 depending on volume)
Clear rubric: what should the AI extract or assess from each document?
Python environment (Google Colab works, or local installation)
Documents are in English or other language supported by the LLM
DATA PROTECTION: If processing case notes or beneficiary documents, ensure your API provider agreement excludes data from training (enterprise tiers). Check you have a Data Processing Agreement (DPA) in place. For highly sensitive data, consider whether cloud AI is appropriate at all.

When to use this

You have 20+ documents that need the same analysis applied
Documents are in standard formats (PDF, Word, plain text)
The analysis is well-defined (score against rubric, extract specific information, summarise)
Manual reading would take days or weeks
You can tolerate 85-95% accuracy with human review of edge cases

When not to use this

Fewer than 20 documents (manual processing is faster)
Documents are scanned images without OCR (need to extract text first)
Each document needs highly customised assessment (not a standard template)
Documents contain highly sensitive information you can't send to external APIs
You need 100% accuracy (always budget for human review)
Documents are very long (>100 pages) - may hit API context limits

Steps

1
Create your assessment rubric or extraction template
Define exactly what you want from each document. For grant applications: 'Score 1-10 on: alignment with our priorities, feasibility, budget reasonableness. List 3 strengths and 3 weaknesses. Recommend: fund, maybe, reject.' For case notes: 'Summarise in 2-3 sentences, list key issues flagged, suggest follow-up actions.' Test this rubric manually on 3 documents to ensure it works.
2
Prepare your documents in one folder
Put all documents to process in a single folder. Use clear filenames (e.g., 'grant-app-001.pdf', 'case-note-2024-01-15.docx'). If documents are scanned PDFs without text, you'll need to OCR them first (separate step). Organise subfolders if you have different document types needing different prompts.
3
Set up code with text extraction
Use the example code below. It handles PDF, DOCX, and TXT files. For PDFs: uses pypdf (the modern replacement for PyPDF2) to extract text. For DOCX: uses python-docx. For TXT: reads directly. Test text extraction on 3 documents first - check the extracted text looks right (no garbled characters, structure preserved). Some complex PDFs extract poorly.
4
Test your prompt on 5 documents
Modify the code to process only 5 documents (there's a limit in the example). Run it. Check outputs carefully: Does the AI follow your rubric? Are scores reasonable? Is extracted information accurate? Common issues: AI misses key details in long documents, scores are too generous, formatting is inconsistent. Refine prompt and re-test.
5
Run the full batch
Remove the document limit and run on all documents. For 100 documents this typically takes 30-60 minutes depending on length and API speed. The code shows progress and handles rate limits. When complete, you'll have a CSV with one row per document containing the AI analysis.
6
Review outputs and flag anomalies
Open the results CSV. Sort by AI scores or key fields. Spot-check 15-20 documents against the original: is the analysis accurate? Flag any documents where the AI clearly failed (extracted nonsense, missed key information, contradicted itself). Review these manually.
7
Handle failed extractions(optional)
Some documents might have failed text extraction (complex PDFs, protected documents, corrupt files). The code logs these. For failed documents: try opening in Word and saving as plain text, or use OCR if they're scanned images. Re-run just the failed documents.

Example code

Process grant applications in bulk

Batch process grant applications with scoring rubric. Uses OpenAI SDK v1.0.0+ syntax and pypdf for PDF extraction.

# Install required packages (run this in Colab first)
# !pip install openai pypdf pandas tqdm

import os
import pandas as pd
import json
import re
from openai import OpenAI
from pypdf import PdfReader
import time
from tqdm import tqdm

# Configuration
API_KEY = 'your-openai-api-key'
DOCS_FOLDER = './grant-applications'  # Folder with PDFs
OUTPUT_CSV = 'grant_scores.csv'

# Initialize OpenAI client (v1.0.0+ syntax)
client = OpenAI(api_key=API_KEY)

# Your assessment rubric
ASSESSMENT_PROMPT = """You are reviewing a grant application. Assess it using this rubric:

1. Alignment with our priorities (1-10 score): How well does it match our funding criteria?
2. Feasibility (1-10 score): Is the project realistic and achievable?
3. Budget quality (1-10 score): Is the budget clear, reasonable, and well-justified?

Also provide:
- 3 key strengths (bullet points)
- 3 key weaknesses (bullet points)
- Overall recommendation: FUND, MAYBE, or REJECT
- Brief rationale (2-3 sentences)

Return ONLY raw JSON with no markdown formatting or code blocks:
{
  "alignment_score": 7,
  "feasibility_score": 8,
  "budget_score": 6,
  "strengths": ["...", "...", "..."],
  "weaknesses": ["...", "...", "..."],
  "recommendation": "MAYBE",
  "rationale": "..."
}"""

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF file"""
    try:
        reader = PdfReader(pdf_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    except Exception as e:
        return None

def parse_json_response(text):
    """Parse JSON from LLM response, handling markdown code blocks"""
    # Strip markdown code blocks if present
    text = re.sub(r'```json\s*', '', text)
    text = re.sub(r'```\s*', '', text)
    return json.loads(text.strip())

def assess_document(text, filename):
    """Send document to OpenAI for assessment"""
    try:
        response = client.chat.completions.create(
            model='gpt-4o-mini',
            messages=[
                {'role': 'system', 'content': ASSESSMENT_PROMPT},
                {'role': 'user', 'content': f"Document: {filename}\n\n{text}"}
            ],
            temperature=0.3,
            max_tokens=1000
        )

        result = parse_json_response(response.choices[0].message.content)
        return result

    except Exception as e:
        return {'error': str(e)}

# Initialize results
results = []

# Get all PDF files
pdf_files = [f for f in os.listdir(DOCS_FOLDER) if f.endswith('.pdf')]

# For testing: process first 5 only
# Remove this line when ready for full batch
pdf_files = pdf_files[:5]

print(f"Processing {len(pdf_files)} documents...")

# Process each document
for pdf_file in tqdm(pdf_files):
    pdf_path = os.path.join(DOCS_FOLDER, pdf_file)

    # Extract text
    print(f"\nExtracting text from {pdf_file}...")
    text = extract_text_from_pdf(pdf_path)

    if not text:
        print(f"  Failed to extract text from {pdf_file}")
        results.append({
            'filename': pdf_file,
            'error': 'Text extraction failed'
        })
        continue

    # Assess document
    print(f"  Assessing {pdf_file}...")
    assessment = assess_document(text, pdf_file)

    # Store results
    results.append({
        'filename': pdf_file,
        'alignment_score': assessment.get('alignment_score'),
        'feasibility_score': assessment.get('feasibility_score'),
        'budget_score': assessment.get('budget_score'),
        'strengths': ' | '.join(assessment.get('strengths', [])),
        'weaknesses': ' | '.join(assessment.get('weaknesses', [])),
        'recommendation': assessment.get('recommendation'),
        'rationale': assessment.get('rationale'),
        'error': assessment.get('error')
    })

    # Rate limiting
    time.sleep(1)

# Save results
df = pd.DataFrame(results)
df.to_csv(OUTPUT_CSV, index=False)

print(f"\nDone! Results saved to {OUTPUT_CSV}")
print(f"Successfully processed: {df['error'].isna().sum()} documents")
print(f"Errors: {df['error'].notna().sum()} documents")

# Show summary statistics
if df['error'].isna().sum() > 0:
    print(f"\nAverage scores:")
    print(f"  Alignment: {df['alignment_score'].mean():.1f}/10")
    print(f"  Feasibility: {df['feasibility_score'].mean():.1f}/10")
    print(f"  Budget: {df['budget_score'].mean():.1f}/10")
    print(f"\nRecommendations:")
    print(df['recommendation'].value_counts())

Summarise case notes in bulk using Claude

Summarise case notes in bulk using Claude. Install: pip install anthropic python-docx pandas tqdm

import os
import pandas as pd
import anthropic
from docx import Document
import time
from tqdm import tqdm

# Configuration
API_KEY = 'your-anthropic-api-key'
DOCS_FOLDER = './case-notes'  # Folder with DOCX files
OUTPUT_CSV = 'case_summaries.csv'

SUMMARY_PROMPT = """Summarise this case note following this template:

1. Brief summary (2-3 sentences): What happened in this interaction?
2. Key issues identified (bullet list): What concerns or needs were flagged?
3. Actions taken (bullet list): What was done or arranged?
4. Follow-up required (bullet list): What needs to happen next?
5. Urgency (LOW, MEDIUM, HIGH): How urgent is follow-up?

Return as JSON."""

def extract_text_from_docx(docx_path):
    """Extract text from Word document"""
    try:
        doc = Document(docx_path)
        return '\n'.join([para.text for para in doc.paragraphs])
    except Exception as e:
        return None

def summarise_case_note(text, filename, client):
    """Send case note to Claude for summarisation"""
    try:
        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            temperature=0.3,
            messages=[
                {
                    "role": "user",
                    "content": f"{SUMMARY_PROMPT}\n\nCase note: {filename}\n\n{text}"
                }
            ]
        )

        import json
        result = json.loads(message.content[0].text)
        return result

    except Exception as e:
        return {'error': str(e)}

# Initialize
client = anthropic.Anthropic(api_key=API_KEY)
results = []

# Get all Word docs
docx_files = [f for f in os.listdir(DOCS_FOLDER) if f.endswith('.docx')]

# Test mode: first 5 only
docx_files = docx_files[:5]

print(f"Processing {len(docx_files)} case notes...")

for docx_file in tqdm(docx_files):
    docx_path = os.path.join(DOCS_FOLDER, docx_file)

    # Extract text
    text = extract_text_from_docx(docx_path)

    if not text:
        results.append({'filename': docx_file, 'error': 'Text extraction failed'})
        continue

    # Summarise
    summary = summarise_case_note(text, docx_file, client)

    results.append({
        'filename': docx_file,
        'summary': summary.get('summary'),
        'key_issues': ' | '.join(summary.get('key_issues', [])),
        'actions_taken': ' | '.join(summary.get('actions_taken', [])),
        'follow_up': ' | '.join(summary.get('follow_up', [])),
        'urgency': summary.get('urgency'),
        'error': summary.get('error')
    })

    time.sleep(1)

# Save
df = pd.DataFrame(results)
df.to_csv(OUTPUT_CSV, index=False)
print(f"\nSaved {len(df)} summaries to {OUTPUT_CSV}")

Tools

OpenAI APIservice · paid

Visit →

Anthropic Claude APIservice · paid

Visit →

Pythonplatform · free · open source

Visit →

pypdflibrary · free · open source

Visit →

Resources

pypdf documentationdocumentation

Extract text from PDF files in Python (modern replacement for PyPDF2).

python-docx documentationdocumentation

Read and write Word documents in Python.

Handling long documents with LLMstutorial

Best practices for processing long documents with Claude.

OCR tools for scanned PDFstool

Add text layer to scanned PDFs before processing.

At a glance

Time to implement: hours
Setup cost: low
Ongoing cost: low
Cost trend: decreasing
Organisation size: small, medium, large
Target audience: operations-manager, program-delivery, fundraising, data-analyst

Cost depends on document length and model used. Typical grant application (3-5 pages): £0.05-0.10 with GPT-4o-mini, £0.20-0.40 with GPT-4. Claude Opus handles longer documents better but costs more. For 100 documents averaging 4 pages: £5-40 depending on model. Always test with 5 documents first to estimate costs.

Part of this pathway

API loops and programmatic AI

This recipe is in the "Document processing" stage

intermediate4 stages

The problem

The solution

What you get

Before you start

When to use this

When not to use this

Steps

Create your assessment rubric or extraction template

Prepare your documents in one folder

Set up code with text extraction

Test your prompt on 5 documents

Run the full batch

Review outputs and flag anomalies

Handle failed extractions(optional)

Example code

Process grant applications in bulk

Summarise case notes in bulk using Claude

Tools

Resources

At a glance

Part of this pathway