Build a searchable knowledge base from your documents

complianceintermediateproven

The problem

You've got policies, procedures, training materials, and FAQs scattered across documents. When staff have questions ('What's our lone working policy?', 'How do I process a safeguarding referral?'), they either ask a colleague or spend ages searching through PDFs. The information exists, but it's not findable.

The solution

Use NotebookLM or a RAG (Retrieval Augmented Generation) system to create a searchable knowledge base. Upload all your documents, and staff can ask questions in plain English. The AI finds the relevant sections, synthesises an answer, and cites which document it came from. No need to remember which PDF has what - just ask.

What you get

A conversational interface where staff ask questions and get answers pulled from your documents with citations. 'What's our data retention policy?' returns the relevant policy section plus which document and page it came from. Staff get instant answers, and you reduce repetitive questions to managers.

Before you start

Your policies, procedures, and reference materials in digital format (PDF, Word, or text)
For NotebookLM: a Google account (easiest option) - note that ALL users who need access will need their own Google account
For custom RAG: Python skills and an OpenAI API key
DATA PROTECTION CHECK: Before uploading documents to any AI tool, check your data protection policy. Consumer-grade tools like NotebookLM may have different terms than enterprise versions. Never upload documents containing personally identifiable information about staff or beneficiaries. Check whether your Google Workspace/Enterprise agreement includes provisions about data not being used for model training. When in doubt, use only generic policy documents without case examples or staff details.

When to use this

Staff frequently ask the same questions that are answered in documentation
Your knowledge is scattered across many documents
Onboarding new staff takes ages because there's so much to learn
You want to reduce the load on managers answering procedural questions

When not to use this

Your procedures change daily - the knowledge base would always be out of date
You've only got a handful of simple policies - might be quicker to just bookmark them
Your documents are so poorly organized that even AI can't make sense of them
The questions people ask require human judgement, not procedure lookup

Steps

1
Gather your knowledge documents
Collect all the documents staff need to reference: policies, procedures, training guides, FAQs, org charts, contact lists. Get them into digital format. For NotebookLM you can use PDFs, Word docs, or text. Aim for clarity over volume - 20 good documents beat 100 messy ones.
2
Quick option: Use NotebookLM
Go to NotebookLM, create a new notebook, and upload your documents as 'sources'. That's it. You can now ask questions and it'll pull answers from across all documents, citing where the information came from. Share the notebook with your team - note that everyone who needs access will need their own Google account, which may be a barrier for some volunteers or staff without unified Google Workspace. This is the fastest path to a working knowledge base.
3
Test with real staff questions
Ask the questions staff actually ask: 'How do I book annual leave?', 'What do I do if someone makes a safeguarding disclosure?', 'What's our social media policy?'. Check that answers are accurate and cite the right documents. If it struggles, your documents might need better structure.
4
Refine document organisation if needed
If the AI gives wrong answers or can't find information, check your documents. Are policies clearly titled? Are sections well-organised? Sometimes adding a table of contents or clearer headings helps the AI find the right information.
5
Optional: Build custom RAG for integration(optional)
If you need this integrated into your own systems (intranet, Teams bot, etc.), you'll need a custom RAG implementation using the OpenAI API and a vector database. The example code shows the basics. This is more work but gives you more control.
6
Keep documents updated
Set a reminder to update the knowledge base when policies change. Delete old versions, upload new ones. The AI is only as good as the documents you give it. If policies are 3 years out of date, the answers will be wrong.
7
Train staff how to ask good questions(optional)
Show staff how to use it: be specific ('What's the process for closing a safeguarding case?' not just 'safeguarding'), check the citations, and escalate to a human when the answer isn't clear. This is a tool to reduce simple lookups, not replace human judgement.

Example code

Basic RAG implementation with OpenAI

This is a minimal RAG system if you want to build custom. For most charities, NotebookLM is easier. NOTE: This example doesn't include document chunking - for long documents (10+ pages), you'll need to split them into smaller overlapping chunks (e.g., 500-1000 tokens each) before indexing, otherwise the embeddings won't capture all the content effectively.

# This requires: pip install openai chromadb pypdf

from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions

client = OpenAI()

# Set up vector database - using PersistentClient so data survives restarts
# Data is saved to ./chroma_db directory
chroma_client = chromadb.PersistentClient(path="./chroma_db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small"
)

# get_or_create_collection avoids errors if collection already exists
collection = chroma_client.get_or_create_collection(
    name="policies",
    embedding_function=openai_ef
)

# Add documents (simplified - in practice you'd chunk long docs)
documents = [
    "Our data retention policy states that service records must be kept for 7 years...",
    "Lone working policy: Staff must not conduct home visits alone if risk assessment indicates...",
    "Safeguarding referral process: 1) Ensure immediate safety. 2) Record disclosure verbatim..."
]

# Index documents
collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

def ask_question(question):
    # 1. Find relevant documents
    results = collection.query(
        query_texts=[question],
        n_results=3
    )

    relevant_docs = results['documents'][0]

    # 2. Generate answer using relevant context
    context = "\n\n".join(relevant_docs)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that answers questions based on the provided policy documents. Always cite which policy section you're referencing."
            },
            {
                "role": "user",
                "content": f"""Answer this question based on our policies:

{question}

Relevant policy sections:
{context}

Provide an answer and cite which section you're using."""
            }
        ]
    )

    return response.choices[0].message.content

# Example usage
answer = ask_question("What's our lone working policy?")
print(answer)

# In practice you'd add:
# - Document chunking for long PDFs
# - Metadata (document name, section, date)
# - User interface (web app, Teams bot, etc.)
# - Access controls
# - Update mechanisms