RAG in Python

What is RAG?

RAG (Retrieval-Augmented Generation) means: fetch relevant data first, then ask the model to answer using that data.

Sounds simple, but it fixes the biggest LLM problem: hallucination on your domain-specific data.

sequenceDiagram
    participant User
    participant App
    participant Retriever
    participant LLM
    User->>App: "What's our refund policy?"
    App->>Retriever: search("refund policy")
    Retriever->>App: relevant chunks
    App->>LLM: question + chunks
    LLM->>App: grounded answer
    App->>User: final answer

Level 1: Naive RAG (No LangChain)

Like anything else, best way to start is to start. Keep it dumb first.

from openai import OpenAI

client = OpenAI()

# Fake retrieval for demo purposes
knowledge_base = {
    "refund": "Refunds are allowed within 30 days with receipt.",
    "shipping": "Standard shipping is 3-5 business days."
}

def retrieve(query: str) -> str:
    q = query.lower()
    if "refund" in q:
        return knowledge_base["refund"]
    if "shipping" in q:
        return knowledge_base["shipping"]
    return "No relevant policy found."

question = "What's your refund policy?"
context = retrieve(question)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer only from provided context. If unknown, say you don't know."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ]
)

print(response.choices[0].message.content)

This is not fancy. But it teaches the core pattern in 5 minutes.

Level 2: Real Retriever with LangChain

Now replace fake retrieval with embeddings + vector search.

pip install langchain langchain-openai langchain-community faiss-cpu

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# 1) Build documents
docs = [
    Document(page_content="Refunds are allowed within 30 days with receipt.", metadata={"source": "policy.md"}),
    Document(page_content="Premium users get free shipping.", metadata={"source": "shipping.md"}),
]

# 2) Index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 3) Retrieve
question = "Can I get a refund after 2 weeks?"
chunks = retriever.invoke(question)
context = "\n\n".join([d.page_content for d in chunks])

# 4) Generate
llm = ChatOpenAI(model="gpt-4o")
answer = llm.invoke([
    ("system", "Answer only from context. If missing, say you don't know."),
    ("human", f"Context:\n{context}\n\nQuestion: {question}")
])

print(answer.content)

Level 3: RetrievalQA Chain

Want less boilerplate? Use a built-in chain.

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff"  # simplest strategy: stuff chunks into one prompt
)

result = qa_chain.invoke({"query": "What's the refund policy?"})
print(result["result"])

Level 4: Practical Tuning

This is where quality moves a lot.

Chunk size: too big = noisy context, too small = missing context
Top K: start with 3-5
Prompt rule: explicitly say "If unknown, say I don't know"
Metadata: store source paths so you can cite docs

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 4}
)

Level 5: Hybrid and Multi-Source Retrieval

Need better recall? Use multiple retrievers (docs + SQL + API) and combine results.

That pattern maps directly to agent routing: route the question, retrieve from the right source, then synthesize.

What to Remember

RAG is retrieval + prompt injection (nothing magical)
Start naive first so you understand failure modes
LangChain helps with plumbing, not with your data quality
Bad chunks in = bad answers out
Always keep an "I don't know" path to avoid confident nonsense