RAG in Python
What is RAG?
RAG (Retrieval-Augmented Generation) means: fetch relevant data first, then ask the model to answer using that data.
Sounds simple, but it fixes the biggest LLM problem: hallucination on your domain-specific data.
sequenceDiagram
participant User
participant App
participant Retriever
participant LLM
User->>App: "What's our refund policy?"
App->>Retriever: search("refund policy")
Retriever->>App: relevant chunks
App->>LLM: question + chunks
LLM->>App: grounded answer
App->>User: final answer
Level 1: Naive RAG (No LangChain)
Like anything else, best way to start is to start. Keep it dumb first.
from openai import OpenAI
client = OpenAI()
# Fake retrieval for demo purposes
knowledge_base = {
"refund": "Refunds are allowed within 30 days with receipt.",
"shipping": "Standard shipping is 3-5 business days."
}
def retrieve(query: str) -> str:
q = query.lower()
if "refund" in q:
return knowledge_base["refund"]
if "shipping" in q:
return knowledge_base["shipping"]
return "No relevant policy found."
question = "What's your refund policy?"
context = retrieve(question)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer only from provided context. If unknown, say you don't know."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
print(response.choices[0].message.content)
This is not fancy. But it teaches the core pattern in 5 minutes.
Level 2: Real Retriever with LangChain
Now replace fake retrieval with embeddings + vector search.
pip install langchain langchain-openai langchain-community faiss-cpu
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
# 1) Build documents
docs = [
Document(page_content="Refunds are allowed within 30 days with receipt.", metadata={"source": "policy.md"}),
Document(page_content="Premium users get free shipping.", metadata={"source": "shipping.md"}),
]
# 2) Index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 3) Retrieve
question = "Can I get a refund after 2 weeks?"
chunks = retriever.invoke(question)
context = "\n\n".join([d.page_content for d in chunks])
# 4) Generate
llm = ChatOpenAI(model="gpt-4o")
answer = llm.invoke([
("system", "Answer only from context. If missing, say you don't know."),
("human", f"Context:\n{context}\n\nQuestion: {question}")
])
print(answer.content)
Level 3: RetrievalQA Chain
Want less boilerplate? Use a built-in chain.
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff" # simplest strategy: stuff chunks into one prompt
)
result = qa_chain.invoke({"query": "What's the refund policy?"})
print(result["result"])
Level 4: Practical Tuning
This is where quality moves a lot.
- Chunk size: too big = noisy context, too small = missing context
- Top K: start with 3-5
- Prompt rule: explicitly say "If unknown, say I don't know"
- Metadata: store source paths so you can cite docs
retriever = vectorstore.as_retriever(
search_kwargs={"k": 4}
)
Level 5: Hybrid and Multi-Source Retrieval
Need better recall? Use multiple retrievers (docs + SQL + API) and combine results.
That pattern maps directly to agent routing: route the question, retrieve from the right source, then synthesize.
What to Remember
- RAG is retrieval + prompt injection (nothing magical)
- Start naive first so you understand failure modes
- LangChain helps with plumbing, not with your data quality
- Bad chunks in = bad answers out
- Always keep an "I don't know" path to avoid confident nonsense