RAG β Retrieval Augmented Generation
What is RAG and why do you need it?
LLMs know a lot, but they don't know your data. RAG solves this by retrieving relevant context from your own data sources and stuffing it into the prompt before the model generates a response.
Sounds simple? It is β at its core. But the devil's in the details.
sequenceDiagram
participant User
participant App
participant VectorStore
participant LLM
User->>App: "What's our refund policy?"
App->>VectorStore: similarity search("refund policy")
VectorStore->>App: [matching document chunks]
App->>LLM: prompt + retrieved context
LLM->>App: "Our refund policy states..."
App->>User: grounded response
The model isn't guessing β it's reading your docs and answering based on them.
Level 1: Naive RAG with QuestionAnswerAdvisor
The fastest way to get RAG working. Assuming you've already loaded data into a VectorStore:
ChatResponse response = ChatClient.builder(chatModel)
.build()
.prompt()
.advisors(QuestionAnswerAdvisor.builder(vectorStore).build())
.user("What's our refund policy?")
.call()
.chatResponse();
That's it. The QuestionAnswerAdvisor queries the vector store for documents similar to the user's question, appends them to the prompt, and the model responds with grounded information.
Tuning the Search
You can control similarity threshold and how many documents to retrieve:
var qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
.searchRequest(SearchRequest.builder()
.similarityThreshold(0.8)
.topK(6)
.build())
.build();
similarityThresholdβ only return documents above this relevance score (0.0 to 1.0)topKβ max number of documents to retrieve
Dynamic Filtering
Filter results at runtime based on metadata:
String content = chatClient.prompt()
.user("Tell me about Spring Boot 3")
.advisors(a -> a.param(QuestionAnswerAdvisor.FILTER_EXPRESSION, "type == 'Spring'"))
.call()
.content();
This uses a SQL-like filter expression that's portable across all vector store implementations.
Level 2: Advanced RAG with RetrievalAugmentationAdvisor
For more control, use the modular RetrievalAugmentationAdvisor. It breaks RAG into composable stages.
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-rag</artifactId>
</dependency>
Basic Setup
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.similarityThreshold(0.50)
.vectorStore(vectorStore)
.build())
.build();
String answer = chatClient.prompt()
.advisors(ragAdvisor)
.user("What is the return window for electronics?")
.call()
.content();
By default, if no relevant documents are found, it tells the model not to answer (prevents hallucination). You can change that:
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.similarityThreshold(0.50)
.vectorStore(vectorStore)
.build())
.queryAugmenter(ContextualQueryAugmenter.builder()
.allowEmptyContext(true)
.build())
.build();
Query Transformation
Sometimes user queries are messy. Spring AI can rewrite them before retrieval:
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.queryTransformers(RewriteQueryTransformer.builder()
.chatClientBuilder(chatClientBuilder.build().mutate())
.build())
.documentRetriever(VectorStoreDocumentRetriever.builder()
.similarityThreshold(0.50)
.vectorStore(vectorStore)
.build())
.build();
Three transformer options:
RewriteQueryTransformerβ rewrites verbose/ambiguous queries for better retrievalCompressionQueryTransformerβ compresses conversation history + follow-up into a standalone queryTranslationQueryTransformerβ translates queries to match your embedding model's language
Query Expansion
Want to search from multiple angles? Expand one query into several:
MultiQueryExpander queryExpander = MultiQueryExpander.builder()
.chatClientBuilder(chatClientBuilder)
.numberOfQueries(3)
.build();
This generates 3 semantically diverse variations of the original query, retrieves documents for each, and merges the results.
The RAG Pipeline β Modular Architecture
Spring AI implements a Modular RAG architecture. Here's how the pieces fit:
βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
β Pre-Retrieval β βββ β Retrieval β βββ β Post-Retrieval β βββ β Generation β
β β β β β β β β
β β’ Query Rewrite β β β’ Vector Search β β β’ Re-ranking β β β’ Context β
β β’ Query Expand β β β’ Doc Retrieval β β β’ Dedup β β Augmentationβ
β β’ Translation β β β’ Doc Join β β β’ Compression β β β’ LLM Call β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
Each module is pluggable. Mix and match based on your needs.
Custom Prompt Templates
Override the default RAG prompt to match your use case:
PromptTemplate customPromptTemplate = PromptTemplate.builder()
.renderer(StTemplateRenderer.builder()
.startDelimiterToken('<').endDelimiterToken('>').build())
.template("""
<query>
Context information is below.
---------------------
<question_answer_context>
---------------------
Given the context information and no prior knowledge, answer the query.
Follow these rules:
1. If the answer is not in the context, just say that you don't know.
2. Avoid statements like "Based on the context..." or "The provided information...".
""")
.build();
QuestionAnswerAdvisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
.promptTemplate(customPromptTemplate)
.build();
What to Remember
- Start with
QuestionAnswerAdvisorβ it covers 80% of RAG use cases - Tune
similarityThresholdandtopKβ too low and you get noise, too high and you miss relevant docs - Use query transformers when your users write sloppy queries (they will)
- Filter by metadata when you have multi-tenant data or document categories
- Empty context = hallucination risk β keep
allowEmptyContextfalse unless you have a good reason