RAG — Retrieval Augmented Generation

What is RAG and why do you need it?

LLMs know a lot, but they don't know your data. RAG solves this by retrieving relevant context from your own data sources and stuffing it into the prompt before the model generates a response.

Sounds simple? It is — at its core. But the devil's in the details.

sequenceDiagram
    participant User
    participant App
    participant VectorStore
    participant LLM
    User->>App: "What's our refund policy?"
    App->>VectorStore: similarity search("refund policy")
    VectorStore->>App: [matching document chunks]
    App->>LLM: prompt + retrieved context
    LLM->>App: "Our refund policy states..."
    App->>User: grounded response

The model isn't guessing — it's reading your docs and answering based on them.

Level 1: Naive RAG with QuestionAnswerAdvisor

The fastest way to get RAG working. Assuming you've already loaded data into a VectorStore:

ChatResponse response = ChatClient.builder(chatModel)
    .build()
    .prompt()
    .advisors(QuestionAnswerAdvisor.builder(vectorStore).build())
    .user("What's our refund policy?")
    .call()
    .chatResponse();

That's it. The QuestionAnswerAdvisor queries the vector store for documents similar to the user's question, appends them to the prompt, and the model responds with grounded information.

Tuning the Search

You can control similarity threshold and how many documents to retrieve:

var qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
    .searchRequest(SearchRequest.builder()
        .similarityThreshold(0.8)
        .topK(6)
        .build())
    .build();

similarityThreshold — only return documents above this relevance score (0.0 to 1.0)
topK — max number of documents to retrieve

Dynamic Filtering

Filter results at runtime based on metadata:

String content = chatClient.prompt()
    .user("Tell me about Spring Boot 3")
    .advisors(a -> a.param(QuestionAnswerAdvisor.FILTER_EXPRESSION, "type == 'Spring'"))
    .call()
    .content();

This uses a SQL-like filter expression that's portable across all vector store implementations.

Level 2: Advanced RAG with RetrievalAugmentationAdvisor

For more control, use the modular RetrievalAugmentationAdvisor. It breaks RAG into composable stages.

<dependency>
   <groupId>org.springframework.ai</groupId>
   <artifactId>spring-ai-rag</artifactId>
</dependency>

Basic Setup

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
    .documentRetriever(VectorStoreDocumentRetriever.builder()
        .similarityThreshold(0.50)
        .vectorStore(vectorStore)
        .build())
    .build();

String answer = chatClient.prompt()
    .advisors(ragAdvisor)
    .user("What is the return window for electronics?")
    .call()
    .content();

By default, if no relevant documents are found, it tells the model not to answer (prevents hallucination). You can change that:

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
    .documentRetriever(VectorStoreDocumentRetriever.builder()
        .similarityThreshold(0.50)
        .vectorStore(vectorStore)
        .build())
    .queryAugmenter(ContextualQueryAugmenter.builder()
        .allowEmptyContext(true)
        .build())
    .build();

Query Transformation

Sometimes user queries are messy. Spring AI can rewrite them before retrieval:

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
    .queryTransformers(RewriteQueryTransformer.builder()
        .chatClientBuilder(chatClientBuilder.build().mutate())
        .build())
    .documentRetriever(VectorStoreDocumentRetriever.builder()
        .similarityThreshold(0.50)
        .vectorStore(vectorStore)
        .build())
    .build();

Three transformer options:

RewriteQueryTransformer — rewrites verbose/ambiguous queries for better retrieval
CompressionQueryTransformer — compresses conversation history + follow-up into a standalone query
TranslationQueryTransformer — translates queries to match your embedding model's language

Query Expansion

Want to search from multiple angles? Expand one query into several:

MultiQueryExpander queryExpander = MultiQueryExpander.builder()
    .chatClientBuilder(chatClientBuilder)
    .numberOfQueries(3)
    .build();

This generates 3 semantically diverse variations of the original query, retrieves documents for each, and merges the results.

The RAG Pipeline — Modular Architecture

Spring AI implements a Modular RAG architecture. Here's how the pieces fit:

┌─────────────────┐     ┌─────────────────┐     ┌──────────────────┐     ┌──────────────┐
│  Pre-Retrieval   │ ──→ │    Retrieval     │ ──→ │  Post-Retrieval   │ ──→ │  Generation   │
│                 │     │                 │     │                  │     │              │
│ • Query Rewrite │     │ • Vector Search │     │ • Re-ranking     │     │ • Context    │
│ • Query Expand  │     │ • Doc Retrieval │     │ • Dedup          │     │   Augmentation│
│ • Translation   │     │ • Doc Join      │     │ • Compression    │     │ • LLM Call   │
└─────────────────┘     └─────────────────┘     └──────────────────┘     └──────────────┘

Each module is pluggable. Mix and match based on your needs.

Custom Prompt Templates

Override the default RAG prompt to match your use case:

PromptTemplate customPromptTemplate = PromptTemplate.builder()
    .renderer(StTemplateRenderer.builder()
        .startDelimiterToken('<').endDelimiterToken('>').build())
    .template("""
        <query>

        Context information is below.
        ---------------------
        <question_answer_context>
        ---------------------

        Given the context information and no prior knowledge, answer the query.
        Follow these rules:
        1. If the answer is not in the context, just say that you don't know.
        2. Avoid statements like "Based on the context..." or "The provided information...".
        """)
    .build();

QuestionAnswerAdvisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
    .promptTemplate(customPromptTemplate)
    .build();

What to Remember

Start with QuestionAnswerAdvisor — it covers 80% of RAG use cases
Tune similarityThreshold and topK — too low and you get noise, too high and you miss relevant docs
Use query transformers when your users write sloppy queries (they will)
Filter by metadata when you have multi-tenant data or document categories
Empty context = hallucination risk — keep allowEmptyContext false unless you have a good reason