Multi-Query RAG Implementation for Improved Retrieval Quality

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Multi-Query RAG Implementation for Improved Retrieval Quality
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Implementing Multi-Query RAG for Improved Retrieval Quality

Multi-Query RAG is a retrieval improvement technique where the original query is automatically paraphrased in several ways, each variant is run through search, and results are combined. This reduces dependency of answer quality on specific query wording and improves retrieval completeness.

Problem Multi-Query Solves

Vector embedding models are sensitive to query formulation. The same question asked differently can yield different top-K results:

  • "How to take vacation?" → finds articles about applications
  • "Procedure for getting annual leave" → finds regulation section
  • "Rules for providing vacation days" → finds HR policy

Multi-Query combines all three and gets more complete context.

Implementation with LangChain

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

base_retriever = Qdrant.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="documents",
    url="http://localhost:6333",
).as_retriever(search_kwargs={"k": 5})

retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=llm
)

results = retriever.invoke("How to file a vacation request?")

Custom Multi-Query with Prompt Control

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import BaseOutputParser
import json

class LineListOutputParser(BaseOutputParser):
    def parse(self, text: str) -> list:
        lines = text.strip().split("\n")
        return [line.strip() for line in lines if line.strip()]

multi_query_prompt = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 3
different versions of the given user question to retrieve relevant documents from a vector database.
Provide these alternative questions separated by newlines.

Original question: {question}

Alternative questions:"""
)

def multi_query_retrieval(query: str, retriever, llm) -> list:
    # Generate queries
    queries = llm.invoke(multi_query_prompt.format(question=query))
    query_list = [queries.content] + [query]
    
    # Retrieve from each
    unique_docs = set()
    all_docs = []
    
    for q in query_list:
        docs = retriever.invoke(q)
        for doc in docs:
            doc_id = doc.metadata.get("id")
            if doc_id not in unique_docs:
                unique_docs.add(doc_id)
                all_docs.append(doc)
    
    return all_docs

Practical Case: HR Documentation Assistant

Task: assistant for HR policies covering vacation, sick leave, business travel, bonuses.

Results:

Metric Standard RAG Multi-Query RAG
Context Recall 0.71 0.89
Answer Completeness 0.68 0.87
User Satisfaction 0.74 0.91
Latency 400ms 1.1s (3× queries in parallel)

Multi-Query improved context recall by 25% while maintaining acceptable latency.

Timeline

  • Implement multi-query generator: 1–2 days
  • Test and optimize prompt: 1–2 days
  • Evaluation and tuning: 2–3 days
  • Total: 1 week