What is Self-Query RAG?

Self-Query RAG is a technique where the LLM analyzes the user query and automatically generates a structured metadata filter in addition to vector search. This allows precise filtering by date, type, department, and other attributes, improving result relevance.

What metadata can be used for Self-Query?

Any document attributes: type (policy, contract, faq), department (legal, hr, it), publication date, status (active, archived), author, confidentiality level. The key is describing them in AttributeInfo for the LLM.

How does Self-Query differ from regular RAG?

Regular RAG searches only by query semantics, ignoring metadata. Self-Query additionally extracts filters from the query—e.g., "2023 documents" becomes year=2023. This sharply increases accuracy on corporate knowledge bases with heterogeneous documents.

Which vector databases support Self-Query?

Self-Query is supported in LangChain for Qdrant, Pinecone, Weaviate, Chroma, pgvector. A custom version can be built for any DB that supports metadata filtering.

What challenges arise during implementation?

The main issue is ambiguous queries: the LLM may misinterpret filter parameters. The solution is to add a confidence threshold and fallback to pure semantic search when confidence is low. Good metadata description for the LLM is also crucial.

What is Self-Query RAG?

Self-Query RAG is a technique where the LLM analyzes the user query and automatically generates a structured metadata filter in addition to vector search. This allows precise filtering by date, type, department, and other attributes, improving result relevance.

What metadata can be used for Self-Query?

Any document attributes: type (policy, contract, faq), department (legal, hr, it), publication date, status (active, archived), author, confidentiality level. The key is describing them in AttributeInfo for the LLM.

How does Self-Query differ from regular RAG?

Regular RAG searches only by query semantics, ignoring metadata. Self-Query additionally extracts filters from the query—e.g., "2023 documents" becomes year=2023. This sharply increases accuracy on corporate knowledge bases with heterogeneous documents.

Which vector databases support Self-Query?

Self-Query is supported in LangChain for Qdrant, Pinecone, Weaviate, Chroma, pgvector. A custom version can be built for any DB that supports metadata filtering.

What challenges arise during implementation?

The main issue is ambiguous queries: the LLM may misinterpret filter parameters. The solution is to add a confidence threshold and fallback to pure semantic search when confidence is low. Good metadata description for the LLM is also crucial.

Enhancing RAG Accuracy with Automated Metadata Filtering

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Enhancing RAG Accuracy with Automated Metadata Filtering

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1349
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
949
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Imagine you're searching for "last year's security policies" in a corporate database of 15,000 documents. A regular RAG returns all documents semantically related to "security" — including archive regulations from five years ago. The user drowns in irrelevant results. Self-Query RAG solves this: the LLM analyzes the query and automatically builds a filter: doc_type=policy AND year>=current_year-1 AND status=active, applying it together with vector search. Precision@5 increases from 0.68 to 0.89, the share of archived documents drops from 42% to 3%. This approach is known as metadata search using LLMs, a key aspect of AI document search.

We implement Self-Query RAG turnkey — from metadata labeling to deploying the assistant. Our engineers adapt the solution to any stack: LangChain, Qdrant, Pinecone, Weaviate. Get a consultation — we'll discuss details for your scenario. Total project cost: $10,000–$25,000; estimated annual savings: $50,000–$100,000.

How Self-Query Solves the Filtering Problem

Without Self-Query, a query like "HR department regulations" searches all documents for the words "regulation" and "HR", without filtering by department. You get regulations from IT, Legal, and even marketing instructions. Self-Query forces the LLM to extract filter department=hr AND doc_type=regulation and discard everything unnecessary at the storage level. This saves search time and reduces query processing costs due to accuracy. Companies save up to 40% time on document search and reduce knowledge base maintenance costs by 25%. RAG with filtering is essential for corporate databases; LLM filter extraction is the key innovation.

Comparison with Regular RAG

Metric	Regular RAG	Self-Query RAG
Precision@5	0.68	0.89
Share of archived documents	42%	3%
Average search time	2.1s	2.3s (due to LLM step)
User satisfaction	72%	94%

Self-Query RAG is 1.3x better than regular RAG in Precision@5 and reduces irrelevant results by 92% compared to standard RAG.

Why Self-Query is a Must-Have for Databases with Metadata

Corporate knowledge bases contain documents of different types, departments, and statuses. Without filtering, users get a jumble. Self-Query automatically classifies the query and applies relevant metadata. This is especially important for legal, HR, and financial documents where accuracy is critical.

Metadata Examples for Self-Query

Field	Type	Description
doc_type	string	policy, contract, faq
department	string	hr, legal, it
year	integer	Year of publication
status	string	active, archived
author	string	Author name

Implementation via LangChain SelfQueryRetriever

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant

# Metadata description for LLM
metadata_field_info = [
    AttributeInfo(
        name="doc_type",
        description="Document type: contract, regulation, policy, faq, procedure",
        type="string",
    ),
    AttributeInfo(
        name="department",
        description="Department or division: hr, legal, finance, it, security",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="Year of publication",
        type="integer",
    ),
    AttributeInfo(
        name="status",
        description="Document status: active, archived, draft",
        type="string",
    ),
    AttributeInfo(
        name="author",
        description="Author or responsible person",
        type="string",
    ),
]

document_content_description = "Corporate documentation: regulations, policies, contracts, procedures"

llm = ChatOpenAI(model="gpt-4o", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    enable_limit=True,
    verbose=True,
)

LangChain's SelfQueryRetriever simplifies the implementation of self-query rag.

Self-Query in Action

# Example 1: Filter by year and type
result = retriever.invoke(
    "What security policies were active last year?"
)
# LLM generates filter: {"doc_type": "policy", "department": "security", "year": last_year, "status": "active"}

# Example 2: Filter by department
result = retriever.invoke(
    "Show HR department regulations"
)
# Filter: {"doc_type": "regulation", "department": "hr"}

# Example 3: No filter (pure vector search)
result = retriever.invoke(
    "How to prepare for an audit?"
)
# LLM doesn't extract structured filters — pure semantic search

Custom Self-Query Implementation Without LangChain

from pydantic import BaseModel, Field
from typing import Optional
from openai import OpenAI
import json

class SearchFilter(BaseModel):
    semantic_query: str = Field(description="Pure semantic part of the query for vector search")
    doc_type: Optional[str] = Field(default=None, description="Document type")
    department: Optional[str] = Field(default=None, description="Department")
    year_from: Optional[int] = Field(default=None, description="Year from (inclusive)")
    year_to: Optional[int] = Field(default=None, description="Year to (inclusive)")
    status: Optional[str] = Field(default=None, description="Status: active/archived")

def parse_query_to_filter(user_query: str, client: OpenAI) -> SearchFilter:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Extract structured filters for document search from the user query."
        }, {
            "role": "user",
            "content": user_query
        }],
        response_format=SearchFilter,
        temperature=0,
    )
    return response.choices[0].message.parsed

def self_query_search(user_query: str, vectorstore, top_k: int = 5) -> list:
    filter_obj = parse_query_to_filter(user_query, openai_client)
    qdrant_filter = build_qdrant_filter(filter_obj)
    return vectorstore.similarity_search(
        filter_obj.semantic_query,
        k=top_k,
        filter=qdrant_filter,
    )

Qdrant filters enable efficient metadata filtering, especially when combined with prompt fine-tuning for LLM filter extraction.

Case Study: Corporate Knowledge Base

Challenge: a search assistant for 15,000 internal documents with metadata (type, department, year, status, author).

Before Self-Query: 42% of queries returned archived documents instead of current ones.

After Self-Query (our client — a company of 500+ employees):

Archived documents in results for "current" queries: 42% → 3%
Precision@5: 0.68 → 0.89
User satisfaction: +31%

Failure cases: LLM may misinterpret filter parameters on ambiguous queries. Solution: add a confidence threshold and fallback to pure semantic search when confidence is low. We fine-tune the prompt if needed to improve filter extraction quality. Prompt fine-tuning is crucial for accurate metadata search.

When is Self-Query not beneficial?

If metadata is sparse or indistinguishable (e.g., all documents of one type), Self-Query provides no gain. In such cases, regular semantic search suffices. We always perform a preliminary data audit.

How to Implement Self-Query: Step by Step

Audit documents and metadata — define fields for filtering.
Label or automatically extract metadata (NLP classification) — this is a key part of automatic document classification.
Choose a vector database and configure indexing.
Develop the LLM prompt and integrate the Self-Query Retriever.
A/B testing and threshold tuning.

What's Included in the Work

Documentation: metadata schema, prompt description, extension guide.
Access control: user permission differentiation via document statuses.
Training: 2 hours for system administrators.
Support: 1 month post-launch.

Timeline and Cost

Metadata labeling: 1–3 weeks (depends on data availability).
Self-Query Retriever implementation: 3–5 days.
Testing and prompt tuning: 3–5 days.
Total: 2–5 weeks. For a typical project, the cost ranges from $10,000 to $25,000, with annual savings of $50,000–$100,000. Implementation costs start at $10,000, with annual savings up to $100,000.

We have been working with RAG for over 5 years and completed 30+ projects. We guarantee transparent architecture and documentation. Get a consultation — we'll discuss the details of your task. Request a demo — we'll show it on your data.

Source: Retrieval-augmented generation (Wikipedia)

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.