Which versions of Elasticsearch support kNN?

Native kNN support appeared in Elasticsearch 8.0 with the dense_vector field type. Earlier versions require plugins or third-party solutions, but upgrading to 8.x is recommended.

How does Elasticsearch compare to Pinecone and Weaviate?

Elasticsearch wins if you already use it as a search engine — no need to spin up new infrastructure. Hybrid search (BM25 + kNN) yields better NDCG than pure vector search. However, for pure vector workloads, specialized databases may offer lower latency.

Which embedding model is recommended for Russian language?

For Russian, OpenAI text-embedding-3-small (1536-dim) or models from the intfloat/multilingual-e5 family work well. The key is that the embedding dimension matches the dims in your mapping. We recommend 1536 for a good balance of quality and speed.

How to tune HNSW parameters for optimal search?

HNSW (Hierarchical Navigable Small World) parameters m (8-64) and ef_construction (100-500) affect indexing speed and recall. For production, we suggest m=16, ef_construction=100.

What factors affect hybrid search performance?

Embedding dimension, number of shards, index size, and RRF settings. For optimal performance, keep dimensions ≤1536, tune sharding to your data volume, and set rank_constant=20 in RRF.

Which versions of Elasticsearch support kNN?

Native kNN support appeared in Elasticsearch 8.0 with the dense_vector field type. Earlier versions require plugins or third-party solutions, but upgrading to 8.x is recommended.

How does Elasticsearch compare to Pinecone and Weaviate?

Elasticsearch wins if you already use it as a search engine — no need to spin up new infrastructure. Hybrid search (BM25 + kNN) yields better NDCG than pure vector search. However, for pure vector workloads, specialized databases may offer lower latency.

Which embedding model is recommended for Russian language?

For Russian, OpenAI text-embedding-3-small (1536-dim) or models from the intfloat/multilingual-e5 family work well. The key is that the embedding dimension matches the dims in your mapping. We recommend 1536 for a good balance of quality and speed.

How to tune HNSW parameters for optimal search?

HNSW (Hierarchical Navigable Small World) parameters m (8-64) and ef_construction (100-500) affect indexing speed and recall. For production, we suggest m=16, ef_construction=100.

What factors affect hybrid search performance?

Embedding dimension, number of shards, index size, and RRF settings. For optimal performance, keep dimensions ≤1536, tune sharding to your data volume, and set rank_constant=20 in RRF.

Hybrid Search with Elasticsearch: Combining BM25 and Vector kNN for RAG

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Hybrid Search with Elasticsearch: Combining BM25 and Vector kNN for RAG

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1349
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
949
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Hybrid Search with Elasticsearch: Combining BM25 and Vector kNN for RAG

Imagine: your Elasticsearch handles hundreds of thousands of documents, but users complain that search doesn't find relevant answers. BM25 works perfectly for exact matches but fails with synonyms, tautology, and complex Russian-language queries. The result: customers leave, operators waste time searching. Adding a separate vector database? That increases infrastructure costs, network latency, and another system to support. The optimal solution is to use the built-in kNN in Elasticsearch 8.x, combining full-text and vector search in a single index. We have helped several teams implement such a hybrid without changing their infrastructure, and now we'll show you how to do it. Contact us for an audit of your current search — we'll propose the optimal path.

Why Elasticsearch kNN is the Optimal Solution for Hybrid Search

Typical pain points when implementing RAG without a rebuild: fragmented search (BM25 misses semantics), infrastructure chaos (spinning up Pinecone or Weaviate alongside ES increases cost and complexity), latency (external vector DBs add p99 network delays of 50+ ms). Elasticsearch kNN solves all three: hybrid search (kNN + BM25) via RRF fusion in a single query, minimal overhead, no new servers needed.

Elasticsearch is a mature technology with 15+ years on the market, used in thousands of production environments. Built-in support for Russian language via the Snowball analyzer provides high-quality stemming: a query for "договором" will find "договор", "договоры", "договорам". This is critical for the BM25 part of the hybrid. Additionally, the ELK stack (Logstash, Kibana) lets you monitor indices and visualize search metrics without additional tools.

Which HNSW Settings Give the Best Balance of Speed and Quality?

For production, we recommend HNSW with parameters m=16, ef_construction=100. This is the optimal balance between indexing speed and search accuracy. Too low num_candidates (below 100) reduces recall; too high increases latency. In our projects we use cosine similarity as the distance metric for embeddings.

How We Do It: Stack, Configs, Case Study

Stack: Elasticsearch 8.11+, OpenAI text-embedding-3-small (1536-dim), Python 3.11, official elasticsearch-py client.

Case Study from Our Practice: Migrating an Existing Elasticsearch to RAG

Context: Our client — a company with 500K legal documents in Elasticsearch 8.x. Task: add a RAG layer without changing infrastructure.

Steps:

Add embedding field (dense_vector, dims=1536) to the existing mapping.
Batch vectorize existing documents (2 days, 500K × $0.02/1M = $10).
Reindex with the new field (6 hours).
Add RRF fusion to search queries.
RAG layer on top of ES retrieval.

Results (vs pure BM25):

NDCG@5: 0.64 → 0.81 (a 27% improvement)
Recall@10: 0.71 → 0.88
Latency P95: 85ms → 140ms (hybrid)
Faithfulness (RAGAS): 0.76 → 0.91

Infrastructure savings: no need to spin up a separate server, saving $200/month. Moving from pure BM25 to hybrid kNN+BM25 yielded a 27% boost in NDCG without changing infrastructure. The client had a working RAG in two weeks.

Creating the Index and Indexing Documents

Add the dense_vector field to the existing index and perform batch vectorization.

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

# Create index with mapping
index_config = {
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "russian",  # Native Russian morphology support
            },
            "source": {"type": "keyword"},
            "doc_type": {"type": "keyword"},
            "page": {"type": "integer"},
            "date": {"type": "date"},
            "embedding": {
                "type": "dense_vector",
                "dims": 1536,
                "index": True,
                "similarity": "cosine",
                "index_options": {
                    "type": "hnsw",
                    "m": 16,
                    "ef_construction": 100,
                }
            }
        }
    },
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1,
    }
}

es.indices.create(index="knowledge_base", body=index_config)

from openai import OpenAI
from elasticsearch.helpers import bulk

openai_client = OpenAI()

def generate_actions(chunks: list):
    texts = [c["text"] for c in chunks]
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [e.embedding for e in response.data]

    for chunk, embedding in zip(chunks, embeddings):
        yield {
            "_index": "knowledge_base",
            "_source": {
                "content": chunk["text"],
                "source": chunk["source"],
                "doc_type": chunk["doc_type"],
                "page": chunk.get("page", 0),
                "embedding": embedding,
            }
        }

bulk(es, generate_actions(document_chunks))

Hybrid Search: BM25 + kNN in Practice

Elasticsearch supports hybrid search via knn + query in a single request with RRF fusion.

def hybrid_search_es(
    query: str,
    doc_type_filter: str = None,
    top_k: int = 5
) -> list:
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    filter_clause = []
    if doc_type_filter:
        filter_clause.append({"term": {"doc_type": doc_type_filter}})

    body = {
        "query": {
            "bool": {
                "must": {
                    "match": {
                        "content": {
                            "query": query,
                            "analyzer": "russian"
                        }
                    }
                },
                "filter": filter_clause,
            }
        },
        "knn": {
            "field": "embedding",
            "query_vector": query_embedding,
            "k": top_k * 3,
            "num_candidates": 100,
            "filter": filter_clause,
        },
        "rank": {
            "rrf": {
                "window_size": 50,
                "rank_constant": 20,
            }
        },
        "size": top_k,
        "_source": ["content", "source", "doc_type"],
    }

    response = es.search(index="knowledge_base", body=body)
    return [
        {
            "text": hit["_source"]["content"],
            "source": hit["_source"]["source"],
            "score": hit["_score"],
        }
        for hit in response["hits"]["hits"]
    ]

Built-in Russian Morphology Advantage

Elasticsearch with the russian analyzer supports stemming of Russian words via Snowball. This is critical for the BM25 part of hybrid search — a query for "договором" will find documents with "договор", "договоры", "договорам".

es.indices.analyze(
    index="knowledge_base",
    body={"analyzer": "russian", "text": "договором аренды"}
)
# tokens: ["договор", "аренд"] — stemmed forms

What's Included

Audit of your current ES index (mapping, shards, performance)
Designing the dense_vector schema and selecting an embedding model
Writing batch vectorization and reindexing scripts
Implementing hybrid search with RRF fusion
Integrating the RAG pipeline (with LangChain or direct OpenAI calls)
Testing: NDCG, Recall, latency, faithfulness
Documentation and team training (2-hour workshop)

Estimated Timelines

Stage	Duration
Analysis and design	2–3 days
Vectorization and reindexing	2–5 days
Hybrid query development	3–5 days
RAG pipeline and evaluation	1–2 weeks
Total	2–4 weeks

Elasticsearch kNN vs Alternative Vector Databases

Characteristic	Elasticsearch kNN	Pinecone / Qdrant
Infrastructure	Already have? No new needed	Separate service
Hybrid search	Built-in BM25 + kNN	Separate BM25 + concatenation
Russian stemmer	Yes (Snowball)	No (external needed)
Latency p99	140 ms (hybrid)	50-100 ms (vector only)
NDCG@5 (our experience)	0.81 vs 0.64 (pure BM25)	~0.75-0.80 (similar)

Elasticsearch wins on total cost and simplicity if ES is already in production. For startups without legacy, Pinecone may be faster to launch.

Common Mistakes When Implementing ES kNN

Using too small num_candidates (below 100) — recall drops.
Not configuring the analyzer for Russian texts — BM25 becomes useless.
Trying to feed an embedding of dimension 768 into a field with dims=1536 — ES returns an error.
Forgetting RRF — without fusion, the hybrid doesn't work as expected.

With 5+ years of Elasticsearch expertise and over 20 successful RAG implementations, we guarantee a proven track record. Our certified specialists ensure your project meets the highest standards. Get a consultation for your project — we have already implemented 10+ similar projects. Contact us to discuss details and receive an individual assessment.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.