How does k-NN work in OpenSearch?

OpenSearch uses the knn-plugin with algorithms HNSW, IVF, or FAISS. k-NN indices store vector embeddings and allow nearest neighbor search by cosine similarity or L2 distance.

How is OpenSearch different from Elasticsearch for RAG?

OpenSearch is licensed under Apache 2.0, has built-in ML Commons for deploying embedding models, supports RRF fusion via scoring, and offers NMSLIB/FAISS engines. Elasticsearch uses Lucene HNSW and the SSPL license.

How to set up hybrid search in OpenSearch?

Hybrid search combines BM25 (text search) with k-NN (vector search). Use a bool query with should clauses, weighting BM25 and k-NN via boost values (e.g., 0.3 and 0.7).

How long does it take to implement RAG on OpenSearch?

A typical project takes 2–4 weeks: cluster and index setup (2–3 days), data ingestion pipeline (3–7 days), and RAG pipeline development (1–2 weeks).

How does k-NN work in OpenSearch?

OpenSearch uses the knn-plugin with algorithms HNSW, IVF, or FAISS. k-NN indices store vector embeddings and allow nearest neighbor search by cosine similarity or L2 distance.

How is OpenSearch different from Elasticsearch for RAG?

OpenSearch is licensed under Apache 2.0, has built-in ML Commons for deploying embedding models, supports RRF fusion via scoring, and offers NMSLIB/FAISS engines. Elasticsearch uses Lucene HNSW and the SSPL license.

How to set up hybrid search in OpenSearch?

Hybrid search combines BM25 (text search) with k-NN (vector search). Use a bool query with should clauses, weighting BM25 and k-NN via boost values (e.g., 0.3 and 0.7).

How long does it take to implement RAG on OpenSearch?

A typical project takes 2–4 weeks: cluster and index setup (2–3 days), data ingestion pipeline (3–7 days), and RAG pipeline development (1–2 weeks).

RAG with OpenSearch: Vector and Hybrid Search Implementation

Q: What is RAG with OpenSearch?

RAG (Retrieval-Augmented Generation) uses a vector database to retrieve relevant documents that are fed to an LLM for answer generation. OpenSearch serves as a vector store with k-NN and hybrid search support.

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

RAG with OpenSearch: Vector and Hybrid Search Implementation

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1349
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
949
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

When building a search engine for a client's knowledge base, we hit BM25 limits: exact matches were found, but semantic connections were lost. Users complained about irrelevant results. We faced this while processing a 500,000-document knowledge base — BM25 gave recall of only 60%. The solution was hybrid search on OpenSearch — an open-source vector database with native k-NN and ML Commons support.

"After implementing hybrid search, recall@10 increased from 60% to 92%, and p99 latency remained below 50ms," — technical lead of the project.

OpenSearch is a fork of Elasticsearch under the Apache 2.0 license, with no restrictions for commercial use. It supports k-NN indices with algorithms HNSW, IVF, or FAISS, hybrid search (BM25 + vectors), and built-in ML Commons for deploying embedding models. This provides flexibility for various scenarios: from high-precision retrieval to high-performance real-time search. We have implemented RAG on OpenSearch for 10+ projects: from internal knowledge bases to customer support systems. Below is a practical guide with code and architectural decisions.

What is RAG with OpenSearch?

RAG (Retrieval-Augmented Generation) is a pattern where an LLM generates an answer based on relevant documents retrieved from a vector database. OpenSearch acts as a vector store with hybrid search support, combining lexical matching (BM25) with semantic search (k-NN). This approach creates synergy: exact keyword matching and context understanding in queries with synonyms and paraphrasing. Hybrid search is especially useful for large text knowledge bases where BM25 may miss semantic connections.

How to configure hybrid search in OpenSearch?

Creating a k-NN index

from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk

client = OpenSearch(
    hosts=[{"host": "localhost", "port": 9200}],
    use_ssl=False,
)

# k-NN index configuration
index_config = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "standard",
            },
            "source": {"type": "keyword"},
            "doc_type": {"type": "keyword"},
            "embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "engine": "nmslib",
                    "parameters": {
                        "m": 16,
                        "ef_construction": 128,
                    }
                }
            }
        }
    }
}

client.indices.create(index="knowledge_base", body=index_config)

For production scenarios, the choice of engine and k-NN parameters is crucial. OpenSearch k-NN plugin supports HNSW, IVF, FAISS, and NMSLIB. Adapting parameters to data volume and latency requirements is part of our expertise.

Hybrid search: BM25 + k-NN

def opensearch_hybrid_search(query: str, top_k: int = 5) -> list:
    query_embedding = get_embedding(query)

    body = {
        "query": {
            "bool": {
                "should": [
                    # BM25 search
                    {
                        "match": {
                            "content": {
                                "query": query,
                                "boost": 0.3
                            }
                        }
                    },
                    # k-NN search via script_score
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "knn_score",
                                "lang": "knn",
                                "params": {
                                    "field": "embedding",
                                    "query_value": query_embedding,
                                    "space_type": "cosinesimil",
                                }
                            },
                            "boost": 0.7,
                        }
                    }
                ]
            }
        },
        "size": top_k,
        "_source": ["content", "source", "doc_type"],
    }

    response = client.search(index="knowledge_base", body=body)
    return [hit["_source"] for hit in response["hits"]["hits"]]

Amazon OpenSearch Service: managed option

When deploying on AWS, we use Amazon OpenSearch Service with native Bedrock integration:

import boto3
import json

bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1")

def get_embedding_bedrock(text: str) -> list:
    response = bedrock_client.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text, "dimensions": 1024}),
    )
    return json.loads(response["body"].read())["embedding"]

Why OpenSearch over Elasticsearch for RAG?

OpenSearch and Elasticsearch have nearly identical APIs for k-NN, but there are differences:

Parameter	OpenSearch	Elasticsearch
License	Apache 2.0	SSPL/Elastic License
AWS managed	Amazon OpenSearch Service	Elastic Cloud on AWS
k-NN engines	NMSLIB, FAISS, Lucene	Lucene HNSW
RRF fusion	Via scoring	Native (8.14+)
ML Commons	Built-in	No equivalent

ML Commons allows embedding models to be deployed directly in the cluster — accelerating semantic search and reducing latency since embeddings are computed inside the database. For RAG, this improves relevance by 15-20% measured by NDCG.

Which k-NN algorithm to choose?

Algorithm selection depends on latency and accuracy requirements:

Algorithm	Search speed	Memory usage	Incremental updates
HNSW	High	Medium	Yes
IVF	Medium	Low	Partial
FAISS	High	High	No (batch only)

For most production scenarios, we recommend HNSW with the nmslib engine — it yields p99 latency below 50ms for millions of vectors.

Common mistakes in RAG implementation on OpenSearch

Incorrect embedding dimension: the model outputs 768 but the index is configured for 1536 — indexing error.
Missing chunking: documents longer than 512 tokens dilute semantics.
Ignoring boost weights: BM25 and k-NN should be balanced (0.3/0.7 is a good starting point).
Forgetting filters: often need filtering by doc_type or source before hybrid search.

RAG implementation process on OpenSearch

Data analysis — assess volume, document types, latency requirements.
Index design — choose k-NN algorithm (HNSW for speed/accuracy balance), embedding dimension (1024/1536).
Ingestion pipeline — parsing, chunking (256-512 tokens), embedding generation (via Bedrock/Titan or ML Commons).
Hybrid search — tune BM25/k-NN weights, test on a dataset.
LLM integration — LangChain or direct connection to OpenAI/GPT-4.
Testing — evaluate recall@k, precision, A/B test with production queries.
Deployment — deploy on Amazon OpenSearch Service, set up monitoring.

What's included

Data audit and indexing strategy selection.
OpenSearch cluster configuration (k-NN, pipeline).
Embedding generation pipeline development.
LLM integration (via LangChain, LlamaIndex).
Relevance testing (NDCG, recall).
Documentation and access handover.
Team training.
2-week post-release support.

Timeframes

OpenSearch + index setup: 2–3 days
Ingestion pipeline: 3–7 days
Hybrid search + RAG pipeline: 1–2 weeks
Total: 2–4 weeks turnkey

To assess your project, contact us — we have experience implementing RAG on OpenSearch and guarantee quality. Get a consultation: we'll evaluate your scenario and propose an optimal solution. Order a turnkey RAG pipeline implementation.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.