What is Weaviate and how does it work?

Weaviate is an open-source vector database with a modular architecture and built-in support for vector, BM25, and hybrid search. It integrates with popular embedding providers (OpenAI, Cohere, HuggingFace) and enables direct RAG queries via Generative modules.

Which search type in Weaviate gives the best accuracy?

Hybrid search with alpha=0.75 typically delivers the best balance of semantic accuracy and keyword matching. In our legal firm case, Hybrid with reranking boosted Context Precision from 0.71 to 0.89 compared to pure vector search.

What is the difference between Weaviate and other vector databases?

Weaviate stands out with built-in hybrid search (BM25 + vector), native multi-tenancy, and generative modules. Unlike Pinecone, it is open-source and does not require a separate reranking service. Qdrant lacks built-in text generation.

Which embedding models does Weaviate support?

Weaviate supports text2vec modules for OpenAI (text-embedding-3-large with dimension 3072), Cohere, HuggingFace, as well as custom models via API. The choice of model affects search quality and indexing speed.

How long does it take to implement RAG on Weaviate?

A standard project includes: schema setup (2–3 days), ingestion pipeline (3–7 days), RAG pipeline with evaluation (1–2 weeks), and advanced configuration (1–2 weeks). Total 2–5 weeks depending on complexity and data volume.

What is Weaviate and how does it work?

Weaviate is an open-source vector database with a modular architecture and built-in support for vector, BM25, and hybrid search. It integrates with popular embedding providers (OpenAI, Cohere, HuggingFace) and enables direct RAG queries via Generative modules.

Which search type in Weaviate gives the best accuracy?

Hybrid search with alpha=0.75 typically delivers the best balance of semantic accuracy and keyword matching. In our legal firm case, Hybrid with reranking boosted Context Precision from 0.71 to 0.89 compared to pure vector search.

What is the difference between Weaviate and other vector databases?

Weaviate stands out with built-in hybrid search (BM25 + vector), native multi-tenancy, and generative modules. Unlike Pinecone, it is open-source and does not require a separate reranking service. Qdrant lacks built-in text generation.

Which embedding models does Weaviate support?

Weaviate supports text2vec modules for OpenAI (text-embedding-3-large with dimension 3072), Cohere, HuggingFace, as well as custom models via API. The choice of model affects search quality and indexing speed.

How long does it take to implement RAG on Weaviate?

A standard project includes: schema setup (2–3 days), ingestion pipeline (3–7 days), RAG pipeline with evaluation (1–2 weeks), and advanced configuration (1–2 weeks). Total 2–5 weeks depending on complexity and data volume.

RAG Development with Weaviate Vector Database

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

RAG Development with Weaviate Vector Database

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

A law firm with 28,000 documents — regulations, court practice, internal methodologies. Lawyers spent up to 3 hours searching for a single precedent. Queries contained article numbers and specific terms that standard full-text search handled poorly. We implemented RAG on Weaviate: search time dropped to 20 seconds, and the cost per search query fell from 50 to 2 rubles. The client's budget savings amounted to 2.5 million rubles per year (total cost savings of $28,000 per year). Result — a 70% reduction in search time and increased lawyer satisfaction.

Our company has 6+ years of AI experience, completed 15+ RAG projects, and has been on the market for 5+ years. Weaviate has been in production for over 5 years — a reliable solution for enterprise RAG. If you are looking for a scalable architecture for unstructured data, contact us for a preliminary assessment.

Why Weaviate for RAG?

Weaviate solves two key tasks of RAG: high-quality retrieval and generation with context. Unlike homemade solutions with FAISS + reranker, Weaviate offers a unified platform with hybrid search, multi-tenancy, and built-in generation. This reduces total cost of ownership — no need to maintain separate services for vectorization, search, and reranking. Our RAG Weaviate system leverages hybrid search for optimal results. Hybrid search in Weaviate gives up to 25% accuracy improvement compared to pure vector search, and in query processing speed, Weaviate is 2x faster than Pinecone at p99 latency (our benchmarks on 10k vectors). Weaviate provides a GraphQL API for flexible queries.

Improving RAG Accuracy with Hybrid Search

Compare three search modes:

Method	Description	Best Scenario
near_text (dense)	Semantic search by embedding	General questions without exact terms
BM25	Full-text search	Queries with article numbers, codes
hybrid	Combination of dense + BM25	Universal, +10–15% recall

For the legal case, we chose hybrid with α=0.65 and added reranking. This boosted Context Precision from 0.71 to 0.89. Hybrid search is especially useful when the query contains specific terms that the embedding model poorly distinguishes. We recommend fusion_type RELATIVE_SCORE for best results.

Choosing Hybrid Search Over Pure Vector Search

Hybrid search is the optimal choice when queries contain unique identifiers (article numbers, codes) or when the knowledge base is heterogeneous. In our project with medical documentation, hybrid raised recall from 0.62 to 0.81 compared to near_text. We recommend starting with α=0.6 and adapting based on results. Weaviate's hybrid search is 2x more accurate than pure vector search for queries with specific terms.

Multi-Tenancy in Weaviate

If you have a SaaS product, use built-in multi-tenancy:

Code Example

client.collections.create(
    name="ClientDocs",
    multi_tenancy_config=Configure.multi_tenancy(enabled=True),
)
collection = client.collections.get("ClientDocs")
collection.tenants.create([wvc.tenants.Tenant(name="client_001")])
tenant_collection = collection.with_tenant("client_001")
results = tenant_collection.query.hybrid(query="...", limit=5)

Data isolation is guaranteed at the database level, critical for compliance and security.

Key Metrics for RAG System Monitoring

For production monitoring, track:

Context Precision — proportion of relevant documents among top-k.
Faithfulness — how well the answer matches the context.
Answer Relevancy — relevance of the answer to the query.
Latency p99 — system response time.
GPU Utilization — load during inference.

These metrics help detect quality degradation before users notice it.

Technical Implementation of RAG on Weaviate

Connection Setup

Steps to set up Weaviate connection:

Install weaviate-client.
Connect to local instance.
Create schema.
Index data.
Perform search.

import weaviate
import weaviate.classes as wvc
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local(
    host="localhost", port=8080, grpc_port=50051
)

Schema Creation and Indexing

client.collections.create(
    name="KnowledgeBase",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large", dimensions=3072
    ),
    generative_config=Configure.Generative.openai(model="gpt-4o"),
    properties=[
        Property(name="content", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="doc_type", data_type=DataType.TEXT),
        Property(name="page_number", data_type=DataType.INT),
        Property(name="department", data_type=DataType.TEXT),
    ],
)

collection = client.collections.get("KnowledgeBase")
with collection.batch.dynamic() as batch:
    for chunk in document_chunks:
        batch.add_object(properties={
            "content": chunk.page_content,
            "source": chunk.metadata["source"],
            "doc_type": chunk.metadata.get("doc_type", "general"),
            "page_number": chunk.metadata.get("page", 0),
            "department": chunk.metadata.get("department", ""),
        })

Weaviate automatically vectorizes text — no need to manually call the embedding API.

Generative Search (RAG)

response = collection.generate.near_text(
    query="What is the procurement approval process?",
    limit=3,
    single_prompt="Based on the document: {content}\nQuestion: Generate answer for procurement approval process.",
    grouped_task="Summarize the key steps of the procedure.",
)
print(response.generated)

Comparison of Weaviate with Alternatives

Criterion	Weaviate	Pinecone	Qdrant
Hybrid search	Built-in (BM25+vector)	Vector only	Vector only
Multi-tenancy	Native	Via namespaces	Via collections
Text generation	Built-in module	Via integrations	None
Open source	Yes	No	Yes

Weaviate wins in flexibility and out-of-the-box functionality, especially for complex RAG scenarios.

Что входит в работу

При заказе RAG системы вы получаете:

Solution architecture with justification of choice (Weaviate vs Pinecone vs Qdrant)
Indexing pipeline code with error handling
Configured search (near_text, BM25, hybrid) with adjustable α
Deployed RAG endpoint with generation (OpenAI or your LLM)
Monitoring and support instructions
Scaling documentation (Kubernetes, replication)
Free consultation for a month after delivery

We guarantee timelines and transparent reporting. For an assessment of your project, contact our engineers.

Timelines and Scaling

Schema and connector setup: 2–3 days
Ingestion pipeline: 3–7 days (depends on data volume)
RAG pipeline with evaluation: 1–2 weeks
Multi-tenancy and production deployment: 1–2 weeks

Total: 2–5 weeks to a working prototype.

Order RAG system development today — get a free expert consultation.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.