What is Pinecone and what is it used for?

Pinecone is a managed vector database optimized for semantic search and RAG tasks. It supports dense and sparse vectors, automatic scaling, and metadata filtering. It is used as an embedding store in LLM response generation pipelines.

How does hybrid search work in Pinecone?

Hybrid search combines dense (neural network) and sparse (BM25) vectors, adjusting their contribution via the alpha parameter. This allows simultaneous consideration of semantic similarity and exact term matching, improving result relevance.

What embedding dimensions does Pinecone support?

Pinecone supports vectors up to dimension 4096. In practice, we use 1536 (text-embedding-3-small) or 3072 (text-embedding-3-large) depending on required accuracy and budget.

How long does it take to implement RAG on Pinecone?

Basic index and indexing pipeline setup takes 3–5 days. Deploying a full RAG solution with quality evaluation and optimization takes 2 to 5 weeks, depending on data volume and requirements.

What quality guarantees do you provide?

We guarantee retrieval precision@5 of at least 0.85 on a representative sample. We monitor latency p99 and LLM answer accuracy, and provide a model card and metrics report.

What is Pinecone and what is it used for?

Pinecone is a managed vector database optimized for semantic search and RAG tasks. It supports dense and sparse vectors, automatic scaling, and metadata filtering. It is used as an embedding store in LLM response generation pipelines.

How does hybrid search work in Pinecone?

Hybrid search combines dense (neural network) and sparse (BM25) vectors, adjusting their contribution via the alpha parameter. This allows simultaneous consideration of semantic similarity and exact term matching, improving result relevance.

What embedding dimensions does Pinecone support?

Pinecone supports vectors up to dimension 4096. In practice, we use 1536 (text-embedding-3-small) or 3072 (text-embedding-3-large) depending on required accuracy and budget.

How long does it take to implement RAG on Pinecone?

Basic index and indexing pipeline setup takes 3–5 days. Deploying a full RAG solution with quality evaluation and optimization takes 2 to 5 weeks, depending on data volume and requirements.

What quality guarantees do you provide?

We guarantee retrieval precision@5 of at least 0.85 on a representative sample. We monitor latency p99 and LLM answer accuracy, and provide a model card and metrics report.

Building RAG with Pinecone: pipelines, hybrid search, and case studies

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Building RAG with Pinecone: pipelines, hybrid search, and case studies

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Building RAG with Pinecone vector database

Let's face it: when your corporate knowledge base grows to hundreds of thousands of documents, plain full-text search stops working — relevant documents drown in noise, and synonymous terms are ignored. RAG (Retrieval-Augmented Generation) with a vector database is the only way to maintain fast access to the right information. We build end-to-end RAG pipelines with Pinecone: from choosing the embedding model to production monitoring. Get a project estimate within a day: send us a description of your data and use cases. Get a consultation for your scenario — we'll evaluate your data in 1 day.

Why Pinecone Serverless?

Serverless mode eliminates cluster management: no pod reservations, no autoscaling configuration. You pay only for write, read, and storage operations — ideal for projects with variable load. Pinecone supports hybrid search (dense + sparse via BM25), which is critical for domains with high term precision (legal, medical). We use BM25Encoder to build sparse vectors without additional infrastructure. Typical infrastructure savings range from 20% to 40% compared to self-hosted solutions.

How we initialize the index

from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

pc = Pinecone(api_key="...")

# Creating a serverless index
pc.create_index(
    name="corporate-knowledge-base",
    dimension=3072,        # text-embedding-3-large
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

index = pc.Index("corporate-knowledge-base")

Indexing documents with metadata

from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

def index_documents(documents: list, batch_size: int = 100):
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
    chunks = splitter.split_documents(documents)

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c.page_content for c in batch]
        vectors = embeddings_model.embed_documents(texts)

        records = []
        for chunk, vector in zip(batch, vectors):
            doc_id = hashlib.md5(chunk.page_content.encode()).hexdigest()
            records.append({
                "id": doc_id,
                "values": vector,
                "metadata": {
                    "text": chunk.page_content,
                    "source": chunk.metadata.get("source", ""),
                    "page": chunk.metadata.get("page", 0),
                    "doc_type": chunk.metadata.get("doc_type", "general"),
                    "date": chunk.metadata.get("date", ""),
                }
            })

        index.upsert(vectors=records)
        print(f"Indexed batch {i//batch_size + 1}: {len(records)} chunks")

Querying with metadata filtering

def rag_query(
    query: str,
    doc_type_filter: str = None,
    top_k: int = 5
) -> dict:

    query_vector = embeddings_model.embed_query(query)

    filter_dict = {}
    if doc_type_filter:
        filter_dict["doc_type"] = {"$eq": doc_type_filter}

    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict if filter_dict else None
    )

    context_chunks = []
    for match in results["matches"]:
        context_chunks.append({
            "text": match["metadata"]["text"],
            "source": match["metadata"]["source"],
            "score": match["score"]
        })

    return context_chunks

Implementing Hybrid Search

Pinecone supports hybrid search through built-in BM25. We train BM25 on the document corpus and use sparse vectors in queries. According to Pinecone Hybrid Search Guide, for this we use pinecone_text.sparse.BM25Encoder.

from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(all_texts)

def hybrid_query(query: str, alpha: float = 0.5, top_k: int = 5) -> list:
    """
    alpha=1.0: only dense
    alpha=0.0: only sparse (BM25)
    alpha=0.5: equal weight
    """
    dense_vector = embeddings_model.embed_query(query)
    sparse_vector = bm25.encode_queries(query)

    results = index.query(
        vector=dense_vector,
        sparse_vector=sparse_vector,
        top_k=top_k,
        include_metadata=True,
        alpha=alpha,
    )
    return results["matches"]

Hybrid search gives a 15–30% recall boost over pure dense for domains with high terminological precision.

Case study: retailer corporate knowledge base

Scale: 45,000 SKU descriptions, 3,200 pages of regulations, 800 FAQ entries. Total ~180,000 vectors. Our client is a retail chain with 5+ years of automation experience.

Configuration: Pinecone Serverless (aws/us-east-1), dimension=1536 (text-embedding-3-small for cost savings), metric=cosine.

Usage pattern: 15,000 queries/day, peak load 200 RPS during sales events.

Results:

Latency P95 for retrieval: 180 ms
Latency P95 for full RAG response: 2.1 s (including GPT-4o-mini)
Context recall (found relevant document): 0.87
Answer accuracy (LLM-judge): 0.83

Optimizations:

Namespace separation: products/regulations/FAQ in separate namespaces – allows filtering without overhead
Metadata-only queries: for certain queries, filtering by metadata suffices without vector search
Cache popular queries: Redis cache for top-500 frequent questions (~30% hit rate)

Pinecone vs alternatives

Criteria	Pinecone	Weaviate	Chroma
Managed type	Fully	SaaS/self-hosted	Self-hosted
Hybrid search	Built-in BM25	Built-in BM25	Via external libraries
Serverless	Yes	No	No
Metadata filtering	Yes (all types)	Yes	Limited
Relative cost	Low	Medium	Free (self-host) + infra

Pinecone outperforms Weaviate in variable load scenarios due to serverless, and Chroma lags in filtering functionality.

How to evaluate retrieval quality?

For objective retrieval evaluation, we use precision@k, recall@k, and NDCG metrics. On a test set of 500 queries with labeled relevant documents, we automatically compute these. Optimal values: precision@5 >= 0.85 and recall@10 >= 0.9. Additionally, we apply LLM-as-judge: GPT-4o evaluates whether the context is sufficient for answering. This helps identify chunking issues or embedding errors.

Checklist before launching RAG into production

- Verify metadata coverage: all documents have doc_type, source, date. - Tune alpha for hybrid search on a validation set. - Set budget guardrails: LLM token limits and number of retrieval results. - Configure monitoring: latency p99 retrieval, HTTP error rate, embedding drift. - Conduct load testing: target 80% of peak load.

What's included in the work

Data audit: source analysis, chunking strategy, metadata schema design.
Index schema design: dimension, metric, serverless configuration.
Indexing pipeline development: batch processing, deduplication, metadata enrichment.
RAG pipeline: integration with LLM (GPT-4o, Claude, LLaMA), prompt engineering, hallucination guardrails.
Testing and monitoring: precision/recall, latency p99, LLM-as-judge.
Documentation and training: model card, developer guide, access transfer.
Maintenance: 3-month warranty, SLA for incidents.

Estimated timelines

Stage	Duration
Pinecone setup + ingestion pipeline	3–5 days
RAG pipeline with quality evaluation	1–2 weeks
Production optimization	1–2 weeks
Total	2–5 weeks

Pricing is calculated individually based on data volume, number of sources, and integration complexity. Contact us for a project estimate — we'll discuss details and prepare a commercial proposal within a day.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.