RAG Development with Pinecone Vector Database

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
RAG Development with Pinecone Vector Database
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

RAG Development with Pinecone Vector Database

Pinecone is a managed vector database with REST/gRPC API, automatic scaling, and hybrid search support (sparse + dense). It requires no infrastructure management and scales easily from prototype to millions of vectors. Pinecone Serverless (since 2024) allows working without pre-reserving resources — you pay only for actual operations.

Initialization and Index Creation

from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI

pc = Pinecone(api_key="...")

# Create serverless index
pc.create_index(
    name="corporate-knowledge-base",
    dimension=3072,        # text-embedding-3-large
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

index = pc.Index("corporate-knowledge-base")

Document Indexing with Metadata

from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

def index_documents(documents: list, batch_size: int = 100):
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
    chunks = splitter.split_documents(documents)

    # Batch indexing
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]

        texts = [c.page_content for c in batch]
        vectors = embeddings_model.embed_documents(texts)

        # Prepare records for Pinecone
        records = []
        for chunk, vector in zip(batch, vectors):
            doc_id = hashlib.md5(chunk.page_content.encode()).hexdigest()
            records.append({
                "id": doc_id,
                "values": vector,
                "metadata": {
                    "text": chunk.page_content,
                    "source": chunk.metadata.get("source", ""),
                    "page": chunk.metadata.get("page", 0),
                    "doc_type": chunk.metadata.get("doc_type", "general"),
                    "date": chunk.metadata.get("date", ""),
                }
            })

        index.upsert(vectors=records)
        print(f"Indexed batch {i//batch_size + 1}: {len(records)} chunks")

Query with Metadata Filtering

def rag_query(
    query: str,
    doc_type_filter: str = None,
    top_k: int = 5
) -> dict:

    # Query embedding
    query_vector = embeddings_model.embed_query(query)

    # Build filter
    filter_dict = {}
    if doc_type_filter:
        filter_dict["doc_type"] = {"$eq": doc_type_filter}

    # Search
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict if filter_dict else None
    )

    # Build context
    context_chunks = []
    for match in results["matches"]:
        context_chunks.append({
            "text": match["metadata"]["text"],
            "source": match["metadata"]["source"],
            "score": match["score"]
        })

    return context_chunks

Hybrid Search in Pinecone

Pinecone supports hybrid search (dense + sparse) via built-in BM25:

from pinecone_text.sparse import BM25Encoder

# Train BM25 on document corpus
bm25 = BM25Encoder()
bm25.fit(all_texts)

def hybrid_query(query: str, alpha: float = 0.5, top_k: int = 5) -> list:
    """
    alpha=1.0: dense only
    alpha=0.0: sparse (BM25) only
    alpha=0.5: equal weight to both
    """
    # Dense vector
    dense_vector = embeddings_model.embed_query(query)

    # Sparse vector (BM25)
    sparse_vector = bm25.encode_queries(query)

    results = index.query(
        vector=dense_vector,
        sparse_vector=sparse_vector,
        top_k=top_k,
        include_metadata=True,
        alpha=alpha,
    )
    return results["matches"]

Practical Case: Retail Corporate Knowledge Base

Scale: 45,000 SKUs with descriptions, 3,200 pages of regulations, 800 FAQ entries. Total ~180,000 vectors.

Configuration: Pinecone Serverless (aws/us-east-1), dimension=1536 (text-embedding-3-small for savings), metric=cosine.

Usage pattern: 15,000 queries/day, peak load 200 RPS during sales hours.

Results:

  • Retrieval latency P95: 180ms
  • Full RAG answer latency P95: 2.1s (including GPT-4o-mini)
  • Pinecone cost: ~$80/month (Serverless)
  • Context recall (found needed document): 0.87
  • Answer accuracy (LLM-judge): 0.83

Optimizations:

  • Namespace separation: products/regulations/FAQ in separate namespaces — allows filtering without overhead
  • Metadata-only queries: for some queries, metadata filter alone is sufficient without vector search
  • Cache popular queries: Redis cache for top-500 frequent questions (~30% hit rate)

Timeline

  • Pinecone setup + ingestion pipeline: 3–5 days
  • RAG pipeline with quality evaluation: 1–2 weeks
  • Optimization and production: 1–2 weeks
  • Total: 2–5 weeks