RAG Development with Pinecone Vector Database
Pinecone is a managed vector database with REST/gRPC API, automatic scaling, and hybrid search support (sparse + dense). It requires no infrastructure management and scales easily from prototype to millions of vectors. Pinecone Serverless (since 2024) allows working without pre-reserving resources — you pay only for actual operations.
Initialization and Index Creation
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
pc = Pinecone(api_key="...")
# Create serverless index
pc.create_index(
name="corporate-knowledge-base",
dimension=3072, # text-embedding-3-large
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index("corporate-knowledge-base")
Document Indexing with Metadata
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")
def index_documents(documents: list, batch_size: int = 100):
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
# Batch indexing
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c.page_content for c in batch]
vectors = embeddings_model.embed_documents(texts)
# Prepare records for Pinecone
records = []
for chunk, vector in zip(batch, vectors):
doc_id = hashlib.md5(chunk.page_content.encode()).hexdigest()
records.append({
"id": doc_id,
"values": vector,
"metadata": {
"text": chunk.page_content,
"source": chunk.metadata.get("source", ""),
"page": chunk.metadata.get("page", 0),
"doc_type": chunk.metadata.get("doc_type", "general"),
"date": chunk.metadata.get("date", ""),
}
})
index.upsert(vectors=records)
print(f"Indexed batch {i//batch_size + 1}: {len(records)} chunks")
Query with Metadata Filtering
def rag_query(
query: str,
doc_type_filter: str = None,
top_k: int = 5
) -> dict:
# Query embedding
query_vector = embeddings_model.embed_query(query)
# Build filter
filter_dict = {}
if doc_type_filter:
filter_dict["doc_type"] = {"$eq": doc_type_filter}
# Search
results = index.query(
vector=query_vector,
top_k=top_k,
include_metadata=True,
filter=filter_dict if filter_dict else None
)
# Build context
context_chunks = []
for match in results["matches"]:
context_chunks.append({
"text": match["metadata"]["text"],
"source": match["metadata"]["source"],
"score": match["score"]
})
return context_chunks
Hybrid Search in Pinecone
Pinecone supports hybrid search (dense + sparse) via built-in BM25:
from pinecone_text.sparse import BM25Encoder
# Train BM25 on document corpus
bm25 = BM25Encoder()
bm25.fit(all_texts)
def hybrid_query(query: str, alpha: float = 0.5, top_k: int = 5) -> list:
"""
alpha=1.0: dense only
alpha=0.0: sparse (BM25) only
alpha=0.5: equal weight to both
"""
# Dense vector
dense_vector = embeddings_model.embed_query(query)
# Sparse vector (BM25)
sparse_vector = bm25.encode_queries(query)
results = index.query(
vector=dense_vector,
sparse_vector=sparse_vector,
top_k=top_k,
include_metadata=True,
alpha=alpha,
)
return results["matches"]
Practical Case: Retail Corporate Knowledge Base
Scale: 45,000 SKUs with descriptions, 3,200 pages of regulations, 800 FAQ entries. Total ~180,000 vectors.
Configuration: Pinecone Serverless (aws/us-east-1), dimension=1536 (text-embedding-3-small for savings), metric=cosine.
Usage pattern: 15,000 queries/day, peak load 200 RPS during sales hours.
Results:
- Retrieval latency P95: 180ms
- Full RAG answer latency P95: 2.1s (including GPT-4o-mini)
- Pinecone cost: ~$80/month (Serverless)
- Context recall (found needed document): 0.87
- Answer accuracy (LLM-judge): 0.83
Optimizations:
- Namespace separation: products/regulations/FAQ in separate namespaces — allows filtering without overhead
- Metadata-only queries: for some queries, metadata filter alone is sufficient without vector search
- Cache popular queries: Redis cache for top-500 frequent questions (~30% hit rate)
Timeline
- Pinecone setup + ingestion pipeline: 3–5 days
- RAG pipeline with quality evaluation: 1–2 weeks
- Optimization and production: 1–2 weeks
- Total: 2–5 weeks







