Which embedding models work with pgvector?

Any model outputting fixed-dimension vectors works, e.g., OpenAI's text-embedding-3-small (1536 dim) or BERT-based models (768 dim).

Which index is recommended for pgvector?

HNSW is recommended for production due to high speed and moderate memory. IVFFlat is an alternative with lower memory but lower accuracy.

How many vectors can pgvector store efficiently?

pgvector works efficiently up to 1–5 million vectors; with HNSW tuning and sufficient RAM, it can handle 10M+. For >50M vectors, specialized databases are preferable.

Which embedding models work with pgvector?

Any model outputting fixed-dimension vectors works, e.g., OpenAI's text-embedding-3-small (1536 dim) or BERT-based models (768 dim).

Which index is recommended for pgvector?

HNSW is recommended for production due to high speed and moderate memory. IVFFlat is an alternative with lower memory but lower accuracy.

How many vectors can pgvector store efficiently?

pgvector works efficiently up to 1–5 million vectors; with HNSW tuning and sufficient RAM, it can handle 10M+. For >50M vectors, specialized databases are preferable.

RAG with pgvector: Vector Search in PostgreSQL

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

RAG with pgvector: Vector Search in PostgreSQL

Simple

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1349
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
949
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

You're working with PostgreSQL, and suddenly you need semantic search over documents. Spinning up a separate vector DB means 3–5 extra days of setup, another service, new APIs, monitoring. pgvector is a PostgreSQL extension that adds the vector type and cosine distance operations right into your familiar database. pgvector brings vector search to PostgreSQL, making RAG pipelines simpler and cost-effective. We are a team with 10+ years in AI/ML and certified PostgreSQL engineers. Over recent years, we've deployed RAG with pgvector for 20+ projects, from startups to enterprise. Using pgvector can save you up to $6000 per year compared to standalone vector databases for 5M vectors. We guarantee stable performance under loads up to 10M vectors. Get a free consultation—we'll assess your project in one day.

How much does pgvector cost?

pgvector vs. standalone vector DB: cost and simplicity

If your data is already in PostgreSQL, adding pgvector requires no new component. Compare with Pinecone:

Parameter	pgvector	Pinecone
Setup time	1–2 days	3–5 days
Extra infrastructure	None	Separate DB required
Latency p99 (1M vectors)	5–15 ms	5–10 ms
Data volume	Up to 10M vectors (with HNSW)	Up to billions
SQL support	Yes	No
Monthly cost for 1M vectors	$0 (included)	$200–$500

The trade-off between recall and latency can be managed by adjusting the HNSW index parameters such as ef_construction and m, which directly influence the k-NN graph quality. pgvector is better for moderate volumes (up to 5M vectors) and when you'd rather not add a new service. For scales >50M vectors or ultra-low latency (p99<2ms), Pinecone may be justified, but for 80% of RAG projects, pgvector is the optimal choice. pgvector confirms the extension supports all necessary operations for semantic search. In terms of cost-effectiveness, pgvector is up to 5x cheaper than Pinecone for datasets under 5M vectors when factoring in infrastructure and operational overhead.

Choosing an embedding model for pgvector

pgvector works with any model that returns a fixed-dimension vector. Most common are text-embedding-3-small from OpenAI (1536 dim), BERT-based models (768 dim), or open-source models from Sentence Transformers. Vector dimension affects performance: a 768-dim vector uses half the memory of a 1536-dim one but may have lower accuracy. For most RAG projects, we recommend text-embedding-3-small: a balance of quality and speed.

Troubleshooting pgvector performance

If search takes more than 20 ms, check:

Is HNSW index used? IVFFlat is slower and less accurate.
Limit candidates with the ef_search parameter (default 40, can be lowered to 20).
Increase work_mem for sorting results.
Check if you're filtering on a non-indexed column—that slows the query.

With proper tuning, pgvector delivers stable 5–15 ms on 1M vectors. For advanced optimization, consider adjusting index hyperparameters like m and ef_construction to improve recall@k.

Setting up a RAG pipeline with pgvector

Step 1: Install pgvector

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table for documents
CREATE TABLE document_chunks (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    source VARCHAR(512),
    doc_type VARCHAR(64),
    page_number INTEGER DEFAULT 0,
    metadata JSONB,
    embedding vector(1536),  -- dimension = embedding model
    created_at TIMESTAMP DEFAULT NOW()
);

-- HNSW index for fast search
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Step 2: Indexing via Python

import psycopg2
from openai import OpenAI
import json

conn = psycopg2.connect("postgresql://user:pass@localhost:5432/ragdb")
openai_client = OpenAI()

def index_chunk(text: str, source: str, doc_type: str, metadata: dict):
    # Get embedding
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    embedding = response.data[0].embedding

    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO document_chunks (content, source, doc_type, metadata, embedding)
            VALUES (%s, %s, %s, %s, %s)
        """, (text, source, doc_type, json.dumps(metadata), embedding))
    conn.commit()

Step 3: Vector search with filtering

def search_similar(query: str, doc_type: str = None, limit: int = 5) -> list:
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query,
    ).data[0].embedding

    sql = """
        SELECT content, source, doc_type, metadata,
               1 - (embedding <=> %s::vector) AS similarity
        FROM document_chunks
        WHERE ($2::text IS NULL OR doc_type = $2)
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """

    with conn.cursor() as cur:
        cur.execute(sql, (query_embedding, doc_type, query_embedding, limit))
        results = cur.fetchall()

    return [
        {"text": r[0], "source": r[1], "similarity": r[4]}
        for r in results
    ]

pgvector operators:

Operator	Function	Typical Use
`<=>`	Cosine distance	Semantic search (RAG)
`<->`	Euclidean distance	L2 norm search
`<#>`	Negative dot product	For models with normalized vectors

Performance tuning tips for pgvector

For HNSW index, choose m=16–32 and ef_construction=64–200. Higher ef_construction increases accuracy but takes longer to build.
Ensure the index fits in shared_buffers. For 1M vectors of dimension 1536 with HNSW (m=32), you need about 1.5 GB RAM.
Use parallel query: PostgreSQL automatically parallelizes search for large tables.
Monitor cache hit ratio—if below 99%, increase shared_buffers.

What's included in our work

Analysis: assess data volume, choose embedding model, design schema.
pgvector setup: install extension, create indexes (HNSW/IVFFlat), tune PostgreSQL parameters for high load.
Ingestion pipeline: Python scripts for document chunking, embedding generation, and writing to the table.
RAG pipeline: implement search, ranking, and prompt construction for LLM.
Testing: measure latency (p99), accuracy (Recall@k), stress tests.
Documentation: architecture description, operational manual, restoration dump.
Support: 2 weeks of post-deployment support—we help with adjustments for your scenarios.

Estimated timelines

Stage	Duration
pgvector setup + table	1 day
Ingestion pipeline	2–4 days
RAG pipeline	3–5 days
Testing and refinement	2–3 days
Total	1–2 weeks

The cost is calculated individually—depends on data volume and integration complexity. Order RAG implementation with pgvector—we'll help design a solution tailored to your data scale.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.