Implementing Vector Search for AI Knowledge Base in a Mobile Application
Vector search finds semantically similar documents, not just keyword matches. Query "how to restore access" finds article "password reset," even if word "restore" doesn't appear. Foundation of any AI search over knowledge base.
How It Works at Code Level
Each text fragment becomes vector—array of numbers (1536 or 3072 values for OpenAI, 768 for local models). Semantically similar texts give close vectors. Search—finding nearest vectors to query (Approximate Nearest Neighbor, ANN).
Practically for mobile app:
- User enters query
- Client sends query to backend
- Backend creates query embedding via API (OpenAI, Cohere) or local model
- Vector DB returns top-K nearest chunks
- Results passed to LLM or returned directly
Whole pipeline up to step 4 takes 50–300 ms—acceptable for mobile UX.
Vector Indices: What to Choose
pgvector—PostgreSQL extension. If already using PostgreSQL—zero additional infrastructure. Supports HNSW and IVFFlat indices.
-- HNSW index for fast ANN search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Search top-5 nearest
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;
<=> is cosine distance in pgvector. For normalized vectors, cosine distance equivalent to inner product (<#>), but <=> works without normalization.
Index choice:
- IVFFlat—builds fast, less memory, slightly less accurate
- HNSW—best accuracy, fast search, more memory on building
For database up to 1M documents, pgvector with HNSW handles fine. Above 10M—consider Pinecone, Weaviate, Qdrant.
Metadata Filtering
Vector search without filters searches entire index. If need search only company-specific documents, department, or language—add filtering.
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE
language = 'ru'
AND category = 'installation'
AND updated_at > NOW() - INTERVAL '1 year'
ORDER BY embedding <=> $1
LIMIT 10;
Important: pgvector executes filter AFTER vector search with HNSW/IVFFlat. For highly selective filters (<10% rows), bad results—need either build separate indices per subset or use partitioned HNSW.
Embeddings: Client vs Server
Generate query embedding on client (local ML model) or server. For mobile app—server preferable: embedding models weigh 80–500 MB, local inference uses resources, API key doesn't stick out of APK.
Exception—fully offline scenario. Then use Core ML on iOS (model conversion via coremltools) or ONNX Runtime on Android. Example: all-MiniLM-L6-v2 in ONNX weighs ~22 MB and produces 384-dimension vectors sufficient for corporate documentation search.
Displaying Search Results on Mobile
Each result contains: text excerpt, document name/section, similarity score, update date. On mobile display:
- Score as visual relevance indicator (dots/bar, not number—number means nothing to user)
- Breadcrumbs of source: "User Guide → Installation → iOS"
- Highlight matching words (even with semantic search—words still often overlap)
- "Open full document" button
Stages and Timeline
Inventory and normalize knowledge base → choose embedding model → configure vector DB and indices → develop ingestion pipeline → search API with filtering → mobile search UI with results → test quality (precision@K, recall@K) → iterate.
Vector search over corpus up to 50 thousand documents with pgvector—2–4 weeks. With custom embedding model, reranking, multilinguality—5–8 weeks.







