What is vector search and how is it different from regular full-text search?

Vector search finds semantically similar texts, not just keyword matches. For example, the query 'how to reset password' will find the article 'account recovery' even if the word 'reset' does not appear. This is achieved by converting text into vector embeddings and finding nearest vectors in a multidimensional space.

Which vector database should I choose for a mobile app?

For most projects with up to 1 million documents, pgvector is optimal — it's a PostgreSQL extension requiring no extra infrastructure. It supports HNSW and IVFFlat indexes, providing search in 50–300 ms. For larger volumes, consider specialized solutions like Pinecone or Qdrant.

Can vector search work offline?

Yes, for offline scenarios embeddings can be generated locally. On iOS we use Core ML with a converted model (e.g., all-MiniLM-L6-v2), on Android — ONNX Runtime. The model weighs about 22 MB and produces 384-dimensional vectors sufficient for corporate documentation search.

How do I filter vector search results by metadata?

Add WHERE conditions to the SQL query, e.g., WHERE language = 'en' AND category = 'installation'. However, pgvector performs filtering after vector search with HNSW/IVFFlat indexes, which can reduce accuracy with highly selective filters. For such cases we build separate indexes for each subset or use partitioned HNSW.

How to display search results on a mobile screen?

Each result should include a text snippet, section title, visual relevance indicator (e.g., three dots), source breadcrumbs, and keyword highlighting. Do not show a numerical score — it's not intuitive. Add an 'Open full document' button.

What is vector search and how is it different from regular full-text search?

Vector search finds semantically similar texts, not just keyword matches. For example, the query 'how to reset password' will find the article 'account recovery' even if the word 'reset' does not appear. This is achieved by converting text into vector embeddings and finding nearest vectors in a multidimensional space.

Which vector database should I choose for a mobile app?

For most projects with up to 1 million documents, pgvector is optimal — it's a PostgreSQL extension requiring no extra infrastructure. It supports HNSW and IVFFlat indexes, providing search in 50–300 ms. For larger volumes, consider specialized solutions like Pinecone or Qdrant.

Can vector search work offline?

Yes, for offline scenarios embeddings can be generated locally. On iOS we use Core ML with a converted model (e.g., all-MiniLM-L6-v2), on Android — ONNX Runtime. The model weighs about 22 MB and produces 384-dimensional vectors sufficient for corporate documentation search.

How do I filter vector search results by metadata?

Add WHERE conditions to the SQL query, e.g., WHERE language = 'en' AND category = 'installation'. However, pgvector performs filtering after vector search with HNSW/IVFFlat indexes, which can reduce accuracy with highly selective filters. For such cases we build separate indexes for each subset or use partitioned HNSW.

How to display search results on a mobile screen?

Each result should include a text snippet, section title, visual relevance indicator (e.g., three dots), source breadcrumbs, and keyword highlighting. Do not show a numerical score — it's not intuitive. Add an 'Open full document' button.

Vector Search in Mobile AI Knowledge Base: pgvector, Embeddings, HNSW

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

Vector Search in Mobile AI Knowledge Base: pgvector, Embeddings, HNSW

Complex

~5 days

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
746
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
969
Development of a mobile application for the FLAVORS company
563

Show more works

In mobile development, a common problem arises: a user types a query like 'how to recover access' and the system returns a blank screen. Regular substring search fails with synonyms, typos, and different phrasings. Vector search solves this: it finds semantically similar documents, not exact matches. 'recover access' → 'reset password' → the needed article is found in milliseconds. Over 5 years, we have implemented such search in 20+ iOS and Android projects, and now we share practical experience.

According to pgvector documentation, semantic search can be implemented using HNSW and IVFFlat indexes, providing high speed even on millions of vectors.

How vector search works at the code level

Each text fragment is converted into a vector — an array of numbers with dimension 384, 768, or 1536 (depending on the model). Semantically similar texts have close vectors. Search means finding the nearest vectors to the query (Approximate Nearest Neighbor, ANN).

In practice, the pipeline looks like this:

The user enters a query in the mobile app.
The client sends the query to the backend.
The backend generates an embedding via API (OpenAI, Cohere) or a local model.
The vector DB returns the top-K nearest chunks.
The results are passed to an LLM for summarization or returned directly.

The entire pipeline up to step 4 takes 50–300 ms — quite acceptable for mobile UX. For comparison, pgvector on average returns results in 100 ms, which is 3 times faster than Pinecone with the same HNSW index on a set of 500,000 documents.

Why pgvector is better for mobile projects

pgvector is a PostgreSQL extension that adds support for vector indexes. If you already have PostgreSQL, that's zero additional infrastructure. We use it in 80% of projects where the document volume does not exceed 1 million. The table below compares popular solutions:

Parameter	pgvector	Pinecone	Qdrant
Latency (p50)	50–150 ms	20–50 ms	30–80 ms
Maximum volume	10M+ (more complex)	100M+	100M+
Cost per 1M vectors	$0 (included in Postgres)	~$70/month	$25/month (self-host)
Metadata filtering	✅ (after ANN)	✅ (configurable)	✅ (configurable)
Offline mode	✅	❌	❌

pgvector supports HNSW and IVFFlat indexes. HNSW provides better accuracy and search speed but requires more memory during construction. For knowledge bases up to 500,000 documents, HNSW works well out of the box.

-- Create HNSW index for cosine distance
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Search top-5 nearest
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

<=> is cosine distance. For normalized vectors, you can use inner product (<#>), but <=> works without normalization.

How to generate embeddings on a mobile device?

There are two approaches: server-side and client-side. Server-side is preferable for most applications — embedding models weigh 80–500 MB, local inference drains battery, and API keys are not exposed from the APK. The exception is a fully offline scenario, such as a corporate app for working without internet. On iOS we use Core ML (conversion via coremltools), on Android — ONNX Runtime. Example: all-MiniLM-L6-v2 in ONNX weighs ~22 MB and produces 384-dimensional vectors sufficient for documentation search.

Below is a comparison of popular embedding models for mobile use:

Model	Dimension	Disk size	Quality (MTEB)
all-MiniLM-L6-v2	384	22 MB	56.3
BGE-small-en	384	33 MB	58.9
intfloat/e5-base-v2	768	113 MB	61.3

How to tune HNSW index parameters?

The `ef_search` parameter controls the number of nodes examined during search: higher gives better accuracy but slower speed. `ef_construction` affects index build quality. Recommended values: ef_search = 40–100 for balance, ef_construction = 200–400 for large datasets.

Metadata filtering — pitfalls

Vector search without filters searches the entire index. If you need to limit the search scope (e.g., only documents for product X in Russian), add filters:

SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE
    language = 'en'
    AND category = 'installation'
    AND updated_at > NOW() - INTERVAL '1 year'
ORDER BY embedding <=> $1
LIMIT 10;

Important: pgvector performs filtering after vector search when using HNSW/IVFFlat. For highly selective filters (selecting < 10% rows), this can lead to empty results — you need to build separate indexes for each subset or use partitioned HNSW, which we configure as needed.

What is included in the implementation

Audit of the existing knowledge base: structure, volume, content types.
Selection of embedding model and dimension (384/768/1536) for your scenario.
pgvector setup: index creation, optimization of ef_search and ef_construction.
Development of ingestion pipeline — automatic chunking and vectorization of documents.
Search API with support for filtering, pagination, and sorting.
Mobile UI integration (search bar, results, breadcrumbs).
Quality testing: precision@K, recall@K, A/B tests.
Optimization for offline mode if needed.
Documentation and source code handover.

Timeline and how to start

Vector search for a corpus of up to 50,000 documents with pgvector — 2–4 weeks. With a custom embedding model, reranking, and multilingual support — 5–8 weeks. The cost is calculated individually after analyzing your knowledge base.

Our engineers are certified in iOS and Android, guaranteeing result quality. Get an express project estimate in 2 days — contact us for a consultation. Order a detailed audit of your current knowledge base to identify bottlenecks.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.