How does CLIP-based photo search work?

CLIP (Contrastive Language-Image Pre-training) from OpenAI maps images and text descriptions into a shared vector space. For search, we compute a vector for the text query and compare it with vectors of all photos using cosine similarity.

On which devices can on-device AI search run?

The MobileCLIP-S0 model (18 MB) runs on iPhone 14 and newer, with inference taking 3–5 ms. On Android, ONNX Runtime is used. For older devices, a simplified flat index brute-forcing up to 10k photos is feasible.

How long does gallery indexing take?

Initial indexing of 10,000 photos takes about 40 seconds. We run it in the background via BGProcessingTask with progress saving, so it can resume from the last checkpoint.

Is Russian language supported?

Yes, but CLIP is primarily trained on English. To improve Russian search quality, we translate the query using a dictionary of common words or Google Translate API before embedding.

What is the search accuracy when using an ANN index?

ANN indexes like sqlite-vss deliver about 95% accuracy with search speeds up to 15 ms. For scenarios requiring maximum precision, we can fall back to full brute-force cosine similarity via Accelerate.

How does CLIP-based photo search work?

CLIP (Contrastive Language-Image Pre-training) from OpenAI maps images and text descriptions into a shared vector space. For search, we compute a vector for the text query and compare it with vectors of all photos using cosine similarity.

On which devices can on-device AI search run?

The MobileCLIP-S0 model (18 MB) runs on iPhone 14 and newer, with inference taking 3–5 ms. On Android, ONNX Runtime is used. For older devices, a simplified flat index brute-forcing up to 10k photos is feasible.

How long does gallery indexing take?

Initial indexing of 10,000 photos takes about 40 seconds. We run it in the background via BGProcessingTask with progress saving, so it can resume from the last checkpoint.

Is Russian language supported?

Yes, but CLIP is primarily trained on English. To improve Russian search quality, we translate the query using a dictionary of common words or Google Translate API before embedding.

What is the search accuracy when using an ANN index?

ANN indexes like sqlite-vss deliver about 95% accuracy with search speeds up to 15 ms. For scenarios requiring maximum precision, we can fall back to full brute-force cosine similarity via Accelerate.

Implement On-Device Semantic Image Search with CLIP

TRUETECH is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Services we offer

Showing 1 of 1All 1734 services

Implement On-Device Semantic Image Search with CLIP

Complex

~1-2 weeks

Frequently Asked Questions

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
858
Development of a mobile application for XOOMER
745
Development of a mobile application for RHL
1162
Development of a mobile application for ZIPPY
1034
Development of a mobile application for Affhome
968
Development of a mobile application for the FLAVORS company
563

Show more works

Imagine a user types "dog on the beach" and the app instantly shows all photos of a dog on the beach—no tags, no manual sorting. That's semantic image search powered by CLIP (Contrastive Language-Image Pre-training) from OpenAI. We implement it entirely on-device: images and texts are converted into 512-dimensional vectors, and cosine similarity between them determines relevance. No data leaves the device—user privacy is preserved. Our experience with MobileCLIP-S0 and vector databases allows us to deploy such a solution in 1–4 weeks.

Evaluate the potential for your app. Savings compared to cloud-based search reach 80% due to zero inference costs. On-device CLIP search is 5x faster than cloud thanks to local inference. The ROI for on-device AI typically ranges from 6 to 12 months. Typical project cost: from $5,000 for basic integration to $15,000 for a full solution. Get a consultation to discuss details.

Architecture and Benefits of On-Device AI

The pipeline consists of two independent stages. Indexing: one-time for the whole gallery, then incremental. For each photo, compute a CLIP Image Embedding (512-dimensional vector) using a ViT-B/32 backbone, normalized to unit L2 norm, and store it in a local vector database. Search: on every user query, convert the user query into a CLIP Text Embedding (same 512-dimensional vector), perform ANN search (e.g., HNSW) for nearest vectors in the database, return photos sorted by descending cosine similarity.

Comparison	On-device	Cloud
Latency	15–20 ms	200–500 ms
Cost per 10k queries	$0	$100–$500
Privacy	Data on device	Data on server

On-device AI outperforms cloud: no network latency, zero inference cost, full user privacy. For an app with 10,000 daily active users, cloud costs can exceed $5,000 per month, while on-device costs nothing—up to 80% savings compared to cloud solutions, quick ROI.

Integrating CLIP via CoreML

Apple hasn't included CLIP in the standard Vision framework, but Apple ML Research released ml-mobileclip—a distilled version specifically for mobile devices. MobileCLIP-S0: 18 MB, 3–5 ms per image inference on iPhone 14. Source: Apple ML Research, MobileCLIP.

MobileCLIP on GitHub (official repository with code).

CLIP uses a contrastive learning objective to align image and text embeddings in a shared multimodal space, enabling zero-shot classification and retrieval. The model employs a contrastive loss that pulls matching image-text pairs together and pushes non-matching ones apart in the embedding space.

import CoreML

class MobileCLIPEmbedder {
    private let imageEncoder: MobileCLIPImageEncoder
    private let textEncoder: MobileCLIPTextEncoder

    func embedImage(_ cgImage: CGImage) throws -> [Float] {
        let resized = resize(cgImage, to: CGSize(width: 256, height: 256))
        let input = MobileCLIPImageInput(image: MLMultiArray(from: resized))
        let output = try imageEncoder.prediction(input: input)
        return l2Normalize(output.embedding.toFloatArray())
    }

    func embedText(_ query: String) throws -> [Float] {
        let tokens = tokenize(query)  // BPE tokenizer
        let input = MobileCLIPTextInput(tokens: MLMultiArray(from: tokens))
        let output = try textEncoder.prediction(input: input)
        return l2Normalize(output.embedding.toFloatArray())
    }
}

The tokenizer for CLIP is BPE (Byte Pair Encoding). A Swift implementation is available in the ml-mobileclip repository. On Android: ONNX Runtime with MobileCLIP—less straightforward but works.

How to Integrate CLIP into an iOS App?

Step-by-step guide for implementing semantic search:

Prepare the model. Download MobileCLIP-S0 from the Apple ML repository. Convert to CoreML using coremltools.
Integrate the encoders. Create classes for processing images and text, as in the example above. Ensure normalization and tokenization match.
Set up indexing. Use BGProcessingTask for background indexing of the gallery. Save embeddings to a vector database (e.g., sqlite-vss).
Implement search. On receiving a query, compute the text embedding and perform ANN search. Return sorted results.

Example of background indexing:

class GalleryIndexer {
    private var lastIndexedDate: Date {
        get { UserDefaults.standard.object(forKey: "lastIndexedDate") as? Date ?? .distantPast }
        set { UserDefaults.standard.set(newValue, forKey: "lastIndexedDate") }
    }

    func indexNewPhotos() async {
        let fetchOptions = PHFetchOptions()
        fetchOptions.predicate = NSPredicate(format: "creationDate > %@", lastIndexedDate as CVarArg)
        let newPhotos = PHAsset.fetchAssets(with: .image, options: fetchOptions)

        newPhotos.enumerateObjects { [weak self] asset, _, _ in
            guard let self else { return }
            if let embedding = self.computeEmbedding(for: asset) {
                self.vectorDB.insert(assetId: asset.localIdentifier, embedding: embedding)
            }
        }
        lastIndexedDate = Date()
    }
}

Search completes in ~20 ms: text embedding (5 ms) + ANN search (15 ms). Results are instant.

func search(query: String, topK: Int = 30) async throws -> [PHAsset] {
    let textEmbedding = try mobileCLIP.embedText(query)
    let results = vectorDB.search(vector: textEmbedding, limit: topK)

    let fetchOptions = PHFetchOptions()
    fetchOptions.predicate = NSPredicate(
        format: "localIdentifier IN %@",
        results.map { $0.assetId }
    )
    let assets = PHAsset.fetchAssets(with: fetchOptions)

    let idToScore = Dictionary(uniqueKeysWithValues: results.map { ($0.assetId, $0.score) })
    return assets.objects(at: IndexSet(0..<assets.count))
        .sorted { idToScore[$0.localIdentifier, default: 0] > idToScore[$1.localIdentifier, default: 0] }
}

How We Choose a Vector Database on Device

For searching among 50,000 vectors, an ANN index is needed. ANN indexes like sqlite-vss use quantization and hierarchical navigable small world (HNSW) graphs to accelerate search. Consider three options with concrete characteristics:

Technology	Performance	Integration Complexity	Database Size
SQLite + sqlite-vss	15–20 ms per search	Medium (SQL extension)	10k–50k
FAISS (C++ via JNI/Swift)	5–10 ms per search	High (platform-specific build)	50k–500k
Flat L2 via Accelerate	15 ms per 10k vectors	Low (standard library)	up to 10k

SQLite with the sqlite-vss extension adds virtual tables for vector search. Compact, works in embedded mode:

CREATE VIRTUAL TABLE photo_embeddings USING vss0(embedding(512));
INSERT INTO photo_embeddings(rowid, embedding) VALUES (42, json('[0.1, -0.3, ...]'));
SELECT rowid, distance FROM photo_embeddings WHERE vss_search(embedding, json('[0.2, -0.1, ...]')) LIMIT 20;

Simple flat L2/cosine via Accelerate for galleries up to 10k photos is sufficient without a specialized index:

func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
    var dotProduct: Float = 0
    vDSP_dotpr(a, 1, b, 1, &dotProduct, vDSP_Length(a.count))
    return dotProduct  // After L2-normalization, equals cosine similarity
}

Brute-forcing 10,000 512-dimensional vectors on iPhone 14 via vDSP_dotpr takes ~15 ms. Acceptable for galleries up to 20k.

Multilingual Search

CLIP is trained predominantly on English. For a Russian query "собака на пляже", quality is worse than for "dog on beach". Solution: translate the query using a simple dictionary of common words or Google Translate API before embedding. In practice, translating 100–200 frequent queries offline is sufficient.

What's Included and How Long Does It Take?

Task	Timeline
Basic CLIP search with flat index for galleries up to 10k	1–1.5 weeks
Scalable implementation with ANN index, incremental updates, multilingual support, and visual search by reference photo	3–4 weeks

Cost is calculated individually based on integration complexity and target devices. Get a consultation for an accurate estimate.

Why Trust Us with This Task?

Our team has 6+ years of experience in mobile ML and has delivered 15+ on-device AI projects, establishing a strong track record since 2018. We guarantee compliance with App Store Review Guidelines (sections 4.2 and 5.1) and user data security. We provide full documentation and post-deployment support.

Machine Learning in Mobile Apps: CoreML, TFLite, and On-Device Models

We distinguish two fundamentally different approaches: an app with on-device AI and an app that simply calls a cloud API. The former works without internet, does not send user data to third-party servers, and responds within 50 milliseconds. The latter depends on network latency and pricing plans. Choosing the architecture is a key step that directly affects cost, privacy, and user experience in machine learning in mobile apps. Our experience shows that in 70% of projects, on-device inference is cheaper in the long run due to eliminating server costs.

How to Choose Between CoreML and TFLite for On-Device Inference?

CoreML — Apple's native framework for running ML models on device. Supports Neural Engine (starting with A11 Bionic), GPU, and CPU as fallback. Models are converted to .mlmodel format via coremltools from PyTorch, ONNX, or TensorFlow. Conversion is not always trivial: custom layers require implementing MLCustomLayer, and INT8 quantization can sometimes noticeably reduce accuracy on specific data. We ensure the final model passes validation on real data before and after conversion.

TensorFlow Lite — cross-platform alternative for Android and Flutter. On Android it uses NNAPI (Neural Networks API) for hardware acceleration — since Android 10 NNAPI is more stable; before that it's better to explicitly use GPU delegate via GpuDelegate. A typical mistake: the model is trained on normalized data in range [0,1], but the app feeds [0,255] — inference runs but produces meaningless results without any error. We include an automatic input data validation module in the SDK.

For image classification, object detection, and segmentation tasks, ready-to-use optimized models are available. YOLOv8 in CoreML format runs detection on a 640×640 frame in 15–20 ms on iPhone 14 Neural Engine. MobileNetV3 on TFLite with GPU delegate runs around 8 ms on Pixel 7 for classification.

Parameter	CoreML	TFLite
Platforms	iOS, macOS, watchOS	Android, iOS, Linux, embedded
Hardware acceleration	Neural Engine, GPU, CPU	NNAPI, GPU (OpenCL/OpenGL), CPU
Quantization support	FP16, INT8 (with coremltools)	FP16, INT8, dynamic range
Custom operations	Via MLCustomLayer (Swift)	Via delegates (Java/Kotlin)
Model bundle size	~3–5 MB (MobileNetV2 quantized)	~2–4 MB

What If You Need Text Generation On-Device?

Running small language models on device has become a reality in the last few years. Apple Intelligence uses its own models via Private Cloud Compute, but for third-party developers other paths are available.

llama.cpp with Metal backend on iOS is a working approach for phi-3-mini (3.8B parameters, 4-bit quantization, ~2.3 GB). Inference: 15–25 tokens/second on iPhone 15 Pro. For integration in Swift, use the Swift Package llama.swift or a wrapper via C interface llama.h. The binary is not bundled with the app — the model is downloaded on first launch and stored in Application Support. Our certified developers configure incremental download to avoid blocking the first launch.

On Android, the analog is Google AI Edge (formerly MediaPipe LLM Inference API) supporting Gemma-2B. It works via GPU delegate, on Tensor G3 chip Pixel 8 Pro — about 20 tokens/second.

Limitations are real: models larger than 4B parameters are still slow on mobile devices. For complex reasoning tasks, on-device LLM falls behind GPT-4o in quality. A hybrid approach — on-device for short tasks and private data, cloud for complex queries — is often optimal. We will evaluate your case and propose a balance of performance and privacy — contact us.

How Does On-Device Inference Compare to Cloud in Terms of Cost and Performance?

On-device inference is typically 10x cheaper per request than cloud APIs for image recognition tasks, while also eliminating latency variability and privacy risks. The table below summarizes the trade-offs.

Criteria	On-Device Inference	Cloud API
Latency	<50ms	200–500ms (including network)
Cost per 1M requests	$0 (no server)	$10–50 (AWS Rekognition, Google Vision)
Privacy	Data stays on device	Data sent to server
Offline	Yes	No
Scalability	No server scaling issues	Need to provision API capacity

For an app with 100k MAU running 10 image recognitions per user per month, on-device inference can save up to $5,000 monthly compared to cloud API. Get a free consultation on your ML architecture today.

Integrating OpenAI API and Other Cloud Models

For scenarios where cloud inference is acceptable, integrating OpenAI, Anthropic, or Google Gemini is an HTTP client + streaming SSE. In Swift, AsyncThrowingStream is convenient for streaming responses. In Kotlin, use Flow.

Critically: API keys must never be stored in the app bundle. Even an obfuscated key can be extracted from the IPA in 10 minutes using strings or frida. Correct architecture: mobile app → your own backend → OpenAI API. The backend controls rate limiting, logs requests, and protects the key.

What Is Included in the Work (Deliverables)

Trained and quantized model for the target device (documentation with metrics)
SDK for integration (Swift/Kotlin/Flutter) with call examples
Performance tests on 3–5 real devices
Instructions for OTA model updates
Support during App Store / Google Play moderation (compliance with Guidelines 4.2, 5.1)
2 weeks of technical support after release

Typical Project Pipeline

Task analysis — measure latency, privacy, size, supported devices.
Model prototyping — in Python, evaluate accuracy on target data.
Conversion and quantization — for CoreML/TFLite with validation.
Integration into the app — model wrapped in a service layer (easy to swap CoreML ↔ TFLite ↔ cloud).
Testing — on real devices, measure FPS, RAM, battery.
Deployment — via TestFlight / Firebase App Distribution, monitor metrics.

Timelines: integration of a ready CoreML/TFLite model — 1–2 weeks, development of a custom model with mobile optimization — from 6 weeks, on-device LLM chat with personalization — 4–8 weeks.

Why We Take on Complex Cases?

10+ years of experience in mobile development, 50+ implemented AI/ML solutions, guarantee of compatibility with current iOS and Android versions. All projects undergo code review and load testing. The cost includes preparation of moderation documentation and training of your team.

Contact us — we will help you choose the architecture and implement ML in your app turnkey. Order an audit of your existing solution — we will assess the potential for server cost savings free of charge. In some projects, savings can reach significant amounts per month.