AI-Powered Text-Based Photo Search for Mobile Apps
"Show me photos with a dog on the beach" — user describes in text, app finds relevant photos. This is CLIP (Contrastive Language-Image Pretraining from OpenAI): a model trained to align images and text descriptions in shared vector space. Cosine similarity between text vector and image vector is "relevance".
Architecture: Embeddings + Vector Search
Pipeline has two independent stages:
Indexing (happens once for entire gallery, then incrementally):
- For each photo → CLIP Image Embedding (512-dimensional vector)
- Save to local vector database
Search (happens on each user query):
- User query → CLIP Text Embedding (same 512-dimensional vector)
- ANN-search nearest vectors in database
- Return photos by descending cosine similarity
CLIP On-Device via CoreML
Apple didn't include CLIP in standard Vision framework, but Apple ML Research released ml-mobileclip — a distilled version for mobile devices. MobileCLIP-S0: 18 MB, 3–5 ms inference per image on iPhone 14.
import CoreML
class MobileCLIPEmbedder {
private let imageEncoder: MobileCLIPImageEncoder
private let textEncoder: MobileCLIPTextEncoder
func embedImage(_ cgImage: CGImage) throws -> [Float] {
let resized = resize(cgImage, to: CGSize(width: 256, height: 256))
let input = MobileCLIPImageInput(image: MLMultiArray(from: resized))
let output = try imageEncoder.prediction(input: input)
return l2Normalize(output.embedding.toFloatArray())
}
func embedText(_ query: String) throws -> [Float] {
let tokens = tokenize(query) // BPE tokenizer
let input = MobileCLIPTextInput(tokens: MLMultiArray(from: tokens))
let output = try textEncoder.prediction(input: input)
return l2Normalize(output.embedding.toFloatArray())
}
}
CLIP tokenizer is BPE (Byte Pair Encoding). Swift implementation available in apple/ml-mobileclip repository.
On Android: ONNX Runtime with MobileCLIP — less convenient but works. OrtEnvironment + OrtSession, batch 8 images.
Vector Database On Device
Searching among 50,000 vectors needs ANN index. Options:
SQLite with sqlite-vss extension — adds virtual tables for vector search. Compact, works embedded:
CREATE VIRTUAL TABLE photo_embeddings USING vss0(embedding(512));
INSERT INTO photo_embeddings(rowid, embedding) VALUES (42, json('[0.1, -0.3, ...]'));
SELECT rowid, distance FROM photo_embeddings WHERE vss_search(embedding, json('[0.2, -0.1, ...]')) LIMIT 20;
Simple FAISS (C++) via JNI/Swift bridging — faster at scale, harder to integrate.
Simple flat L2/cosine via Accelerate — for galleries up to 10k photos sufficient without specialized index:
func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
var dotProduct: Float = 0
vDSP_dotpr(a, 1, b, 1, &dotProduct, vDSP_Length(a.count))
return dotProduct // After L2-normalization = cosine similarity
}
Iterate through 10,000 512-dimensional vectors on iPhone 14 via vDSP_dotpr — ~15 ms. For galleries up to 20k acceptable.
Background Indexing
First indexing of 10k photo gallery at 4 ms/photo = 40 seconds. Run via BGProcessingTask:
// Save progress — resume from checkpoint on next launch
class GalleryIndexer {
private var lastIndexedDate: Date {
get { UserDefaults.standard.object(forKey: "lastIndexedDate") as? Date ?? .distantPast }
set { UserDefaults.standard.set(newValue, forKey: "lastIndexedDate") }
}
func indexNewPhotos() async {
let fetchOptions = PHFetchOptions()
fetchOptions.predicate = NSPredicate(format: "creationDate > %@", lastIndexedDate as CVarArg)
let newPhotos = PHAsset.fetchAssets(with: .image, options: fetchOptions)
newPhotos.enumerateObjects { [weak self] asset, _, _ in
guard let self else { return }
if let embedding = self.computeEmbedding(for: asset) {
self.vectorDB.insert(assetId: asset.localIdentifier, embedding: embedding)
}
}
lastIndexedDate = Date()
}
}
Search: Processing Query
func search(query: String, topK: Int = 30) async throws -> [PHAsset] {
let textEmbedding = try mobileCLIP.embedText(query)
let results = vectorDB.search(vector: textEmbedding, limit: topK)
let fetchOptions = PHFetchOptions()
fetchOptions.predicate = NSPredicate(
format: "localIdentifier IN %@",
results.map { $0.assetId }
)
let assets = PHAsset.fetchAssets(with: fetchOptions)
// Sort by relevance (order from vectorDB)
let idToScore = Dictionary(uniqueKeysWithValues: results.map { ($0.assetId, $0.score) })
return assets.objects(at: IndexSet(0..<assets.count))
.sorted { idToScore[$0.localIdentifier, default: 0] > idToScore[$1.localIdentifier, default: 0] }
}
Search latency — text embedding (~5 ms) + ANN search (~15 ms) = ~20 ms. Results feel instant to user.
Multilingual Search
CLIP trained mainly on English. For Russian query "собака на пляже" (dog on beach) — quality worse than English. Solution: translate query via simple dictionary of frequent words or Google Translate API before embeddings. In practice, enough to translate 100–200 frequent queries without API.
Timelines
Basic CLIP search with flat index for galleries up to 10k — 1–1.5 weeks. Scalable implementation with ANN index, incremental updates, multilinguality, and visual search by reference photo — 3–4 weeks. Cost calculated individually.







