AI-Powered Face Grouping for Photo Gallery in Mobile Apps
Face grouping is recognizing that the same person appears across different photos without identifying who they are. Technically: face detection → extract embedding (numerical vector of 128–512 dimensions) → cluster vectors by similarity. Google Photos does this on-device exactly this way.
On-Device vs Server: A Deliberate Choice
For this task, on-device is not just convenient — it's a regulatory requirement in several jurisdictions. Biometric data (face embeddings technically count as biometrics) cannot be transmitted without explicit consent under GDPR and various national laws. Recommend building an on-device pipeline with zero server transmission.
Pipeline: Detection → Embedding → Clustering
Face Detection
// iOS: VNDetectFaceRectanglesRequest
let request = VNDetectFaceRectanglesRequest { req, _ in
guard let faces = req.results as? [VNFaceObservation], !faces.isEmpty else { return }
for face in faces {
let faceRect = VNImageRectForNormalizedRect(face.boundingBox, width, height)
self.extractEmbedding(from: originalImage.cropping(to: faceRect)!)
}
}
On Android: ML Kit FaceDetector or MediaPipe FaceDetector. ML Kit is simpler to integrate; MediaPipe gives more control.
Embedding Extraction
Apple doesn't provide a built-in face recognition API (only detection). Use MobileFaceNet — a compact model (1–3 MB) for face embeddings, runs on-device via Core ML:
// Extract 128-dimensional embedding
func extractEmbedding(from faceImage: CGImage) -> [Float]? {
guard let input = try? MobileFaceNetInput(face_image: MLMultiArray(from: resize(faceImage, to: CGSize(width: 112, height: 112)))) else { return nil }
guard let output = try? facenetModel.prediction(input: input) else { return nil }
// Normalize vector (L2 norm)
let embedding = (0..<128).map { output.embedding[$0].floatValue }
return l2Normalize(embedding)
}
func l2Normalize(_ v: [Float]) -> [Float] {
let norm = sqrt(v.reduce(0) { $0 + $1 * $1 })
return norm > 0 ? v.map { $0 / norm } : v
}
After L2-normalization, cosine distance between embeddings of the same person is <0.3; different people is >0.6. Threshold 0.4–0.5 works well in practice.
Clustering
For clustering with unknown number of clusters, use DBSCAN. Swift doesn't have built-in implementation; write it yourself or use Accelerate/BLAS:
// Simplified DBSCAN for face clustering
func dbscan(embeddings: [[Float]], eps: Float = 0.45, minPoints: Int = 2) -> [Int] {
var labels = Array(repeating: -1, count: embeddings.count) // -1 = noise
var clusterId = 0
for i in 0..<embeddings.count {
guard labels[i] == -1 else { continue }
let neighbours = rangeQuery(embeddings: embeddings, idx: i, eps: eps)
if neighbours.count < minPoints { continue } // noise point
labels[i] = clusterId
var seeds = neighbours
while !seeds.isEmpty {
let q = seeds.removeFirst()
if labels[q] == -1 { labels[q] = clusterId }
if labels[q] != clusterId { continue }
labels[q] = clusterId
let qNeighbours = rangeQuery(embeddings: embeddings, idx: q, eps: eps)
if qNeighbours.count >= minPoints { seeds.append(contentsOf: qNeighbours) }
}
clusterId += 1
}
return labels
}
Cosine distance via Accelerate vDSP_dotpr — fast even for 10k vectors.
Performance on Large Galleries
Real gallery is 5,000–50,000 photos. About 30–40% contain faces. Say 10,000 photos with faces, average 2 faces each = 20,000 embeddings.
DBSCAN with O(n²) on 20k 128-dimensional vectors — ~10–30 seconds on iPhone 14. Speed up: preliminary ANN (Approximate Nearest Neighbour) via FAISS (Swift binding exists) reduces to O(n log n).
Run processing in background via BackgroundTask (iOS 13+, BGProcessingTask) — task can run minutes while device is charging.
BGTaskScheduler.shared.register(forTaskWithIdentifier: "com.app.faceGrouping") { task in
let bgTask = task as! BGProcessingTask
self.runFaceGrouping(completion: { bgTask.setTaskCompleted(success: true) })
bgTask.expirationHandler = { /* save progress */ }
}
Storing Results
Cannot store embeddings themselves in iCloud/CloudKit without explicit consent (biometrics). Locally — in Core Data encrypted via Data Protection API (fileProtection = .complete). Identifier for mapping — PHAsset.localIdentifier, not the original photo.
Timelines
On-device pipeline with detection, embeddings, and clustering for medium galleries — 2–3 weeks. Scalable implementation with FAISS, background processing, incremental updates, and UI — 4–5 weeks. Cost calculated individually.







