Implementing AI Face-Document Matching (Face Match) Verification in a Mobile App
KYC flow without Face Match is checking a document with no connection to the person in front of the camera. Face Match closes this gap: compares user selfie with document photo and returns confidence score. Technically the task is solved, but the devil is in details: photo quality in passport, lighting during selfie, document age, and aging factor.
How Face Embedding Comparison Works
Classic pipeline:
-
Face detection in both images—
Vision.VNDetectFaceRectanglesRequeston iOS, ML KitFaceDetectoron Android. - Alignment — normalize eye and nose coordinates to canonical face position. Without alignment, match drops 15–20%.
- Embedding — CNN model (ArcFace, FaceNet) transforms 112×112 px face into 512-dimensional vector.
- Cosine similarity between two vectors—value 0.0–1.0, where ≥0.65 usually counts as match (threshold depends on model).
Important: threshold is not universal. Different demographic groups show different baseline similarity. Good model trained on balanced dataset (MS-Celeb-1M, VGGFace2 + augmentation) and validated on LFW / AgeDB with demographic breakdown. Model without such validation—potential discrimination risk and false FRR for elderly users.
On-Device Embedding on iOS
ArcFace R50 converted to CoreML (coremltools) weighs ~85 MB. For mobile production, MobileFaceNet is better—1.1 MB, 99.2% accuracy on LFW vs 99.6% for ArcFace R50. The 0.4% difference rarely matters; bundle size gain is significant.
let faceModel = try MobileFaceNet(configuration: MLModelConfiguration())
guard let embedding = try? faceModel.prediction(face_input: alignedFaceBuffer) else { return }
func cosineSimilarity(_ a: MLMultiArray, _ b: MLMultiArray) -> Float {
var dot: Float = 0
var normA: Float = 0
var normB: Float = 0
for i in 0..<512 {
let ai = a[i].floatValue
let bi = b[i].floatValue
dot += ai * bi
normA += ai * ai
normB += bi * bi
}
return dot / (sqrt(normA) * sqrt(normB))
}
let score = cosineSimilarity(selfieEmbedding, documentEmbedding)
On Apple Neural Engine (A14+) MobileFaceNet inference takes ~25 ms. iPhone SE 2nd gen—~180 ms. If target audience is budget devices, server inference is more cost-effective.
Peculiarity: Low-Quality Document Photo
Passport photo is compressed, often printed and re-photographed. Typical problems:
- Overexposure when shooting passport page (glare on glossy film).
- Moiré patterns from print raster during scanning.
- Aging factor: passport issued 9 years ago, person has aged.
Preprocessing pipeline for document photo: gamma correction, denoising (Core Image CINoiseReduction), glare removal via CIHighlightShadowAdjust. Then detection and alignment as usual.
Aging factor can be partially compensated via age-invariant model or explicit normalization: if birth date on document >40 years ago, lower similarity threshold by 0.03–0.05.
Server Verification for High Demands
On-device match suits internal services. For financial products (banking, crypto onboarding), server verification with audit trail is required—log embeddings (not photos!), timestamp, device fingerprint, similarity score. Sending photos to server is undesirable—embeddings only. Both privacy and bandwidth savings.
Server stack: Python + insightface (ArcFace R100) + FAISS for batch search + PostgreSQL with pgvector for embeddings. Latency: ~150–300 ms on GPU T4.
Protection from Attacks
Face Match without liveness is attacked via photo. Without anti-spoofing—via mask. Integration with Liveness Detection is mandatory in production scenario. Face Match itself—final step after liveness pass, not standalone module.
Implementation Stages
Choose model (on-device/server) → integrate detection + alignment → embedding + similarity → threshold tuning → test edge cases (glasses, beard, poor lighting, old photos) → integration with liveness and IDV flow → audit accuracy by demographics → publication.
Timeline: integrate ready CoreML/TFLite model—3–5 weeks. With server inference, audit trail, and model fine-tuning—8–14 weeks. Cost is calculated individually.







