Implementing Face Tracking and Recognition in AR Applications
Face tracking in mobile applications is mature technology with clearly defined capabilities and limitations. ARKit with TrueDepth camera gives face depth map with millimeter precision. ARCore AugmentedFace and MediaPipe work on RGB camera and are slightly worse at motion, but work on any device. Choice depends on task — and it's easy to choose wrong stack.
ARKit Face Tracking: What Exactly We Get
ARFaceTrackingConfiguration requires iPhone X or newer (TrueDepth front camera). Returns ARFaceAnchor:
geometry — ARFaceGeometry with 1220 vertices and 2304 triangles. Face mesh in real-world scale (meters). Updates ~30 times per second. Each vertex has fixed index — can address nose tip specifically (vertex ~9), mouth corners (~37, ~45), pupils.
blendShapes — dictionary of 52 AR face blend shape coefficients. browDownLeft, eyeBlinkLeft, jawOpen, mouthSmileLeft, etc. Each is Float from 0 to 1. This is foundation for face-driven animation (morph targets, 3D avatars) and expression recognition.
leftEyeTransform, rightEyeTransform — position and orientation of each eye. For eye tracking and gaze direction.
func session(_ session: ARSession, didUpdate anchors: [ARAnchor]) {
guard let faceAnchor = anchors.first as? ARFaceAnchor else { return }
let blinkLeft = faceAnchor.blendShapes[.eyeBlinkLeft]?.floatValue ?? 0
let jawOpen = faceAnchor.blendShapes[.jawOpen]?.floatValue ?? 0
// Control UI with blink/mouth
if blinkLeft > 0.7 { triggerAction() }
}
ARCore AugmentedFace and MediaPipe
ARCore AugmentedFace (iOS not supported, Android only): 468 face mesh points via ML model on RGB camera. AugmentedFace.RegionType — NOSE_TIP, FOREHEAD_RIGHT, FOREHEAD_LEFT for key points. Fewer points than ARKit, no depth map, but works on 85% of Android flagships without special sensor.
MediaPipe Face Landmark Task — cross-platform variant (iOS, Android, Web). 478 points. Works via VisionImage / MPImage. Open source, free. For tasks without strict realtime requirements (photo analysis, static filters) — excellent choice. For 30fps live camera — requires device with Neural Engine (iPhone) or modern Android ML accelerator.
Classification and Expression Recognition
Basic tasks on blendShapes without ML:
- Smile:
mouthSmileLeft + mouthSmileRight > 0.5 - Wink:
eyeBlinkLeft > 0.85witheyeBlinkRight < 0.3 - Surprise:
eyeWideLeft + eyeWideRight > 1.2+browInnerUp > 0.5 - Mouth open:
jawOpen > 0.4
This works for simple triggers — game mechanics, hands-free interface control. For emotion recognition (joy, sadness, anger) — need ML classification on top of blendShapes. CreateML allows training MLMultiArrayClassifier on recorded blendShape sequences.
Recognizing Specific Person
Face Recognition (identity verification) is fundamentally different task, not covered by face tracking. For identification: Vision framework VNDetectFaceRectanglesRequest + VNRecognizeAnimalsRequest → face embedding via CoreML model (ArcFace, FaceNet, or Apple's own). Compare embedding vectors with database.
In iOS 15+ — LocalAuthentication with LAContext.evaluatePolicy(.deviceOwnerAuthenticationWithBiometrics) for Face ID. This isn't SDK for your logic — this is system biometry. Using system Face ID for user verification in app is simpler and more secure than building your own.
Latency and Performance
ARKit face tracking + 3D mask + environment occlusion on iPhone 12 — ~8–12% CPU, ~30% GPU at 60fps UI. On iPhone XR (A12) — consumption higher, sometimes thermal throttling on long sessions. Monitor via os_signpost + Instruments → Metal System Trace.
Two cameras simultaneously (front + rear) — not supported via standard AR configurations. For selfie AR need to check ARFaceTrackingConfiguration.supportedVideoFormats — verify available resolutions.
Timeline
Face tracking with basic blendShape triggers (game mechanics, hands-free control) — 1–2 weeks. Face tracking + 3D mask/accessories + video recording — 2–4 weeks. Expression recognition via ML classifier — plus 2–3 weeks. Cost calculated individually.







