Setting up hand gesture recognition for AR game controls
Gesture control in AR – game responds to hand shape, not button press. Open palm calls menu. Pinch selects object. Clenched fist – attack. Sounds like magic, realized as math over joint positions.
What platform provides and what to build yourself
Meta Quest (via Meta Hand Tracking SDK or OpenXR Hand Interaction Extension) provides 26 joint positions of fingers and wrist in world coordinates, updated at 30–60 Hz. This raw data – bone positions.
ARKit (iOS, via AR Foundation) with Vision framework provides similar hand landmarks via ARHandTrackingConfiguration, available iOS 18+. On Android via ARCore direct Hand Tracking API doesn't exist – use third-party solutions (MediaPipe via ML Kit) or Google ARCore Geospatial + custom ML model.
Out of box you get data, not gestures. Gesture recognition – your task.
How gesture recognizer builds
Each gesture – set of conditions over joint positions:
Pinch (pinch gesture): distance between ThumbTip and IndexTip < threshold (usually 2–3 cm in world units). Plus additional check: MiddleTip, RingTip, PinkyTip far from thumb (open). Without second condition fist falsely triggers as pinch.
Open Palm: all fingertip joints at significant distance from Palm joint. Check Vector3.Distance(fingertip, palm) > openThreshold for all five fingers. Additionally – palm normal (vector from Palm to Middle Metacarpal) should face roughly toward camera, else open palm behind you triggers too.
Point (pointing gesture): IndexTip extended (large distance from IndexMetacarpal), other fingers bent (small tip→metacarpal distance). Plus angle between vector IndexProximal → IndexTip and palm vector.
This not magic – just distance checks and dot products in Update().
Debounce and preventing false positives
Gestures need debounce. Hand naturally trembles – ThumbTip and IndexTip can accidentally approach and separate faster than 1 frame. Without debounce pinch triggers 5 times per second on attempting single gesture.
Standard technique: state machine with time threshold. Gesture considered active only if condition held continuously minimum N frames (or T seconds). Values: 3–5 frames for quick gestures, 10–15 frames (≈ 0.2 sec at 60 Hz) for static poses like Open Palm.
Additionally – confidence filter. Meta Hand Tracking SDK provides OVRHand.HandConfidence – at low tracking confidence (hand partially out of view, poor lighting) gestures not processed. Critical for AR on smartphone where conditions unpredictable.
Integration into AR Foundation
In AR Foundation (Unity), Hand Tracking connected via XRHandSubsystem (package com.unity.xr.hands). XRHandJoint for each joint gives TryGetPose() – position and rotation in space. Gesture recognizer subscribes to XRHandSubsystem.handsUpdated event and processes data in callback.
Important not doing heavy computation in this callback – might not call on main thread. Either buffer data and process in Update, or use Job System with IJobParallelFor for multi-hand recognition.
For MediaPipe on Android – separate integration via Native Plugin or ready wrappers (mediapipe-unity-plugin), data arrives through callback with ML-results.
Timeline: basic 3–5 gestures on Meta SDK – 3–5 business days; full system with debounce, confidence filter, 10+ gestures for AR Foundation – 1–2 weeks. Cost determined individually.





