Hand and Gesture Tracking Setup in AR Games
When a user sees their real hands over AR content and can interact with virtual objects without controllers — that's a different level of immersion. But under the hood this is one of the most technically non-trivial tasks in AR development: tracking 21 points on each hand in real-time, recognizing gestures from a continuous stream of poses, and doing all this with changing lighting, partial hand occlusion, and device movement.
Hardware and software foundation
For mobile AR (iOS/Android), the main path is AR Foundation with XRHandSubsystem, introduced in AR Foundation 5.x. On iOS it uses ARKit Hand Tracking (available on iPhone X with A12 and newer), on Android it's ARCore, but with an important limitation: ARCore doesn't have built-in hand tracking in the main SDK. For Android hand tracking you need either a device with Qualcomm Snapdragon Spaces support, or integrate third-party solutions — MediaPipe Hands (works via plugin or native integration) or XR Hands package from Unity with a custom provider.
For headsets without controllers: Meta Quest 2/3 — through OVR Hand components from Meta XR SDK or through OpenXR Hand Tracking Extension (XR_EXT_hand_tracking). The second path is preferable for cross-platform projects — one code works on Quest, Pico, HoloLens 2 without conditional compilation for each SDK.
HoloLens 2 — through Microsoft Mixed Reality Toolkit (MRTK3), which provides abstraction over WMR hand tracking and already includes ready-made components for pinch, grab, poke interactions.
Why gesture recognition is harder than it seems
Gesture recognition is not simply "check if the palm is closed". It's classification of temporal pose patterns accounting for transition states and tracking noise.
Problem one: joint jitter. Even a stationary hand gives position fluctuation of joints by ±2–5 mm due to sensor noise. If you check the "pinch" gesture by raw positions of thumb and index finger — the trigger will fire randomly. Solution: low-pass filter on position of each joint (Kalman filter or simple exponential moving average with alpha ≈ 0.3–0.5) before classification.
Problem two: threshold hysteresis. One threshold for entering and exiting a gesture state — the path to state flickering. Correct approach: threshold for activating a gesture (e.g., pinch_distance < 15 mm), and a separate, wider threshold for deactivation (pinch_distance > 25 mm). This is standard technique, but often forgotten.
Problem three: temporal consistency. A gesture must be held for minimum N frames (usually 3–5 at 30 fps) before being recognized. This filters out random matches during pose transitions.
XR Interaction Toolkit with XR Hands package has HandShape and GestureDetector components implementing part of this logic. But for non-standard gestures (e.g., "draw a circle in the air" or "double pinch") you need custom logic — a state machine or sequence recognizer with temporal windows.
Integrating hand tracking with game objects
Once tracking is set up, you need to bind hands to game logic. Typical AR interactions: near interaction (touch/press object with hand) and far interaction (ray from palm or finger).
For near interaction the key component is proper Poke Interactor setup from MRTK or custom collider on finger tip with isTrigger = true. Main problem: the speed of finger tip movement during fast motion can be such that in one frame it moves 5–10 cm, passing through thin objects without trigger activation. Solution: sweep test (SphereCast along finger movement vector) instead of simple collider.
For AR games on mobile devices it's important to account for the hand periodically leaving the camera's field of view. You need to implement graceful degradation: when tracking is lost, interrupt current interaction correctly, not leave the object "hanging" in air with last known positioning.
Setup stages
We start with choosing SDK for target platforms and devices. This determines 80% of further architecture.
Then basic hand tracking setup: importing joints, skeleton visualization for debugging, checking tracking quality on target devices under different lighting conditions.
Next: implementing gestures (starting with pinch as basic interaction trigger), integration with interactable objects, fine-tuning filters and thresholds.
Final stage — testing with real users. No developer can predict all the ways people hold their hands.
| Task | Estimated timeline |
|---|---|
| Basic hand tracking (pinch + grab) | 1–2 weeks |
| Custom gesture set (5–10 gestures) | 2–4 weeks |
| Complete interaction system without controllers | 4–8 weeks |
Cost is calculated after analyzing platforms and required interactions.





