Development of AI Virtual Assistant with Realistic Facial Expressions and Gestures
Difference between virtual assistant with realistic facial expressions and without — difference between tool and experience. Facial expressions create emotional resonance, reduce cognitive load on perception and increase trust in information. We build complete stack: from dialogue to rendering.
Technical Stack for Facial Expressions
FACS-based Facial Animation: Facial Action Coding System (Ekman) — standard for describing expressions via Action Units (AU). System generates AU vectors in real time based on:
- Emotional tone of LLM response (sentiment analysis → emotion mapping)
- Accents in TTS audio (prosody analysis → brow raise, lip corners)
- Dialogue context (question → nod, uncertainty → frown)
NVIDIA Audio2Face: Neural lip sync: audio → facial muscle animation (jaw, lips, cheeks). Latency: <33 ms when running locally. Integration with MetaHuman via LiveLink.
Gesture Generation: Gesticulator (paper: NeurIPS 2020) / DiffuseStyleGesture — neural gesture generation for hands synchronized with speech. Speech2Gesture — data-driven approach based on motion capture corpora.
Eye Behavior: Procedural system: saccades (rapid movements), smooth pursuit, blink rate (adapts to context — internal thinking → slower blink), vergence (focus on speaker via webcam face tracking).
Dialogue System
LLM (GPT-4o or Llama 3 70B self-hosted) + RAG for domain knowledge. Emotion-aware system prompt: besides answer, model generates JSON with emotion tag {emotion: "curious", intensity: 0.7} → emotion controller → facial expressions.
Streaming TTS (ElevenLabs WebSocket API) for sub-second first audio. Avatar starts moving before full response generation completes.
Rendering
Web: Three.js + morph targets for browser-based without plugins (30 fps on mid-range hardware)
Desktop/Kiosk: Unreal Engine 5 Pixel Streaming (60 fps photorealism, requires GPU server)
Mobile: Unity + ARKit/ARCore (25–30 fps on iPhone 12+)
Development Pipeline
Weeks 1–4: Avatar design and 3D modeling. FACS system setup.
Weeks 5–9: Dialogue pipeline (STT → LLM → TTS with emotion tagging). Audio2Face integration.
Weeks 10–14: Gesture system. Eye behavior. Component integration.
Weeks 15–18: Latency optimization. Load testing. UX testing with users.
Latency Target Metrics
| Component | Target Latency |
|---|---|
| STT (Whisper large) | 200–400 ms |
| LLM (streaming first token) | 100–300 ms |
| TTS (first audio chunk) | 100–200 ms |
| Lip sync start | <33 ms from audio |
| Total Perceived Response | 500–1000 ms |
User research shows: delays up to 1200 ms perceived as normal in conversational context.







