AI System for Digital Avatar Emotional Reactions
Avatar that "hears" and "feels" — qualitatively different experience compared to avatar that simply speaks. Emotional reactions increase engagement, trust and perceived intelligence of system. We build complete emotion pipeline: from detecting user emotions to avatar expressing them.
System Architecture
Emotion Input Pipeline:
Voice channel: SpeechBrain / audeering/wav2vec2 for emotion recognition from audio. 4-class system (neutral, positive, negative, tense) — accuracy ~82% on IEMOCAP. 8-class (fear, anger, joy, sadness, surprise, disgust, contempt, neutral) — accuracy ~72%.
Video channel: DeepFace / FER+ / ABAW models for facial expression recognition via WebRTC. MediaPipe FaceMesh for 478 keypoints + classifier.
Text channel: BERT-based sentiment analysis (CardiffNLP) for message tone analysis. Context-aware: "this task is difficult" ≠ negative if context is technical.
Emotion Fusion: Bayesian fusion of three channels. Priorities: video > audio > text (when available). Temporal smoothing (exponential moving average with 2–3 second window) to prevent jittery switches.
Emotion Output — Avatar:
Face: FACS-based blend shapes via emotion-to-AU mapping. "Joy" emotion → AU6 (cheeks) + AU12 (mouth corners) + AU25 (mouth opening). Intensity scales.
Voice: ElevenLabs emotion parameters (stability, similarity) — fine-tuning TTS expressiveness in real time.
Gestures: gesture clip library triggered by emotion state. Positive → open gestures; Tense → reduced gesticulation.
Gaze: increased eye contact with positive, gaze aversion with conflictual content.
Development Pipeline
Weeks 1–3: Emotion detection setup (channel selection by requirements). Testing on representative audience examples.
Weeks 4–7: Emotion fusion engine development. Emotion to FACS AU mapping. Smooth transition implementation.
Weeks 8–11: Integration with existing avatar and TTS. Testing natural-feeling transitions.
Weeks 12–14: User study with real users. Calibrating intensities, eliminating uncanny valley effects.
Evaluation
| Metric | Value |
|---|---|
| Emotion Detection Accuracy (4 class) | ~82% |
| Perceived Naturalness (5-point scale) | >3.8/5 |
| User Engagement (vs. non-emotional avatar) | +28–35% |
| Uncanny Valley Incidents | <5% interactions |
Edge Cases
Sarcasm, cultural differences in emotion expression, mixed emotions — all reduce accuracy. For professional applications (psychotherapy, HR) we recommend human-in-the-loop: system flags uncertain emotional states for operator attention.







