AI Characters for VR/AR
Static NPCs in VR/AR applications are a bottleneck for any immersive experience. Users press a trigger, the character says a pre-recorded phrase from 5 variants, the dialogue ends. AI characters conduct real conversations: they understand scene context, remember previous interactions, adapt behavior to users, and manage animations in real-time.
AI Character Architecture
[STT] User voice → text (Whisper)
↓
[Context Manager] History + scene state + character personality
↓
[LLM] GPT-4o / Claude 3.5 → response text + action commands
↓
[TTS] ElevenLabs → audio stream
↓
[Animation Controller] Unity/Unreal → lip sync + gestures + emotions
import asyncio
from openai import AsyncOpenAI
from dataclasses import dataclass, field
import json
@dataclass
class CharacterState:
character_id: str
name: str
personality: str # system prompt with character traits
scene_context: dict # current VR scene state
history: list = field(default_factory=list)
emotional_state: str = "neutral"
relationship_score: float = 0.5 # 0=hostile, 1=friendly
class VRCharacterEngine:
ACTION_SCHEMA = {
"type": "json_schema",
"json_schema": {
"name": "character_response",
"schema": {
"type": "object",
"properties": {
"speech": {"type": "string"},
"emotion": {"type": "string",
"enum": ["neutral", "happy", "angry", "scared",
"surprised", "sad", "suspicious"]},
"animation": {"type": "string",
"enum": ["idle", "walk_towards", "walk_away",
"point", "nod", "shake_head",
"hand_gesture", "look_around"]},
"scene_action": {"type": "string",
"description": "Scene action: open_door, pick_up_item, etc."},
"relationship_delta": {"type": "number",
"description": "Change in relationship_score [-0.2, 0.2]"}
},
"required": ["speech", "emotion", "animation"]
}
}
}
def __init__(self):
self.client = AsyncOpenAI()
async def process_interaction(
self,
user_input: str,
state: CharacterState
) -> dict:
messages = [
{"role": "system", "content": self._build_system_prompt(state)},
*state.history[-10:], # last 5 exchanges
{"role": "user", "content": user_input}
]
response = await self.client.chat.completions.create(
model="gpt-4o-mini", # mini is sufficient, latency is critical
messages=messages,
response_format=self.ACTION_SCHEMA,
max_tokens=300,
temperature=0.7
)
action = json.loads(response.choices[0].message.content)
# Update character state
state.emotional_state = action["emotion"]
state.relationship_score = max(0, min(1,
state.relationship_score + action.get("relationship_delta", 0)
))
state.history.append({"role": "user", "content": user_input})
state.history.append({"role": "assistant", "content": action["speech"]})
return action
Lip Sync and Animation Synchronization
// Unity: lip sync synchronization with ElevenLabs audio stream
using OVRLipSync;
using UnityEngine;
public class AICharacterAnimator : MonoBehaviour
{
private OVRLipSyncContext lipSyncContext;
private Animator animator;
private AudioSource audioSource;
public async void PlayCharacterResponse(string speechText, string emotion, string animation)
{
// 1. Request audio from TTS service
byte[] audioData = await TTSService.Synthesize(speechText, voiceId: "character_voice");
// 2. Set emotion via Blend Shapes
SetEmotionBlendShape(emotion);
// 3. Start body animation
animator.SetTrigger(animation);
// 4. Play audio with lip sync
AudioClip clip = AudioService.BytesToClip(audioData);
audioSource.clip = clip;
audioSource.Play();
// OVRLipSync automatically synchronizes lips with audio
lipSyncContext.ProcessAudioSamplesRaw(audioData, 0);
}
private void SetEmotionBlendShape(string emotion)
{
var face = GetComponent<SkinnedMeshRenderer>();
// Reset all emotions
for (int i = 0; i < face.sharedMesh.blendShapeCount; i++)
face.SetBlendShapeWeight(i, 0);
// Set the needed emotion
int shapeIndex = face.sharedMesh.GetBlendShapeIndex($"emotion_{emotion}");
if (shapeIndex >= 0)
face.SetBlendShapeWeight(shapeIndex, 100f);
}
}
Latency: The Main VR Character Challenge
In VR, a gap > 800 ms between user speech and character response breaks immersion. Pipeline optimization:
| Step | Without optimization | With optimization |
|---|---|---|
| STT (Whisper large) | 800–1200 ms | 200–400 ms (Whisper medium + streaming) |
| LLM (GPT-4o) | 1000–2000 ms | 400–700 ms (GPT-4o-mini + short context) |
| TTS (ElevenLabs) | 600–1000 ms | 200–400 ms (streaming TTS) |
| Total | 2400–4200 ms | 800–1500 ms |
Solution: run TTS in parallel immediately after receiving the first tokens from LLM (streaming), begin audio playback before synthesis is complete.
Case study: VR sales training simulator. 4 characters with different personalities (aggressive client, loyal client, skeptic, neutral). Average latency after optimization: 920 ms. Realism assessment (survey of 50 users): 4.1/5 vs 2.3/5 for scripted NPCs.
Timeframe: one AI character with basic animations: 3–5 weeks; complete training simulator with multiple characters and analytics: 2–3 months.







