Development of Facial Emotion Recognition Systems
Emotion recognition from facial expression is a task of classifying face expression into basic emotional states. Ekman's classic model identifies 7 universal emotions: happiness, sadness, anger, fear, surprise, disgust, neutral. Applications: engagement analysis in online learning, customer satisfaction monitoring in call centers, UX research, driver state monitoring.
Model Architecture
Pipeline: face detection → alignment → emotion classification.
import torch
import torch.nn as nn
import timm
import cv2
import numpy as np
from insightface.app import FaceAnalysis
class EmotionRecognizer:
def __init__(self, model_path: str):
# Face detection and alignment
self.detector = FaceAnalysis(allowed_modules=['detection'])
self.detector.prepare(ctx_id=0, det_size=(640, 640))
# Emotion classifier
backbone = timm.create_model('efficientnet_b0', pretrained=False)
backbone.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(backbone.num_features, 7)
)
backbone.load_state_dict(torch.load(model_path))
backbone.eval()
self.model = backbone
self.emotions = ['angry', 'disgust', 'fear', 'happy',
'neutral', 'sad', 'surprise']
self.transform = get_inference_transform()
@torch.no_grad()
def predict(self, image: np.ndarray) -> list[dict]:
faces = self.detector.get(image)
results = []
for face in faces:
x1, y1, x2, y2 = face.bbox.astype(int)
face_crop = image[y1:y2, x1:x2]
face_crop = cv2.resize(face_crop, (48, 48))
tensor = self.transform(face_crop).unsqueeze(0)
logits = self.model(tensor)
probs = torch.softmax(logits, dim=1).squeeze()
emotion_scores = {
self.emotions[i]: float(probs[i])
for i in range(7)
}
dominant = max(emotion_scores, key=emotion_scores.get)
results.append({
'bbox': [x1, y1, x2, y2],
'emotion': dominant,
'confidence': emotion_scores[dominant],
'all_scores': emotion_scores
})
return results
Datasets and Model Quality
| Dataset | Size | Conditions | Classes |
|---|---|---|---|
| FER-2013 | 35k photos | Wild | 7 |
| AffectNet | 1M photos | Wild | 8 (+ contempt) |
| RAF-DB | 30k photos | Real-world | 7 + compound |
| CK+ | 593 videos | Laboratory | 7 |
| SFEW | 1766 frames | Movies | 7 |
Accuracy on FER-2013:
- EfficientNet-B0 fine-tuned: 73.1%
- Vision Transformer (ViT-B/16): 74.8%
- EfficientFace: 73.3%
Main difficulty: labels in public datasets are subjective, people disagree in 30–40% of cases. 75% accuracy is the practical limit for FER-2013 due to human disagreement.
Temporal Analytics on Video
Frame-by-frame classification is unstable — emotion "flickers" between frames. Solutions:
- Temporal smoothing: moving average over 10–30 frames
- RNN/LSTM on top of frame-level classifier: accounts for temporal dynamics
- Interval aggregation: average emotion per N-second interval for analytics
from collections import deque
class TemporalEmotionTracker:
def __init__(self, window_size: int = 30):
self.window = deque(maxlen=window_size)
def update(self, emotion_scores: dict) -> dict:
self.window.append(emotion_scores)
# Average over window
averaged = {}
for emotion in emotion_scores:
averaged[emotion] = sum(
frame[emotion] for frame in self.window
) / len(self.window)
return averaged
Limitations and Ethical Aspects
Important to understand technology limitations:
- Cultural differences in emotion expression (facial cues vary across cultures)
- Neutral face ≠ neutral emotional state
- Acted expressions differ from genuine ones
Technology should not be used for covert employee monitoring without consent. Production systems always require legal consent.
| Task | Timeline |
|---|---|
| SDK for mobile/web application | 2–3 weeks |
| Video engagement analytics | 3–5 weeks |
| Custom model on corporate dataset | 5–8 weeks |







