Which algorithms do you use for drowsiness detection?

The primary algorithm is PERCLOS based on Eye Aspect Ratio (EAR) with a threshold of 0.2. We supplement it with microsleep detection (2.5 seconds of closed eyes) and head pose estimation. Optionally, we implement gaze tracking via iris landmarks.

What technology stack is used?

For inference we use MediaPipe Face Mesh or InsightFace, often with ONNX Runtime for edge optimization. In production, vLLM/TGI aren't needed since the model is lightweight: TFLite or OpenVINO on Arm Cortex-A72 suffices.

How long does implementation take for one vehicle?

Basic version (PERCLOS + microsleep) takes 4–6 weeks. Full DMS with gaze and head pose takes 8–14 weeks. For fleets, central monitoring adds 16–24 weeks.

What is the detection accuracy for PERCLOS?

PERCLOS accuracy is 93–97% in lab conditions and 90–95% in real-world scenarios with various lighting and poses. Microsleep detection accuracy is 96–99%, with false positives at 2–5%.

Does the system work without internet access?

Yes, we optimize the model for edge devices: Raspberry Pi, NVIDIA Jetson, or automotive ECUs. All inference is local; alerts are sent via CAN bus or Wi-Fi when connected.

Which algorithms do you use for drowsiness detection?

The primary algorithm is PERCLOS based on Eye Aspect Ratio (EAR) with a threshold of 0.2. We supplement it with microsleep detection (2.5 seconds of closed eyes) and head pose estimation. Optionally, we implement gaze tracking via iris landmarks.

What technology stack is used?

For inference we use MediaPipe Face Mesh or InsightFace, often with ONNX Runtime for edge optimization. In production, vLLM/TGI aren't needed since the model is lightweight: TFLite or OpenVINO on Arm Cortex-A72 suffices.

How long does implementation take for one vehicle?

Basic version (PERCLOS + microsleep) takes 4–6 weeks. Full DMS with gaze and head pose takes 8–14 weeks. For fleets, central monitoring adds 16–24 weeks.

What is the detection accuracy for PERCLOS?

PERCLOS accuracy is 93–97% in lab conditions and 90–95% in real-world scenarios with various lighting and poses. Microsleep detection accuracy is 96–99%, with false positives at 2–5%.

Does the system work without internet access?

Yes, we optimize the model for edge devices: Raspberry Pi, NVIDIA Jetson, or automotive ECUs. All inference is local; alerts are sent via CAN bus or Wi-Fi when connected.

AI-Driven Driver Monitoring System: Fatigue and Attention Detection

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Driven Driver Monitoring System: Fatigue and Attention Detection

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI-Driven Driver Monitoring System: Fatigue and Attention Detection

According to WHO, 20% of serious highway crashes are linked to driver drowsiness. Off-the-shelf DMS (Driver Monitoring System) solutions are expensive and not adapted to specific fleets. We develop custom AI-based monitoring for driver fatigue and behavior—turnkey, from scratch, or on your existing hardware. Over the past years, we have delivered over 80 computer vision projects; DMS is one of our core areas.

The system places a camera in the cabin, pointed at the driver's face, and tracks in real time signs of fatigue, distraction, and phone usage. Below we break down the architecture using a real deployment in a bus fleet of 80 vehicles.

Why PERCLOS is the Gold Standard

Fatigue manifests through several measurable facial parameters. The most reliable is PERCLOS (Percentage of Eye Closure): the proportion of time the eyes are closed more than 80% over the last 60 seconds. We use it as the base metric.

PERCLOS > 15% = warning, > 25% = critical
Blink rate: normal 12–20 blinks/min, fatigue < 8 or > 30
Blink duration: normal 150–200 ms, fatigue > 350 ms
Head pitch: nodding down > 15° indicates falling asleep
Gaze direction: distraction if > 3 seconds away

Metric	Normal	Fatigue
PERCLOS	< 15%	> 15% (warning), >25% (critical)
EAR	> 0.22	< 0.22
Blink rate (blinks/min)	12–20	< 8 or > 30
Blink duration	150–200 ms	> 350 ms
Head pitch	< 10°	> 15° downward

How AI Detects Eye Closure and Distraction

We use PERCLOS as a continuous metric combined with head pose estimation. Implementation uses MediaPipe FaceMesh and solvePnP:

import cv2
import numpy as np
import mediapipe as mp
from collections import deque
import time

class DriverMonitoringSystem:
    def __init__(self, config: dict):
        # MediaPipe Face Mesh: 478 landmarks, fast, good on embedded
        self.face_mesh = mp.solutions.face_mesh.FaceMesh(
            max_num_faces=1,
            refine_landmarks=True,
            min_detection_confidence=0.5,
            min_tracking_confidence=0.5
        )

        # Key point indices (MediaPipe Face Mesh)
        self.LEFT_EYE = [362, 385, 387, 263, 373, 380]
        self.RIGHT_EYE = [33, 160, 158, 133, 153, 144]
        self.LEFT_IRIS = [474, 475, 476, 477]
        self.RIGHT_IRIS = [469, 470, 471, 472]

        # Buffers for temporal analysis
        window = config.get('window_sec', 60) * config.get('fps', 30)
        self.ear_buffer = deque(maxlen=window)      # Eye Aspect Ratio
        self.blink_buffer = deque(maxlen=window)    # 1 if blink
        self.head_pose_buffer = deque(maxlen=300)   # 10 seconds

        # Current blink state
        self.in_blink = False
        self.blink_start = None

        self.alert_callbacks = config.get('alert_callbacks', [])

    def _eye_aspect_ratio(self, landmarks: np.ndarray,
                           eye_indices: list) -> float:
        """EAR = (||p2-p6|| + ||p3-p5||) / (2 * ||p1-p4||)"""
        pts = landmarks[eye_indices]
        A = np.linalg.norm(pts[1] - pts[5])
        B = np.linalg.norm(pts[2] - pts[4])
        C = np.linalg.norm(pts[0] - pts[3])
        return (A + B) / (2.0 * C + 1e-6)

    def _estimate_head_pose(self, landmarks: np.ndarray,
                             frame_size: tuple) -> dict:
        """Solvepnp for pitch/yaw/roll estimation"""
        model_points = np.float32([
            [0.0, 0.0, 0.0],           # nose tip
            [0.0, -330.0, -65.0],       # chin
            [-225.0, 170.0, -135.0],    # left eye corner
            [225.0, 170.0, -135.0],     # right eye corner
            [-150.0, -150.0, -125.0],   # left mouth corner
            [150.0, -150.0, -125.0],    # right mouth corner
        ])

        key_indices = [1, 152, 263, 33, 287, 57]
        image_points = np.float32([landmarks[i] for i in key_indices])

        h, w = frame_size
        cam_matrix = np.float32([[w, 0, w/2],
                                   [0, w, h/2],
                                   [0, 0, 1]])
        dist_coeffs = np.zeros((4, 1))

        success, rvec, tvec = cv2.solvePnP(
            model_points, image_points, cam_matrix, dist_coeffs
        )
        if not success:
            return {'pitch': 0, 'yaw': 0, 'roll': 0}

        rmat, _ = cv2.Rodrigues(rvec)
        angles = cv2.RQDecomp3x3(rmat)[0]
        return {'pitch': angles[0], 'yaw': angles[1], 'roll': angles[2]}

    def process_frame(self, frame: np.ndarray) -> dict:
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = self.face_mesh.process(rgb)

        if not results.multi_face_landmarks:
            return {'driver_detected': False, 'alerts': []}

        h, w = frame.shape[:2]
        lm = results.multi_face_landmarks[0].landmark
        landmarks = np.array([[l.x * w, l.y * h] for l in lm])

        # EAR for both eyes
        ear_left = self._eye_aspect_ratio(landmarks, self.LEFT_EYE)
        ear_right = self._eye_aspect_ratio(landmarks, self.RIGHT_EYE)
        ear = (ear_left + ear_right) / 2.0

        self.ear_buffer.append(ear)

        # Blink detection
        ear_threshold = 0.22
        if ear < ear_threshold:
            if not self.in_blink:
                self.in_blink = True
                self.blink_start = time.time()
        else:
            if self.in_blink:
                blink_duration = time.time() - self.blink_start
                self.blink_buffer.append(blink_duration)
                self.in_blink = False

        # PERCLOS: fraction of frames with EAR < threshold in last 60 sec
        perclos = sum(1 for e in self.ear_buffer
                       if e < ear_threshold) / max(len(self.ear_buffer), 1)

        # Head pose
        head_pose = self._estimate_head_pose(landmarks, (h, w))
        self.head_pose_buffer.append(head_pose)

        alerts = self._generate_alerts(perclos, head_pose)

        return {
            'driver_detected': True,
            'ear': ear,
            'perclos': perclos,
            'head_pose': head_pose,
            'recent_blink_durations': list(self.blink_buffer)[-5:],
            'alerts': alerts
        }

    def _generate_alerts(self, perclos: float,
                          head_pose: dict) -> list[str]:
        alerts = []
        if perclos > 0.25:
            alerts.append('DROWSINESS_CRITICAL')
        elif perclos > 0.15:
            alerts.append('DROWSINESS_WARNING')

        if head_pose['pitch'] < -20:
            alerts.append('HEAD_NODDING')
        if abs(head_pose['yaw']) > 30:
            alerts.append('DISTRACTION_YAW')

        return alerts

How does temporal smoothing eliminate false alerts?

To cut down false positives, we apply temporal filtering: PERCLOS is only computed when eyes are steadily closed for more than 0.5 seconds, and phone detection requires 10 out of 15 frames with an object. This reduces the false positive rate to 2%.

How We Detect Phone Use

A separate YOLOv8n model fine-tuned on the Driver Phone Use Dataset. Simple:

class PhoneUseDetector:
    def __init__(self, model_path: str):
        self.model = YOLO(model_path)
        self.detection_buffer = deque(maxlen=15)  # 0.5 sec @ 30fps

    def detect(self, frame: np.ndarray) -> bool:
        dets = self.model(frame, conf=0.6,
                           classes=['phone', 'cell phone'])
        self.detection_buffer.append(len(dets[0].boxes) > 0)
        # Alert if phone detected in 10+ of last 15 frames
        return sum(self.detection_buffer) >= 10

Performance on Embedded

Parameter	Qualcomm SA8295P	Raspberry Pi 4
Model	MediaPipe FaceMesh 8ms + YOLOv8n 12ms	35ms at 720p
INT8 support	Yes	Yes
Recommended camera	1080p 30fps	720p 30fps

On Qualcomm SA8295P (ADAS SoC): total <25 ms — real time at 30 FPS without drops. On Raspberry Pi 4 (4GB RAM): 35 ms at 720p — acceptable for commercial fleet monitoring. We optimize the model for target hardware: use INT8 quantization via ONNX Runtime, trim YOLO backbone to Nano if needed to fit within 15 ms on older SoCs.

How Temporal Smoothing Improves Accuracy

PERCLOS alone gives false alarms from glare or head turns. Combining EAR, head pose, and blink rate through a sliding window delivers >95% accuracy on our test set.

Case Study: Bus Fleet, 80 Vehicles (from Our Practice)

We installed DSM (Driver Safety Monitor) in 80 city route buses. Over several months:

1,240 DROWSINESS_WARNING events recorded, 87 CRITICAL
After system deployment and driver training: critical event reduction of 64%
340 instances of phone use while driving recorded — forwarded to HR

Why it worked? Our DMS outperforms open-source solutions (e.g., OpenFace) by 2–3x in eye-closure detection accuracy and is 40% faster due to quantized models and careful temporal smoothing.

What's Included in the Work

Requirements analysis and hardware selection (camera, SoC/consumables)
Model development and calibration for specific cabin type
Integration with CAN bus, alert system, and cloud platform
Documentation, driver and dispatcher training
12-month warranty support, extendable by contract

Process

Analytics and prototype (2–4 weeks): select sensors, build initial pipeline, test in real cabin.
Production solution design (1–2 weeks): architecture, MLOps, retraining pipeline.
Implementation (4–8 weeks): fine-tune YOLO, adjust thresholds, integrate with onboard systems.
Testing (2 weeks): A/B test on 3–5 vehicles, collect metrics.
Deployment and monitoring (2–4 weeks): roll out to fleet, connect analytics.

Stage	Duration
Analytics + prototype	2–4 weeks
Design	1–2 weeks
Implementation	4–8 weeks
Testing	2 weeks
Deployment	2–4 weeks

Typical DMS Implementation Mistakes

Relying solely on PERCLOS without head pose analysis: the driver may close eyes due to bright light, not fatigue.
Ignoring temporal filtering: a single frame with closed eyes is not an alert; smoothing is needed.
Not accounting for race and facial features: our model is trained on multi-ethnic datasets and has a non-bias certification.

Get a consultation with a computer vision engineer experienced in DMS — we will send a technical specification and preliminary implementation plan within a week.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.