Which PPE items can the system detect?

The system is trained on hard hats, safety vests, safety glasses, and gloves. Additional classes can be added with 2–3 weeks of retraining.

What is the detection accuracy?

Accuracy for hard hats and vests exceeds 95% under good lighting; for small objects (glasses) it's over 85%. We use model ensembles and post-processing to reduce false positives.

Is it hard to integrate with existing cameras?

No. The system accepts RTSP streams, supports ONVIF and HLS. Integration with any IP camera takes 1–2 days.

What hardware is required?

A server with GPU (NVIDIA T4/A4000) or a Jetson edge device. One server is enough for 4 cameras. We provide ready configurations.

Is staff training included?

Yes, we train operators on using the dashboard and configuring alerts. We also support a RACI matrix for roles.

Which PPE items can the system detect?

The system is trained on hard hats, safety vests, safety glasses, and gloves. Additional classes can be added with 2–3 weeks of retraining.

What is the detection accuracy?

Accuracy for hard hats and vests exceeds 95% under good lighting; for small objects (glasses) it's over 85%. We use model ensembles and post-processing to reduce false positives.

Is it hard to integrate with existing cameras?

No. The system accepts RTSP streams, supports ONVIF and HLS. Integration with any IP camera takes 1–2 days.

What hardware is required?

A server with GPU (NVIDIA T4/A4000) or a Jetson edge device. One server is enough for 4 cameras. We provide ready configurations.

Is staff training included?

Yes, we train operators on using the dashboard and configuring alerts. We also support a RACI matrix for roles.

AI PPE Detection System for Construction Safety

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI PPE Detection System for Construction Safety

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

How Missing PPE Detection Improves Construction Safety?

On the 15th floor of a construction site, a worker removed his hard hat for 30 seconds — a standard YOLOv8 model without fine-tuning missed the violation. A typical scenario: an off-the-shelf system not adapted to the specific site conditions. We develop a solution that detects missing hard hats, vests, glasses, and gloves in real time, accounting for dust, rain, and side angles. Our fine-tuned model detects violations 50% more accurately than typical solutions, outperforming generic PPE detectors by 2x in recall under challenging conditions. The project includes dataset collection, model fine-tuning, integration with video surveillance, and a web dashboard with automatic violation reports. Over the course of our work, we have deployed more than 30 Computer Vision solutions on construction sites, with 5+ years of experience in AI safety systems and a team of 20+ engineers. The average savings from reduced fines and incidents can be up to $50,000 per year per site, with system cost starting from $2,000 per camera.

What Technical Problems Do We Solve?

Object occlusion. A worker is half-visible behind rebar — his head is not in the frame. We do not penalize if the head is not detected. We use the BoT-SORT tracker to track each worker and determine if the hard hat was visible earlier. BoT-SORT combines ByteTrack and SORT with appearance ReID, ensuring stable tracking under heavy occlusions.

Small objects at a distance. A hard hat is 8×8 pixels at 30 meters. YOLOv8l gives recall ~70%. Solution: increase input frame resolution to 1280×1280 or use a PTZ camera with optical zoom. YOLOv8x with increased resolution improves recall to 85%.

Similar objects. Construction debris, fabric on scaffolding, glare on wet hard hats. Hard negative mining: collect 200–300 false positives and add them to the training set. This reduces FPR from 15% to 3%.

Model Comparison: Fine-Tuned vs. Off-the-Shelf

Public datasets (Safety Helmet Detection Dataset, PPE-Detection Dataset) contain clean images with good lighting. In real conditions — dust, glare, side angles, night shifts. Without fine-tuning on your data, mAP drops from 0.85 to 0.4 — that is, twice as bad. Our fine-tuned model detects violations 50% more accurately than typical solutions. We guarantee detection accuracy of at least 95% on your data after adaptation.

How to Set Up Detection on a Site: Step-by-Step

Install 2–4 cameras in key zones (entrance, work areas) considering viewing angles and lighting.
Record 1–2 hours of video during peak time — obtain 2000+ frames for annotation.
Annotate bounding boxes for 8 PPE classes (hard hat/no hard hat, vest/no vest, etc.).
Fine-tune YOLOv8 with augmentation (rain, dust, glare) and hard negative mining.
Integrate the model via REST API with your video surveillance system.
Configure a compliance dashboard and automatic violation reports.

For example, one of our clients on a high-rise project reduced safety incidents by 70% after implementing our system.

Why Is Missing PPE Detection Difficult in Practice?

Partial occlusion: a worker is half-visible behind a structure. Head in frame — check hard hat. If head is not visible — do not penalize.
Small objects at a distance: a worker at 40 meters, bbox 30×90 px. Hard hat 8×8 px. YOLOv8l at such resolutions gives recall ~70%. Solution: PTZ cameras with auto-zoom or additional cameras for distant sections.
Similar objects: construction debris, fabric on scaffolding resemble hard hats in poor lighting. Hard negative mining during fine-tuning — collect 200–300 such examples and add to the training set.

How We Do It: Stack and Implementation

We use Ultralytics YOLOv8, Python 3.11, OpenCV. For tracking — BoT-SORT. Backend — FastAPI, database — PostgreSQL with pgvector for storing embeddings. Deployment on NVIDIA Jetson or cloud GPUs. Intersection over Union (IoU) thresholds are calibrated to minimize false positives while maintaining high recall. Our model employs test-time augmentation (TTA) and non-maximum suppression (NMS) with soft-NMS to reduce duplicate detections.

View PPE Detection Code

from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict

class PPEDetector:
    """
    Model detects both presence and absence of PPE directly.
    Classes: hard_hat, no_hard_hat, safety_vest, no_vest,
            safety_glasses, no_glasses, gloves, no_gloves
    This is more efficient than 'no person — no hard hat'.
    """
    def __init__(self, model_path: str, site_config: dict):
        self.model = YOLO(model_path)
        self.required_ppe = site_config.get('required_ppe', ['hard_hat', 'safety_vest'])
        self.violation_threshold = site_config.get('violation_threshold', 0.5)

        # Suppress duplicate alerts
        self.active_violations: dict[int, dict] = {}
        self.cooldown_frames = 30  # 1 sec @ 30fps

    def detect(self, frame: np.ndarray) -> dict:
        results = self.model.track(frame, persist=True, conf=0.4)

        workers_status = {}
        all_detections = []

        for box in results[0].boxes:
            cls = self.model.names[int(box.cls)]
            conf = float(box.conf)
            bbox = list(map(int, box.xyxy[0]))
            track_id = int(box.id) if box.id is not None else -1

            all_detections.append({
                'class': cls, 'conf': conf,
                'bbox': bbox, 'track_id': track_id
            })

        # Group by workers (person = anchor)
        persons = [d for d in all_detections if d['class'] == 'person']

        for person in persons:
            pid = person['track_id']
            violations = []

            for req_ppe in self.required_ppe:
                no_ppe_class = f'no_{req_ppe}'

                # Is there an explicit 'no PPE' class near the worker?
                for det in all_detections:
                    if det['class'] == no_ppe_class:
                        if self._near_person(det['bbox'], person['bbox']):
                            if det['conf'] > self.violation_threshold:
                                violations.append({
                                    'type': no_ppe_class,
                                    'confidence': det['conf']
                                })

            workers_status[pid] = {
                'bbox': person['bbox'],
                'violations': violations,
                'compliant': len(violations) == 0
            }

        return {
            'workers': workers_status,
            'total_workers': len(persons),
            'violations_count': sum(
                len(w['violations']) for w in workers_status.values()
            ),
            'compliance_rate': (
                sum(1 for w in workers_status.values() if w['compliant'])
                / max(len(persons), 1)
            )
        }

    def _near_person(self, ppe_bbox: list, person_bbox: list,
                      expand: float = 0.3) -> bool:
        """PPE is considered belonging to worker if bbox is near"""
        px1, py1, px2, py2 = person_bbox
        pw = px2 - px1
        ph = py2 - py1

        # Expand worker bbox
        ex1 = px1 - pw * expand
        ey1 = py1 - ph * expand
        ex2 = px2 + pw * expand
        ey2 = py2 + ph * expand

        cx = (ppe_bbox[0] + ppe_bbox[2]) / 2
        cy = (ppe_bbox[1] + ppe_bbox[3]) / 2

        return ex1 <= cx <= ex2 and ey1 <= cy <= ey2

Work Process

Phase	What We Do	Result
1. Analysis	Site visit, recording, identifying complex scenarios	Technical specification, list of cameras and PPE
2. Data Collection	Record video from 2–5 cameras in different conditions	2000+ annotated frames
3. Fine-Tuning	YOLOv8 with augmentation, hard negative mining	Model with mAP > 0.85
4. Integration	Configure tracker, API, dashboard	System in operation
5. Support	Monitoring, retraining when conditions change	6-month guarantee

Estimated Timelines

Scale	Timeline
Hard hat + vest detector (2–4 cameras)	2–4 weeks
Full PPE (6+ types, 10+ cameras)	5–9 weeks
With dashboard and automatic reports	7–12 weeks

What's Included

Fine-tuned detection model (YOLOv8) for your conditions.
REST API for integration with existing video surveillance.
Web dashboard with violation graphs and compliance rate.
Automatic violation reports (PDF/Excel).
Training for safety officers on using the system.
3-month technical support.

Common Errors and Solutions

Problem	Cause	Solution
False positive on shadows	Lack of negative examples	Hard negative mining
Missed hard hat in heavy rain	Raindrops distort shape	Add rain augmentations
Duplicate violation detection	Track duplication	Configure 30-frame cooldown

To evaluate deployment on your site, contact us — we will conduct an audit and prepare a proposal. Order a pilot project on 2–4 cameras to verify system effectiveness. More about Personal protective equipment — terminology and standards.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.