Which models are best for real-time detection on edge devices?

YOLO11n or YOLOv8n achieve 300+ FPS on T4 and easily export to TensorRT or ONNX. For Raspberry Pi, YOLO11n in TFLite runs at 30 FPS. If you need maximum quality and have a GPU, use RT-DETR-L with 74 FPS and mAP 65.6.

How many images are needed to train a detector?

At least 500 labeled images per class for a baseline, 1500–5000 for production. Diversity in angles, lighting, and background is crucial. Augmentation methods (mosaic, mixup) multiply the effective dataset size by 3–5 times.

What annotation format is used for detection?

Standard formats are COCO JSON or YOLO TXT. COCO stores coordinates as [x, y, width, height]; YOLO uses relative center coordinates and size. We typically use COCO for compatibility with tools (Label Studio, CVAT) and convert to YOLO during training.

How can I improve detection of small objects?

Add a Feature Pyramid Network (FPN) to aggregate features at different scales. Use scale-changing augmentation (RandomScale). Collect more examples of small objects in the dataset. Apply multi-scale inference: run on several resolutions and average results.

What if the model confuses similar classes?

Increase the number of examples for those classes, especially hard angles. Use focal loss to focus on hard examples. Add MixUp and Copy-Paste augmentation. Check that bounding boxes of different classes do not overlap in the annotation. As a last resort, apply class-agnostic NMS and post-processing.

Which models are best for real-time detection on edge devices?

YOLO11n or YOLOv8n achieve 300+ FPS on T4 and easily export to TensorRT or ONNX. For Raspberry Pi, YOLO11n in TFLite runs at 30 FPS. If you need maximum quality and have a GPU, use RT-DETR-L with 74 FPS and mAP 65.6.

How many images are needed to train a detector?

At least 500 labeled images per class for a baseline, 1500–5000 for production. Diversity in angles, lighting, and background is crucial. Augmentation methods (mosaic, mixup) multiply the effective dataset size by 3–5 times.

What annotation format is used for detection?

Standard formats are COCO JSON or YOLO TXT. COCO stores coordinates as [x, y, width, height]; YOLO uses relative center coordinates and size. We typically use COCO for compatibility with tools (Label Studio, CVAT) and convert to YOLO during training.

How can I improve detection of small objects?

Add a Feature Pyramid Network (FPN) to aggregate features at different scales. Use scale-changing augmentation (RandomScale). Collect more examples of small objects in the dataset. Apply multi-scale inference: run on several resolutions and average results.

What if the model confuses similar classes?

Increase the number of examples for those classes, especially hard angles. Use focal loss to focus on hard examples. Add MixUp and Copy-Paste augmentation. Check that bounding boxes of different classes do not overlap in the annotation. As a last resort, apply class-agnostic NMS and post-processing.

Custom Object Detection System Development

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Custom Object Detection System Development

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Have you found that an off-the-shelf detection model fails to handle your objects? We develop custom object detection systems turnkey: from dataset collection to deployment on edge devices. With over 5 years of experience, we have delivered projects for retail, manufacturing, and security. In this article, we discuss how to choose the right architecture, fine-tune effectively, and achieve real-time performance. Typical tasks include counting products on shelves, inspecting defects on conveyor belts, and recognizing vehicles in parking lots. Often clients come with requests like 'teach a neural network to find defects' or 'count the number of products.' We help formulate the technical requirements, select the optimal model, and implement the solution.

How to choose the right detector for your task?

YOLOv8/YOLO11 is the optimal choice for most tasks. The Ultralytics implementation offers excellent documentation, active support, and built-in export to TensorRT/ONNX. For standard scenarios (1–20 classes, real-time), this is the starting point.

RT-DETR (Real-Time Detection Transformer) is a transformer-based detector that provides better quality at comparable speed to YOLOv8. Its architecture builds on DETR with acceleration via query selection. We recommend it when maximum mAP is needed and latency requirements are not extremely strict (74 FPS on T4).

Grounding DINO enables open-vocabulary detection: it finds objects based on textual descriptions without fine-tuning. Useful for prototyping and tasks with rare categories or frequently changing product lines. No dataset collection required—just formulate a query.

Model	[email protected] COCO	FPS (T4)	Parameters
YOLOv8n	52.9	320	3.2M
YOLOv8l	64.9	87	43.7M
YOLO11m	64.0	183	20.1M
RT-DETR-L	65.6	74	32M

Table data from Ultralytics and Baidu Research

Why is fine-tuning on custom data critical?

Pre-trained detectors on COCO can recognize 80 classes. If your objects are not in this list, fine-tuning is necessary. Even if the classes are present, the domain may differ (night shots, specific angles), reducing quality. Fine-tuning adapts the model to your domain.

from ultralytics import YOLO

model = YOLO('yolov8l.pt')
results = model.train(
    data='dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    optimizer='AdamW',
    lr0=0.001,
    lrf=0.01,
    weight_decay=0.0005,
    augment=True,
    degrees=10.0,
    mosaic=1.0,
    device=0
)

Structure of dataset.yaml:

path: /data/myproject
train: images/train
val: images/val
test: images/test

nc: 5
names: ['cat', 'dog', 'car', 'person', 'bicycle']

Augmentation for detection

Detection requires specific augmentations that must correctly apply to bounding boxes (a key part of bounding box augmentation):

Mosaic — stitching 4 images into one, increasing context diversity
MixUp — mixing two images with weights
Copy-Paste — cutting objects and pasting them into a new context
Random crop while preserving objects in the frame
Albumentations: HorizontalFlip, RandomBrightnessContrast, GaussNoise

Metrics and post-processing

[email protected] — mean Average Precision at IoU threshold 0.5
[email protected]:0.95 — stricter: average mAP at IoU from 0.5 to 0.95 with step 0.05
Precision / Recall at a specific confidence threshold
FPS / latency — for real-time systems

Choosing the confidence threshold: use a precision-recall curve, select the threshold based on the acceptable balance for the specific application.

Non-Maximum Suppression removes duplicate detections. Parameters: IoU threshold (0.45–0.7), confidence threshold (0.25–0.5). For densely packed objects, apply Soft-NMS or Class-Agnostic NMS.

Deployment to target device

TensorRT engine for NVIDIA GPUs: export from Ultralytics with a single command model.export(format='engine'). ONNX for CPU deployment. For Raspberry Pi / Jetson: YOLO11n in TFLite / ONNX Runtime. This is a typical TensorRT deployment scenario for edge devices.

Case study: Metal surface defect detection

For a manufacturing client, we deployed a YOLOv8m model on an NVIDIA Jetson Xavier NX to detect surface defects on metal parts. After fine-tuning on 2000 labeled images and applying mosaic and copy-paste augmentation, we achieved 98% recall at 30 FPS, reducing manual inspection time by 70%. The solution ran reliably in a dusty industrial environment over a 6-month pilot. The project cost $12,000 and the client saves $60,000 annually in quality control labor.

Timeline estimates

Task	Duration
Detection of 1–5 classes, sufficient data	1–3 weeks
Detection of 20+ classes, data collection	4–7 weeks
Detection in challenging conditions (night, fog)	6–10 weeks

Typical mistakes and how to avoid them

Insufficient data per class — leads to low recall. Solution: collect at least 500 images per class.
Overfitting due to excess empty frames. Solution: balance empty and object-containing images.
Wrong augmentation: e.g., cropping that removes objects. Solution: configure RandomCrop to preserve objects.
Ignoring post-processing: NMS with a high threshold may remove correct detections. Solution: tune the threshold on a validation set.

Our project workflow

Analysis and requirements gathering: identify objects, shooting conditions, FPS requirements.
Dataset collection and annotation: using CVAT or Label Studio. Minimum 1000 images.
Architecture selection and baseline training: iterative improvement with augmentation and hyperparameter optimization.
Testing on real data: evaluate mAP, precision, recall, FPS.
Deployment: export to TensorRT/ONNX/TFLite, integrate into your system.
Post-deployment support: monitor quality, fine-tune when new classes appear.

What is included in our service

Technical documentation on architecture and usage instructions.
Training your team on working with the model.
Source code and training configurations.
Access to a server with the trained model (optional).
Quality guarantee: if performance degrades after one month, we fine-tune at no extra cost.

Typical project costs

Prices start at $5,000 for a simple single-class detector and go up to $20,000 for complex multi-class systems with custom datasets. Client savings often exceed $50,000 per year in reduced manual inspection labor. For example, a warehouse inventory detection system cost $18,000 and saved $70,000 annually.

Speed comparison

YOLO11n is over 4 times faster than RT-DETR-L (320 vs 74 FPS), while RT-DETR-L achieves 2% higher mAP. For edge detection tasks requiring both speed and accuracy, YOLO11m offers a good balance at 183 FPS with 64.0 mAP.

Contact us to discuss your project. Get a consultation on model selection and timeline estimation — we'll show you how quickly we can achieve the required detection quality.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.