Computer Vision: Detection, Segmentation, OCR, and Video Analytics
A camera on the production line monitors quality. Model trained on 10,000 annotated images achieves mAP 0.84. Deployed to production — within first week 30% defects pass through. Turns out lighting changes by shift, and distribution shift kills metrics. Classic Computer Vision in industry story.
Object Detection: YOLO, RT-DETR and Everything Between
YOLO is standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are most used in production: simple API, active community, good documentation, built-in validation and ONNX/TensorRT export.
For high-accuracy tasks where latency less critical — RT-DETR (Real-Time DEtection TRansformer). Transformer-based architecture without NMS delivers better mAP on COCO at comparable speed to YOLOv8l.
Common detector training mistake. 8000-image dataset, 3 classes, YOLOv8m fine-tune — F1 0.73 on validation. Check confusion matrix: one class barely detected. Reason: 1:23 imbalance favoring two classes. Solution: oversample rare class, focal loss instead of BCELoss for objectness, disable augmentations (Mosaic, MixUp) for rare class.
Transfer learning and fine-tuning. COCO or ImageNet preweights mandatory starting point. Full training from scratch requires millions of examples. Fine-tune 500-2000 domain images with proper augmentation yields working model in 1-2 days on single GPU.
Export and optimization. For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin achieves 150+ FPS with P99 latency < 8ms. On server (A10G): 700+ FPS for YOLOv8n in TensorRT INT8.
Segmentation: SAM, Mask R-CNN and Instance Segmentation
SAM (Segment Anything Model) from Meta changed segmentation approach. SAM 2 works with video, supports object tracking across frames. For "segment object by prompt (point, bbox)" tasks SAM is best out-of-the-box option.
For production instance segmentation without interactive prompt — Mask R-CNN or YOLOv8-seg. YOLOv8-seg trains like normal detector with added masks, fits same pipelines.
Semantic segmentation (each pixel — class) — SegFormer, DeepLabV3+. SegFormer-B5 good balance of accuracy and speed for satellite imagery or medical segmentation.
Case: cell segmentation on microscopy images. 400-image dataset with manual annotation. Mask R-CNN on ResNet-50 backbone gave IoU 0.61 — poor. Problem: cells overlap, standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized for biomedical) + soft-NMS. IoU rose to 0.79.
OCR: When Tesseract Falls Short
Tesseract — starting point for simple tasks: printed text, good lighting, straight placement. With handwritten elements, non-standard fonts, perspective distortion or multi-column layouts — Tesseract degrades quickly.
PaddleOCR — production-grade solution: text block detection + recognition + structural analysis. Works out-of-box for 80+ languages including Russian. Supports tables and complex-layout documents.
TrOCR (Microsoft) — transformer OCR with strong results on handwritten text. For Russian handwriting needs fine-tuning: base model trained mainly on Latin.
Document understanding. For "extract data from invoice / contract / passport" — LayoutLMv3 or Donut. These models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tune on 200-500 annotated documents.
Typical production OCR pipeline:
- Preprocessing: deskew, denoising, binarization via OpenCV
- Text block detection: PaddleOCR detection or CRAFT
- Recognition: PaddleOCR recognition or TrOCR
- Post-processing: normalization, validation via regex or LLM for structured fields
For fixed-structure documents (standard forms) template matching + OCR at specific coordinates often more reliable and faster than end-to-end.
Face Recognition: Identification and Verification
Face recognition = detection + alignment + embedding + matching. Each stage matters.
Detection. RetinaFace or InsightFace for precise face localization and keypoints. MTCNN — older but reliable.
Embedding. ArcFace (InsightFace) — state-of-the-art for face recognition embeddings. iresnet50/iresnet100 models pretrained on MS1MV3 (5M identities). 512 float32 embedding vector, comparison by cosine similarity.
Threshold tuning. Decision threshold — critical parameter. At threshold 0.6 (cosine) typical FPR on LFW — 0.001, TPR — 0.985. In production threshold calibrate to real distribution: masked people, changed appearance, varied lighting.
Liveness detection. Serious production systems need anti-spoofing: protection from photos, video, 3D masks. MiniFASNet — lightweight model runs on CPU. FaceX-Zoo contains several pretrained liveness detectors.
Video Analytics
Video — sequence of frames plus temporal dimension. Naive approach — detect each frame — works but expensive.
Tracking. ByteTrack and BoT-SORT — standard for multi-object tracking. Work atop any detector, add persistent IDs across frames. Enables counting, movement trajectories, velocity.
Optimization. Don't process every frame. For static scenes: detection every 5-10 frames, tracker between. For event detection (person entered zone): background subtraction (OpenCV MOG2) as lightweight pre-filter before neural detection.
Action Recognition. SlowFast, VideoMAE for video action classification. Heavy models, demand significant compute. For production — ONNX export + TensorRT, or offline processing.
Deploying CV Models
Online inference. Triton Inference Server (NVIDIA) — production standard for CV model serving. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API.
Edge deployment. ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile. OpenVINO for Intel CPU/GPU/VPU — often 2-3x speedup on Intel hardware versus ONNX Runtime.
Quality monitoring. For CV in production monitor: prediction confidence distribution, low-confidence prediction share (OOD data indicator), input image drift via feature distribution (embeddings from backbone). Average confidence drop from 0.87 to 0.71 per week — early distribution shift signal.
Timelines and Stages
Detector prototype on existing data — 1-2 weeks. Production system with target hardware optimization — 4-8 weeks. Full cycle including data annotation (1000-5000 images) — 2-4 months. Cost depends on dataset volume, target platform, accuracy/latency requirements.







