Real-Time Object Detection on Video System Development
Real-time object detection on video is a task with strict latency requirements. Real-time threshold: for surveillance systems — 25+ FPS, for robotics — 30+ FPS with latency < 33ms. Performance depends on three factors: model architecture, hardware accelerator, and inference pipeline efficiency.
System Architecture
Camera → Frame Capture → Preprocessing → Inference → Postprocessing → Output
↓ ↓
Frame Skipping TensorRT/ONNX Runtime
Resize/Normalize GPU batching
For RTSP/IP cameras, use GStreamer or FFmpeg for stream capture with hardware decoding (NVDEC on NVIDIA):
import cv2
# Hardware-accelerated RTSP capture
cap = cv2.VideoCapture(
'rtsp://camera_ip/stream?'
'pipeline='
'rtspsrc location=rtsp://camera_ip/stream !'
'rtph264depay ! h264parse ! nvh264dec !' # NVDEC
'videoconvert ! appsink',
cv2.CAP_GSTREAMER
)
Model Optimization for Real-Time
TensorRT optimization provides 2–5x speedup vs PyTorch:
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
# Export to TensorRT FP16
model.export(
format='engine',
half=True, # FP16 precision
batch=1, # or batch=4 for batching
device=0,
workspace=4 # GB for optimization
)
YOLOv8n with TensorRT FP16 on T4: 280+ FPS at 640×640 resolution.
Frame skipping — detect not every frame. At 30 FPS video, detect every 3rd frame (10 detections/sec) + track for intermediate frames. Perceived quality is preserved.
Dynamic batching — group frames from multiple cameras into batch for single GPU pass:
class MultiCameraInference:
def __init__(self, model_path, num_cameras=8):
self.model = load_trt_model(model_path)
self.batch_size = num_cameras
def process_batch(self, frames: list[np.ndarray]) -> list[list]:
# Preprocessing batch
batch = preprocess_batch(frames) # [N, 3, H, W]
# Single GPU inference for all cameras
results = self.model.infer(batch)
return postprocess_batch(results)
Multi-Camera Systems
For monitoring with 8–32 cameras: one A100/H100 GPU handles up to 32 streams of 1080p@30fps with YOLOv8n. Architecture: shared inference server (Triton) + separate capture processes for each camera.
Throughput:
- NVIDIA T4 (16GB): 8–12 cameras 1080p with YOLOv8m
- NVIDIA A100: 24–32 cameras 1080p with YOLOv8l
Latency Optimization
Pipeline latency = capture + decode + preprocess + inference + postprocess + display
| Stage | Typical Time | Optimized |
|---|---|---|
| Frame capture | 5 ms | 2 ms (NVDEC) |
| Preprocessing | 8 ms | 1 ms (GPU preproc) |
| YOLOv8n inference | 12 ms | 4 ms (TRT FP16) |
| Postprocessing + NMS | 5 ms | 2 ms |
| Total | 30 ms | 9 ms |
Deployment and Monitoring
Docker container with CUDA 12.x + TensorRT. Metrics: FPS per camera, inference latency, GPU utilization, detection count per class per minute. Alerting via Prometheus + Grafana.
| System Scale | Timeline |
|---|---|
| 1–4 cameras, basic detection | 2–3 weeks |
| 8–32 cameras, custom classes | 4–7 weeks |
| 50+ cameras, distributed architecture | 8–14 weeks |







