Object Counting in Frame System Development
Object counting in images or videos is a task with nuances. Simple "detect and count boxes" works only with few objects and good visibility of each. In dense clusters (crowds, crops in fields, cells under microscope, cars in parking) detectors lose performance. For such cases, specialized approaches are used: density maps and crowd counting models.
Approach 1: Detection + Counting
For sparse objects (< 50 in frame, objects don't overlap much) — YOLOv8/YOLO11 + box counting:
from ultralytics import YOLO
model = YOLO('yolov8m.pt')
def count_objects(image_path: str, target_class: str) -> int:
results = model(image_path, conf=0.4, iou=0.5)
class_names = model.names
target_id = [k for k, v in class_names.items() if v == target_class][0]
count = 0
for result in results:
for cls in result.boxes.cls:
if cls.item() == target_id:
count += 1
return count
Approach 2: Density Map for Dense Clusters
For tasks with hundreds and thousands of objects in frame: counting people in crowds, grains in fields, cells under microscope.
Density map — image where each pixel contains "density" of objects in neighborhood. Integral over density map = object count.
import torch
import torch.nn as nn
from torchvision.models import vgg16
class CSRNet(nn.Module):
"""Crowd Scene Recognition Network for people counting"""
def __init__(self):
super().__init__()
# Frontend: VGG16 without FC layers
vgg = vgg16(pretrained=True)
self.frontend = nn.Sequential(*list(vgg.features.children())[:23])
# Backend: dilated convolutions for multi-scale context
self.backend = nn.Sequential(
nn.Conv2d(512, 512, 3, padding=2, dilation=2),
nn.ReLU(inplace=True),
nn.Conv2d(512, 256, 3, padding=2, dilation=2),
nn.ReLU(inplace=True),
nn.Conv2d(256, 128, 3, padding=2, dilation=2),
nn.ReLU(inplace=True),
nn.Conv2d(128, 64, 3, padding=2, dilation=2),
nn.ReLU(inplace=True),
nn.Conv2d(64, 1, 1)
)
def forward(self, x):
x = self.frontend(x)
density_map = self.backend(x)
count = density_map.sum()
return density_map, count
Ground truth for training: dot annotations — one point per object. From points, generate density map via Gaussian kernel.
Approach 3: Counting via Line (Line Crossing)
For video counting of vehicles, people at doors: tracking + virtual line.
class LineCrossingCounter:
def __init__(self, line_start, line_end):
self.line = (line_start, line_end)
self.counted_ids = set()
self.count = 0
self.prev_positions = {}
def update(self, track_id, center_x, center_y):
if track_id in self.prev_positions:
prev_pos = self.prev_positions[track_id]
if self._crosses_line(prev_pos, (center_x, center_y)):
if track_id not in self.counted_ids:
self.count += 1
self.counted_ids.add(track_id)
self.prev_positions[track_id] = (center_x, center_y)
Applications and Metrics
| Application | Approach | Metric |
|---|---|---|
| Counting vehicles on road | Tracking + line | Accuracy, false count rate |
| Counting people in crowds | Density map (CSRNet) | MAE, RMSE |
| Counting cells under microscope | Density map | MAE |
| Counting fruits on plantation | YOLO + counting | mAP, MAE |
| Product inventory on shelf | YOLO + counting | Accuracy |
Typical CSRNet metrics on Shanghai Tech dataset:
- Part A (dense crowds): MAE 68.2, RMSE 115.0
- Part B (sparse): MAE 10.6, RMSE 16.0
| Task | Timeline |
|---|---|
| Detection-based counting, ready model | 1–2 weeks |
| Density map, custom domain | 3–5 weeks |
| Complex system (video + analytics) | 4–7 weeks |







