What detection accuracy does the planogram violation system achieve?

After fine-tuning YOLOv8 on real shelf images, we reach mAP 0.92–0.96. SKU identification via CLIP delivers top-1 accuracy of 85–92% for 1000+ SKUs.

How long does it take to implement the system?

The core pipeline (detection + identification + comparison) takes 7–12 weeks. Integration with ERP and mobile apps adds another 4–6 weeks.

What cameras are required?

Any IP camera with 1080p resolution works. Optimal cameras have a 90–120° field of view to capture the entire shelf.

Can it integrate with our existing ERP?

Yes, we provide a REST API to transmit violation reports. Integration with 1С, SAP, and Oracle is part of the standard package.

What is included in the development cost?

The cost is individual: it covers dataset collection, model fine-tuning, pipeline development, documentation, and 3 months of support.

What detection accuracy does the planogram violation system achieve?

After fine-tuning YOLOv8 on real shelf images, we reach mAP 0.92–0.96. SKU identification via CLIP delivers top-1 accuracy of 85–92% for 1000+ SKUs.

How long does it take to implement the system?

The core pipeline (detection + identification + comparison) takes 7–12 weeks. Integration with ERP and mobile apps adds another 4–6 weeks.

What cameras are required?

Any IP camera with 1080p resolution works. Optimal cameras have a 90–120° field of view to capture the entire shelf.

Can it integrate with our existing ERP?

Yes, we provide a REST API to transmit violation reports. Integration with 1С, SAP, and Oracle is part of the standard package.

What is included in the development cost?

The cost is individual: it covers dataset collection, model fine-tuning, pipeline development, documentation, and 3 months of support.

AI-Powered Planogram Compliance: Detect Violations Instantly

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Powered Planogram Compliance: Detect Violations Instantly

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
930

Show more works

An empty shelf in retail means a 15% loss in category sales. Manual planogram checks once a week are too late: violations are spotted 7 days later, when lost revenue is already irrecoverable. Non-compliance losses can reach up to $120,000 per month for a chain of 50 stores. We implement CV systems based on YOLOv8 and CLIP that detect violations within minutes after capturing images. The system processes photos from any IP camera with 1080p resolution and generates a report highlighting each discrepancy. AI detects violations 4 times faster than manual checks, with detection accuracy reaching 95% compared to 80% for humans.

A delay of just a few days turns into million-dollar losses, especially in high-turnover categories. Automating shelf compliance with AI is not just a replacement for manual labor—it's a new level of precision and response speed. Our clients, after implementation, report a 30% reduction in out-of-stock and savings of $150,000 annually for a chain of 20 stores. We guarantee detector mAP of at least 0.92 on your data. The typical project cost ranges from $30,000 to $50,000 per store pilot, with an annual license fee of $5,000 per store.

Problems Solved by AI Planogram Control

Reaction delay: from violation to detection takes 7 days—sales are lost in the meantime. The AI system produces a report immediately after shooting.
Human error: merchandisers miss up to 20% of discrepancies, especially with large assortments. The detector finds >95% of violations.
Scaling complexity: manual checks across hundreds of stores require a large staff. The AI system processes 10,000+ photos per day without additional resources.

How YOLOv8 and CLIP Detect Violations

The main pipeline consists of three steps: detecting products on the shelf, identifying each product, and comparing with the reference planogram.

from ultralytics import YOLO
import numpy as np
from PIL import Image
import torch
import torch.nn.functional as F

class PlanogramComplianceChecker:
    """
    Step 1: YOLOv8 detects all products on the shelf (bbox + class)
    Step 2: CLIP/ViT identifies the specific SKU from the crop
    Step 3: Comparison with planogram
    """
    def __init__(
        self,
        detector_path: str,       # fine-tuned YOLO on shelves
        sku_embeddings_path: str, # CLIP embeddings of all SKUs
        planogram: dict           # {position: sku_id}
    ):
        self.detector = YOLO(detector_path)
        sku_data = np.load(sku_embeddings_path)
        self.sku_embeddings = torch.from_numpy(
            sku_data['embeddings']
        ).float()                 # (N_SKU, embedding_dim)
        self.sku_ids = sku_data['sku_ids'].tolist()
        self.planogram = planogram

        # CLIP for SKU identification
        from transformers import CLIPProcessor, CLIPModel
        self.clip_model = CLIPModel.from_pretrained(
            'openai/clip-vit-large-patch14'
        ).eval().cuda()
        self.clip_processor = CLIPProcessor.from_pretrained(
            'openai/clip-vit-large-patch14'
        )

    def analyze_shelf(
        self,
        shelf_image: Image.Image,
        confidence_threshold: float = 0.5
    ) -> dict:
        img_array = np.array(shelf_image)

        # Step 1: detection
        detections = self.detector.predict(
            img_array, conf=confidence_threshold, verbose=False
        )[0]

        shelf_products = []
        for box in detections.boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0])
            crop = shelf_image.crop((x1, y1, x2, y2))

            # Step 2: identify SKU via CLIP
            sku_id, similarity = self._identify_sku(crop)

            shelf_products.append({
                'bbox': [x1, y1, x2, y2],
                'sku_id': sku_id,
                'confidence': float(box.conf),
                'sku_similarity': float(similarity),
                'position': self._get_shelf_position(
                    [x1, y1, x2, y2], img_array.shape
                )
            })

        # Step 3: compare with planogram
        compliance = self._check_compliance(shelf_products)
        return compliance

    @torch.no_grad()
    def _identify_sku(
        self, crop: Image.Image
    ) -> tuple[str, float]:
        inputs = self.clip_processor(
            images=crop, return_tensors='pt'
        ).to('cuda')
        features = self.clip_model.get_image_features(**inputs)
        features = F.normalize(features, dim=-1).cpu()

        # Cosine similarity with all SKU embeddings
        similarities = (features @ self.sku_embeddings.T).squeeze()
        best_idx = similarities.argmax().item()
        return self.sku_ids[best_idx], float(similarities[best_idx])

    def _get_shelf_position(
        self, bbox: list, img_shape: tuple
    ) -> dict:
        """Horizontal position + shelf row"""
        h, w = img_shape[:2]
        cx = (bbox[0] + bbox[2]) / 2
        cy = (bbox[1] + bbox[3]) / 2
        return {
            'col': int(cx / w * 10),   # 0-9 — ten columns
            'row': int(cy / h * 5)     # 0-4 — five rows
        }

    def _check_compliance(self, shelf_products: list) -> dict:
        violations = []
        actual_positions = {
            f"{p['position']['row']}_{p['position']['col']}": p['sku_id']
            for p in shelf_products
        }

        for position_key, expected_sku in self.planogram.items():
            actual_sku = actual_positions.get(position_key)
            if actual_sku is None:
                violations.append({
                    'type': 'out_of_stock',
                    'position': position_key,
                    'expected_sku': expected_sku
                })
            elif actual_sku != expected_sku:
                violations.append({
                    'type': 'wrong_product',
                    'position': position_key,
                    'expected_sku': expected_sku,
                    'actual_sku': actual_sku
                })

        compliance_score = 1.0 - len(violations) / max(len(self.planogram), 1)
        return {
            'compliance_score': round(compliance_score, 3),
            'violations': violations,
            'total_positions': len(self.planogram),
            'violations_count': len(violations),
            'detected_products': len(shelf_products)
        }

Advantages of CLIP Over a Classifier

A classifier requires retraining each time a new product is added. CLIP works on embedding similarity: add a photo of the new SKU to the index—the system is ready. This reduces maintenance costs for assortments of 500+ items. According to ECR Retail research, automating shelf compliance reduces out-of-stock by 30%.

Role of MLOps in Retail

MLOps ensures continuous model updates when the assortment changes. We use Kubeflow for pipeline orchestration and MLflow for experiment tracking. This allows automatic retraining of the detector and rebuilding of the CLIP index without developer intervention. Without MLOps, each new SKU addition would require manual retraining—with an assortment of 2000+ items, this becomes a bottleneck.

How Quickly Does AI Compliance Pay Off?

A typical project pays for itself in 6–8 months through reduced out-of-stock and lower manual labor costs. For a chain of 50 stores, direct savings can be as high as $150,000 per year. An additional benefit is increased customer loyalty due to consistent product availability. The AI system identifies violations 4 times faster than manual checks.

Implementation Stages of the AI System

Audit of control scheme and data collection — analysis of current processes, shelf photography (at least 2000 images).
Fine-tune product detector YOLOv8 — image annotation, training product detector to mAP ≥0.92.
Building the CLIP SKU index — collecting product photos, generating embeddings.
Developing the planogram comparison pipeline — implementing shelf violation detection logic.
Integration with ERP — REST API for automatic report export.
Testing and team training — 2 days on-site plus webinar recordings.
Support — 3 months of maintenance, retraining when assortment changes.

SKU Indexing via CLIP Embeddings

from transformers import CLIPProcessor, CLIPModel
import torch
import numpy as np
from pathlib import Path

def build_sku_index(
    product_images_dir: str,   # directory: {sku_id}/{image1.jpg, ...}
    output_path: str,
    model_name: str = 'openai/clip-vit-large-patch14',
    images_per_sku: int = 5    # average embedding over several photos
) -> None:
    """
    Build CLIP index of all SKUs.
    Multiple photos per product → averaged embedding is more stable.
    """
    model = CLIPModel.from_pretrained(model_name).eval().cuda()
    processor = CLIPProcessor.from_pretrained(model_name)

    sku_embeddings = []
    sku_ids = []

    for sku_dir in sorted(Path(product_images_dir).iterdir()):
        if not sku_dir.is_dir():
            continue
        sku_id = sku_dir.name
        image_files = list(sku_dir.glob('*.{jpg,jpeg,png}'))[:images_per_sku]

        if not image_files:
            continue

        batch_embeddings = []
        for img_path in image_files:
            image = Image.open(img_path).convert('RGB')
            inputs = processor(images=image, return_tensors='pt').to('cuda')
            with torch.no_grad():
                emb = model.get_image_features(**inputs)
                emb = F.normalize(emb, dim=-1).cpu().numpy()
            batch_embeddings.append(emb)

        mean_emb = np.mean(batch_embeddings, axis=0)
        mean_emb = mean_emb / np.linalg.norm(mean_emb)
        sku_embeddings.append(mean_emb.squeeze())
        sku_ids.append(sku_id)

    np.savez(
        output_path,
        embeddings=np.array(sku_embeddings),
        sku_ids=np.array(sku_ids)
    )
    print(f'Indexed {len(sku_ids)} SKUs')

What's Included in the Work?

Documentation: architecture description, API specification, operation manual.
Fine-tuning YOLOv8 detector on your shelves (collection and annotation of ~2000 images).
Building CLIP SKU index from the product catalog.
Integration with ERP (1С, SAP, Oracle) via REST API—automatic report export.
Team training: 2 days on-site plus webinar recordings.
Technical support: 3 months after delivery—bug fixes, retraining when assortment changes.

Timelines

Task	Duration
Product detector on shelf (fine-tuning YOLO)	3–5 weeks
Full system (detection + identification + planogram)	7–12 weeks
Integration with ERP / merchandiser mobile app	10–16 weeks

Comparison of Approaches: AI vs Manual Check

Criterion	Manual Check	AI System
Reaction time to violation	up to 7 days	minutes (after photo)
Miss rate for violations	15–20%	<5%
Scaling to 100 stores	10+ merchandisers	no additional staff

Implementing an AI system pays for itself in 6–8 months through reduced out-of-stock and lower manual labor costs. With over 5 years of experience and 50+ retail projects, our team delivers reliable AI solutions. This automated shelf alignment system provides a high ROI. For more details on the technology, refer to the computer vision article. Get a consultation on our AI shelf compliance system today. Our CV camera retail solution is tailored for retail environments.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.