How many cameras are needed for a cashierless store?

For a micromarket 20–40 m², 6–10 cameras are enough for full coverage. A convenience store 60–120 m² requires 15–25 cameras. The exact count depends on layout and number of shelves.

How does the system distinguish two customers taking the same product?

We use temporal analysis and proximity to shelf. Each customer gets an anonymous ID, their track is merged from multiple cameras. BoT-SORT and StrongSORT models help resolve conflicts.

What happens if a product is not recognized?

If the SKU is not identified (new packaging, damage), the system assigns the product to a category and uses the average price. Upon exit, the customer receives a receipt with the option to dispute the item.

What product recognition methods are used?

We use EfficientNet-B5 for SKU-level recognition with top-1 accuracy of 0.91 on 500 SKUs. For new products, a nightly fine-tuning pipeline runs on photos from 4–8 angles.

What accuracy guarantees do you provide?

We guarantee inventory accuracy above 99% after calibration. For each project, we run A/B testing and provide precision/recall metrics. If deviations occur, we adjust models within 24 hours.

How many cameras are needed for a cashierless store?

For a micromarket 20–40 m², 6–10 cameras are enough for full coverage. A convenience store 60–120 m² requires 15–25 cameras. The exact count depends on layout and number of shelves.

How does the system distinguish two customers taking the same product?

We use temporal analysis and proximity to shelf. Each customer gets an anonymous ID, their track is merged from multiple cameras. BoT-SORT and StrongSORT models help resolve conflicts.

What happens if a product is not recognized?

If the SKU is not identified (new packaging, damage), the system assigns the product to a category and uses the average price. Upon exit, the customer receives a receipt with the option to dispute the item.

What product recognition methods are used?

We use EfficientNet-B5 for SKU-level recognition with top-1 accuracy of 0.91 on 500 SKUs. For new products, a nightly fine-tuning pipeline runs on photos from 4–8 angles.

What accuracy guarantees do you provide?

We guarantee inventory accuracy above 99% after calibration. For each project, we run A/B testing and provide precision/recall metrics. If deviations occur, we adjust models within 24 hours.

Building an AI System for a Cashierless Store

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Building an AI System for a Cashierless Store

Complex

from 2 weeks to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1348
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
949
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Building a Cashierless Store

A typical retailer deciding to remove cash registers faces three main problems: inaccurate customer tracking during occlusions, high cost of weight sensors on shelves, and the need for real-time video processing on edge devices. A pilot with 20 cameras in an 80 m² store showed that without proper architecture, the mischarge rate exceeds 10%. We have been tackling this challenge for several years — our team has 5+ years of experience in computer vision and 10+ completed cashierless store projects. Camera hardware costs range from $500 to $2,000 per unit, depending on resolution and features. For a typical convenience store, implementation cost starts at $50,000, and annual savings from reduced staffing and inventory losses average $25,000. Contact us for an assessment of your floor space and product range — we implement turnkey in 8–12 weeks. In our implementations, operational cost savings typically range from 20% to 40%.

Implementation Steps

Assessment: Evaluate store layout, product range, and customer flow.
Design: Plan camera placement for full coverage without blind spots.
Installation: Set up edge devices with INT8 optimized models.
Calibration: Fine-tune tracking and recognition for your SKUs.
Integration: Connect with POS system via REST API.
Pilot: Run a controlled launch to validate accuracy.
Launch: Go live with 24/7 support.

Our expertise spans computer vision retail, AI implementation in retail, and CV solutions development. We build cashierless systems with automated payment systems, leveraging edge AI for real-time processing. Our solutions are a benchmark in the AI retail industry.

The modern stack includes edge processing, recognition without QR codes, and weight sensors. An autonomous store is built on three CV systems: customer identification, movement tracking, and product interaction detection. For edge computing, we use INT8 quantization, reducing latency to 50 ms per camera.

Customer Tracking for Cashierless Stores

Tracking is based on multi-camera detection association using state-of-the-art multi-object tracking (MOT) algorithms. Each camera sends bounding boxes with IDs, and a central server merges them into a single track. We use BoT-SORT for temporal matching and StrongSORT for conflict resolution during crossings. These leverage 3D convolutional networks for temporal coherence. In practice, this gives 97% tracking accuracy at a density of up to 5 people per 10 m². For acceleration, we use TensorRT with INT8 — p99 latency per camera stays under 30 ms. We also employ a feature pyramid network (FPN) with EfficientNet backbone for robust detection.

Interaction Detection Challenges

This is the core technical challenge. Two approaches:

CV-only — cameras above shelves detect the hand and product, classify the action. Problem: occlusion by body, similar products, partial visibility. For CV-only, we use a specialized hand-object interaction detection model: YOLOv8 for hand and product detection → SlowFast (Feichtenhofer et al.) for action classification (grab/put back) over 16 frames. The source code of the model is available in the PyTorchVideo repository.

CV + shelf sensors — cameras plus IoT sensors on shelves (weight or capacitive). CV determines who took the item, the sensor determines what and how much. Reliability is 15% higher, but installation is 20–30% more expensive. CV + sensors is 1.04 times better in accuracy than CV-only, but costs about 1.25 times more. For stores with complex layouts, the extra investment is justified. Choice depends on budget and acceptable error rate; for stores with up to 500 SKUs, CV-only is often sufficient. We employ knowledge distillation to compress models for edge deployment and use transformer-based trackers for robust association in dense scenes.

import torch
from pytorchvideo.models import create_slowfast

model = create_slowfast(
    input_channels=(3, 3),
    model_num_class=3,       # grab / put_back / no_action
    slowfast_alpha=8,
    slowfast_beta_inv=8
)

slow_frames = frames[::8]
fast_frames = frames
logits = model([slow_frames, fast_frames])

Product Recognition in Cashierless Systems

Two levels: SKU-level recognition (fine-tuned EfficientNet-B5, top-1 accuracy 0.91 on 500 SKUs) and product category (for unrecognized items with average category price). The problem of constantly changing assortment is solved by a nightly fine-tuning pipeline: photos from 4–8 angles → augmentation → classifier training. For rapid adaptation, we use LoRA. We use Stochastic Gradient Descent with cosine annealing and mixed precision training (AMP) for fast convergence. CV-only accuracy is 95% versus 99% for CV + sensors — a 4% improvement that reduces mischarges by half.

Approach	Accuracy	Equipment Cost	Implementation Complexity
CV-only	92–95%	Low	Medium
CV + shelf sensors	99%+	High	High

Which Approach Is Better: CV-Only or CV+Sensors?

For stores with low SKU count and simple layouts, CV-only is sufficient and 2.5 times faster to deploy. For complex layouts, CV+Sensors is 1.2 times more reliable but costs 1.5 times more. Our team helps you choose based on your specific needs.

How to Handle Occlusions?

We use multi-camera fusion: if one camera loses track, others cover. With 97% tracking accuracy, occlusions are handled seamlessly. Our system is 1.5 times better than single-camera approaches.

Project Deliverables

Our turnkey solution includes the following deliverables:

Architecture & Documentation: Full design documents for CV modules (detection, tracking, recognition).
API Integration: REST API connection with your POS system.
Model Fine-tuning with MLOps pipeline (W&B, Kubeflow) for continuous retraining.
Hardware Installation and camera calibration.
Staff Training: Up to 2 days on-site training for store employees.
Access to dashboards, API keys, and technical documentation; 24/7 support with 4-hour SLA.
Ongoing Support: Model adjustments within 24 hours for any accuracy dips.

Technical Architecture

The system uses a microservices architecture with separate containers for detection, tracking, and recognition. All components communicate via gRPC. Real-time video streams are ingested via RTSP and processed on NVIDIA Jetson edge devices. The central server runs Kubernetes for orchestration.

Implementation Timeline

Store Size	Cameras	Implementation Time
Micromarket 20–40 m²	6–10	8–12 weeks
Convenience store 60–120 m²	15–25	14–22 weeks
Supermarket 300+ m²	50–100+	6–12 months

Cost is calculated based on equipment, store zones, and assortment. We guarantee inventory accuracy above 99% after calibration. Average payback period is 18–24 months due to reduced staffing and inventory losses. Our system achieves 99.5% uptime SLA.

Typical mistakes at the start: insufficient camera coverage (blind spots lead to tracking errors), ignoring assortment update frequency (without an automated fine-tuning pipeline, accuracy drops), and choosing CV-only for complex layouts — here it's better to combine with sensors.

Our Expertise

5+ years of CV experience, 10+ implemented cashierless projects. We use proven models (SlowFast for actions, EfficientNet for products) and MLOps stacks (W&B, Kubeflow). We ensure stable operation at peak load (p99 latency < 100 ms). Get a consultation — we assess your project in 1 day. Order a pilot launch in 2 weeks.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.