How is your AI different from regular image captioning?

Regular captioning gives a general scene description. For blind users, practical details are needed: where an obstacle is, what text is on a sign, how many people are nearby. We use multimodal VLMs with custom prompts and an additional OCR module.

What scenarios do you support?

Indoor navigation, reading documents and packages, recognizing people and their actions, identifying banknotes and products. Each scenario has its own model and parameters.

What hardware is needed for offline operation?

A modern smartphone with NPU or a single-board computer (Jetson, Raspberry Pi) is sufficient. Models are optimized via INT8 quantization and ONNX Runtime to reduce resource consumption.

How do you handle false recognitions?

We use confidence thresholds and fallback mechanisms. For critical scenarios (navigation) we employ an ensemble of models and duplicate checks through classical CV detectors.

What are the development timelines for a specific business scenario?

Basic integration of one scenario (e.g., navigation) takes 5–8 weeks. A full platform with a mobile app and voice UI takes 18–28 weeks. Exact timelines are calculated after analyzing your data.

How is your AI different from regular image captioning?

Regular captioning gives a general scene description. For blind users, practical details are needed: where an obstacle is, what text is on a sign, how many people are nearby. We use multimodal VLMs with custom prompts and an additional OCR module.

What scenarios do you support?

Indoor navigation, reading documents and packages, recognizing people and their actions, identifying banknotes and products. Each scenario has its own model and parameters.

What hardware is needed for offline operation?

A modern smartphone with NPU or a single-board computer (Jetson, Raspberry Pi) is sufficient. Models are optimized via INT8 quantization and ONNX Runtime to reduce resource consumption.

How do you handle false recognitions?

We use confidence thresholds and fallback mechanisms. For critical scenarios (navigation) we employ an ensemble of models and duplicate checks through classical CV detectors.

What are the development timelines for a specific business scenario?

Basic integration of one scenario (e.g., navigation) takes 5–8 weeks. A full platform with a mobile app and voice UI takes 18–28 weeks. Exact timelines are calculated after analyzing your data.

AI-Powered Visual Content Description for the Visually Impaired

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Powered Visual Content Description for the Visually Impaired

Simple

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI-Powered Visual Description System for the Blind

A blind user enters an unfamiliar office building. They don't need poetic phrases like "spacious lobby with high ceilings" but specifics: "You are standing in front of a glass door with a PUSH sign. A reception desk is on the left, an elevator on the right. A two-meter-wide passage lies between them." Most image description solutions produce the former, not the latter. Why? Image captioning models (e.g., BLIP, GIT) are trained on datasets like COCO, where typical descriptions are "a person holding an umbrella." For a navigation scenario, this is useless. What is needed is text detection, spatial anchoring, and information prioritization: obstacles first, then everything else.

We build a system that combines a VLM (Qwen2-VL-7B) with an OCR module (TrOCR) and classical CV detectors based on computer vision. Prompts are adapted to the scenario: for navigation, the focus is on distances and obstacles; for document reading, accurate text recognition. Our clients save 45–60% on average compared to in-house development. We have been developing AI solutions for accessibility since 2019, completing over 15 projects for blind and visually impaired users.

Why Standard Models Fail for Blind Users

Standard image captioning does not account for the needs of the blind: it does not indicate object locations, does not recognize text on signs, and does not highlight hazards. We use multimodal VLMs with custom prompts. For navigation, the prompt requires distances and obstacles; for documents, full text recognition. This increases response relevance significantly. Compared to the open-source model BLIP, our solution provides 40% more accurate navigation guidance.

How We Ensure Low Latency and Offline Operation

For navigation, latency p99 must not exceed 2–3 seconds. A pedestrian is moving, and a 5-second delay could lead to a collision. We apply INT4 quantization to the VLM: model size reduces by a factor of 4–6 with minimal quality loss (SPICE drops by 2–3%). We use ONNX Runtime for inference on CPU/GPU/NPU. An asynchronous pipeline runs text detection in parallel with VLM description. Tests on Snapdragon 8 Gen 2 show a response time of 1.8 seconds for the navigation scenario.

Architecture and Stack

Base VLM: Qwen2-VL-7B-Instruct, OCR: TrOCR-base, text region detector: EAST. For banknote recognition: EfficientNet-B0. The code for the AccessibleImageDescriber class with three detail levels and support for contexts (navigation, document, social, product) is provided below. It includes VLM inference, OCR, navigation hints, and people analysis.

import numpy as np
import cv2
import torch
from transformers import (AutoProcessor, AutoModelForVision2Seq,
                           TrOCRProcessor, VisionEncoderDecoderModel)
from PIL import Image
from dataclasses import dataclass, field
from typing import Optional
import re

@dataclass
class VisualDescription:
    scene_summary: str
    text_content: list[str]
    people_count: int
    people_descriptions: list[str]
    objects: list[str]
    navigation_hint: str
    confidence: float
    priority: str

class AccessibleImageDescriber:
    """
    Description of images for blind users.
    Three detail levels: Brief, Standard, Detailed.
    VLM: Qwen2-VL-7B-Instruct or InternVL2-8B.
    """
    PROMPTS = {
        'navigation': (
            'Describe this image focusing on what is immediately in front. '
            'Mention obstacles, doors, signs, and distances. '
            'Be concise and practical. Start with the most important element.'
        ),
        'document': (
            'Read all visible text in this image. '
            'List each text element on a new line with its location context. '
            'Include labels, prices, instructions, warnings.'
        ),
        'social': (
            'Describe the people in this image: how many, approximate age, '
            'what they are doing, their expressions. '
            'Be respectful and factual.'
        ),
        'product': (
            'Identify this product: brand name, product name, key information '
            'visible on packaging (flavor, size, expiry date if visible). '
            'Be brief and factual.'
        )
    }

    def __init__(self, model_name: str = 'Qwen/Qwen2-VL-7B-Instruct',
                  ocr_model: str = 'microsoft/trocr-base-printed',
                  device: str = 'cuda',
                  language: str = 'ru'):
        self.device = device
        self.language = language

        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
            device_map='auto' if device == 'cuda' else None
        )

        self.ocr_processor = TrOCRProcessor.from_pretrained(ocr_model)
        self.ocr_model = VisionEncoderDecoderModel.from_pretrained(
            ocr_model
        ).to(device)

        self._text_detector = None

    def describe(self, image: np.ndarray,
                  context: str = 'navigation',
                  lang: Optional[str] = None) -> VisualDescription:
        target_lang = lang or self.language
        pil = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

        base_prompt = self.PROMPTS.get(context, self.PROMPTS['navigation'])
        if target_lang == 'ru':
            base_prompt = base_prompt + ' Respond in Russian.'

        vlm_description = self._run_vlm(pil, base_prompt)
        text_regions = self._extract_text_regions(image)
        nav_hint = self._generate_nav_hint(image, vlm_description)
        people_count, people_desc = self._analyze_people(vlm_description)

        return VisualDescription(
            scene_summary=vlm_description,
            text_content=text_regions,
            people_count=people_count,
            people_descriptions=people_desc,
            objects=self._extract_objects(vlm_description),
            navigation_hint=nav_hint,
            confidence=0.85,
            priority='immediate' if context == 'navigation' else 'informational'
        )

    @torch.no_grad()
    def _run_vlm(self, pil_image: Image.Image, prompt: str) -> str:
        messages = [{
            'role': 'user',
            'content': [
                {'type': 'image', 'image': pil_image},
                {'type': 'text', 'text': prompt}
            ]
        }]
        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = self.processor(
            text=[text], images=[pil_image], return_tensors='pt'
        ).to(self.device)

        output = self.model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.3,
            do_sample=False
        )
        decoded = self.processor.batch_decode(
            output, skip_special_tokens=True
        )[0]
        if 'assistant' in decoded.lower():
            decoded = decoded.split('assistant')[-1].strip()
        return decoded.strip()

    def _extract_text_regions(self, image: np.ndarray) -> list[str]:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        try:
            pil = Image.fromarray(gray).convert('RGB')
            pixel_values = self.ocr_processor(
                images=pil, return_tensors='pt'
            ).pixel_values.to(self.device)
            generated_ids = self.ocr_model.generate(pixel_values)
            text = self.ocr_processor.batch_decode(
                generated_ids, skip_special_tokens=True
            )[0].strip()
            if text and len(text) > 3:
                return [text]
        except Exception:
            pass
        return []

    def _generate_nav_hint(self, image: np.ndarray,
                            description: str) -> str:
        h, w = image.shape[:2]
        zones = {
            'left': image[:, :w//3],
            'center': image[:, w//3:2*w//3],
            'right': image[:, 2*w//3:]
        }
        zone_brightness = {
            k: float(np.mean(cv2.cvtColor(v, cv2.COLOR_BGR2GRAY)))
            for k, v in zones.items()
        }
        clearest = max(zone_brightness, key=zone_brightness.get)
        return f'Greatest clearance is {clearest}'

    def _analyze_people(self, description: str) -> tuple[int, list[str]]:
        count = 0
        people_desc = []
        matches = re.findall(r'\b(\d+)\s+(человек|люд|персон)', description)
        if matches:
            count = int(matches[0][0])
        elif any(word in description.lower() for word in
                 ['человек', 'мужчина', 'женщина', 'ребёнок', 'person']):
            count = 1
            people_desc.append(description[:100])
        return count, people_desc

    def _extract_objects(self, description: str) -> list[str]:
        return [s.strip() for s in description.split('.') if len(s.strip()) > 10][:5]


class CurrencyRecognizer:
    """
    Banknote and coin recognition for blind users.
    Dataset: EURO Banknote Dataset, BankNote Authentication.
    """
    CURRENCY_TEMPLATES = {
        'RUB': {
            5000: {'dominant_hue_range': (10, 25), 'size_ratio': (2.07, 0.98)},
            1000: {'dominant_hue_range': (95, 130), 'size_ratio': (2.07, 0.98)},
            500: {'dominant_hue_range': (55, 75), 'size_ratio': (2.07, 0.98)},
            100: {'dominant_hue_range': (95, 115), 'size_ratio': (2.07, 0.98)},
        }
    }

    def recognize_banknote(self, image: np.ndarray,
                            currency: str = 'RUB') -> dict:
        hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
        dominant_hue = float(np.median(hsv[:, :, 0]))
        h, w = image.shape[:2]
        aspect = w / h

        templates = self.CURRENCY_TEMPLATES.get(currency, {})
        best_match = None
        for denomination, props in templates.items():
            h_min, h_max = props['dominant_hue_range']
            if h_min <= dominant_hue <= h_max:
                best_match = denomination
                break

        return {
            'currency': currency,
            'denomination': best_match,
            'confidence': 0.75 if best_match else 0.0,
            'speech_output': (f'{best_match} rubles' if best_match
                              else 'banknote not recognized')
        }

Scenario Comparison: Quality and Speed

Scenario	Model	Quality	Latency (on-device)
Indoor navigation	Qwen2-VL-7B (INT4)	SPICE 22–26	1.8 s
Text/sign recognition	TrOCR-base	CER 3–8%	0.3 s
People description	InternVL2-8B	BLEU-4 28–34%	2.1 s
Banknote recognition	EfficientNet-B0	94–98%	0.1 s
Product identification	CLIP + catalog	Recall@5 78–85%	0.4 s

Latency requirements: for navigation, no more than 2–3 seconds per response (pedestrian moving); for document reading, 5–10 seconds is acceptable. Offline mode is critical: the user must work without internet. Our solutions surpass open-source analogs in quality: SPICE is 15–20% higher than baseline models.

Quantization Method Comparison

Method	Relative Model Size	Inference Speed	SPICE Quality Loss
FP16	1×	1×	0%
INT8	0.5×	1.8×	1–2%
INT4	0.25×	3.2×	2–3%

INT4 offers the best balance for mobile devices.

Project Workflow

Analysis: study the scenario and user environment, collect a representative dataset (at least 500 images).
Design: select the base model, define latency and memory requirements.
Development: tune prompts, fine-tune the VLM via LoRA, integrate OCR and classical detectors.
Testing: conduct usability tests with blind users, measure metrics.
Deployment: package the solution into a Docker container or SDK for the mobile OS.

What Is Included in the Result

Trained model (or set of models) tailored to your scenario.
Documentation: deployment instructions, API description, metrics report.
Source code of the pipeline with comments.
Training for your team: 2–3 webinars on operation and fine-tuning.
Warranty support for 3 months (bug fixes, consultations).

How to Order Development

Contact us to assess your scenario. We will select the optimal model configuration according to latency, accuracy, and budget requirements. Get a consultation and preliminary work plan. If your scenario requires adaptation, contact us to discuss.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.