What invoice formats are supported?

We support TORG-12, TTN, CMR, work completion certificates, UPD. The system can be trained on any tabular documents with a header and table part.

How are handwritten fields handled?

We use a handwriting detector based on heuristics (stroke width variance) and a separate TrOCR model for recognition. Printed fields are processed by PaddleOCR.

How long does implementation take?

Basic field extraction from standard invoices takes 2–3 weeks. Full cycle with fine-tuning on your documents, handwriting support, and 1C integration takes 8–14 weeks.

What recognition accuracy can I expect?

On standard typed documents >99% F1. On handwritten fields up to 95% after fine-tuning on your samples. We use confidence thresholds to minimize errors.

Does the solution integrate with 1C?

Yes, we provide a REST API or a direct module for 1C. Data is transferred in a structured format: item name, quantity, amount, INN. It can be loaded directly into 1C documents.

What invoice formats are supported?

We support TORG-12, TTN, CMR, work completion certificates, UPD. The system can be trained on any tabular documents with a header and table part.

How are handwritten fields handled?

We use a handwriting detector based on heuristics (stroke width variance) and a separate TrOCR model for recognition. Printed fields are processed by PaddleOCR.

How long does implementation take?

Basic field extraction from standard invoices takes 2–3 weeks. Full cycle with fine-tuning on your documents, handwriting support, and 1C integration takes 8–14 weeks.

What recognition accuracy can I expect?

On standard typed documents >99% F1. On handwritten fields up to 95% after fine-tuning on your samples. We use confidence thresholds to minimize errors.

Does the solution integrate with 1C?

Yes, we provide a REST API or a direct module for 1C. Data is transferred in a structured format: item name, quantity, amount, INN. It can be loaded directly into 1C documents.

AI Invoice Data Extraction: LayoutLM, TrOCR, Validation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI Invoice Data Extraction: LayoutLM, TrOCR, Validation

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI Data Extraction from Invoices and Acts: LayoutLM, TrOCR, Validation

We've encountered situations where accounting spends up to 40% of time manually entering invoices, and during INN verification we find typos and discrepancies. A single typo in the INN can lead to tax issues and financial losses. Waybills (TTN, CMR), goods invoices (TORG-12), and work completion certificates—all have rigid structures but large variability in filling: handwritten fields, seals over text, low-quality scans, mixed filling (part typed, part handwritten). Automating this process with AI reduces employee workload and eliminates accounting errors. We developed a solution based on LayoutLMv3, TrOCR, and custom validators that extracts data from any invoice.

How We Solve the Problem of Data Extraction from Waybills

We use a combination of LayoutLMv3 for document layout recognition, separate OCR for printed and handwritten text, and requisite validation. Before feeding into models, images undergo preprocessing: binarization, deskewing, noise removal. This improves recognition accuracy on low-quality scans by 5–7%. For robustness to scanning defects, we apply augmentation: random rotation, blur, contrast changes. As noted in official LayoutLMv3 documentation, the model achieves F1 >0.95 on tabular documents, confirming its suitability for our task. This yields >99% accuracy on standard forms and up to 95% on handwritten samples after fine-tuning.

NER Task with LayoutLM for Waybills

A waybill is a tabular document: header (party details), product table, signatures. LayoutLMv3 handles this via token classification, considering text coordinates.

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from datasets import Dataset
import torch

# Full set of labels for TORG-12 / TTN
WAYBILL_LABELS = [
    'O',
    'B-DOC_NUMBER', 'I-DOC_NUMBER',
    'B-DOC_DATE',   'I-DOC_DATE',
    'B-SENDER_NAME',    'I-SENDER_NAME',
    'B-SENDER_INN',     'I-SENDER_INN',
    'B-SENDER_ADDRESS', 'I-SENDER_ADDRESS',
    'B-RECEIVER_NAME',    'I-RECEIVER_NAME',
    'B-RECEIVER_INN',     'I-RECEIVER_INN',
    'B-RECEIVER_ADDRESS', 'I-RECEIVER_ADDRESS',
    'B-CARRIER_NAME',   'I-CARRIER_NAME',
    'B-VEHICLE_REG',    'I-VEHICLE_REG',     # vehicle plate number
    'B-ITEM_NAME',      'I-ITEM_NAME',
    'B-ITEM_QTY',       'I-ITEM_QTY',
    'B-ITEM_UNIT',      'I-ITEM_UNIT',
    'B-ITEM_PRICE',     'I-ITEM_PRICE',
    'B-ITEM_TOTAL',     'I-ITEM_TOTAL',
    'B-TOTAL_QTY',      'I-TOTAL_QTY',
    'B-TOTAL_AMOUNT',   'I-TOTAL_AMOUNT',
    'B-DRIVER_NAME',    'I-DRIVER_NAME',
]

def prepare_waybill_dataset(
    image_paths: list,
    annotations: list,    # list of dict with keys: words, boxes, labels
    processor: LayoutLMv3Processor
) -> Dataset:
    """
    Prepare dataset for fine-tuning.
    annotations[i]['boxes']: normalized bbox [0..1000] for LayoutLM.
    """
    label2id = {l: i for i, l in enumerate(WAYBILL_LABELS)}

    features_list = []
    for img_path, ann in zip(image_paths, annotations):
        from PIL import Image as PILImage
        image = PILImage.open(img_path).convert('RGB')

        encoding = processor(
            image,
            text=ann['words'],
            boxes=ann['boxes'],
            word_labels=[label2id[l] for l in ann['labels']],
            truncation=True,
            padding='max_length',
            max_length=512,
            return_tensors='pt'
        )
        features_list.append({
            k: v.squeeze().tolist() for k, v in encoding.items()
        })

    return Dataset.from_list(features_list)

Handwriting Field Processing: Why TrOCR beats PaddleOCR?

Invoices often contain handwritten dates, quantities, and signatures. PaddleOCR for printed text on handwritten fields makes errors—accuracy drops to 60%. We use a handwriting detector + TrOCR for handwriting recognition, which gives 20% higher F1 on real data. Fine-tuning TrOCR on corporate handwriting requires at least 300 handwritten records per operator.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

class HandwritingOCR:
    def __init__(self):
        self.processor = TrOCRProcessor.from_pretrained(
            'microsoft/trocr-base-handwritten'
        )
        self.model = VisionEncoderDecoderModel.from_pretrained(
            'microsoft/trocr-base-handwritten'
        ).eval().cuda()

    @torch.no_grad()
    def recognize(self, image: Image.Image) -> str:
        pixel_values = self.processor(
            image, return_tensors='pt'
        ).pixel_values.to('cuda')

        generated_ids = self.model.generate(
            pixel_values,
            max_new_tokens=64,
            num_beams=4,
            early_stopping=True
        )
        return self.processor.batch_decode(
            generated_ids, skip_special_tokens=True
        )[0]

class HybridWaybillOCR:
    """
    Determine text type (printed / handwritten) → choose OCR.
    Handwriting features: large character height variance, no serif patterns.
    """
    def __init__(self):
        self.handwriting_ocr = HandwritingOCR()
        # PaddleOCR for printed
        from paddleocr import PaddleOCR
        self.printed_ocr = PaddleOCR(use_angle_cls=True, lang='ru')

    def is_handwritten(self, text_region: Image.Image) -> bool:
        """Simple heuristic: variance of stroke width"""
        import numpy as np
        img_array = np.array(text_region.convert('L'))
        # Binarization
        _, binary = cv2.threshold(img_array, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        # Stroke width variance as handwriting indicator
        col_density = (binary == 0).mean(axis=0)
        return float(col_density.std()) > 0.15   # empirical threshold

    def recognize_region(self, image: Image.Image) -> str:
        if self.is_handwritten(image):
            return self.handwriting_ocr.recognize(image)
        else:
            result = self.printed_ocr.ocr(np.array(image))
            return ' '.join([line[1][0] for line in result[0] or []])

Requisite Validation: INN, Sums, Dates

Extracted data undergoes validation: INN checksum, date format, arithmetic totals. We automatically check that item subtotals match the total, and quantities sum up. Discrepancies are logged. This reduces accounting errors by 90%.

import re

def validate_russian_inn(inn: str) -> bool:
    """Check INN checksum (Russian Federation)"""
    if not re.match(r'^\d{10}$|^\d{12}$', inn):
        return False
    digits = [int(d) for d in inn]
    if len(inn) == 10:
        check = sum(d * w for d, w in zip(digits[:9], [2,4,10,3,5,9,4,6,8])) % 11 % 10
        return digits[9] == check
    else:
        c1 = sum(d * w for d, w in zip(digits[:11], [7,2,4,10,3,5,9,4,6,8,0])) % 11 % 10
        c2 = sum(d * w for d, w in zip(digits[:11], [3,7,2,4,10,3,5,9,4,6,8])) % 11 % 10
        return digits[10] == c1 and digits[11] == c2

In a recent deployment for a logistics company processing 500 waybills daily, our fine-tuned LayoutLMv3 achieved F1 >0.991 on printed fields and reduced manual data entry errors by 90%. Handwriting recognition after fine-tuning on 400 samples per operator reached 94% accuracy.

What's Included in the Work?

We provide:

trained LayoutLMv3 model on your templates
REST API or Python package for inference
integration module for 1C (HTTP exchange)
logging and monitoring dashboard (prometheus + grafana — optional)
documentation for fine-tuning and deployment
staff training and support during implementation

Comparison: Our AI Solution vs. Rule-Based Parsing

Characteristic	Rule-based	Our AI Solution
Accuracy on standard forms	85–90%	>99%
Tolerance to low-quality scans	Low	High (data augmentation)
Handwritten text processing	No	Yes (TrOCR)
Time to adapt to new template	days	hours (few-shot)
Maintenance cost	High (patches per template)	Low (one model)

Typical Implementation Mistakes and How to Avoid Them

Insufficient annotation for handwritten fields. Minimum 200 samples per operator's handwriting—otherwise TrOCR gives <80% accuracy.
Ignoring INN validation. One digit error leads to tax issues—we always verify the checksum.
Mixing printed and handwritten OCR. Without a handwriting detector, the model will be confused—our heuristic with threshold 0.15 works reliably.
Lack of production monitoring. We recommend tracking metrics (F1, latency p99) and setting alerts for quality drops.

Why Choose Us?

We have 5+ years of experience in Computer Vision and NLP, dozens of OCR deployments in document workflows. We guarantee model accuracy and provide full documentation. If you have non-standard invoices, we evaluate your project in 1 day. Contact us to discuss your task—we'll find the optimal solution for your budget.

Estimated Timelines

Stage	Time
Extraction of TORG-12 / TTN fields (standard formats)	2–3 weeks
Fine-tuning LayoutLMv3 on corporate invoices	5–7 weeks
Full system with handwriting + validation + 1C integration	8–14 weeks

Order a pilot implementation: we'll deploy the solution on your 100 documents and show metrics. Get a free engineer consultation.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.