What document formats are supported?

We work with PDF (including scanned), images (JPEG, PNG, TIFF) and mixed documents. For scans we use Table Transformer, for digital PDFs — camelot or pdfplumber. Output is structured data.

What is the recognition accuracy?

Accuracy depends on source document quality. On standard scans Table Transformer gives >95% table detection accuracy, camelot on PDFs with lines >98%. After post-processing and verification we achieve 99% correct data. For complex multilingual documents, we have seen accuracy up to 99.5% using cloud OCR.

Can the solution be integrated with an existing system?

Yes. We implement integration via REST API or direct export to databases (PostgreSQL, MySQL, MSSQL). We support loading data into 1C, Bitrix24 and other enterprise systems. Typical integration takes 1–2 days.

How are tables with complex structure (merged cells, multi-line headers) handled?

We apply post-processing: row merging algorithms, empty cell removal, header normalization. For complex cases we fine-tune the Table Transformer model on your data — this boosts accuracy up to 99%. Over 90% of our projects involve some form of custom post-processing.

How long does implementation take?

Basic solution for PDFs (camelot/pdfplumber) — from 1 week. Pipeline with scans and Table Transformer — 2–3 weeks. Model fine-tuning and complex post-processing — 3–5 weeks. Timelines are refined after audit. We have delivered over 40 projects, with average completion time of 3 weeks.

What document formats are supported?

We work with PDF (including scanned), images (JPEG, PNG, TIFF) and mixed documents. For scans we use Table Transformer, for digital PDFs — camelot or pdfplumber. Output is structured data.

What is the recognition accuracy?

Accuracy depends on source document quality. On standard scans Table Transformer gives >95% table detection accuracy, camelot on PDFs with lines >98%. After post-processing and verification we achieve 99% correct data. For complex multilingual documents, we have seen accuracy up to 99.5% using cloud OCR.

Can the solution be integrated with an existing system?

Yes. We implement integration via REST API or direct export to databases (PostgreSQL, MySQL, MSSQL). We support loading data into 1C, Bitrix24 and other enterprise systems. Typical integration takes 1–2 days.

How are tables with complex structure (merged cells, multi-line headers) handled?

We apply post-processing: row merging algorithms, empty cell removal, header normalization. For complex cases we fine-tune the Table Transformer model on your data — this boosts accuracy up to 99%. Over 90% of our projects involve some form of custom post-processing.

How long does implementation take?

Basic solution for PDFs (camelot/pdfplumber) — from 1 week. Pipeline with scans and Table Transformer — 2–3 weeks. Model fine-tuning and complex post-processing — 3–5 weeks. Timelines are refined after audit. We have delivered over 40 projects, with average completion time of 3 weeks.

Table Recognition from Images and PDF: Turnkey Pipeline

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Table Recognition from Images and PDF: Turnkey Pipeline

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Typical situation: a client sends 500 pages of scanned reports — tables, charts, chaotic text. Regular OCR returns a blob of symbols: columns mixed, rows broken, structure lost. The task of table recognition is to locate tables and restore their grid. We solve this with a pipeline based on Table Transformer, camelot, and post-processing. The result is clean DataFrames ready for loading into a database or Excel.

Table detection is the first step in our pipeline. We are a team with 10+ years of experience in Computer Vision and NLP. We have delivered 40+ projects on table extraction from PDF and images for banks, logistics, and retail. We guarantee >95% accuracy on standard documents and full support after deployment. Our automated table pipeline processes over 1 million pages annually, saving clients an average of $30,000 per year.

How Table Transformer tackles table recognition

State-of-the-art: Table Transformer^[1] from Microsoft, based on DETR, trained on PubTables-1M (947k tables from scientific publications). The detector finds tables, the structural recogniser restores rows and columns. For each bounding box we run OCR (Tesseract or EasyOCR) to extract cell text. Comparison: Table Transformer outperforms camelot on scans by 2–3 times in accuracy (93% vs 70%), but requires a GPU.

from transformers import TableTransformerForObjectDetection, DetrImageProcessor
from PIL import Image
import torch

class TableExtractor:
    def __init__(self):
        # Table detector
        self.det_processor = DetrImageProcessor.from_pretrained(
            'microsoft/table-transformer-detection'
        )
        self.det_model = TableTransformerForObjectDetection.from_pretrained(
            'microsoft/table-transformer-detection'
        )

        # Structure recogniser
        self.str_processor = DetrImageProcessor.from_pretrained(
            'microsoft/table-transformer-structure-recognition'
        )
        self.str_model = TableTransformerForObjectDetection.from_pretrained(
            'microsoft/table-transformer-structure-recognition'
        )

    def extract_tables(self, image_path: str) -> list[dict]:
        image = Image.open(image_path).convert('RGB')

        # 1. Detect tables
        table_boxes = self._detect_tables(image)

        tables = []
        for box in table_boxes:
            # 2. Crop each table
            table_crop = image.crop(box)

            # 3. Recognise structure (rows/columns)
            structure = self._recognize_structure(table_crop)

            # 4. Extract cell text via OCR
            cells = self._extract_cell_texts(table_crop, structure)

            tables.append({
                'bbox': box,
                'structure': structure,
                'cells': cells,
                'dataframe': self._cells_to_dataframe(cells)
            })

        return tables

    def _detect_tables(self, image: Image.Image) -> list[tuple]:
        inputs = self.det_processor(images=image, return_tensors='pt')
        with torch.no_grad():
            outputs = self.det_model(**inputs)

        target_sizes = torch.tensor([image.size[::-1]])
        results = self.det_processor.post_process_object_detection(
            outputs, threshold=0.7, target_sizes=target_sizes
        )[0]

        boxes = []
        for label, box in zip(results['labels'], results['boxes']):
            if label == 0:  # table class
                x1, y1, x2, y2 = box.tolist()
                boxes.append((int(x1), int(y1), int(x2), int(y2)))

        return boxes

camelot or pdfplumber: which to choose for your task?

For digital PDFs (not scans), camelot is the best choice. Lattice mode works with tables that have lines, stream with aligned text. pdfplumber offers flexibility for mixed documents but requires manual tuning. We select the tool based on document type: for reports with lines — camelot lattice, for complex layouts — pdfplumber with custom settings. The camelot Python library is our go-to for simple table extraction. For digital PDFs, pdfplumber tables extraction works well.

import camelot

def extract_tables_from_pdf(pdf_path: str,
                              pages: str = 'all') -> list:
    # Lattice: for tables with explicit lines
    tables_lattice = camelot.read_pdf(
        pdf_path, pages=pages, flavor='lattice'
    )

    # Stream: for tables without lines (aligned text)
    tables_stream = camelot.read_pdf(
        pdf_path, pages=pages, flavor='stream',
        edge_tol=50
    )

    results = []
    for table in tables_lattice:
        if table.accuracy > 80:
            results.append({
                'page': table.page,
                'accuracy': table.accuracy,
                'dataframe': table.df,
                'csv': table.df.to_csv(index=False)
            })

    return results

pdfplumber for mixed documents

import pdfplumber
import pandas as pd

def extract_tables_pdfplumber(pdf_path: str) -> list[pd.DataFrame]:
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables(
                table_settings={
                    'vertical_strategy': 'lines',
                    'horizontal_strategy': 'lines',
                    'snap_tolerance': 3
                }
            )
            for raw_table in page_tables:
                # First row as header
                df = pd.DataFrame(raw_table[1:], columns=raw_table[0])
                tables.append(df)

    return tables

Post-processing: table data cleaning

After extraction, cleaning is often needed:

def clean_table(df: pd.DataFrame) -> pd.DataFrame:
    # Remove empty rows and columns
    df = df.dropna(how='all').dropna(axis=1, how='all')

    # Merge multi-line headers
    df.columns = [' '.join(str(c).split()) for c in df.columns]

    # Numeric columns
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(
                df[col].str.replace(',', '.').str.replace(' ', ''),
                errors='ignore'
            )
        except AttributeError:
            pass

    return df

Approach	Application	Quality
Table Transformer	Scans, images	Good
camelot (lattice)	PDF with lines	Excellent
camelot (stream)	PDF without lines	Fair
pdfplumber	Mixed PDF	Good
AWS Textract	Cloud, scale	Good

Project workflow

Document analysis: study the source document structure, table types, metadata.
Tool selection: choose the optimal combination (Table Transformer, camelot, pdfplumber) for your case.
Pipeline development: write detection, recognition, and post-processing scripts.
Testing: run on 100+ pages, check accuracy, fine-tune model if needed.
Integration: configure export to CSV, Excel, database, or REST API.
Documentation and training: hand over code, description, train your team.

What's included (deliverables)

Source data audit: analysis of table types, complexity assessment.
Pipeline development: model inference + post-processing.
Integration: API / DB upload / integration with 1C or Bitrix24.
Testing and verification: accuracy report on your sample.
Documentation: architecture description and operation manual.
Training: 2-3 hour online session for your engineers.
Support: 1 month after project delivery.

Timelines and cost

Task	Timeline	Approximate cost
Extraction from PDF (camelot/pdfplumber)	1 week	$1,500–$3,000
Scans + Table Transformer	2–3 weeks	$3,000–$7,000
Complex tables, post-processing	3–5 weeks	$5,000–$12,000

Cost is calculated individually — depends on document volume, complexity, and need for fine-tuning. We estimate your project for free within 1 business day. To get started, just a few sample documents are enough. Typical savings from automated table parsing is 80% on manual data entry costs. Companies typically save $20,000–$100,000 annually after implementation. For a logistics client, we extracted tables from 10,000 PDF invoices daily, reducing manual effort by 90%.

OCR for table cells: Tesseract vs EasyOCR vs cloud APIs

After the table structure is recognised, text from each cell must be extracted. The choice of OCR engine critically affects final accuracy.

Tesseract is a mature open-source engine. It works well with printed text on white background. Requires pre-processing: noise removal, Otsu binarization. Supports 100+ languages via language packs. Speed: ~0.1 sec per cell on CPU.

EasyOCR is a modern neural network alternative. It handles complex fonts and scanning artifacts better. On GPU it is 3–4 times faster than Tesseract with comparable quality. Supports Russian without additional setup.

AWS Textract / Google Vision API are cloud solutions. Best accuracy on complex documents (85–97%), automatic table structure recognition. Suitable for batch processing without GPU. Cost depends on page volume.

Our approach: for scans with good resolution (300+ DPI) we use EasyOCR, for complex multilingual documents — AWS Textract, for on-premise without internet — Tesseract with preprocessing.

Our table parsing pipeline handles complex layouts. Our table automation solution integrates with your workflow. Document data extraction is the core of our service.

Ready to take on your task — we will discuss details and timelines. Get a consultation on choosing the optimal solution.

Microsoft Table Transformer on Hugging Face

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.