Which image formats does Azure Computer Vision OCR support?

It supports JPEG, PNG, BMP, TIFF. For PDF, we recommend converting to images first since Read API does not process PDF directly.

How to handle large volumes of documents?

Use async Read API with awareness of the 10 requests per minute limit per resource. For batch processing, implement parallel requests with retry logic on throttling.

Can OCR be deployed on a local server?

Yes, Microsoft provides a Docker container for Read API. Data stays within your infrastructure, crucial for banks and government agencies. The container supports the same features as the cloud version.

How to extract data from invoices?

Use Document Intelligence with the prebuilt 'prebuilt-invoice' model. It automatically recognizes fields: vendor, date, total, line items. Accuracy on structured documents reaches 99%.

How long does integration take?

Basic Read API integration takes 3-5 days. With Document Intelligence and custom models – from 2 weeks. On-premise container with PDF processing – 1-2 weeks. Timelines depend on complexity and volumes.

Which image formats does Azure Computer Vision OCR support?

It supports JPEG, PNG, BMP, TIFF. For PDF, we recommend converting to images first since Read API does not process PDF directly.

How to handle large volumes of documents?

Use async Read API with awareness of the 10 requests per minute limit per resource. For batch processing, implement parallel requests with retry logic on throttling.

Can OCR be deployed on a local server?

Yes, Microsoft provides a Docker container for Read API. Data stays within your infrastructure, crucial for banks and government agencies. The container supports the same features as the cloud version.

How to extract data from invoices?

Use Document Intelligence with the prebuilt 'prebuilt-invoice' model. It automatically recognizes fields: vendor, date, total, line items. Accuracy on structured documents reaches 99%.

How long does integration take?

Basic Read API integration takes 3-5 days. With Document Intelligence and custom models – from 2 weeks. On-premise container with PDF processing – 1-2 weeks. Timelines depend on complexity and volumes.

Extract Document Data with Azure Computer Vision OCR

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Extract Document Data with Azure Computer Vision OCR

Simple

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Let's be clear: when standard OCR solutions fail on unusual fonts or poor lighting — especially when extracting data from hundreds of invoices or passports — Azure Computer Vision from Microsoft comes to the rescue. Our certified engineers configure the Read API and Document Intelligence for your specific tasks, delivering accuracy up to 99%. We've implemented over 30 document automation projects across industries. Automation reduces manual processing costs by up to 80%, saving approximately $15,000 per month on 10,000 documents. Typical integration costs range from $3,000 to $10,000, depending on complexity.

Why Read API Is the Primary OCR Service in Azure?

Azure Computer Vision offers two OCR services: Read API (optimized for dense documents, recommended by Microsoft) and the legacy OCR API (only for simple images). Read API 4.0 works both in the cloud and as an on-premise container. We use the text recognition API because it handles handwritten text, tables, and multi-page PDFs. According to official Microsoft documentation, Read API's accuracy on structured documents reaches 99%. Compared to legacy OCR, the Azure OCR service provides 2x higher accuracy on handwritten content.

Integrating the OCR API in Python: Step-by-Step Guide

Create a Computer Vision resource in Azure portal (key and endpoint).
Install the library azure-cognitiveservices-vision-computervision via pip.
Write an async call — the code below shows class AzureOCR for extracting text from an image.
Process the result — parse bounding boxes for tables, filter by confidence.
Add retry logic for timeouts (exponential backoff).

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from msrest.authentication import CognitiveServicesCredentials
import time

class AzureOCR:
    def __init__(self, endpoint: str, api_key: str):
        self.client = ComputerVisionClient(
            endpoint,
            CognitiveServicesCredentials(api_key)
        )
    
    def extract_text_from_url(self, image_url: str) -> str:
        """Read API: async processing via URL"""
        read_response = self.client.read_in_stream(
            open('image.jpg', 'rb'),
            raw=True
        )
        
        # Get operation ID from header
        operation_location = read_response.headers['Operation-Location']
        operation_id = operation_location.split('/')[-1]
        
        # Wait for result
        while True:
            read_result = self.client.get_read_result(operation_id)
            if read_result.status not in [
                OperationStatusCodes.running,
                OperationStatusCodes.not_started
            ]:
                break
            time.sleep(0.5)
        
        # Extract text
        text_lines = []
        if read_result.status == OperationStatusCodes.succeeded:
            for page in read_result.analyze_result.read_results:
                for line in page.lines:
                    text_lines.append(line.text)
        
        return '\n'.join(text_lines)
    
    def extract_with_positions(self, image_path: str) -> list[dict]:
        """Extraction with bounding box coordinates"""
        with open(image_path, 'rb') as f:
            read_response = self.client.read_in_stream(f, raw=True)
        
        operation_id = read_response.headers['Operation-Location'].split('/')[-1]
        
        while True:
            result = self.client.get_read_result(operation_id)
            if result.status not in [OperationStatusCodes.running,
                                       OperationStatusCodes.not_started]:
                break
            time.sleep(0.3)
        
        words = []
        if result.status == OperationStatusCodes.succeeded:
            for page in result.analyze_result.read_results:
                for line in page.lines:
                    for word in line.words:
                        words.append({
                            'text': word.text,
                            'confidence': word.confidence,
                            'bbox': word.bounding_box
                        })
        return words

When to Use Document Intelligence Instead of Read API?

For arbitrary text on images, use the OCR API. If you need to extract structured fields from invoices, contracts, or IDs, Document Intelligence (formerly Form Recognizer) is better. It offers prebuilt models and allows custom ones. Document Intelligence is up to 3x more accurate on structured documents compared to general OCR. Example invoice analysis:

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

class AzureDocumentIntelligence:
    def __init__(self, endpoint: str, api_key: str):
        self.client = DocumentAnalysisClient(
            endpoint=endpoint,
            credential=AzureKeyCredential(api_key)
        )
    
    def analyze_invoice(self, image_path: str) -> dict:
        """Specialized invoice analysis"""
        with open(image_path, 'rb') as f:
            poller = self.client.begin_analyze_document(
                'prebuilt-invoice', f
            )
        
        result = poller.result()
        invoices = []
        
        for invoice in result.documents:
            fields = invoice.fields
            invoices.append({
                'vendor_name': fields.get('VendorName', {}).get('value'),
                'invoice_date': str(fields.get('InvoiceDate', {}).get('value')),
                'total_amount': fields.get('AmountDue', {}).get('value'),
                'invoice_id': fields.get('InvoiceId', {}).get('value'),
                'line_items': [
                    {
                        'description': item.get('Description', {}).get('value'),
                        'amount': item.get('Amount', {}).get('value')
                    }
                    for item in (fields.get('Items', {}).get('value') or [])
                ]
            })
        
        return invoices[0] if invoices else {}

How to Deploy OCR On-Premise?

For data requiring local processing, use the Read API Container. Data stays within your infrastructure with minimal latency. This container is essential in banking and government sectors. Launch is simple:

docker run --rm -it -p 5000:5000 \
  -e ApiKey=YOUR_KEY \
  -e Billing=YOUR_ENDPOINT \
  mcr.microsoft.com/azure-cognitive-services/vision/read:3.2

Case: Processing 10,000 Invoices per Day

For a large retailer, we deployed a hybrid solution: cloud Read API for quick requests and on-premise container for sensitive data. We set up parallel queues with Azure Service Bus, enabling processing of up to 10,000 invoices daily with p99 latency < 2s. Field recognition accuracy reached 98.5%.

Azure Computer Vision OCR Implementation Process

Audit — analyze current document processing workflows, document types, volumes.
Design — select service (Read API / Document Intelligence), architecture (cloud / container / hybrid).
Integration — develop a Python library for API calls with error handling, retries, monitoring.
Testing — verify accuracy on your samples, stress test under load.
Deploy — deploy to production, set up CI/CD, monitor latency and accuracy.
Support — train your team, provide documentation, post-launch support.

What's Included in the Work (Deliverables)

Documentation — architecture description, operation manual, API description.
Source code — Python module for Azure Computer Vision integration, including error handling and retries.
Team training — workshop on using the developed solution.
Support — warranty maintenance for one month after launch.

Common OCR Integration Mistakes to Avoid

Wrong API selection: using legacy OCR instead of Read API. Always use the modern text recognition service.
Ignoring limits: Read API is restricted to 10 requests per minute per resource. Distribute requests across multiple keys or introduce a queue.
Missing error handling: timeouts, service unavailability. Add exponential backoff and retry logic.
Forgetting bounding boxes: for table text extraction, coordinates are mandatory. Always use extract_with_positions when working with tables.

Feature	Read API	Document Intelligence
OCR for arbitrary text	Yes	Yes
Table structure	No	Yes
Specialized models (invoice, ID)	No	Yes
Custom models	No	Yes
Price per 1000 pages	$1.50	$10–50

Task	Timeline
Basic Read API integration	3–5 days
Document Intelligence with field extraction	1–2 weeks
On-premise container + PDF processing	1–2 weeks

Checklist for Successful Integration

Determine document types and required fields.
Choose the appropriate service tier (S0/S1) based on volumes.
Implement async calls with error handling.
Set up monitoring of metrics (latency, accuracy, error rate).
Conduct A/B testing on real data.

Our team, with over 30 successful projects and 5+ years of experience, delivers reliable OCR solutions. Our Azure Computer Vision OCR integration ensures high accuracy and efficiency. Get a consultation from an Azure Computer Vision engineer. Contact us to assess your project — we'll help automate document processing with up to 99% accuracy.

How Distribution Shift Kills CV Model Metrics in Industry

On a production line, a camera is installed to control product quality. The model is trained on 10,000 labeled images—test accuracy mAP 0.84. Deployed to production, and in the first week it misses 30% of defects. Lighting on the line changes between shifts; distribution shift nullifies the metrics. This is a classic story with computer vision in industry, where pattern recognition fails without proper drift handling.

Our engineers, with experience from 60+ computer vision projects, know how to eliminate such scenarios. We guarantee stable model performance under real conditions.

Object Detection: YOLO, RT-DETR, and Everything in Between

YOLO is the standard for real-time detection. YOLOv8 and YOLOv11 from Ultralytics are the most used versions in production: simple API, active community, built-in validation, and export to ONNX/TensorRT. For tasks with high accuracy requirements and less critical latency, RT-DETR, a transformer-based architecture without NMS, gives better mAP on COCO at comparable speed to YOLOv8l.

Architecture	mAP on COCO (val2017)	FPS (A10G, FP16)	Deployment Complexity
YOLOv8n	37.3	700+	Low (ONNX/TensorRT)
YOLOv8m	50.2	250	Low
RT-DETR-L	53.0	140	Medium (requires PyTorch)
Mask R-CNN	38.2 (bbox)	30	High

A typical mistake when training a detector: dataset of 8000 images, 3 classes, fine-tune YOLOv8m—F1 0.73 on validation. Look at confusion matrix—one class is almost never detected. Cause: imbalance 1:23. Solution: oversampling rare class, focal loss for objectness, augmentations (Mosaic, MixUp disabled for rare class as they "blur" it). Transfer learning is mandatory: pretrained on COCO weights reduces data requirement by 10 times. Fine-tuning on 500–2000 domain images yields a working model in 1–2 days on a single GPU.

For edge deployment: export to ONNX → TensorRT engine. YOLOv8n in TensorRT FP16 on Jetson AGX Orin gives 150+ FPS at P99 latency < 8 ms—3 times faster than ONNX Runtime without TensorRT. On server A10G: 700+ FPS for YOLOv8n in TensorRT INT8.

How Does Fine-Tuning YOLO Help in Pattern Recognition?

Suppose you need to find micro-defects on a metal surface—a task with high resolution and class imbalance. We use YOLOv8m pretrained on COCO and fine-tune on 2000 proprietary images. Apply augmentations Mosaic, MixUp, random perspective. After 200 epochs, mAP 0.5 reaches 0.93. Key techniques:

Focal loss for the objectness head—reduces contribution of easily classified examples.
Class-balanced sampling—equalizes representation of rare classes.
Test Time Augmentation (TTA)—increases recall by 5–7% through averaging over flips and scales.

Get a consultation on architecture selection for your task—contact us.

Segmentation: SAM, Mask R-CNN, and Instance Segmentation

SAM (Segment Anything Model) from Meta changed the approach to segmentation. SAM 2 works with video, supports object tracking across frames—for interactive object selection by point or bbox, it's the best out-of-the-box choice. For production instance segmentation without interactive prompting, Mask R-CNN or YOLOv8-seg are used. YOLOv8-seg trains like a regular detector with additional masks, convenient in the same pipelines. Semantic segmentation (each pixel is a class) uses SegFormer, DeepLabV3+. SegFormer-B5 provides a good balance of accuracy and speed for satellite imagery or medical segmentation.

Case study: cell segmentation on microscopic images. Dataset of 400 images with manual annotation. Training Mask R-CNN on ResNet-50 backbone gave IoU 0.61—poor. Problem: objects (cells) overlap; standard NMS kills overlapping predictions. Solution: switch to cellpose (specialized architecture for biomedical tasks) + soft-NMS. IoU increased to 0.79.

OCR: When Tesseract Fails

Tesseract is a starting point for simple tasks: printed text, good lighting, straight layout. As soon as there are handwritten elements, non-standard fonts, perspective distortions, or multi-column layouts, Tesseract degrades quickly.

PaddleOCR is a production-grade solution: text block detection + recognition + structural analysis. Works out of the box for 80+ languages, including Russian. Supports tables and complex document structures. TrOCR (Microsoft) is a transformer OCR with strong results on handwritten text. For Russian handwritten text, fine-tuning is needed: the base model is trained mostly on Latin script.

What to Do When Tesseract Cannot Handle Pattern Recognition on Documents?

For tasks like "extract data from invoices/contracts/passports," we use LayoutLMv3 or Donut—these models understand document layout, not just text. Integration via Hugging Face Transformers, fine-tuning on 200–500 annotated documents. Typical pipeline:

Preprocessing: deskew, denoising, binarization via OpenCV.
Text block detection: PaddleOCR detection or CRAFT.
Recognition: PaddleOCR recognition or TrOCR.
Post-processing: normalization, validation via regex or LLM for structured fields.

For documents with fixed structure, template matching + OCR by coordinates is often more reliable than an end-to-end solution.

Face Recognition: Identification and Verification

Face recognition = detection + alignment + embedding + matching. Each stage matters.

Detection: RetinaFace or InsightFace for accurate face localization and keypoints. MTCNN is older but reliable. Embedding: ArcFace (InsightFace) is state-of-the-art for face recognition embeddings. Models iresnet50/iresnet100 pretrained on MS1MV3 (5M identities). Embedding vector 512 float32, comparison by cosine similarity. Threshold tuning: decision threshold is a critical parameter. At threshold 0.6, typical FPR on LFW benchmark is 0.001, TPR is 0.985. In production, threshold must be calibrated to the real distribution: people in masks, with changed appearance, different lighting conditions. Liveness detection is mandatory: MiniFASNet—lightweight model on CPU; FaceX-Zoo contains several pretrained liveness detectors.

Video Analytics

Video is a sequence of frames plus a temporal dimension. A naive approach—detecting on every frame—is expensive.

Tracking: ByteTrack and BoT-SORT are the standard for multi-object tracking. They work on top of any detector, adding persistent IDs to objects across frames—enabling object counting, motion tracking, velocity.

Optimization: not every frame needs processing. For static scenes, detect every 5–10 frames, with tracking in between. For event detection (person entering a zone), background subtraction (OpenCV MOG2) serves as a lightweight pre-filter before neural detection. Action recognition: SlowFast, VideoMAE for action classification. Heavy models—for production use ONNX export + TensorRT or offline processing.

How to Measure Pattern Recognition Model Quality in Production?

Quality monitoring is key to MLOps. We track:

Prediction confidence distribution.
Share of low-confidence predictions (indicator of OOD data).
Drift of input images via feature distribution (embeddings from backbone).

A drop in average confidence from 0.87 to 0.71 over a week is an early signal of distribution shift. NVIDIA Triton Inference Server recommends tracking these metrics via Prometheus. Our certified engineers set up monitoring and guarantee SLA for inference quality.

Deployment of CV Models

For online inference, we use Triton Inference Server (NVIDIA)—production standard for serving CV models. Supports TensorRT, ONNX, PyTorch, dynamic batching, multiple instances. REST and gRPC API. We guarantee stable operation under load.

Edge deployment: ONNX Runtime on ARM/x86 CPU. TensorFlow Lite for mobile devices. OpenVINO for Intel CPU/GPU/VPU—gives 2–3× speedup on Intel hardware compared to ONNX Runtime. After deployment, we hand over the model with documentation and train personnel.

What Is Included in the Work

Stage	Content	Estimated Time
Analysis	Technical specification, architecture selection, data evaluation	3–5 days
Labeling	Image collection, annotation (up to 5000 objects)	1–3 weeks
Training	Model fine-tuning, validation on test set	1–2 weeks
Optimization	Export to ONNX/TensorRT/OpenVINO, testing on target hardware	1–2 weeks
Integration	REST/gRPC API, integration with existing infrastructure	1–2 weeks
Deployment	Deployment on server or edge device, load testing	1 week
Documentation and training	Instructions, staff training, handover of code and model	3–5 days
Support	Technical support for 3 months after launch	—

Deadlines and Cost

A prototype detector on existing data takes 1–2 weeks. Production system with optimization for target hardware takes 4–8 weeks. Full cycle including data labeling (1000–5000 images) takes 2–4 months. Cost is calculated individually for each task. Typical savings from implementing a quality control system can be significant per production line.

We have been in the market for over 5 years and completed 60+ computer vision projects. We will evaluate your project end-to-end—request a consultation to get a quote and technical proposal.