What document formats does the AI integration support?

The system accepts PDF, scans (JPG/PNG), DOCX, and email attachments. For scans, OCR (Tesseract or Google Cloud Vision) is used. Both machine-readable and handwritten documents are processed (with OCR model fine-tuning).

What is the accuracy percentage for requisites extraction?

On real client documents (1000+ tests), automatic extraction accuracy reaches 94%. For structured documents (invoices, contracts) — up to 98%. Low-confidence exceptions are passed to an operator.

Which ECM systems can be integrated?

Currently supported are 1C:Document Management, Directum, and DocsVision. For other ECMs (ELMA, TESSA), integration is developed individually via REST API. Backend stack is Python/FastAPI.

Do you need to fine-tune models on our documents?

Yes, for maximum accuracy, we fine-tune a BERT classifier on your document corpus (from 500 labeled instances). This takes 1–2 weeks. Without fine-tuning, classification accuracy is around 80%.

How long does implementation take?

Basic integration (classifier + extractor + ECM connection) takes 3 to 5 weeks. Model fine-tuning adds another 1–2 weeks. Time may increase with non-standard routing requirements.

What document formats does the AI integration support?

The system accepts PDF, scans (JPG/PNG), DOCX, and email attachments. For scans, OCR (Tesseract or Google Cloud Vision) is used. Both machine-readable and handwritten documents are processed (with OCR model fine-tuning).

What is the accuracy percentage for requisites extraction?

On real client documents (1000+ tests), automatic extraction accuracy reaches 94%. For structured documents (invoices, contracts) — up to 98%. Low-confidence exceptions are passed to an operator.

Which ECM systems can be integrated?

Currently supported are 1C:Document Management, Directum, and DocsVision. For other ECMs (ELMA, TESSA), integration is developed individually via REST API. Backend stack is Python/FastAPI.

Do you need to fine-tune models on our documents?

Yes, for maximum accuracy, we fine-tune a BERT classifier on your document corpus (from 500 labeled instances). This takes 1–2 weeks. Without fine-tuning, classification accuracy is around 80%.

How long does implementation take?

Basic integration (classifier + extractor + ECM connection) takes 3 to 5 weeks. Model fine-tuning adds another 1–2 weeks. Time may increase with non-standard routing requirements.

AI Integration in ECM: Automating Incoming Documents

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI Integration in ECM: Automating Incoming Documents

Medium

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Incoming documents — the bottleneck of any organization

A scanned contract arrives by email → the operator opens it → manually enters the details into 1C → selects the type → launches the approval process. On average, 8 minutes per document. With 500 documents per month, that's 67 hours of purely mechanical work. Our AI integration reduces this to 45 seconds per document, with 89% processed without human intervention. The problem is compounded by the variety of formats: PDF, scans, DOCX, email attachments. Each requires preprocessing, and manual entry errors lead to approval failures. We build an AI layer that understands the content of any document, extracts key details, classifies, and automatically launches workflows in your ECM. No templates — only trained models tailored to your document flow. Our team's experience: over 20 successful implementations, 5+ years in NLP and MLOps.

How does AI process documents faster than an operator?

AI processes an incoming document 10–15 times faster than a human: 45 seconds vs. 8 minutes. Moreover, requisites extraction accuracy reaches 94% (vs. 85% with manual entry). The system works around the clock, requires no breaks, and makes no fatigue-related errors.

Criterion	Manual processing	AI processing
Speed per document	8 minutes	45 seconds
Requisites extraction accuracy	~85%	94–98%
Documents without human intervention	0%	89%
Availability	8/5	24/7

Investment in AI integration pays off in an average of 6 months. For example, with a document flow of 500 units per month, savings amount to about 1.2 million rubles per year due to freed operator time and reduced errors.

Why is fine-tuning BERT critical for accuracy?

A base document classification model (cointegrated/rubert-tiny2) yields about 80% accuracy on typical documents. However, each company uses unique template contracts, invoices, and acts. Fine-tuning BERT on your corpus (from 500 labeled instances) boosts accuracy to 94% and above. We use Hugging Face Transformers for fine-tuning and inference. Below is a sample classifier implementation.

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

class DocumentClassifier:
    DOCUMENT_TYPES = [
        "договор", "счёт-фактура", "накладная", "акт",
        "приказ", "служебная записка", "коммерческое предложение",
        "доверенность", "устав", "протокол", "письмо входящее"
    ]

    def __init__(self, model_path: str = "cointegrated/rubert-tiny2"):
        # Для production — дообученный BERT на корпусе документов компании
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path,
            num_labels=len(self.DOCUMENT_TYPES)
        )
        self.model.eval()

    def classify(self, text: str) -> dict:
        # Берём первые 512 токенов (шапка документа несёт основную семантику)
        inputs = self.tokenizer(
            text[:2000],
            return_tensors="pt",
            truncation=True,
            max_length=512,
            padding=True
        )
        with torch.no_grad():
            logits = self.model(**inputs).logits

        probs = torch.softmax(logits, dim=-1)[0]
        top_idx = probs.argmax().item()

        return {
            "type": self.DOCUMENT_TYPES[top_idx],
            "confidence": float(probs[top_idx]),
            "alternatives": [
                {"type": self.DOCUMENT_TYPES[i], "score": float(probs[i])}
                for i in probs.topk(3).indices.tolist()
                if i != top_idx
            ]
        }

AI layer architecture for document management

[Incoming document]
PDF/scan/DOCX/email
         ↓
[Document Preprocessor]
OCR (Tesseract/Google Cloud Vision) → normalized text
         ↓
[AI Processing Pipeline]
  ├── Classification: document type
  ├── NER: counterparty, dates, amounts, details
  ├── Summary: brief content
  └── Routing: determine approval route
         ↓
[ECM API]
Create card + launch workflow

Requisites extraction: combination of NER and LLM

For quick extraction of standard fields (INN, dates, amounts) we use regex and NER. For complex cases — LLM (GPT-4o-mini or local LLaMA via LangChain). The combination yields 94% accuracy on real documents. For non-standard requests, we employ RAG with vector databases (ChromaDB, pgvector), allowing search across a database of previously processed documents.

from langchain_openai import ChatOpenAI
import re
from datetime import datetime

class DocumentExtractor:
    EXTRACTION_PROMPT = """Извлеки реквизиты из документа.

Текст документа:
{text}

Тип документа: {doc_type}

Извлеки (верни null если не найдено):
- contractor_name: название контрагента
- contractor_inn: ИНН контрагента
- contract_number: номер договора/счёта
- contract_date: дата документа (ISO 8601)
- total_amount: сумма (число)
- currency: валюта (RUB/USD/EUR)
- payment_deadline: срок оплаты (если есть)
- subject: предмет договора (1-2 предложения)
- signatory: подписант со стороны контрагента

Верни JSON."""

    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def extract_requisites(self, text: str, doc_type: str) -> dict:
        # Сначала быстрое regex-извлечение
        fast_extract = self._regex_extract(text)

        # LLM для пропущенных полей и валидации
        llm_result = self.llm.invoke(
            self.EXTRACTION_PROMPT.format(
                text=text[:3000],
                doc_type=doc_type
            )
        )

        import json
        llm_data = json.loads(llm_result.content)

        # Мерджим: regex имеет приоритет для числовых полей (точнее)
        return {**llm_data, **fast_extract}

    def _regex_extract(self, text: str) -> dict:
        result = {}

        # ИНН: 10 или 12 цифр
        inn_match = re.search(r'\bИНН[:\s]*(\d{10,12})\b', text)
        if inn_match:
            result["contractor_inn"] = inn_match.group(1)

        # Суммы с валютой
        amount_match = re.search(
            r'(\d[\d\s,]*\.?\d*)\s*(руб|рублей|RUB|USD|EUR)',
            text, re.IGNORECASE
        )
        if amount_match:
            amount_str = amount_match.group(1).replace(' ', '').replace(',', '.')
            result["total_amount"] = float(amount_str)

        return result

Integration with ECM: Directum, 1C, DocsVision

Integration is built through official REST APIs. Example for Directum: upload file, fill card, launch workflow. Similar logic for 1C:Document Management and DocsVision.

class SEDIntegration:
    """Интеграция с 1С:Документооборот, Directum, DocsVision"""

    def push_to_directum(self, extracted: dict, original_file: bytes) -> dict:
        """Создаёт карточку документа в Directum"""
        import requests

        # Загружаем файл
        upload_response = requests.post(
            f"{self.directum_url}/api/v1/documents",
            headers={"Authorization": f"Bearer {self.token}"},
            files={"file": original_file}
        )
        doc_id = upload_response.json()["id"]

        # Заполняем карточку
        card_response = requests.patch(
            f"{self.directum_url}/api/v1/documents/{doc_id}/properties",
            headers={"Authorization": f"Bearer {self.token}"},
            json={
                "DocumentType": extracted["type"],
                "Counterparty": extracted.get("contractor_name"),
                "INN": extracted.get("contractor_inn"),
                "Amount": extracted.get("total_amount"),
                "DocumentDate": extracted.get("contract_date"),
                "Subject": extracted.get("subject")
            }
        )

        # Запускаем маршрут согласования
        route = self._determine_route(extracted)
        requests.post(
            f"{self.directum_url}/api/v1/documents/{doc_id}/workflow/{route}",
            headers={"Authorization": f"Bearer {self.token}"}
        )

        return {"doc_id": doc_id, "route": route}

    def _determine_route(self, extracted: dict) -> str:
        """Определяет маршрут согласования по параметрам документа"""
        amount = extracted.get("total_amount", 0)
        doc_type = extracted.get("type", "")

        if doc_type == "договор":
            if amount > 1_000_000:
                return "contract_large"      # директор + юрист + финансы
            elif amount > 100_000:
                return "contract_medium"     # руководитель + юрист
            else:
                return "contract_standard"   # только руководитель
        elif doc_type == "счёт-фактура":
            return "invoice_approval"
        return "standard"

What's included: stages and results

We provide the full implementation cycle:

Document flow analysis — route schemas, document types, volume.
Model development — classifier and NER fine-tuning.
ECM integration — REST API, workflow setup.
Testing on real documents — up to 1000 instances.
Launch and operator training.

Stage	Duration	Result
Document flow analysis	3–5 days	Route schema, document type list
Classifier development	2–3 weeks	Model with accuracy ≥90%
Requisites extractor	1–2 weeks	JSON output with fields
ECM integration	2–3 weeks	Full cycle: document → card → workflow
Fine-tuning on your data	1–2 weeks	Accuracy grows to 94%

Deliverables:

Architecture and API documentation.
Access to trained models and code.
Operator training on the system.
Technical support for one year.

Typical mistakes in AI integration for document management

Ignoring OCR quality. If scans are poor (resolution <150 DPI, creases), accuracy drops. Solution: image preprocessing — deskewing, binarization.
One model for everything. Classification and NER require different architectures. Combining them in one model reduces accuracy for both tasks.
No human-in-the-loop. Documents with confidence <0.8 should be checked by an operator. Otherwise errors propagate through the system.

Implementation results: case study and company metrics

Case study: a manufacturing company, 500 incoming documents per month. Before implementation: 2 operators spent 40% of their working time on manual requisites entry. After: automatic requisites extraction accuracy 94% (verified on 1000 documents), 89% of documents processed without operator intervention, operators handle only exceptions (confidence < 0.8) and disputed routes. Incoming document processing time decreased from 8 minutes to 45 seconds. Time savings — over 60 hours per month, equivalent to the cost of two operators.

We have completed over 20 AI integrations in ECM for companies with document flows ranging from 200 to 5000 documents per month. Team experience: 5+ years in NLP and MLOps. We use only licensed solutions and official APIs.

Contact us for a free evaluation of your project. Order a pilot processing of 100 documents — we will show accuracy on your data.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.