What is e-Discovery and why use AI?

e-Discovery (electronic discovery) is the process of identifying, collecting, and producing electronic information in legal proceedings. AI accelerates analysis by up to 80%: machine learning models pinpoint relevant documents, exclude privileged ones, and reduce the burden on attorneys.

Which algorithms are used for document classification?

The primary method is Predictive Coding (TAR). First, attorneys label a small training set (thousands of documents), then a model (e.g., fine-tuned BERT) is trained. The model ranks millions of remaining documents by relevance probability. We also apply active learning to select the most informative documents for manual review.

How does AI detect privileged documents (attorney-client privilege)?

The system analyzes multiple signals: sender domain (outside counsel), phrases like 'legal advice', 'Confidential/Privileged' markings, and request context. We combine regex patterns with a fine-tuned classifier. Recall is tuned to exceed 99% to minimize the risk of missing privileged material.

What data sources does the system support?

We support all common corporate systems: Exchange/Outlook (PST), Gmail (mbox), Slack/Teams (via API), SharePoint, file servers, and cloud storage. Data is converted to a unified format (e.g., Relativity RSMF) using Apache Tika. Scale is up to tens of terabytes.

How long does it take to deploy an AI e-Discovery system?

Timelines depend on data volume and integration complexity. For a standard case (1 million documents, 3-5 sources), it takes 2 to 4 weeks. This includes pipeline setup, model training, integration with Relativity or your platform. We deliver turnkey solutions with model, documentation, and team training.

What is e-Discovery and why use AI?

e-Discovery (electronic discovery) is the process of identifying, collecting, and producing electronic information in legal proceedings. AI accelerates analysis by up to 80%: machine learning models pinpoint relevant documents, exclude privileged ones, and reduce the burden on attorneys.

Which algorithms are used for document classification?

The primary method is Predictive Coding (TAR). First, attorneys label a small training set (thousands of documents), then a model (e.g., fine-tuned BERT) is trained. The model ranks millions of remaining documents by relevance probability. We also apply active learning to select the most informative documents for manual review.

How does AI detect privileged documents (attorney-client privilege)?

The system analyzes multiple signals: sender domain (outside counsel), phrases like 'legal advice', 'Confidential/Privileged' markings, and request context. We combine regex patterns with a fine-tuned classifier. Recall is tuned to exceed 99% to minimize the risk of missing privileged material.

What data sources does the system support?

We support all common corporate systems: Exchange/Outlook (PST), Gmail (mbox), Slack/Teams (via API), SharePoint, file servers, and cloud storage. Data is converted to a unified format (e.g., Relativity RSMF) using Apache Tika. Scale is up to tens of terabytes.

How long does it take to deploy an AI e-Discovery system?

Timelines depend on data volume and integration complexity. For a standard case (1 million documents, 3-5 sources), it takes 2 to 4 weeks. This includes pipeline setup, model training, integration with Relativity or your platform. We deliver turnkey solutions with model, documentation, and team training.

AI System for e-Discovery: Accelerating Legal Document Review

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI System for e-Discovery: Accelerating Legal Document Review

Complex

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI System for e-Discovery: Automating Legal Document Review

Imagine a lawsuit requiring analysis of 5 million documents in two weeks. Without AI, that would mean hundreds of lawyers working around the clock and budgets comparable to millions. We develop AI systems for e-Discovery that get the job done in days, cutting costs by 60–80%. Our team has 5+ years and 10+ projects, from startups to large law firms using AI for legal matters. The foundation is e-Discovery technology powered by machine learning.

How AI Accelerates e-Discovery?

Manual review of every document is a utopia. Modern models, such as fine-tuned BERT or LLMs with RAG, process terabytes of data and identify the relevant 1–5% in hours. Recall for relevant documents reaches 95%+, and for privileged documents — 99%. This isn't just time savings; it's a legal guarantee: missing a privileged document risks court sanctions. Electronic discovery becomes manageable with ML in jurisprudence.

Technologies We Use

The key component is Technology-Assisted Review (TAR), also known as Predictive Coding. We implement it via active learning with PyTorch or Hugging Face Transformers. The model trains on a seed set (thousands of documents labeled by lawyers) and then iteratively improves by selecting the most uncertain documents for labeling. This reduces manual effort by 10–20 times. As shown in the study Grossman & Cormack (2011), TAR cuts analysis time by 70-80% compared to linear review.

Example code: document classification

class DocumentRelevance(BaseModel):
    document_id: str
    relevance_score: float    # 0-1
    is_privileged: bool       # attorney-client privilege
    is_responsive: bool       # responsive to discovery request
    key_topics: list[str]
    custodians: list[str]     # who is in the correspondence
    date: date | None

def predict_relevance(
    document: str,
    seed_set: list[tuple[str, bool]]  # (doc, is_relevant) for training
) -> DocumentRelevance:
    # Active Learning: pick most informative documents for labeling
    ...

What is Technology-Assisted Review?

TAR is a method where a machine learning algorithm ranks documents by relevance. Attorneys review only top-ranked documents, and the model finetunes on their decisions. Vector search using a FAISS ANN index finds similar documents in milliseconds. Embedding models (OpenAI text-embedding-3-small or E5) generate 1536-dimensional vectors indexed in Qdrant or pgvector. This ensures high-speed processing of terabytes of data.

How We Detect Privileged Documents?

Attorney-client privilege covers documents exempt from disclosure. Missing such a document is a legal catastrophe. Our pipeline includes several layers:

Domain filter: external counsel (e.g., @lawfirm.com)
NLP model trained on phrases like "legal advice", "confidential", "attorney work product"
Vector comparison with reference privileged documents
Metadata-based validation (subject, participants, markings)

We target 99% recall for privileged documents, though this increases false positives which are filtered by attorneys. On average, 2–3% of the corpus is flagged as privileged.

Project Workflow

We handle the project turnkey. Stages:

Analytics: audit data sources, EDRM modeling, define relevance and privilege criteria
Integration: connectors to Exchange, SharePoint, Slack, Google Workspace, convert to unified format (RSMF) via Apache Tika
Model Training: seed-set labeling, fine-tuning transformer models (BERT, RoBERTa), threshold tuning
Validation: test on a holdout set, precision/recall metrics, legal sign-off
Deployment: containerization (Docker), deployment on your servers or cloud (AWS, GCP), integration with Relativity or other platform
Knowledge Transfer: documentation, team training, 3 months support

Comparison: TAR vs Linear Review

Criterion	TAR (our approach)	Linear review (no AI)
Time to review 1M docs	3 days	50 days (100 lawyers)
Cost	Significantly lower	High
Recall relevant	95%	80%
Flexibility	Case-specific tuning	Static process
Privilege error rate	<1%	5–10%

Result: TAR is 10x faster and cheaper, while more accurate. We guarantee recall not lower than contractually agreed.

Embedding Model Comparison for e-Discovery

Model	Dimension	Indexing speed (100k docs)	Recall@10	Cost per 1k docs
OpenAI text-embedding-3-small	1536	2 minutes	95%	Low
E5-base	768	3 minutes	92%	Free
BERT-large	1024	5 minutes	90%	Requires GPU

Embedding models are chosen for the task: OpenAI for high accuracy, open-source E5 for economy.

What's Included

Model and API: a ready TAR model with REST API for uploading documents and getting predictions.
Documentation: description of the pipeline, metrics, instructions for model updates.
Access: login to a monitoring dashboard (W&B or MLflow) to see real-time metrics.
Training: 2 days onsite or online for the legal team: how to label, interpret scores.
Support: 3 months incident support, performance guarantee.

Typical Timeline

Cost is calculated individually, depending on data volume, number of custodians, and required speed. We estimate delivery from 2 to 6 weeks. A typical 2 million document project takes 3 weeks. We don't list fixed prices, but are ready to evaluate your case within 1 day. Contact us for an assessment.

Why Choose Us

We don't just deploy AI — we ensure the legal defensibility of the results. Our systems have passed court audits in the US and EU. 5+ years in the industry, 10+ projects, each with privileged recall > 99%. Turnkey work with metric guarantees. Schedule a consultation — let's discuss how to cut your e-Discovery costs.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.