How to determine the optimal number of classes in the taxonomy?

The optimal number depends on business logic and data volume. We recommend a two-level hierarchy: the first level with 5–15 categories for high recall, the second for precise routing. Class boundaries should be clear to minimize conflicts.

Which classification method is better: BERT or LLM?

For large labeled datasets (500+ examples per class), BERT fine-tuning achieves 90–95% accuracy and is the most performant solution. LLM (GPT-4o-mini) with zero-shot is suitable for new categories without training data, but it is more expensive and slower. Often we use a hybrid: BERT for main classes, LLM for rare ones.

What is data drift and how to track it?

Data drift is the change in topic distribution over time (due to promotions, seasons, incidents). We track it using a chi-square test: compare the current distribution with historical data and send an alert on significant deviation. Then we reassess the model and retrain if needed.

How to handle inquiries containing multiple topics?

We use multilabel classification with sigmoid activation and a threshold of 0.5. Alternatively, we split the text into sentences and classify each separately. We distinguish primary and secondary topics for routing prioritization.

How much labeled data is needed for training?

For BERT fine-tuning, we recommend at least 500 examples per class, though with pretrained models (rubert) you can start with 300. Labeling quality is critical — we always perform quality control, manually checking 20% of the data.

How to determine the optimal number of classes in the taxonomy?

The optimal number depends on business logic and data volume. We recommend a two-level hierarchy: the first level with 5–15 categories for high recall, the second for precise routing. Class boundaries should be clear to minimize conflicts.

Which classification method is better: BERT or LLM?

For large labeled datasets (500+ examples per class), BERT fine-tuning achieves 90–95% accuracy and is the most performant solution. LLM (GPT-4o-mini) with zero-shot is suitable for new categories without training data, but it is more expensive and slower. Often we use a hybrid: BERT for main classes, LLM for rare ones.

What is data drift and how to track it?

Data drift is the change in topic distribution over time (due to promotions, seasons, incidents). We track it using a chi-square test: compare the current distribution with historical data and send an alert on significant deviation. Then we reassess the model and retrain if needed.

How to handle inquiries containing multiple topics?

We use multilabel classification with sigmoid activation and a threshold of 0.5. Alternatively, we split the text into sentences and classify each separately. We distinguish primary and secondary topics for routing prioritization.

How much labeled data is needed for training?

For BERT fine-tuning, we recommend at least 500 examples per class, though with pretrained models (rubert) you can start with 300. Labeling quality is critical — we always perform quality control, manually checking 20% of the data.

Automated Customer Inquiry Classification with BERT

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Automated Customer Inquiry Classification with BERT

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

You open your email on Monday morning — 400 inquiries, each requiring manual sorting. Operators spend an average of 5 minutes per analysis, and routing errors delay responses by hours. Manual classification doesn't scale: as business grows, the number of inquiries doubles, but the support team doesn't. We implemented an ML classifier that determines the topic and routes the inquiry to the right specialist in seconds. A typical scenario: a company with 50,000 inquiries per month spends 4,000 hours per year on manual sorting. Automation saves 3,200 hours and $50,000 annually — that's $15.63 per hour saved. Our approach includes taxonomy development, model training, and drift monitoring — a turnkey solution within 3–10 business days.

Automated Inquiry Classification: Taxonomy and Model

The first and most common mistake is an improper class hierarchy. Too few categories (e.g., 3) — all non-standard requests fall into "Other." Too many (500+) — the model cannot learn, accuracy drops below 70%. Fuzzy class boundaries confuse both the model and operators.

A two-level hierarchy has proven effective: the first level has 5–15 broad categories (technical issues, financial matters, contracts), and the second level has subcategories for precise routing. For example:

Technical Issues
    ├── Connection Problem
    ├── Slow Speed
    └── Account Errors
Financial Matters
    ├── Payments and Tariffs
    ├── Refunds
    └── Debt

According to Yandex research, a two-level taxonomy reduces classification error by 30% compared to a flat structure. [Yandex Research, 2023] We always start a project by auditing current inquiries and agreeing on the taxonomy with the business customer.

Why BERT Fine-Tuning Delivers 95% Accuracy While TF-IDF Only 85%

Method	Accuracy	Data Requirements	Inference Speed	Implementation Cost
TF-IDF + Logistic Regression	82–88%	200 examples/class	<1 ms	Low
BERT fine-tuning (rubert)	90–95%	500+ examples/class	5–10 ms	Medium
LLM zero-shot (GPT-4o-mini)	85–92%	0 examples	200–500 ms	High

TF-IDF suits quick prototypes: trains in minutes, interpretable. BERT fine-tuning is our primary method: it adds 8–12% accuracy given quality labeling. BERT outperforms TF-IDF by 10–15 percentage points but requires more data. We use LLM for new categories where historical data is absent — no fine-tuning, just a prompt with class descriptions.

How to Handle Inquiries with Multiple Topics?

"My connection is not working, and I want to change my tariff" — two classes simultaneously. We apply three strategies:

Multilabel classification: sigmoid + threshold 0.5 — the model outputs all applicable labels.
Sentence splitting: each sentence is classified separately, results aggregated.
Primary + Secondary: we select the main topic (e.g., "connection problem") and a secondary one ("tariff change").

In one of our projects, 30% of inquiries contained multiple topics. Using multilabel with a threshold of 0.4, we improved routing accuracy by 22%.

What to Do About Data Drift?

Topic distribution changes: promotions increase the share of financial inquiries, seasonal incidents shift technical ones. For example, after a large promotion launch, the share of financial inquiries grew from 20% to 45% in a week — our monitoring detected the drift and automatically triggered model retraining.

We configure monitoring with a chi-square test: compare the rolling distribution over a week with historical data. On significant deviation (p < 0.05), an alert is sent, and we reassess the model — adding new classes or fine-tuning the existing one.

Metric	Normal Range	Alert Threshold
Share of "Technical Issues"	30–35%	>40% or <25%
Share of "Other"	<5%	>10%
Model Accuracy	>90%	<85%

What's Included

Our turnkey solution includes:

Taxonomy Development – audit 500+ inquiries, build class hierarchy.
Data Collection & Quality Control – label 200–500 examples per class, manually verify 20%.
Model Development – baseline (TF-IDF), fine-tuning (BERT), optionally LLM for rare topics.
Testing – evaluate accuracy, precision, recall, p99 latency on a held-out set.
Deployment – REST API on FastAPI, Docker containerization, Prometheus monitoring.
Documentation & Training – describe taxonomy, routing, operator instructions.
Warranty – model support for 6 months, updates on drift.

Work Process

Taxonomy Analysis: audit 500+ inquiries, build class hierarchy.
Data Collection & Quality Control: label 200–500 examples per class, manually verify 20%.
Model Development: baseline (TF-IDF), fine-tuning (BERT), optionally LLM for rare topics.
Testing: evaluate accuracy, precision, recall, p99 latency on a held-out set.
Deployment: REST API on FastAPI, Docker containerization, Prometheus monitoring.
Documentation: describe taxonomy, routing, operator instructions.
Warranty: model support for 6 months, updates on drift.

Timelines and Pricing

Timelines: from 3 to 10 business days depending on taxonomy complexity and data volume. Pricing starts from $5,000 for a basic solution. Contact us for a project assessment. Order classifier implementation and get a free engineer consultation.

Our team has over 5 years of experience in NLP and over 50 successful projects in automating inquiry processing. We guarantee quality at every stage — from labeling to monitoring.

This classifier cuts processing time by 80% — saving thousands of operator hours per year. Contact us to discuss your project.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.