Text Classification Implementation
Text classification—task of assigning label(s) to text. Email topic, article category, request type, document language—beneath apparent simplicity lies many technical decisions that fundamentally affect quality.
Problem Statement and Approach Selection
Before choosing architecture, define task parameters:
- Number of classes: 2–5 (binary/simple multiclass) vs 20–100+ (hierarchical)
- Annotation volume: do you have 500+ examples per class?
- Language: English, Russian, multilingual
- Latency requirements: real-time (<100ms) vs batch
- Interpretability: must you explain decision?
These parameters determine the stack. Mistake—automatically jumping to BERT when logistic regression solves the task in 50ms.
Hierarchy of Approaches
Level 1 — Classic ML (TF-IDF / BOW + Logistic Regression / SVM / LightGBM):
- When sufficient: clear topics, lots of annotation, need interpretability, latency < 10ms
- scikit-learn Pipeline: TfidfVectorizer → LogisticRegression
- Accuracy on typical tasks: 85–92%
Level 2 — FastText:
- When: need quick training on large volume, multilingual task
- Training 100K examples: < 30 seconds
- Inference: ~1ms per text
- Quality close to BERT for pure topic classifiers
Level 3 — Transformer Fine-tuning:
- BERT / RoBERTa / DeBERTa for English
- ruBERT / ruRoBERTa for Russian
- When: complex classes, little annotation (few-shot), need high precision
- Training: 2–10 epochs on GPU, 15–60 minutes for typical dataset
Level 4 — LLM with Prompting:
- Zero-shot / few-shot via GPT-4o-mini or Claude
- When: no annotation, need quick start, or classes are descriptive
- Drawbacks: latency 500ms–2s, cost at scale
BERT Fine-tuning with PyTorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
model_name = "DeepPavlov/rubert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=len(label2id)
)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length"
)
training_args = TrainingArguments(
output_dir="./classifier",
num_train_epochs=5,
per_device_train_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
load_best_model_at_end=True,
)
Handling Imbalanced Classes
Real data is almost always imbalanced. Strategies:
-
Class weights:
compute_class_weight('balanced', ...)— passed to loss function - Oversampling: SMOTE for embeddings or text augmentation (paraphrase)
- Undersampling: only if majority class is truly excessive
- Focal Loss: for extreme imbalance (1:100+)
Monitor per-class F1, not just accuracy—95% accuracy with 5% rare class means nothing.
Multiclass vs Multilabel Classification
For multilabel (text can have multiple labels simultaneously):
- Replace
softmaxwithsigmoidin final layer - Use
BCEWithLogitsLossinstead ofCrossEntropyLoss - Classification threshold tuned separately per class (maximize F1)
Classifier Deployment
Inference Optimization:
- ONNX export: 2–4x CPU inference speedup
- Quantization (INT8): 4x memory reduction, accuracy degradation < 1%
- TorchScript: for production PyTorch serving
Serving:
# ONNX Runtime
pip install onnxruntime
# Export
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained("./classifier", export=True)
ONNX+INT8 latency on CPU: 20–50ms for 512-token text.
Metrics and Monitoring
- F1 Macro — main metric for imbalanced tasks
- Confusion matrix — mandatory in initial assessment
- Calibration curve — if you need reliable probabilities
In production: monitor distribution shift via KL-divergence of predicted class distribution. Signal: metric exceeds historical corridor → retrain model.
Implementation Timeline
- Baseline (TF-IDF + ML): 3–5 days (including annotation)
- BERT fine-tuning: 1–2 weeks
- Production with monitoring: 3–5 weeks







