AI-Powered Message Toxicity Detection for Mobile Apps
Toxicity and spam are different tasks. Spam detected by repetition patterns and behavior. Toxic message is unique, written by real human, often grammatically correct — makes detection much harder.
Main Technical Problem
General toxicity models like unitary/toxic-bert work well on English Reddit dataset. In Russian-language app, they false positive on words with culture-specific connotation and miss masked profanity with character substitution (standard circumvention practice in CIS audience). Same story with Ukrainian and Belarusian.
Another trap — sync model call before message send. User presses "send", waits 800 ms — UX broken. Detection should be either async post-processing or fast enough that delay is unnoticed.
Architecture That Actually Works
Multi-Level Classification
Level 1 — on-device, fast: regex + dictionary of 2000 obvious toxic patterns, including leetspeak variants. Processed in < 5 ms, no network. Catches 60–65% of toxic messages with minimal false positive.
Level 2 — server ML: fine-tuned model on Russian dataset (RuToxic or similar from Hugging Face). Called async after message display — if triggered, message hidden and replaced with placeholder.
// Android: optimistic sending + async toxicity check
fun sendMessage(text: String) {
val tempMessage = Message(text = text, status = MessageStatus.PENDING_REVIEW)
chatAdapter.addMessage(tempMessage) // show immediately
viewModelScope.launch {
val result = toxicityRepository.classify(text)
if (result.isToxic && result.confidence > 0.78f) {
chatAdapter.updateMessageStatus(tempMessage.id, MessageStatus.HIDDEN)
showToxicityNotice()
} else {
chatAdapter.updateMessageStatus(tempMessage.id, MessageStatus.VISIBLE)
}
}
messageApi.send(tempMessage)
}
This approach — "optimistic UI" + post-facto check — solves delay problem. User sees message instantly, check runs in parallel.
Multilingual Support via xlm-roberta-base
For apps with multi-country audience, use xlm-roberta-base, fine-tuned on mixed dataset. Model in ONNX format deployed via FastAPI endpoint. Important: inference should run in batches for high traffic — onnxruntime supports dynamic batching, giving ~4x throughput vs sequential processing.
Granular Categories Instead of Binary Label
Instead of simple "toxic/not" model returns score vector:
| Category | Auto-Block Threshold | Human Review Threshold |
|---|---|---|
| hate_speech | 0.85 | 0.60 |
| insult | 0.90 | 0.70 |
| threat | 0.80 | 0.55 |
| obscenity | 0.88 | 0.65 |
This lets tune moderation policy for app type: kids' app — stricter thresholds, adult forum — looser.
iOS: Core ML Pre-Filter
On iOS implement pre-filter via Core ML with Text Classifier model converted via coremltools:
let request = NLModel(mlModel: toxicityModel.model)
let prediction = request.predictedLabel(for: text) ?? "safe"
let confidence = request.predictedLabelHypotheses(for: text, maximumCount: 2)
if prediction == "toxic", let score = confidence["toxic"], score > 0.9 {
return .block
}
NaturalLanguage.framework with custom NLModel — cleanest path for iOS, requires no build dependencies.
Process
Dataset collection: export historical user reports, label via Label Studio or Toloka.
Fine-tune base model on domain-specific data.
Deploy inference API + integrate into mobile clients.
Tune thresholds based on precision/recall tradeoff per product requirements.
Monitoring: share of auto-blocked messages, false positive rate from user complaints.
Timeline Guidance
Basic integration of ready multilingual model — 4–6 days. Fine-tuning on own dataset and deployment — 2–3 weeks additional. Full system with categorization, human review queue, feedback loop — 4–6 weeks.







