Machine Translation Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Machine Translation Implementation
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Machine Translation Implementation

Machine translation evolved from statistical models (Moses) through neural (seq2seq+attention) to modern transformers. Today, high-quality ready models exist for most language pairs—task reduces to choosing right model and integration.

Translation Model Selection

Ready APIs (best quality, simplicity):

  • Google Cloud Translation API: 500K characters/month free, >100 languages, $20/1M characters
  • DeepL API: exceeds Google for European languages, $5.99/month for 500K characters
  • OpenAI GPT-4o: for context-dependent translation (marketing, literature)

Open-source models (privacy, on-premise, no API costs):

  • MarianMT (Helsinki-NLP): compact models for 1000+ language pairs, Hugging Face
  • NLLB-200 (Meta): 200 languages including rare ones, quality near Google for many pairs
  • SeamlessM4T (Meta): multimodal—text and speech, 100+ languages
  • Opus-MT: large collection of trained MarianMT models
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-ru-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(texts: list[str]) -> list[str]:
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.batch_decode(translated, skip_special_tokens=True)

Specialized Translation

Ready models struggle with domain terminology. Strategies:

Terminology Dictionaries: post-process translation with approved term replacement. sacremoses library for detokenization, then regex replacement.

Fine-tuning on domain data: 10K–100K parallel sentences from your field. MarianMT trains on one GPU in hours. Quality grows 3–8 BLEU for specialized texts.

Prompt engineering for LLM: GPT-4o with instruction "translate medical texts, preserve Latin terms" without fine-tuning.

Quality Post-Processing

Automatic translation evaluation:

  • BLEU: standard metric, but correlates with quality only on large sets
  • COMET: neural metric, better correlates with human ratings (model Unbabel/wmt22-comet-da)
  • chrF: good for morphologically rich languages (Russian)

In production: A/B test two models on real users—engagement, page time, explicit ratings.

Long Text Processing

MarianMT limited to 512 tokens. For long documents:

  • Split into sentences: nltk.sent_tokenize or spacy
  • Translate per sentence
  • Assemble preserving formatting

For GPT-4o: chunk by paragraphs with overlap (last sentence of previous chunk)—preserves context for coherent transitions.

Performance

Model Speed (CPU) Speed (GPU) Quality ru-en
MarianMT 50–100 words/sec 500–1000 words/sec BLEU ~35
NLLB-200 20–50 words/sec 200–500 words/sec BLEU ~38
GPT-4o-mini API ~500 words/sec BLEU ~42
DeepL API ~2000 words/sec BLEU ~44

For on-premise with GPU budget: NLLB-200 on A10G yields good quality with full data control.