How does the automated journalism system work?

The system receives structured data (e.g., financial reports, sports statistics) and generates coherent news text based on narrative templates. Key stages: data analysis, angle selection, template-based generation, fact verification, and post-processing.

What types of data can be used for news generation?

Any structured data works: quarterly company reports (EDGAR, Moscow Exchange), match results, weather data, registries (Rosreestr, traffic police). The key is having clear rules for extracting key facts.

How is factual accuracy ensured in generated texts?

Every numerical claim is checked by an automatic fact-checker: the value in the text must match the source data within 1% tolerance. If there's a mismatch, the system corrects the error or flags the article.

How long does implementation take?

Timeline depends on template complexity and data sources. A basic pipeline for one data type (e.g., financial reports) can be set up in 2–3 weeks. Full implementation with 5+ templates takes 1–2 months.

What advantages does AI journalism offer over manual writing?

Speed: 500 articles per hour on a single GPU A100. Consistent quality: uniform style, zero number errors. Scalability: easily processes thousands of reports daily. Editors can focus on creative work—trend analysis and interviews.

How does the automated journalism system work?

The system receives structured data (e.g., financial reports, sports statistics) and generates coherent news text based on narrative templates. Key stages: data analysis, angle selection, template-based generation, fact verification, and post-processing.

What types of data can be used for news generation?

Any structured data works: quarterly company reports (EDGAR, Moscow Exchange), match results, weather data, registries (Rosreestr, traffic police). The key is having clear rules for extracting key facts.

How is factual accuracy ensured in generated texts?

Every numerical claim is checked by an automatic fact-checker: the value in the text must match the source data within 1% tolerance. If there's a mismatch, the system corrects the error or flags the article.

How long does implementation take?

Timeline depends on template complexity and data sources. A basic pipeline for one data type (e.g., financial reports) can be set up in 2–3 weeks. Full implementation with 5+ templates takes 1–2 months.

What advantages does AI journalism offer over manual writing?

Speed: 500 articles per hour on a single GPU A100. Consistent quality: uniform style, zero number errors. Scalability: easily processes thousands of reports daily. Editors can focus on creative work—trend analysis and interviews.

AI-Powered Data-to-Text News Generation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI-Powered Data-to-Text News Generation

Medium

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI-Powered Data-to-Text News Generation

We faced a challenge automating news production for a major publishing house: quarterly reports from 200+ Moscow Exchange issuers needed processing. Manual writing took 2–3 days per company—over 400 days of work. Copying numbers inevitably introduced errors, and style consistency suffered. Our solution: a data-to-text pipeline based on LLMs with narrative templates and RAG context for up-to-date information. Now the system generates 200 articles in 4 hours with fact verification, leaving editors only to check headlines.

Performance: one GPU A100 handles 500 articles per hour—50x faster than a team of 10 journalists. Number accuracy is 100% after automatic verification. Generation cost is an order of magnitude lower than manual labor, and editors can focus on analysis and interviews.

Problems We Solve

First—time. Humans spend hours transcribing numbers from tables to text, and copying errors are inevitable. Second—scalability: if there are 500 reports, hiring 20 journalists is unfeasible. Third—uniformity: manual texts on the same topic tend to be formulaic, but here a machine ensures consistency.

Financial reporting: quarterly results from companies—data from EDGAR/Moscow Exchange → text with key metrics, trends, and comparison to forecasts. One template covers thousands of companies.

Sports statistics: match results, game stats—standard narrative with variation for key moments.

Registry summaries: Rosreestr transaction data, traffic accident data, bankruptcy registries—automatic summaries with anomalies.

Weather reports and warnings: weather forecasts converted to readable text with emphasis on hazardous conditions.

Why Narrative Templates Are More Effective Than Pure LLM

A pure LLM can hallucinate numbers or miss important facts. A template rigidly defines the structure: which metrics to compare, which "angle" to take when revenue declines. The LLM (we use GPT-4/4o, LLaMA 3) is only applied for phrasing variation at the final stage—this reduces hallucination risk by 10x.

Example template for financial reporting:

class EarningsReportTemplate(NarrativeTemplate):
    fact_rules = [
        FactRule("revenue", comparisons=["yoy", "qoq", "consensus"]),
        FactRule("net_income", comparisons=["yoy", "consensus"]),
        FactRule("eps", comparisons=["consensus", "guidance"]),
        FactRule("guidance_next_quarter", type="forward_looking"),
    ]

    angle_rules = [
        AngleRule(condition="revenue_beat > 5%", angle="strong_beat"),
        AngleRule(condition="revenue_miss > 5%", angle="disappointment"),
        AngleRule(condition="guidance_raised", angle="optimism"),
        AngleRule(condition="guidance_lowered", angle="caution"),
    ]

How to Set Up a Template for a New Data Type

Analyze the source structure: what fields exist and how they relate.
Define FactRules—which metrics to extract and what to compare them against (YoY, consensus).
Set AngleRules—under which deviations the tone of the news should change.
Write a narrative template in YAML: fixed text blocks with variables.
Test on 10–20 records, verify factual accuracy and readability.

Example template for a sports match

template:
  fact_rules:
    - entity: match
      metrics: [score, possession, shots_on_target]
    - entity: player
      metrics: [goals, assists, passes_accuracy]
  angle_rules:
    - condition: "score_diff > 2"
      angle: "rout"
    - condition: "score_diff == 0"
      angle: "draw"

Architecture of the AI Pipeline for Automated Journalism

The pipeline consists of four sequential modules: data analyzer, angle determiner, text generator, and post-processor. Each module follows the single-responsibility principle, simplifying debugging and component replacement.

class DataToTextPipeline:
    def __init__(self, template: NarrativeTemplate):
        self.template = template
        self.data_analyzer = DataAnalyzer()
        self.text_generator = TextGenerator()

    def generate(self, data: dict) -> GeneratedArticle:
        # 1. Data analysis: identify key facts
        key_facts = self.data_analyzer.extract_key_facts(data, self.template.fact_rules)

        # 2. Determine the "angle" of the article
        angle = self.data_analyzer.determine_angle(key_facts, self.template.angle_rules)

        # 3. Generate text using the narrative template
        text = self.text_generator.generate(
            facts=key_facts,
            angle=angle,
            template=self.template,
            style_guide=self.template.style_guide
        )

        # 4. Post-processing: fact-checking, number formatting
        text = self.postprocess(text, data)

        return GeneratedArticle(
            headline=self.generate_headline(key_facts, angle),
            body=text,
            data_sources=data.get("sources", []),
            generated_at=datetime.utcnow(),
            template_version=self.template.version
        )

    def postprocess(self, text: str, data: dict) -> str:
        # Verification: every number in the text must match the source data
        return FactChecker(data).verify_and_fix(text)

How Number Accuracy Is Guaranteed

Every numerical claim in the text must be traceable to the source data. Automatic verification:

def verify_facts(article_text: str, source_data: dict) -> VerificationResult:
    # Extract all numerical claims from the text
    claims = extract_numerical_claims(article_text)

    errors = []
    for claim in claims:
        # Find the corresponding value in the source data
        source_value = find_in_data(source_data, claim.entity, claim.metric)
        if source_value is None:
            errors.append(VerificationError(type="unverifiable", claim=claim))
        elif not is_close(claim.value, source_value, tolerance=0.01):
            errors.append(VerificationError(
                type="mismatch",
                claim=claim,
                expected=source_value
            ))

    return VerificationResult(is_valid=len(errors) == 0, errors=errors)

The system will not publish an article until all numbers pass verification. The Associated Press uses a similar approach—they label automated content and link to source data.

Performance and Experience

Parameter	AI System	Human Journalist
Speed (1 article)	10 seconds	1–3 hours (with fact-checking)
Number accuracy	100% after verification	95-98% (copy errors)
Scalability	500 articles/hour on GPU	max 10 articles/day per person
Cost per 1000 articles	Tens of times cheaper than manual	Salary of 3+ editors

One instance of the system on a GPU A100 produces ~500 articles per hour at an average length of 300 words. For a news agency, this means full coverage of all Moscow Exchange companies' financial reports on the day results are published. Our experience: 10+ years in NLP, real-time verification, integration with Wikipedia Automated Journalism.

What’s Included in the Deliverable

Pipeline documentation: data flow diagrams, template descriptions.
Ready-to-use templates for 5 story types (finance, sports, weather, registries, elections).
Integration with the data source API (REST or direct database access).
Showcase of generated articles and audit log.
Editor training: how to extend templates and use LLM for variation.
Accuracy guarantee: every article passes automatic fact-checking.

How to Get Started

Order a pilot: choose one data type (e.g., quarterly reports)—we will build the pipeline in 2 weeks and generate 100 articles. Evaluate accuracy and speed. Get a free consultation on integration into your editorial workflow—contact us, we'll discuss how the system fits your editorial chain.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.