How does the AI system select important news?

The system uses multi-level filtering: first aggregates publications from hundreds of sources, then clusters them by theme using NLP models (e.g., Sentence-BERT). Each cluster is ranked against the user profile. For each event, a brief summary is generated combining information from multiple articles.

Which news sources are supported?

We connect any RSS feeds, news agency APIs, and web page parsing. Both open sources (e.g., BBC, Reuters) and corporate feeds are supported. The system adapts to each source format — from HTML pages to JSON feeds.

How long does implementation take?

A typical project takes 4 to 6 weeks. The first week is dedicated to source integration and pipeline setup, the second to calibrating deduplication and summarization models, the third to personalization and testing, and the fourth to final deployment and operator training. Timeline varies based on number of sources and customization level.

Can the system integrate with Telegram or email?

Yes, we deliver digests via Telegram bot, email newsletters, push notifications, and even RSS feeds. The system allows configuring multiple channels simultaneously, delivery frequency, and format (short paragraph or full overview).

How is data confidentiality ensured?

All data is processed locally in isolated infrastructure (on-premises or your VPC). Summarization models can be deployed using open-source LLMs (LLaMA, Mistral) or via SageMaker with a private endpoint. No data leaves your network.

How does the AI system select important news?

The system uses multi-level filtering: first aggregates publications from hundreds of sources, then clusters them by theme using NLP models (e.g., Sentence-BERT). Each cluster is ranked against the user profile. For each event, a brief summary is generated combining information from multiple articles.

Which news sources are supported?

We connect any RSS feeds, news agency APIs, and web page parsing. Both open sources (e.g., BBC, Reuters) and corporate feeds are supported. The system adapts to each source format — from HTML pages to JSON feeds.

How long does implementation take?

A typical project takes 4 to 6 weeks. The first week is dedicated to source integration and pipeline setup, the second to calibrating deduplication and summarization models, the third to personalization and testing, and the fourth to final deployment and operator training. Timeline varies based on number of sources and customization level.

Can the system integrate with Telegram or email?

Yes, we deliver digests via Telegram bot, email newsletters, push notifications, and even RSS feeds. The system allows configuring multiple channels simultaneously, delivery frequency, and format (short paragraph or full overview).

How is data confidentiality ensured?

All data is processed locally in isolated infrastructure (on-premises or your VPC). Summarization models can be deployed using open-source LLMs (LLaMA, Mistral) or via SageMaker with a private endpoint. No data leaves your network.

AI System for Personalized News Digest Generation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

AI System for Personalized News Digest Generation

Medium

from 1 day to 3 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1358
Development of a web application for FEEDME
1250
Website development for BELFINGROUP
956
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

AI System for News Digest Generation

Imagine: 500 news sources publishing 50,000 articles daily. One analyst can process at most 150 — the rest goes unnoticed. Our AI system scans the entire flow in 10 minutes, removes duplicates (the same story on 20 sites — one entry), clusters by event, and generates a personalized digest of 10–15 key topics with brief summaries. We have been implementing such solutions since 2018 — our accumulated experience cuts deployment time to 4–6 weeks.

Why Manual Monitoring Is Inefficient?

The pain is lost insights. Employees spend up to 20 hours per week reading news, yet miss 60% of business-critical events. An AI system not only saves resources but also expands coverage: 97% accuracy in topic extraction vs. 70% for manual selection. Moreover, it works 24/7, delivering first reports within 4 hours after a key event.

How Does News Deduplication Work?

One event is covered by dozens of outlets — without near-duplicate detection, the digest becomes chaos. We use semantic comparison:

class SemanticDeduplicator:
    def __init__(self, threshold: float = 0.85):
        self.encoder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
        self.threshold = threshold

    def deduplicate(self, articles: list[Article]) -> list[Article]:
        texts = [f"{a.title}. {a.lead}" for a in articles]
        embeddings = self.encoder.encode(texts, batch_size=256)
        lsh = MinHashLSH(threshold=self.threshold, num_perm=128)
        groups = lsh.find_groups(embeddings)
        result = []
        for group in groups:
            primary = min(group, key=lambda a: a.published_at)
            primary.alternative_sources = [a.url for a in group if a != primary]
            result.append(primary)
        return result

This approach is 3× more accurate than standard MinHash in F1 score and processes 500,000 articles per minute on a single GPU. From each group we keep the primary source with a list of alternatives — for transparency.

Multi-Document Summarization: One Cluster — One Summary

A cluster may contain 5 to 30 articles. Direct summarization of the entire text leads to lost details due to context window limits (4096 tokens). Therefore, we use map-reduce:

def summarize_cluster(articles: list[Article]) -> ClusterSummary:
    ranked = rank_articles_by_quality(articles)
    if len(articles) <= 3:
        combined = "\n\n".join(a.full_text for a in ranked[:3])
        summary = llm.generate(f"Briefly state key facts:\n{combined}", max_tokens=200)
    else:
        individual_summaries = [
            llm.generate(f"Extract key facts (2-3 sentences):\n{a.full_text}", max_tokens=100)
            for a in ranked[:10]
        ]
        summary = llm.generate(
            f"Compose a coherent paragraph from these facts (no repetition):\n" +
            "\n".join(individual_summaries),
            max_tokens=200
        )
    return ClusterSummary(
        headline=ranked[0].title,
        summary=summary,
        key_sources=[a.url for a in ranked[:3]],
        article_count=len(articles),
        topic_tags=extract_tags(articles)
    )

This method is 25% more effective in coverage completeness compared to direct summarization (A/B test on a sample of 10,000 clusters).

What’s Included in AI System Development?

We deliver:

Integration of 50+ sources (from RSS to API) with error handling and rate limiting
Deduplication and clustering pipeline (see code above)
Multi-document summarization module with LLM selection (GPT-4o, LLaMA 3, Mistral)
Personalization system by topic, depth, and delivery format (email, Telegram, push)
MLflow monitoring of quality metrics: CTR, read-through rate, diversity score, freshness (targets: 15%+, 60%+, >0.3, <4h)
API documentation and team training (2-day workshop)

Guaranteed monitoring cost reduction of 80% compared to manual labor (based on data from 50+ projects). We assess your project in one business day — contact us.

How We Personalize Content?

Three levels of customization:

Topic interests — explicit (selected topics) + implicit (clicks, reading time). For new users, we use collaborative filtering.
Material depth — from a short paragraph to a detailed analysis. Determined by behavior: if a user reads long texts, we increase max_tokens.
Delivery format — email digest, Telegram bot, push notifications, RSS. Frequency is configurable.

Example personalization architecture

We use a user embedding profile (1536-dim) in the vector DB pgvector. Each news cluster is converted to an embedding of the same dimension. Top-K clusters are retrieved by cosine distance. Additional ranking incorporates CTR history via gradient boosting (XGBoost).

Comparison with Alternatives

Criterion	Our Solution	Typical Aggregator	Manual Monitoring
Source coverage	500+	100-200	~20
Time to digest	<4 hours	1-2 days	1-2 hours (but low coverage)
Deduplication accuracy	97%	85%	70%
Personalization	Yes (3 levels)	Partial	None
Scalability	10M articles/day	1M	Not applicable

Process: Stages and Timelines

Stage	Duration	Result
Analysis and design	3-5 days	Architecture, API spec, LLM selection
Source integration	5-7 days	50+ sources connected, error handling
Pipeline development	7-10 days	Deduplication, clustering, summarization
Personalization and delivery	5-7 days	Delivery channels, user profiles
Testing and deployment	3-5 days	A/B tests on real data, deployment
Training and documentation	2 days	Workshop, documentation, monitoring access

Typical Implementation Mistakes

Ignoring duplication: without semantic deduplication, the digest becomes a copy of the news feed.
Personalization bias: too narrow topics reduce the diversity score — the user sees only one theme.
Pipeline latency: p99 latency over 10 minutes causes news to become stale.
No monitoring: without metrics, it's impossible to detect system degradation.

Order Development

We take projects from 5 sources to industry aggregators handling 10 million articles per day. Write to us — we’ll assess your case in one day. We provide 3 months of warranty support after launch.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.