AI News Digest Generation System

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
AI News Digest Generation System
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Development of AI System for News Digest Generation

Personalized news digest from hundreds of sources — a task that no human can solve manually in reasonable time. An AI system monitors sources, clusters publications by topic, removes duplicates, and generates a coherent digest for specific user or audience segment.

Collection and Processing Pipeline

class NewsDigestPipeline:
    def __init__(self, sources: list[NewsSource]):
        self.crawler = NewsCrawler(sources)
        self.deduplicator = SemanticDeduplicator(threshold=0.85)
        self.clusterer = NewsClusterer()
        self.summarizer = NewsSummarizer()
        self.ranker = PersonalizedRanker()

    async def generate_digest(
        self,
        user_profile: UserProfile,
        period_hours: int = 24
    ) -> Digest:
        # 1. Collect news for period
        articles = await self.crawler.fetch_since(
            datetime.utcnow() - timedelta(hours=period_hours)
        )

        # 2. Remove duplicates (one story from 20 sources → 1 entry)
        unique_articles = self.deduplicator.deduplicate(articles)

        # 3. Cluster by events
        clusters = self.clusterer.cluster(unique_articles)

        # 4. Personalized cluster ranking
        ranked_clusters = self.ranker.rank(clusters, user_profile)

        # 5. Generate summary per cluster (multi-document summarization)
        summaries = [
            self.summarizer.summarize_cluster(cluster)
            for cluster in ranked_clusters[:user_profile.digest_size]
        ]

        return Digest(items=summaries, generated_at=datetime.utcnow())

News Deduplication

One event is covered by dozens of publications. Near-duplicate detection:

class SemanticDeduplicator:
    def __init__(self, threshold: float = 0.85):
        self.encoder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
        self.threshold = threshold

    def deduplicate(self, articles: list[Article]) -> list[Article]:
        # Encode headlines + lead
        texts = [f"{a.title}. {a.lead}" for a in articles]
        embeddings = self.encoder.encode(texts, batch_size=256)

        # MinHash LSH for efficient similarity search
        lsh = MinHashLSH(threshold=self.threshold, num_perm=128)
        groups = lsh.find_groups(embeddings)

        # From each group take primary source (by publication time)
        result = []
        for group in groups:
            primary = min(group, key=lambda a: a.published_at)
            primary.alternative_sources = [a.url for a in group if a != primary]
            result.append(primary)

        return result

Multi-document Summarization for Cluster

Task: from 5-20 articles on one event create brief summary without losing key details. Map-reduce strategy:

def summarize_cluster(articles: list[Article]) -> ClusterSummary:
    # Rank articles by source authority and completeness
    ranked = rank_articles_by_quality(articles)

    if len(articles) <= 3:
        # Small cluster — direct summarization
        combined = "\n\n".join(a.full_text for a in ranked[:3])
        summary = llm.generate(f"Briefly outline key facts:\n{combined}", max_tokens=200)
    else:
        # Large cluster — map-reduce
        individual_summaries = [
            llm.generate(f"Extract key facts (2-3 sentences):\n{a.full_text}", max_tokens=100)
            for a in ranked[:10]
        ]
        # Combine unique facts
        summary = llm.generate(
            f"Create coherent paragraph from these facts (no repeats):\n" +
            "\n".join(individual_summaries),
            max_tokens=200
        )

    return ClusterSummary(
        headline=ranked[0].title,
        summary=summary,
        key_sources=[a.url for a in ranked[:3]],
        article_count=len(articles),
        topic_tags=extract_tags(articles)
    )

Personalization

Three levels of personalization:

Topic Interests: explicit (user selected categories) + implicit (clicks, read time). Collaborative filtering for new users.

Content Depth: some prefer brief paragraph, others — detailed analysis. Determined by behavior.

Delivery Format: email digest, Telegram bot, push notifications in app, RSS feed. Frequency: morning, evening, weekly — user choice.

Digest Quality Metrics

  • Article CTR: what % of content user opens — target 15%+
  • Read-through rate: completion rate — target 60%+
  • Diversity score: topic variety — not all articles on one topic
  • Freshness: average time from event to digest — target < 4 hours for important news