Development of AI System for News Digest Generation
Personalized news digest from hundreds of sources — a task that no human can solve manually in reasonable time. An AI system monitors sources, clusters publications by topic, removes duplicates, and generates a coherent digest for specific user or audience segment.
Collection and Processing Pipeline
class NewsDigestPipeline:
def __init__(self, sources: list[NewsSource]):
self.crawler = NewsCrawler(sources)
self.deduplicator = SemanticDeduplicator(threshold=0.85)
self.clusterer = NewsClusterer()
self.summarizer = NewsSummarizer()
self.ranker = PersonalizedRanker()
async def generate_digest(
self,
user_profile: UserProfile,
period_hours: int = 24
) -> Digest:
# 1. Collect news for period
articles = await self.crawler.fetch_since(
datetime.utcnow() - timedelta(hours=period_hours)
)
# 2. Remove duplicates (one story from 20 sources → 1 entry)
unique_articles = self.deduplicator.deduplicate(articles)
# 3. Cluster by events
clusters = self.clusterer.cluster(unique_articles)
# 4. Personalized cluster ranking
ranked_clusters = self.ranker.rank(clusters, user_profile)
# 5. Generate summary per cluster (multi-document summarization)
summaries = [
self.summarizer.summarize_cluster(cluster)
for cluster in ranked_clusters[:user_profile.digest_size]
]
return Digest(items=summaries, generated_at=datetime.utcnow())
News Deduplication
One event is covered by dozens of publications. Near-duplicate detection:
class SemanticDeduplicator:
def __init__(self, threshold: float = 0.85):
self.encoder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
self.threshold = threshold
def deduplicate(self, articles: list[Article]) -> list[Article]:
# Encode headlines + lead
texts = [f"{a.title}. {a.lead}" for a in articles]
embeddings = self.encoder.encode(texts, batch_size=256)
# MinHash LSH for efficient similarity search
lsh = MinHashLSH(threshold=self.threshold, num_perm=128)
groups = lsh.find_groups(embeddings)
# From each group take primary source (by publication time)
result = []
for group in groups:
primary = min(group, key=lambda a: a.published_at)
primary.alternative_sources = [a.url for a in group if a != primary]
result.append(primary)
return result
Multi-document Summarization for Cluster
Task: from 5-20 articles on one event create brief summary without losing key details. Map-reduce strategy:
def summarize_cluster(articles: list[Article]) -> ClusterSummary:
# Rank articles by source authority and completeness
ranked = rank_articles_by_quality(articles)
if len(articles) <= 3:
# Small cluster — direct summarization
combined = "\n\n".join(a.full_text for a in ranked[:3])
summary = llm.generate(f"Briefly outline key facts:\n{combined}", max_tokens=200)
else:
# Large cluster — map-reduce
individual_summaries = [
llm.generate(f"Extract key facts (2-3 sentences):\n{a.full_text}", max_tokens=100)
for a in ranked[:10]
]
# Combine unique facts
summary = llm.generate(
f"Create coherent paragraph from these facts (no repeats):\n" +
"\n".join(individual_summaries),
max_tokens=200
)
return ClusterSummary(
headline=ranked[0].title,
summary=summary,
key_sources=[a.url for a in ranked[:3]],
article_count=len(articles),
topic_tags=extract_tags(articles)
)
Personalization
Three levels of personalization:
Topic Interests: explicit (user selected categories) + implicit (clicks, read time). Collaborative filtering for new users.
Content Depth: some prefer brief paragraph, others — detailed analysis. Determined by behavior.
Delivery Format: email digest, Telegram bot, push notifications in app, RSS feed. Frequency: morning, evening, weekly — user choice.
Digest Quality Metrics
- Article CTR: what % of content user opens — target 15%+
- Read-through rate: completion rate — target 60%+
- Diversity score: topic variety — not all articles on one topic
- Freshness: average time from event to digest — target < 4 hours for important news







