Text Summarization Implementation (Extractive/Abstractive)
Summarization comes in two fundamentally different types: extractive (selecting sentences from original) and abstractive (generating new text). Choice depends on quality requirements, tolerable hallucinations, and computational resources.
Extractive Summarization
Extraction selects most important sentences from source text without rewording. Advantage: no hallucinations—everything taken from original. Disadvantage: text may be disconnected, context lost between selected sentences.
Methods:
-
TextRank: sentence graph, PageRank ranking.
sumy(Python),pytextrank(spaCy) - Sentence embeddings + clustering: cluster sentences semantically, select centroids
- BERTSum: BERT for importance scoring
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
parser = PlaintextParser.from_string(text, Tokenizer("russian"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=5)
Abstractive Summarization
Generates new text which may contain no literal fragments from original. More readable result, but hallucination risk.
Russian Models:
-
IlyaGusev/rut5-base-absum—T5 fine-tuned on Russian news -
IlyaGusev/bart-base-ru-giga—BART for Russian - GPT-4o / Claude via prompt—best quality, but costlier and slower
Prompt for GPT:
Briefly summarize following text in 3–5 sentences.
Don't add information not in text.
Preserve key facts: numbers, names, dates.
Text: {text}
Long Document Summarization
Documents longer than model context require strategy:
- Map-Reduce: summarize by chunks → final summarization of summaries
- Refine: incrementally update summary as chunks read
- Hierarchical: summarize sections → summarize summaries of sections
For legal/technical documents, Hierarchical preferable—preserves structure.
Quality Assessment
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): standard metric, measures n-gram overlap with reference summary. ROUGE-1, ROUGE-2, ROUGE-L.
BERTScore: semantic similarity via BERT embeddings—correlates better with human ratings than ROUGE.
For production, user ratings (thumbs up/down) matter more than automatic metrics—ROUGE poorly handles abstractive summaries.
Approach Selection
| Scenario | Recommendation |
|---|---|
| News texts, speed important | TextRank or rut5-base-absum |
| Legal/medical documents | Extractive (no hallucinations) |
| Business reports, quality important | GPT-4o with Map-Reduce |
| High load (>100 req/s) | Distilled T5 + ONNX |







