Keyword/Keyphrase Extraction Implementation
Keywords are short n-grams reflecting main document topics. Applications are broad: indexing, search, content tagging, automatic annotations.
Extraction Methods
Statistical methods—fast, no training:
- YAKE (Yet Another Keyword Extractor): accounts for word position, collocations, frequency. Works without corpus, 5ms/document
- RAKE (Rapid Automatic Keyword Extraction): split by stopwords, scoring via co-occurrence
- TF-IDF: top words by TF-IDF weight—effective with corpus for IDF
Graph-based methods:
- TextRank (PageRank analog for words): builds co-occurrence graph, ranks nodes. Implementation: gensim, pytextrank
Semantic methods (best quality):
- KeyBERT: document and candidate embeddings compared via cosine similarity
from keybert import KeyBERT
kw_model = KeyBERT(model="cointegrated/rubert-tiny2")
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 3), top_n=10)
For Russian Language
Statistical methods perform poorly without lemmatization. Correct pipeline: lemmatization (pymorphy3) → YAKE/KeyBERT. KeyBERT with rubert-tiny2 yields good quality at ~50ms/document latency.
Production Application
Typical task: tag 10K articles daily. Optimal stack: YAKE for speed + KeyBERT for top documents. Results normalized (lemmatization, lowercase, deduplication) and saved to search index (Elasticsearch with keywords field).







