Prodigy Integration for Data Labeling
Prodigy is a professional annotation tool from the creators of spaCy. It specializes in NLP tasks: NER, text classification, semantic similarity. Active learning is built-in — the model learns as you annotate and directs the annotator to the most informative examples.
Prodigy Advantages
- Active Learning: no need to label everything. Prodigy selects examples where the model is least confident — maximum value from each labeled unit
- Built-in recipes: ready workflows for NER, classification, comparison
- spaCy integration: annotation → training → model update → new examples — seamlessly
- Human-in-the-loop: model proposes annotations, human corrects
Installation and Setup
pip install prodigy # requires license key
prodigy ner.manual my_ner_dataset blank:ru texts.jsonl --label PER,ORG,LOC
Or with Active Learning (model already partially trained):
prodigy ner.teach my_ner_dataset ru_core_news_lg texts.jsonl --label PRODUCT,FEATURE
Data Formats
Input data is JSONL, each line is one example:
{"text": "Gazprom signed an agreement with Deutsche Bank in Berlin."}
{"text": "Ivan Petrov, CEO of Yandex, spoke at the conference."}
Export labeled data for spaCy training:
prodigy data-to-spacy ./train ./dev --ner my_ner_dataset
python -m spacy train config.cfg --output ./model
Workflows for Different Tasks
Text Classification:
prodigy textcat.manual news_cats dataset texts.jsonl \
--label POSITIVE,NEGATIVE,NEUTRAL
Semantic Similarity (sentence-transformers training):
prodigy pos.teach similarity_dataset en_core_web_md sentence_pairs.jsonl
Entity Relationship Labeling:
prodigy rel.manual rel_dataset blank:ru texts.jsonl \
--label WORKS_AT,LOCATED_IN
Production Pipeline Integration
# Export from Prodigy
import prodigy
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("my_ner_dataset")
# Convert to HuggingFace dataset
from datasets import Dataset
hf_dataset = Dataset.from_list([
{"tokens": ex["tokens"], "labels": convert_spans_to_bio(ex)}
for ex in examples if ex["answer"] == "accept"
])
Cost and Alternatives
Prodigy: $490 (one-time license for personal use), $790 for teams. Open-source alternatives: Label Studio (more formats, complex UI), Doccano (simpler, basic tasks only), Argilla (data quality + labeling).
For NER tasks with active learning, Prodigy remains the best choice despite cost: saves 2-3x annotator time compared to manual labeling.







