How long does it take to develop a custom data labeling platform?

A basic version based on Label Studio with custom backend can be deployed in 2 weeks. A full-featured platform with pre-annotation, Active Learning, and IAA quality control takes 3 to 6 weeks depending on the complexity of the labeling types.

What tech stack is used for the platform?

Backend — Python (FastAPI, Celery), frontend — customized Label Studio or React UI. Pre-annotation models — HuggingFace Transformers, GLiNER, zero-shot NLI. Vector storage — ChromaDB. Orchestration — RabbitMQ + PostgreSQL.

How do you ensure labeling quality?

We use IAA (Inter-Annotator Agreement): Cohen's Kappa for classification, F1 agreement for NER. Gold standard — 10% of tasks are checked by a senior annotator. Automatic review pipeline triggers when IAA is low. Disputed cases are resolved via LLM arbitration.

What is Active Learning and why is it needed?

Active Learning is a loop where the model selects the most informative (hard) examples for manual labeling. This reduces the amount of labeled data by 3–5 times without sacrificing final model quality. We commonly use uncertainty (entropy) or core-set diversity strategies.

Which export formats are supported?

We support JSONL (for text models), COCO (segmentation), YOLO (bounding boxes), CSV. Direct integration with Hugging Face Datasets and PyTorch DataLoader. Export to MLflow for dataset versioning is also available.

How long does it take to develop a custom data labeling platform?

A basic version based on Label Studio with custom backend can be deployed in 2 weeks. A full-featured platform with pre-annotation, Active Learning, and IAA quality control takes 3 to 6 weeks depending on the complexity of the labeling types.

What tech stack is used for the platform?

Backend — Python (FastAPI, Celery), frontend — customized Label Studio or React UI. Pre-annotation models — HuggingFace Transformers, GLiNER, zero-shot NLI. Vector storage — ChromaDB. Orchestration — RabbitMQ + PostgreSQL.

How do you ensure labeling quality?

We use IAA (Inter-Annotator Agreement): Cohen's Kappa for classification, F1 agreement for NER. Gold standard — 10% of tasks are checked by a senior annotator. Automatic review pipeline triggers when IAA is low. Disputed cases are resolved via LLM arbitration.

What is Active Learning and why is it needed?

Active Learning is a loop where the model selects the most informative (hard) examples for manual labeling. This reduces the amount of labeled data by 3–5 times without sacrificing final model quality. We commonly use uncertainty (entropy) or core-set diversity strategies.

Which export formats are supported?

We support JSONL (for text models), COCO (segmentation), YOLO (bounding boxes), CSV. Direct integration with Hugging Face Datasets and PyTorch DataLoader. Export to MLflow for dataset versioning is also available.

Custom Data Labeling Platform with Active Learning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Custom Data Labeling Platform with Active Learning

Complex

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Custom Data Labeling Platform: When Off-the-Shelf Solutions Don't Fit

Data labeling is the bottleneck of any ML project. When building a platform for classifying medical reports, one client was manually labeling 500 documents per week with three experts. Our team of experienced engineers has delivered over a dozen custom labeling platforms with guaranteed quality. Our custom platform is 6 times more efficient than manual labeling, saving $12,000 per month on labor costs — a 6x improvement over the manual process.

Off-the-shelf solutions like Label Studio or Supervisely often don't cover specific needs: integration with your own model, non-standard labeling types (hierarchical classification with 10,000+ classes), quality control via IAA, pre-annotation with weak models, or closed-loop Active Learning. Over 10+ projects, we've encountered everything from queues breaking at 50K tasks to real-time annotation sync issues.

How Active Learning Reduces Labeling Costs

In a typical NLP or Computer Vision project, data labeling consumes 60–80% of the time. A manual process without pipelines leads to three main problems: duplicate tasks (one document sent to two annotators without aggregation), annotator idle time due to manual assignment, and systematic omission of hard examples — the model trains on easy cases and fails on production data. Our platform solves this with a unified API: ingest → pre-annotation → queue → quality control → export → adaptive sampling. Throughput increases 3–5x with the same number of people. At an annotator's hourly rate of ~$15, this saves $4,000–$8,000 per month for a 5-person team.

Why a Custom Platform Beats Off-the-Shelf Solutions

Quality control without manual re-checks. A typical scenario: two annotators label the same text but disagree in 30% of cases. Without IAA, you don't know who is correct. We implement Cohen's Kappa (classification) and F1 agreement (NER), automatically sending disputed tasks for review. The quality threshold is configurable per project — typically 0.8–0.85.

Pre-annotation cuts labor costs by 40–70%. We use weak models: zero-shot NLI from Facebook (bart-large-mnli) for classification or GLiNER for NER. If the confidence of the prediction is above 0.85, the task is automatically accepted; the annotator only confirms. Our tests on a 10K document dataset showed that 60% of tasks pass auto-validation with 97% accuracy.

Active Learning — the model chooses what to label. Strategy: uncertainty — select examples with the highest entropy of predictions. This yields a 5–10% improvement in model quality compared to random sampling. For production, we use a hybrid: 70% uncertainty + 30% diversity (core-set) to avoid getting stuck on similar examples. Learn more about Active Learning on Wikipedia.

Platform Architecture

[Raw Data Sources]
↓
[Ingestion & Preprocessing]   ← format conversion, deduplication
↓
[Pre-annotation (weak models)] ← saves 40-70%
↓
[Task Queue Management]        ← distribution
↓
[Annotation Interface]         ← Label Studio / custom UI
↓
[Quality Control]              ← IAA, gold standard
↓
[Export & Model Training]      ← JSONL, COCO, YOLO
↓
[Active Learning Loop]         ← complex examples

Key Platform Modules

Task and Annotator Management

from anthropic import Anthropic
import pandas as pd
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
import uuid
import numpy as np

class TaskStatus(Enum):
    PENDING = "pending"
    PRE_ANNOTATED = "pre_annotated"
    IN_REVIEW = "in_review"
    COMPLETED = "completed"
    DISPUTED = "disputed"

@dataclass
class AnnotationTask:
    task_id: str
    data: dict          # raw data (text, image_url, etc.)
    task_type: str      # classification, ner, segmentation
    annotations: list = field(default_factory=list)
    pre_annotation: dict = None
    status: TaskStatus = TaskStatus.PENDING
    assigned_to: list = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    difficulty_score: float = 0.5

class AnnotationPlatform:
    def __init__(self, db_connection):
        self.db = db_connection
        self.llm = Anthropic()
        self.quality_threshold = 0.8  # Minimum IAA
        self.annotators_per_task = 2

    def ingest_data(self, raw_data: list[dict], task_type: str) -> list[AnnotationTask]:
        """Ingest data and create tasks"""
        tasks = []
        for item in raw_data:
            task = AnnotationTask(
                task_id=str(uuid.uuid4()),
                data=item,
                task_type=task_type
            )
            tasks.append(task)

        # Pre-estimate difficulty
        tasks = self._estimate_difficulty(tasks)

        # Prioritize: easy tasks first for quick start
        tasks.sort(key=lambda t: t.difficulty_score)

        return tasks

    def _estimate_difficulty(self, tasks: list[AnnotationTask]) -> list[AnnotationTask]:
        """LLM-based difficulty estimation for prioritization"""
        # Batch evaluation via LLM
        sample_texts = [t.data.get('text', '')[:200] for t in tasks[:20]]
        if not any(sample_texts):
            return tasks

        text_list = "\n".join([f"{i+1}. {t}" for i, t in enumerate(sample_texts)])

        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""Rate the annotation difficulty of these texts (0-1, where 1 is hardest).
Consider: ambiguity, domain specificity, length complexity.

Texts:
{text_list}

Return only comma-separated scores, e.g.: 0.3, 0.7, 0.5..."""
            }]
        )

        try:
            scores = [float(s.strip()) for s in response.content[0].text.split(',')]
            for i, task in enumerate(tasks[:len(scores)]):
                task.difficulty_score = scores[i]
        except Exception:
            pass

        return tasks

Quality Control via IAA

    def compute_iaa(self, annotations: list[dict], task_type: str) -> float:
        """
        Inter-Annotator Agreement:
        - Classification: Cohen's Kappa
        - NER: F1 agreement
        - Segmentation: IoU agreement
        """
        if len(annotations) < 2:
            return 1.0

        if task_type == 'classification':
            return self._cohen_kappa(annotations)
        elif task_type == 'ner':
            return self._ner_agreement(annotations)
        else:
            return self._pairwise_agreement(annotations)

    def _cohen_kappa(self, annotations: list[dict]) -> float:
        """Cohen's Kappa for classification"""
        from sklearn.metrics import cohen_kappa_score

        if len(annotations) == 2:
            labels_a = [a['label'] for a in annotations[0]['items']]
            labels_b = [a['label'] for a in annotations[1]['items']]

            if len(labels_a) != len(labels_b):
                return 0.0

            try:
                return cohen_kappa_score(labels_a, labels_b)
            except Exception:
                return 0.0

        return 0.5  # Default for >2 annotators (needs Fleiss kappa)

    def _ner_agreement(self, annotations: list[dict]) -> float:
        """F1 agreement for named entities"""
        if len(annotations) < 2:
            return 1.0

        spans_a = set(
            (e['start'], e['end'], e['label'])
            for e in annotations[0].get('entities', [])
        )
        spans_b = set(
            (e['start'], e['end'], e['label'])
            for e in annotations[1].get('entities', [])
        )

        if not spans_a and not spans_b:
            return 1.0

        intersection = spans_a & spans_b
        if not intersection:
            return 0.0

        precision = len(intersection) / len(spans_b)
        recall = len(intersection) / len(spans_a)
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        return f1

    def review_disputed_task(self, task: AnnotationTask,
                              annotations: list[dict]) -> dict:
        """Resolve disputed cases via LLM"""
        response = self.llm.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=400,
            messages=[{
                "role": "user",
                "content": f"""You are a senior annotation expert. Resolve this labeling dispute.

Task type: {task.task_type}
Text: {task.data.get('text', '')[:500]}

Annotator A: {annotations[0]}
Annotator B: {annotations[1]}

Provide:
1. Correct annotation
2. Brief reasoning (1-2 sentences)
3. Guideline clarification needed (if any)"""
            }]
        )
        return {
            'resolution': response.content[0].text,
            'resolved_by': 'llm_arbitration',
            'task_id': task.task_id
        }

Automatic Pre-annotation

class PreAnnotationEngine:
    """Pre-annotation to reduce annotator workload"""

    def __init__(self, task_type: str):
        self.task_type = task_type
        self.weak_model = None
        self.confidence_threshold = 0.85  # Only high-confidence accepted without review

    def pre_annotate_classification(self, texts: list[str],
                                     labels: list[str]) -> list[dict]:
        """Zero-shot classification via NLI"""
        from transformers import pipeline

        if self.weak_model is None:
            self.weak_model = pipeline(
                "zero-shot-classification",
                model="facebook/bart-large-mnli",
                device=0
            )

        results = []
        batch_size = 32

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            preds = self.weak_model(batch, candidate_labels=labels, batch_size=batch_size)

            for pred in preds:
                top_label = pred['labels'][0]
                confidence = pred['scores'][0]
                results.append({
                    'label': top_label,
                    'confidence': confidence,
                    'auto_accepted': confidence >= self.confidence_threshold
                })

        return results

    def pre_annotate_ner(self, texts: list[str]) -> list[dict]:
        """NER via GLiNER (general NER)"""
        from gliner import GLiNER

        if self.weak_model is None:
            self.weak_model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1")

        entity_types = ["person", "organization", "location", "date", "product"]
        results = []

        for text in texts:
            entities = self.weak_model.predict_entities(text, entity_types)
            results.append({
                'entities': [
                    {'start': e['start'], 'end': e['end'],
                     'label': e['label'], 'confidence': e['score']}
                    for e in entities
                ],
                'auto_accepted': all(e['score'] >= self.confidence_threshold for e in entities)
            })

        return results

Active Learning Loop

class ActiveLearningLoop:
    """Smart selection of next tasks for labeling"""

    def select_informative_samples(self, unlabeled_pool: list[dict],
                                    current_model,
                                    strategy: str = 'uncertainty',
                                    budget: int = 100) -> list[int]:
        """
        Strategies:
        - uncertainty: least confident predictions
        - diversity: most diverse in feature space
        - hybrid: combination of both
        """
        texts = [item.get('text', '') for item in unlabeled_pool]

        if strategy == 'uncertainty':
            probs = current_model.predict_proba(texts)
            # Highest entropy = highest uncertainty
            entropy = -np.sum(probs * np.log(probs + 1e-10), axis=1)
            return np.argsort(entropy)[-budget:].tolist()

        elif strategy == 'diversity':
            # Core-set: maximally diverse examples
            embeddings = current_model.encode(texts)  # if encoder available
            selected = [np.random.randint(len(texts))]

            for _ in range(budget - 1):
                dists = np.min(
                    np.linalg.norm(
                        embeddings[:, None] - embeddings[selected],
                        axis=2
                    ),
                    axis=1
                )
                selected.append(np.argmax(dists))

            return selected

        return list(range(min(budget, len(unlabeled_pool))))

Labeling Strategy Comparison

Strategy	Cost per 1K documents	IAA (classification)	Time to complete	Model quality gain
Manual labeling	$750 (50 hours × $15)	0.82	2 days	Baseline
Pre-annotation + review	$300 (20 hours × $15)	0.88	1 day	+3%
Active Learning	$150 (10 hours × $15)	0.91	0.5 day	+5–10%

Platform Comparison: Off-the-Shelf vs Custom

Feature	Label Studio (off-the-shelf)	Custom Platform
Model integration	Via hooks, limited	Full integration with your ML pipeline
Labeling types	Limited set	Any (hierarchical, 3D, video)
Active Learning	Not built-in	Built-in loop with uncertainty/diversity
Quality control	Basic IAA	Cohen's Kappa, F1, LLM arbitration
Throughput	Up to 10K tasks/day	50K+ tasks/day with optimization

Example cost savings for a team of 5 annotators

Without platform: 5 annotators × 40 hours/week × $15/hour = $3,000/week. Per month — $12,000. With Active Learning: labeling volume reduced 3–5 times, labor costs drop to $150–$300 per 1K documents. Net savings: $4,000–$8,000/month.

How to Implement a Labeling Platform

Data and labeling type audit — identify permissible types, complexity, error rates.
Stack and architecture selection — decide which components to customize (Label Studio or from scratch), which pre-annotation models to use.
Backend development — FastAPI + Celery + RabbitMQ for queues, PostgreSQL for storage.
Pre-annotation and Active Learning integration — connect weak models and uncertainty strategy.
Quality control setup — IAA thresholds, gold standard, LLM arbitration.
Testing with real data — load testing of queues, consistency checks.
Deployment and team training — deploy on your infrastructure, hand over documentation.

What's Included in Development and Timeline

Orchestration API — data ingestion, queue, prioritization, distribution.
Annotator interface — customized Label Studio or React UI.
Pre-annotation module — weak models with confidence thresholds.
Quality control — IAA, gold standard, review pipeline.
Export — JSONL, COCO, YOLO, integration with HuggingFace Datasets.
Active Learning — uncertainty and diversity calculator.
Documentation and team training.

Basic platform based on Label Studio — from 2 weeks. Full-featured with pre-annotation and Active Learning — from 3 to 8 weeks depending on complexity. Pricing is determined individually after auditing your data and requirements.

Discuss your project with our engineers — we'll assess your data and propose an architecture. Contact us to get a consultation.

Data Engineering for ML: Pipelines, Labeling, and Data Quality

“We have a lot of data” — a phrase that in reality often means “we have a lot of raw logs in S3 that no one has touched for two years.” Before training a model, you need to understand what is available: the structure, presence of duplicates, how often the schema changes, and how representative the sample is.

Data Engineering for ML is not just ETL. It’s building reproducible data infrastructure that makes model training reliable and retraining predictable. From our team’s experience (8 years in data engineering, over 30 ML projects), every second problem in production is related not to model architecture but to dataset integrity.

How Are ETL Pipelines for ML Different from BI?

ETL for analytics and ETL for ML are different tasks. Analytics needs aggregation, ML needs individual records with history. Analytics doesn’t require train/val/test split, ML does. Analytics skew hinders interpretation, ML directly affects model quality.

Tools. Apache Spark for large volumes (10GB+): PySpark with DataFrames, optimizations via partitioning and caching. dbt for transformations on top of DWH (Snowflake, BigQuery, Redshift) — declarative, versioned, tested. Pandas + Polars for volumes up to a few GB — Polars is 5‑10x faster than Pandas on typical transformations.

Temporal splits. For ML it’s important that the split is by time, not random. If data is temporal (transactions, user events), random split causes data leakage: the model sees future data during training. Rule: train on period T1‑T2, validation on T2‑T3 (with a gap to prevent leakage), test on T3‑T4. An incorrect split can cost 10–15% of model quality on validation.

Incremental pipelines. The model is retrained weekly on new data. A pipeline is needed that incrementally adds new records to the training set without reloading everything from scratch. Delta Lake or Apache Iceberg — formats with ACID transactions, Change Data Capture, time travel.

What Causes Training‑Serving Skew and How to Avoid It?

Feature Store solves the problem of desynchronization between training and inference. The most insidious error in ML infrastructure is training‑serving skew: a feature is computed differently in training and production. The model learns on correct data, but inference gets different values.

Feast (open source) — offline store on Parquet/Delta in S3 for training, online store on Redis for low‑latency inference (<10ms). Feature definitions as Python code:

from feast import FeatureView, Field
from feast.types import Float32, Int64

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    schema=[
        Field(name="purchase_count_7d", dtype=Int64),
        Field(name="avg_session_duration", dtype=Float32),
    ],
    ttl=timedelta(days=7),
    source=user_features_source,
)

One definition, used everywhere. No discrepancies. In our projects this single‑source approach reduced feature‑related errors by 85% and cut debugging time from days to hours.

Streaming features. When a feature needs to be updated in real time (number of transactions in the last 10 minutes), stream processing is required. Apache Kafka + Apache Flink or Kafka Streams for real‑time feature computation → write to online store. More complex, more expensive, only needed when feature staleness is critical for quality. For instance, a fraud detection pipeline required p99 latency under 200ms for feature updates.

Data Labeling: How Not to Waste Budget

Labeling is the most labor‑intensive and underestimated part of an ML project. Poorly labeled data cannot be fixed by any architecture.

Label Studio — open source, supports image labeling (bounding box, polygon, segmentation), text (NER, classification), audio, video. Deploys in 10 minutes via Docker. For small teams — first choice.

Labeling quality assessment. Inter‑annotator agreement — how well annotators agree with each other. Cohen’s Kappa > 0.8 — good, 0.6‑0.8 — acceptable, < 0.6 — task ambiguous or instructions poor. Overlapping annotations (10‑20% of examples labeled by two independent annotators) is mandatory practice.

Active learning prevents budget waste. Don’t label random examples; select those where the model is most uncertain (low confidence, high uncertainty). Allows achieving the same quality with 50‑70% of the labeling volume. Modals, Prodigy, Label Studio support active learning workflows. In one NLP project, we reduced the labeling budget by 2.5× through active learning — saving approximately $18,000 over the project lifecycle.

Synthetic data. When real data is scarce or expensive to obtain. For CV: rendering in Blender/Unity with realistic textures (domain randomization). For NLP: paraphrase via LLM, backtranslation. Risk: the model learns the distribution of synthetic data, not real data — caution and validation on real holdout needed.

Data Quality: Validation and Monitoring

Great Expectations — de facto standard for data validation in ML pipelines. Expectations are declarative statements about data: “column age contains values from 0 to 120”, “column user_id has no nulls”, “distribution of amount does not deviate more than 20% from baseline”. Runs in the pipeline, on failure blocks progression. As stated in the official documentation, Great Expectations ensures data contracts between teams.

Pandera — Pythonic alternative for pandas/polars DataFrames. Schema‑based validation with type hints:

import pandera as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, nullable=False),
    "score": pa.Column(float, pa.Check.between(0, 1)),
    "label": pa.Column(str, pa.Check.isin(["positive", "negative", "neutral"])),
})

Data freshness. The model expects data from the last N days. ETL fails, data is not updated — the model uses stale features. Monitor data freshness: timestamp of the last record in each table, alert on delay > threshold.

Deduplication. Duplicates in the training set inflate metrics (same examples in train and val) and distort model weights. MinHash LSH for approximate deduplication of large datasets. For exact — hash by normalized content.

Validation Tools Comparison

Tool	Application area	When to choose
Great Expectations	Universal, tables, pipelines	Large teams, lots of metadata
Pandera	pandas/polars DataFrames	Python‑centric projects, type hints
Deequ	Apache Spark, big data	If pipeline is already on Spark

What Does a Data Engineering Project for ML Include?

We provide the full cycle:

Audit of existing data and pipelines (1 week).
Architecture design: selection of tools, formats, labeling methods.
Implementation of ETL/ELT pipeline with validation and monitoring.
Documentation of code and processes (model card, data card).
Training your team on pipeline operation.
Post‑deployment support for 3 months.
Access to code repository and all pipeline definitions.

How We Build a Pipeline: Step by Step

Audit existing data. Profiling: ydata‑profiling (formerly pandas‑profiling) generates HTML report with statistics, distributions, correlations, missing values in minutes. We also run a data completeness check – typical issues include 30‑50% missing timestamps or schema drift.
Pipeline design. Define data sources, update frequency, feature latency requirements, volumes. Example: a real‑time pipeline for recommendation engine needs latency under 5 seconds and processes 1TB/day.
Implementation and testing. Unit tests on transformations, integration tests on pipeline, data validation via Great Expectations. We target 95% test coverage for transformation logic.
Deployment and monitoring. Alerts on freshness, quality checks, anomalies in data volumes. Typical alert threshold: no new data for 2 hours.

Storage and Formats

Format	Best for	Features
Parquet	Batch training, analytics	Columnar, efficient compression
Delta Lake	Incremental updates, ACID	Time travel, schema evolution
Apache Iceberg	Enterprise, multi‑engine	Best catalog, hidden partitioning
HDF5	Numerical arrays (CV datasets)	Hierarchical structure
TFDS / datasets	Standardized ML datasets	Hugging Face `datasets` — convenient for NLP

For most ML projects at start: Parquet in S3 + DVC for versioning. Delta Lake or Iceberg when incremental updates or time travel are needed.

Why Trust Us

We have been working in data engineering and ML for over 8 years. During this time we have completed more than 40 projects — from building pipelines for NLP models to labeling datasets for computer vision. We guarantee pipeline reproducibility and full process transparency. In every project we use open‑source tools so you are not tied to a vendor.

Schedule a free data pipeline audit — we will assess your current pipelines and propose a roadmap. Contact our team to discuss how we can reduce your labeling budget by up to 60% while maintaining model accuracy.