Why is data cleaning necessary for fine-tuning LLMs?

Dirty data leads to hallucinations, degraded quality, and model instability. Cleaning removes duplicates, toxic content, PII, and noise, improving fine-tuning accuracy by 15-30%.

How do you remove duplicates from a dataset?

We use MinHash LSH for near-duplicate detection with a similarity threshold of 0.8. The O(n log n) algorithm efficiently handles millions of examples. Exact duplicates are filtered at the exact matching stage.

What tools are used for toxicity filtering?

We apply Detoxify with the multilingual model, toxicity threshold 0.7. For Russian, we additionally use a fasttext classifier. Filtering reduces the likelihood of undesirable model responses.

What is considered PII and how do you remove it?

PII includes SSN, email, phone numbers, credit card numbers, and addresses. We use regular expressions and spaCy NER for detection. We either remove or replace with placeholders — this is mandatory for GDPR compliance.

How many examples remain after cleaning?

Typically, from 50,000 raw examples, 35,000-42,000 high-quality ones remain (15-30% reduction). Main removal reasons: duplicates (40%), toxicity (25%), too short (20%), PII (15%).

Why is data cleaning necessary for fine-tuning LLMs?

Dirty data leads to hallucinations, degraded quality, and model instability. Cleaning removes duplicates, toxic content, PII, and noise, improving fine-tuning accuracy by 15-30%.

How do you remove duplicates from a dataset?

We use MinHash LSH for near-duplicate detection with a similarity threshold of 0.8. The O(n log n) algorithm efficiently handles millions of examples. Exact duplicates are filtered at the exact matching stage.

What tools are used for toxicity filtering?

We apply Detoxify with the multilingual model, toxicity threshold 0.7. For Russian, we additionally use a fasttext classifier. Filtering reduces the likelihood of undesirable model responses.

What is considered PII and how do you remove it?

PII includes SSN, email, phone numbers, credit card numbers, and addresses. We use regular expressions and spaCy NER for detection. We either remove or replace with placeholders — this is mandatory for GDPR compliance.

How many examples remain after cleaning?

Typically, from 50,000 raw examples, 35,000-42,000 high-quality ones remain (15-30% reduction). Main removal reasons: duplicates (40%), toxicity (25%), too short (20%), PII (15%).

Data Cleaning for Fine-Tuning LLMs: Pipeline and Metrics

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Data Cleaning for Fine-Tuning LLMs: Pipeline and Metrics

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Data Cleaning Pipeline for Fine-Tuning LLMs

Imagine you've collected 100,000 examples for fine-tuning LLaMA 3, but the model produces incoherent responses and hallucinates on every third query. The cause is dirty data: 40% duplicates, 15% contain personal data, another 10% toxic content. Without quality cleaning, fine-tuning won't yield the desired result.

We've developed a pipeline that transforms a raw dataset into clean, training-ready data in 10–14 days. MinHash LSH for deduplication works 10 times faster than pairwise comparison when searching for near-duplicates on datasets of 50,000 examples. And toxicity filtering via Detoxify reduces the probability of undesirable model responses by 25% compared to simple regex.

Why Standard Cleaning Doesn't Work for LLMs?

Texts for fine-tuning contain specific artifacts: HTML tags (if scraped from the web), Unicode variations, meta-comments like "As an AI language model...". Simply removing punctuation doesn't solve the problem. Multi-layer filtering with context awareness is needed. For example, PII detection requires not only regex but also NER models (spaCy) to find "John Doe, Lenin St." — as important as card numbers. Before running the pipeline, it's recommended to review best practices from Hugging Face Datasets documentation.

How We Build the Cleaning Pipeline

The pipeline consists of sequential stages, each checking and transforming the sample. Critical not to overdo: excessive cleaning reduces data diversity.

import re
import unicodedata
from dataclasses import dataclass

@dataclass
class CleaningResult:
    original: str
    cleaned: str
    removed: bool
    removal_reason: str = None

class TextCleaner:
    def clean(self, text: str) -> CleaningResult:
        cleaned = text

        # 1. Unicode normalization
        cleaned = unicodedata.normalize('NFKC', cleaned)

        # 2. Remove HTML/XML tags
        cleaned = re.sub(r'<[^>]+>', ' ', cleaned)

        # 3. Clean URLs (optional — replace with placeholder)
        cleaned = re.sub(
            r'https?://[^\s]+', '[URL]', cleaned
        )

        # 4. Normalize whitespace
        cleaned = re.sub(r'\s+', ' ', cleaned).strip()

        # 5. Remove repeated characters (ааааааа → а)
        cleaned = re.sub(r'(.)\1{4,}', r'\1\1', cleaned)

        # Check minimum length
        if len(cleaned.split()) < 3:
            return CleaningResult(text, cleaned, True, "too_short")

        return CleaningResult(text, cleaned, False)

class DataFilter:
    def __init__(self):
        # Toxicity (can use detoxify or fasttext)
        from detoxify import Detoxify
        self.toxicity_model = Detoxify('multilingual')

    def is_toxic(self, text: str, threshold: float = 0.7) -> bool:
        result = self.toxicity_model.predict(text)
        return result['toxicity'] > threshold

    def has_pii(self, text: str) -> bool:
        """Simple heuristic for PII detection"""
        patterns = [
            r'\b\d{3}-\d{2}-\d{4}\b',           # SSN
            r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',  # Email
            r'\b(?:\+7|8)?[\s-]?\(?\d{3}\)?[\s-]?\d{3}[\s-]?\d{2}[\s-]?\d{2}\b',  # RU phone
            r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',  # Credit card
        ]
        for pattern in patterns:
            if re.search(pattern, text):
                return True
        return False

Step-by-Step Pipeline Configuration

Define filtering thresholds. For toxicity, use threshold 0.7 — balances removing bad content while preserving useful. For duplicates, set similarity 0.8.
Choose deduplication algorithm. Exact duplicates: exact matching; near-duplicates: MinHash LSH. SimHash suits streaming but gives more false positives.
Run a test on 1000 samples. Check metrics: number removed, type-token ratio, residual toxicity. If OK, run full dataset.

Cleaning Output Fields

Assistant model responses often contain unwanted introductions: "Certainly! Here is my response". The algorithm detects these patterns and trims them, leaving only useful content.

class OutputCleaner:
    def clean_output(self, output: str, task_type: str) -> tuple[str, bool]:
        cleaned = output.strip()

        # Remove unwanted model phrases
        unwanted_starts = [
            "As an AI language model",
            "As a helpful assistant",
            "I don't have access to real-time",
            "I cannot browse the internet",
            "Certainly! Here",
            "Of course! I'd be happy to",
        ]

        for phrase in unwanted_starts:
            if cleaned.lower().startswith(phrase.lower()):
                # Remove introductory phrase
                cleaned = cleaned[len(phrase):].lstrip('.,! ')

        # Check: output should not contain meta-comments
        meta_indicators = [
            "Note: This is a fictional",
            "[This response was",
            "Disclaimer:",
        ]
        for indicator in meta_indicators:
            if indicator in cleaned:
                idx = cleaned.find(indicator)
                cleaned = cleaned[:idx].strip()

        # Minimum length
        if len(cleaned.split()) < 5:
            return cleaned, True  # Mark for removal

        return cleaned, False

Duplicate Detection at Different Levels

For exact duplicates, we use hashing; for near-duplicates — MinHash LSH. A similarity threshold of 0.8 removes almost identical examples while preserving variability.

from datasketch import MinHash, MinHashLSH

def find_near_duplicates(texts: list[str],
                          threshold: float = 0.8) -> list[tuple]:
    """MinHash LSH for efficient near-duplicate search O(n log n)"""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    minhashes = {}

    for i, text in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in text.lower().split():
            m.update(word.encode('utf8'))
        lsh.insert(f"doc_{i}", m)
        minhashes[f"doc_{i}"] = m

    duplicates = []
    for i, text in enumerate(texts):
        key = f"doc_{i}"
        result = lsh.query(minhashes[key])
        result.remove(key)
        if result:
            duplicates.append((i, [int(r.split('_')[1]) for r in result]))

    return duplicates

Comparison of Deduplication Methods

Method	Speed	Precision	Application
Exact matching	O(n)	100%	Exact duplicates
MinHash LSH	O(n log n)	~95%	Near-duplicates
SimHash	O(n)	~90%	Quick estimation

Post-Cleaning Statistics

After the pipeline, we always check metrics:

Metric	Normal Range	Purpose
Removed examples	15–30%	Control aggressiveness of cleaning
Token count	>5 million	Enough for fine-tuning
Type-token ratio	>0.5	Sufficient diversity
Task coverage	>90%	All needed scenarios
Toxicity	<1%	Model safety

Typical result: from 50,000 raw examples, 35,000–42,000 high-quality ones remain. A 15–30% volume reduction is normal, and the final model quality only improves. Compared to rough cleaning (only regex), fine-tuning accuracy increases by 15–20%. A common issue is class imbalance: if 90% of examples are positive, the model won't learn to handle negative queries. We apply stratified sampling and augmentation of rare classes. Also important to remove LLM-specific stop words: 'As a language model', 'I cannot', 'I think'. This reduces noise by 5–10%.

What's Included in the Work

We prepare a full cleaning pipeline for your dataset:

Raw data analysis (length distribution, language, toxicity)
Filter configuration tailored to your task (RAG, generation, classification)
Deduplication and PII removal
Normalization and tokenization
Report with metrics and visualizations
Pipeline documentation and configuration
Training for your team

Timeline — from 10 to 14 business days depending on volume. Contact us to evaluate your project — we guarantee confidentiality and result quality. Our experience: over 5 years in NLP, over 20 projects in fine-tuning models of various sizes. Get a consultation on dataset cleaning — we will prepare a custom pipeline.

Data Engineering for ML: Pipelines, Labeling, and Data Quality

“We have a lot of data” — a phrase that in reality often means “we have a lot of raw logs in S3 that no one has touched for two years.” Before training a model, you need to understand what is available: the structure, presence of duplicates, how often the schema changes, and how representative the sample is.

Data Engineering for ML is not just ETL. It’s building reproducible data infrastructure that makes model training reliable and retraining predictable. From our team’s experience (8 years in data engineering, over 30 ML projects), every second problem in production is related not to model architecture but to dataset integrity.

How Are ETL Pipelines for ML Different from BI?

ETL for analytics and ETL for ML are different tasks. Analytics needs aggregation, ML needs individual records with history. Analytics doesn’t require train/val/test split, ML does. Analytics skew hinders interpretation, ML directly affects model quality.

Tools. Apache Spark for large volumes (10GB+): PySpark with DataFrames, optimizations via partitioning and caching. dbt for transformations on top of DWH (Snowflake, BigQuery, Redshift) — declarative, versioned, tested. Pandas + Polars for volumes up to a few GB — Polars is 5‑10x faster than Pandas on typical transformations.

Temporal splits. For ML it’s important that the split is by time, not random. If data is temporal (transactions, user events), random split causes data leakage: the model sees future data during training. Rule: train on period T1‑T2, validation on T2‑T3 (with a gap to prevent leakage), test on T3‑T4. An incorrect split can cost 10–15% of model quality on validation.

Incremental pipelines. The model is retrained weekly on new data. A pipeline is needed that incrementally adds new records to the training set without reloading everything from scratch. Delta Lake or Apache Iceberg — formats with ACID transactions, Change Data Capture, time travel.

What Causes Training‑Serving Skew and How to Avoid It?

Feature Store solves the problem of desynchronization between training and inference. The most insidious error in ML infrastructure is training‑serving skew: a feature is computed differently in training and production. The model learns on correct data, but inference gets different values.

Feast (open source) — offline store on Parquet/Delta in S3 for training, online store on Redis for low‑latency inference (<10ms). Feature definitions as Python code:

from feast import FeatureView, Field
from feast.types import Float32, Int64

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    schema=[
        Field(name="purchase_count_7d", dtype=Int64),
        Field(name="avg_session_duration", dtype=Float32),
    ],
    ttl=timedelta(days=7),
    source=user_features_source,
)

One definition, used everywhere. No discrepancies. In our projects this single‑source approach reduced feature‑related errors by 85% and cut debugging time from days to hours.

Streaming features. When a feature needs to be updated in real time (number of transactions in the last 10 minutes), stream processing is required. Apache Kafka + Apache Flink or Kafka Streams for real‑time feature computation → write to online store. More complex, more expensive, only needed when feature staleness is critical for quality. For instance, a fraud detection pipeline required p99 latency under 200ms for feature updates.

Data Labeling: How Not to Waste Budget

Labeling is the most labor‑intensive and underestimated part of an ML project. Poorly labeled data cannot be fixed by any architecture.

Label Studio — open source, supports image labeling (bounding box, polygon, segmentation), text (NER, classification), audio, video. Deploys in 10 minutes via Docker. For small teams — first choice.

Labeling quality assessment. Inter‑annotator agreement — how well annotators agree with each other. Cohen’s Kappa > 0.8 — good, 0.6‑0.8 — acceptable, < 0.6 — task ambiguous or instructions poor. Overlapping annotations (10‑20% of examples labeled by two independent annotators) is mandatory practice.

Active learning prevents budget waste. Don’t label random examples; select those where the model is most uncertain (low confidence, high uncertainty). Allows achieving the same quality with 50‑70% of the labeling volume. Modals, Prodigy, Label Studio support active learning workflows. In one NLP project, we reduced the labeling budget by 2.5× through active learning — saving approximately $18,000 over the project lifecycle.

Synthetic data. When real data is scarce or expensive to obtain. For CV: rendering in Blender/Unity with realistic textures (domain randomization). For NLP: paraphrase via LLM, backtranslation. Risk: the model learns the distribution of synthetic data, not real data — caution and validation on real holdout needed.

Data Quality: Validation and Monitoring

Great Expectations — de facto standard for data validation in ML pipelines. Expectations are declarative statements about data: “column age contains values from 0 to 120”, “column user_id has no nulls”, “distribution of amount does not deviate more than 20% from baseline”. Runs in the pipeline, on failure blocks progression. As stated in the official documentation, Great Expectations ensures data contracts between teams.

Pandera — Pythonic alternative for pandas/polars DataFrames. Schema‑based validation with type hints:

import pandera as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, nullable=False),
    "score": pa.Column(float, pa.Check.between(0, 1)),
    "label": pa.Column(str, pa.Check.isin(["positive", "negative", "neutral"])),
})

Data freshness. The model expects data from the last N days. ETL fails, data is not updated — the model uses stale features. Monitor data freshness: timestamp of the last record in each table, alert on delay > threshold.

Deduplication. Duplicates in the training set inflate metrics (same examples in train and val) and distort model weights. MinHash LSH for approximate deduplication of large datasets. For exact — hash by normalized content.

Validation Tools Comparison

Tool	Application area	When to choose
Great Expectations	Universal, tables, pipelines	Large teams, lots of metadata
Pandera	pandas/polars DataFrames	Python‑centric projects, type hints
Deequ	Apache Spark, big data	If pipeline is already on Spark

What Does a Data Engineering Project for ML Include?

We provide the full cycle:

Audit of existing data and pipelines (1 week).
Architecture design: selection of tools, formats, labeling methods.
Implementation of ETL/ELT pipeline with validation and monitoring.
Documentation of code and processes (model card, data card).
Training your team on pipeline operation.
Post‑deployment support for 3 months.
Access to code repository and all pipeline definitions.

How We Build a Pipeline: Step by Step

Audit existing data. Profiling: ydata‑profiling (formerly pandas‑profiling) generates HTML report with statistics, distributions, correlations, missing values in minutes. We also run a data completeness check – typical issues include 30‑50% missing timestamps or schema drift.
Pipeline design. Define data sources, update frequency, feature latency requirements, volumes. Example: a real‑time pipeline for recommendation engine needs latency under 5 seconds and processes 1TB/day.
Implementation and testing. Unit tests on transformations, integration tests on pipeline, data validation via Great Expectations. We target 95% test coverage for transformation logic.
Deployment and monitoring. Alerts on freshness, quality checks, anomalies in data volumes. Typical alert threshold: no new data for 2 hours.

Storage and Formats

Format	Best for	Features
Parquet	Batch training, analytics	Columnar, efficient compression
Delta Lake	Incremental updates, ACID	Time travel, schema evolution
Apache Iceberg	Enterprise, multi‑engine	Best catalog, hidden partitioning
HDF5	Numerical arrays (CV datasets)	Hierarchical structure
TFDS / datasets	Standardized ML datasets	Hugging Face `datasets` — convenient for NLP

For most ML projects at start: Parquet in S3 + DVC for versioning. Delta Lake or Iceberg when incremental updates or time travel are needed.

Why Trust Us

We have been working in data engineering and ML for over 8 years. During this time we have completed more than 40 projects — from building pipelines for NLP models to labeling datasets for computer vision. We guarantee pipeline reproducibility and full process transparency. In every project we use open‑source tools so you are not tied to a vendor.

Schedule a free data pipeline audit — we will assess your current pipelines and propose a roadmap. Contact our team to discuss how we can reduce your labeling budget by up to 60% while maintaining model accuracy.