What types of plagiarism do you detect?

We detect exact copying, cosmetic modification (synonym replacement), paraphrasing, and cross-lingual plagiarism. Each type uses a specific method: fingerprinting, n-grams with Jaccard similarity, semantic comparison on BERT embeddings, and cross-lingual embeddings.

How do you handle a corpus of 1 million documents?

We use ANN indexing via FAISS or Qdrant. The index is built in O(N log N), and each query search takes milliseconds. Exact pairwise comparisons do not scale; ANN finds nearest candidates, then we apply exact algorithms.

What percentage of borrowing is considered plagiarism?

Thresholds depend on context: 15–20% for academic works, 30–40% for business content. We tune the threshold to your requirements and add visualization with highlights and source links.

Do you integrate with existing services?

Yes, we support integration with Antiplagiat.ru and iThenticate. If you need a custom system with a private corpus or specific security requirements, we build it from scratch.

What deliverables do you provide?

A report in PDF or JSON format: plagiarism percentage, list of matches with fragments, source links, and confidence metrics. REST API integration is also available.

What types of plagiarism do you detect?

We detect exact copying, cosmetic modification (synonym replacement), paraphrasing, and cross-lingual plagiarism. Each type uses a specific method: fingerprinting, n-grams with Jaccard similarity, semantic comparison on BERT embeddings, and cross-lingual embeddings.

How do you handle a corpus of 1 million documents?

We use ANN indexing via FAISS or Qdrant. The index is built in O(N log N), and each query search takes milliseconds. Exact pairwise comparisons do not scale; ANN finds nearest candidates, then we apply exact algorithms.

What percentage of borrowing is considered plagiarism?

Thresholds depend on context: 15–20% for academic works, 30–40% for business content. We tune the threshold to your requirements and add visualization with highlights and source links.

Do you integrate with existing services?

Yes, we support integration with Antiplagiat.ru and iThenticate. If you need a custom system with a private corpus or specific security requirements, we build it from scratch.

What deliverables do you provide?

A report in PDF or JSON format: plagiarism percentage, list of matches with fragments, source links, and confidence metrics. REST API integration is also available.

Plagiarism Detection with Semantic Search and ANN Indexing

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Plagiarism Detection with Semantic Search and ANN Indexing

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Imagine you have a corpus of 500,000 scientific articles and need to check a new paper for plagiarism. Finding exact copies takes seconds, but what if the text is paraphrased? Standard algorithms yield up to 40% false negatives. We solve this problem using semantic search and ANN indexing. Our experience spans over seven years in NLP and Computer Vision; we have implemented systems for three universities and two publishing houses. The plagiarism detection system combines fingerprinting and semantic search using embeddings.

Why Exact Matching Isn't Enough

Verbatim copying accounts for only 30% of cases. The rest of plagiarism is paraphrasing, translation from another language, or structural rearrangement. Without semantic analysis, such borrowings go undetected. We combine several approaches:

Plagiarism Type	Detection Method	Accuracy
Verbatim copying	Fingerprinting (Rabin-Karp)	99.9%
Cosmetic modification	N-gram + Jaccard similarity	95%
Paraphrasing	Semantic similarity (Sentence-BERT)	92%
Cross-lingual	Cross-lingual embeddings (LASER)	88%

How We Scale Checking to 1M+ Documents

For large corpora, exact pairwise search is infeasible. We use an ANN index (FAISS or Qdrant): indexing takes O(N log N), search takes O(log N). After finding candidates, we apply exact algorithms. This reduces latency from hours to milliseconds.

Example FAISS configuration:

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [...] # list of documents
embeddings = model.encode(docs)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)
# Search: distances, indices = index.search(query_emb, k=10)

How Fine-Tuning Improves Domain Accuracy

Standard Sentence-BERT models (e.g., all-MiniLM-L6-v2) are trained on general data. For a corpus of scientific articles or legal documents, semantic comparison accuracy can be boosted by 3–5% with fine-tuning. We use LoRA (Low-Rank Adaptation) — only 2% of model parameters are updated, reducing overfitting risk and speeding up fine-tuning. Example: on a corpus of 50,000 documents, fine-tuning takes two hours on a single GPU V100. After fine-tuning, recall@10 for paraphrased plagiarism increases from 88% to 94%.

Approach	Indexing Time (1M docs)	Accuracy (Rec@10)
Without fine-tuning	15 min	88%
Fine-tuning LoRA	15 min + 2 hours	94%

For finding relevant sources in an open corpus, we include an RAG pipeline: embeddings of all documents are indexed, and a query is converted to a vector to find nearest candidates, to which exact semantic matching is then applied.

Technical Stack and Integration

Fingerprinting is the fastest for exact matches:

def get_shingles(text: str, k: int = 5) -> set:
    words = text.lower().split()
    return {tuple(words[i:i+k]) for i in range(len(words)-k+1)}

def jaccard_similarity(s1: set, s2: set) -> float:
    return len(s1 & s2) / len(s1 | s2)

Semantic comparison (for paraphrasing):

Sentence segmentation
Sentence-BERT embeddings for each sentence
Cosine similarity matrix between all sentence pairs
Detect pairs with similarity > 0.85

Integration with external services: For academic works, we connect the Antiplagiat.ru API (Russian standard for universities) and iThenticate. If privacy or a custom corpus is needed, we build a bespoke system.

According to Sentence-BERT paper, semantic comparison on embeddings provides high accuracy with minimal computational cost.

Development Process

Analysis: gather requirements, assess corpus, choose thresholds.
Design: pipeline architecture (indexing, search, reporting).
Implementation: develop fingerprinting and semantic comparison modules, set up ANN index, fine-tune model.
Testing: run on test corpus, measure precision/recall, optimize p99 latency.
Deployment: deploy on your infrastructure or cloud (SageMaker, Vertex AI), integrate via REST API.

What's Included in the Result

Ready plagiarism detection pipeline (fingerprinting + semantic comparison)
ANN index (FAISS or Qdrant) for fast search
Sentence-BERT model fine-tuned on your corpus (optional)
REST API with endpoints /check, /upload, /report
Visualization of matches with highlights and source links
Documentation and team training (2–3 days)
1-year support guarantee

Comparison with Alternatives

Sentence-BERT is 3x faster than extracting exact embeddings via BERT-base, with less than 2% quality drop. ANN indexing (HNSW) outperforms exact search by 100x for corpora >10K documents. Additionally, we use few-shot prompts to analyze complex paraphrasing cases, reducing model hallucination rate.

Performance comparison example:

Method	Time for 10K queries	Accuracy (F1)
Exact search	12 hours	95%
ANN (HNSW)	7 minutes	93%

Typical Implementation Mistakes

Using stop words in shingles (adds noise)
Missing preprocessing: lemmatization, lowercasing
Choosing too small k in n-grams (missed matches)
Ignoring multilinguality (if corpus is multilingual)

If you want to evaluate your case, contact us — we'll prepare a demo version for your corpus. Order a pilot project: we'll test the system on 1,000 documents in 5 business days for $2,500. Get integration consultation right now — we'll help set everything up for your tasks. This investment saves up to $10,000 annually by reducing false positives and manual review time.

NLP Development: Text Classification, NER, Embeddings, and Information Extraction

We often receive a task: process 50,000 support tickets — currently all manual. Dataset — 3,000 labeled examples, 12 categories, imbalance: one category occupies 40% of the sample, three at 1-2% each. Baseline accuracy — 78%. Sounds decent until you look at recall for rare classes: 0.31, 0.44, 0.28. These classes — complaints and churn threats — are most important to the business.

This is a typical NLP development project. The problem is not the algorithm but that accuracy is the wrong metric. Our experience across 30+ projects shows: we start by analyzing business metrics and only then choose the model.

Why accuracy is not the right metric for rare classes?

Accuracy ignores imbalance. If the "churn" class appears in 2% of cases, the model can predict "all good" and get 98% accuracy — but the business loses clients. Solution: F1 macro (averaged over all classes) or weighted F1. For NER — strict entity F1 (exact matches only). We guarantee: after choosing the correct metric, model quality becomes measurable and predictable.

Text Classification: From BERT to Distillation

BERT-like models are the standard for classification. ruBERT-base or ruBERT-large from DeepPavlov for Russian. multilingual-e5-large — for multiple languages in one pipeline. XLM-RoBERTa-large — a strong multilingual backbone.

Fine-tuning for classification: add a classification head on top of the [CLS] token, train for 3-5 epochs with lr=2e-5, weight decay=0.01. For imbalance — weighted CrossEntropyLoss or focal loss with gamma=2.0. Contact us — we will show a code snippet.

Imbalance case study. Dataset — 3,000 examples, imbalance 1:20. Solution: class_weight via sklearn + CrossEntropyLoss. Additionally — augmentation of rare classes via backtranslation (ru→en→ru through MarianMT). Recall for rare classes rose from 0.31 to 0.67 with a slight drop in accuracy (76%→74%). Full NLP development end-to-end took 3 weeks.

Distillation for production. BERT-large gives F1 0.89, but inference on CPU — 180ms. Distillation into DistilBERT or ruBERT-tiny2 reduces latency to 25ms with F1 0.84. Export to ONNX Runtime provides an additional 1.5-2x speedup. DistilBERT achieves 7x lower latency than BERT-large with only a 5% drop in macro F1 – a typical production trade-off.

Model	F1 macro	Latency (CPU)	Size
BERT-large	0.89	180 ms	1.3 GB
DistilBERT	0.84	25 ms	250 MB
ruBERT-tiny2	0.81	12 ms	120 MB
DistilBERT + ONNX	0.84	14 ms	150 MB

How to choose between BERT and LLM for your task?

For most classification and extraction tasks, BERT-sized models offer the best trade-off between cost and performance. Shift to LLMs only when the task demands generation, complex reasoning, or zero-shot generalization.

NER: Named Entity Recognition

NER — extracting persons, organizations, locations, dates, amounts, document numbers. For general categories (PER, ORG, LOC), pre-trained models work well. For specialized ones (medical terms, legal concepts) — fine-tuning is needed.

Data annotation. The main cost of an NER project. For a quality model — 500-2,000 labeled sentences per entity type. Tools: Label Studio (open source) or Prodigy (by spaCy creators). IOB2 format — standard.

Architecture. Token classification on top of BERT: each token gets a label (B-PER, I-PER, O). spaCy 3.x with transformer pipeline — a convenient production choice.

Nested entities. Standard IOB models cannot handle nested entities (organization inside an address). For such tasks — span-based NER: SpanBERT or SpERT. More complex but correct.

Post-processing is mandatory. The model predicts tokens — normalized entities are needed. Date — dateparser. Amounts — regex + validation. Names — deduplication via rapidfuzz. Included in our standard delivery.

Sentiment Analysis and Opinion Mining

Binary classification positive/negative works out of the box with BERT. Complexity — aspect-based sentiment analysis (ABSA): "the restaurant has good food but terrible service." For ABSA: aspect extraction (NER) + sentiment per aspect. Joint models BERT-for-ABSA — quality on Russian data is lower due to dataset scarcity. RuSentiment, SentiRuEval — main resources.

For production with simple positive/negative/neutral: distil models are enough. Three classes, balanced dataset, 2,000+ examples — F1 macro 0.82-0.87 in 1-2 days.

Text Summarization

Extractive summarization (select sentences) — TextRank or BM25 without training. Fast, no hallucinations. Good for long documents.

Abstractive (generates new text) — seq2seq: mT5, mBART, FRED-T5, ruT5-large. For production via LLM API (GPT-4, Claude) — often the best cost/quality/speed trade-off.

Embeddings: Vector Representations of Text

Embeddings are the foundation of semantic search, deduplication, clustering, RAG. Quality critically affects downstream tasks.

Models. E5-large-v2, BGE-M3, multilingual-e5-large — strong multilingual embedders. sentence-transformers/paraphrase-multilingual-mpnet-base-v2 — fast option. For Russian: ru-en-RoSBERTa (Skoltech) performs well on semantic textual similarity.

Embedding quality evaluation uses the MTEB benchmark as standard. But top results on MTEB don't guarantee success on a domain dataset — we build domain-specific eval.

Fine-tuning embeddings. If standard models don't give the required Recall@k — contrastive learning on domain pairs with MultipleNegativesRankingLoss. How to perform this for domain data:

Collect 500–2,000 semantically similar pairs from your domain.
Apply MultipleNegativesRankingLoss with a batch size of 32–64.
Train for 1–3 epochs using AdamW (lr=2e-5).
Evaluate Recall@k on a held-out domain test set.

This approach yields a 5–15% improvement in Recall@k in practice.

Dimensionality and storage. E5-large: 1024 dim, float32 — 4KB per vector. For 10M documents — 40GB. Quantization int8 reduces to 10GB. FAISS IVF_PQ — more compact but with losses. Included in our deployment recommendations.

Information Extraction

Structured extraction is a frequent task. Examples: key contract terms, technical characteristics, dates and amounts from invoices.

Regex + rule-based. For INN, OGRN, amounts, dates — more reliable than neural networks. No data required.
NER + post-processing. For variable formats.
LLM with structured output. GPT‑4 / Claude with JSON schema — for complex documents. Cost: minimal per document. For 10k+ documents/day — we calculate the economics.

We guarantee a hybrid: regex/NER for typical fields + LLM for edge cases. Our guarantee is backed by years of production experience and more than 30 projects.

Work Stages

Stage	Duration	What's included
Data and metric analysis	3-5 days	Class distribution, text lengths, baseline
Baseline (TF‑IDF + LogReg)	1 day	Quick estimate of gap with deep models
Training and validation	1-2 weeks	k‑fold, early stopping, error analysis
Deployment (ONNX + FastAPI)	1-2 weeks	REST API, batching, monitoring
Documentation and training	2-3 days	Model card, API docs, team training

Prototype on existing data — 1-3 weeks. Production system with CI/CD — 1.5–2.5 months. Cost is calculated individually — get a consultation for a project estimate.

What's Included

Model and pipeline architecture documentation
Access to the model via REST API (FastAPI + ONNX)
Client team training (2-hour webinar + Q&A)
Accuracy guarantee on the agreed test set
Months of post-delivery support (bug fixes, adaptation to new data)

Our Experience

Years of NLP projects from classification to RAG systems. The team includes ML engineers experienced with Hugging Face, spaCy, LangChain, MLOps. We use vLLM, Kubeflow, Weights & Biases — a production stack, not toys. Contact us to evaluate your NLP project within two days — request a free consultation on your text processing pipeline.