Plagiarism Detection Implementation
Plagiarism detection—searching for borrowings from known text corpus. Technical complexity determined by corpus scale and plagiarism types: verbatim copying detected trivially, paraphrased requires semantic comparison.
Plagiarism Types and Detection Methods
| Type | Description | Method |
|---|---|---|
| Verbatim copying | Exact fragment match | Fingerprinting (Rabin-Karp) |
| Cosmetic modification | Synonym replacement, word reordering | N-gram + Jaccard similarity |
| Paraphrasing | Meaning preserved, different words | Semantic similarity (BERT) |
| Cross-lingual | Translation from another language | Cross-lingual embeddings |
Technical Stack
Fingerprinting—fastest for exact match:
def get_shingles(text: str, k: int = 5) -> set:
words = text.lower().split()
return {tuple(words[i:i+k]) for i in range(len(words)-k+1)}
def jaccard_similarity(s1: set, s2: set) -> float:
return len(s1 & s2) / len(s1 | s2)
Semantic comparison (for paraphrasing):
- Sentence segmentation
- Sentence-BERT embeddings per sentence
- Cosine similarity matrix for all pairs
- Identify pairs with similarity > 0.85
Scaling: for corpus > 1M documents—ANN search via FAISS or Qdrant. Exact pairwise search doesn't scale; ANN finds nearest candidates in O(log N).
Integration with External Services
For academic work: Antiplagiat.ru API (Russian standard for universities), iThenticate (international). Custom system needed when privacy requirements or own corpus required.
Reporting
Result: percentage of borrowings + match visualization (highlighting in text with source link). Flagging threshold: 15–20% for academic work, 30–40% for business content.







