What is a Plagiarism Checker?
A plagiarism checker is a tool that detects whether text is copied or closely derived from existing sources — comparing submitted content against databases of web pages, academic papers, and other documents.
How Plagiarism Checkers Work
Submitted text
↓
1. Fingerprinting: Break into n-grams (overlapping word sequences)
↓
2. Search: Compare against large database (web, academic papers, student submissions)
↓
3. Match detection: Find similar or identical passages
↓
4. Similarity report: Show % match and highlight sources
Technical Implementation (Basic)
from difflib import SequenceMatcher
import hashlib
def basic_plagiarism_check(submitted: str, database: list[str]) -> list[dict]:
results = []
for doc in database:
# Calculate similarity ratio
similarity = SequenceMatcher(None, submitted.lower(), doc.lower()).ratio()
if similarity > 0.3: # 30% threshold
results.append({
"similarity": similarity,
"matched_text": doc[:100],
"match_percent": f"{similarity:.1%}"
})
return sorted(results, key=lambda x: x["similarity"], reverse=True)
# N-gram fingerprinting (more robust)
def get_ngrams(text: str, n: int = 5) -> set:
words = text.lower().split()
return {" ".join(words[i:i+n]) for i in range(len(words) - n + 1)}
def ngram_similarity(text1: str, text2: str, n: int = 5) -> float:
ng1 = get_ngrams(text1, n)
ng2 = get_ngrams(text2, n)
intersection = ng1 & ng2
union = ng1 | ng2
return len(intersection) / len(union) if union else 0.0
similarity = ngram_similarity("The cat sat on the mat", "A cat was sitting on the mat")
print(f"Similarity: {similarity:.1%}")
Popular Tools
| Tool | Use Case | Database |
|---|
| Turnitin | Academic/education | Student papers, web, journals |
| Grammarly | Writing assistance | Web content |
| Copyscape | Web content | Web pages |
| iThenticate | Research/publishing | Academic journals |
| Unicheck | Education | Web + student submissions |
| PlagScan | Enterprise | Web + academic |
Types of Plagiarism Detected
| Type | Description |
|---|
| Verbatim | Exact word-for-word copy |
| Paraphrasing | Same ideas, different words |
| Mosaic | Mixing quoted and paraphrased content |
| Self-plagiarism | Reusing own previous work without attribution |
| AI-generated | Content from AI tools (newer detectors) |
Modern Plagiarism Checkers vs AI Detectors
| Plagiarism Checker | AI Detector |
|---|
| Detects | Copying from sources | AI-generated text |
| Compares against | Document databases | Statistical patterns |
| Accuracy | High for exact matches | Variable (~80-90%) |
| False positives | Low | Higher (10-20%) |
| Turnitin | ✅ Classic function | ✅ Added in 2023 |
Integration in CI/CD (Code Plagiarism)
# Detecting code plagiarism (for programming assignments)
import ast
import hashlib
def normalize_code(code: str) -> str:
'''Normalize Python code by removing variable names'''
try:
tree = ast.parse(code)
# Replace variable names with generic placeholders
for node in ast.walk(tree):
if isinstance(node, ast.Name):
node.id = "VAR"
return ast.dump(tree)
except SyntaxError:
return code
def code_similarity(code1: str, code2: str) -> float:
norm1 = normalize_code(code1)
norm2 = normalize_code(code2)
return SequenceMatcher(None, norm1, norm2).ratio()