Concept #132Easyextended-ai-concepts

What is a plagiarism checker?

#gen-ai

Answer

What is a Plagiarism Checker?

A plagiarism checker is a tool that detects whether text is copied or closely derived from existing sources — comparing submitted content against databases of web pages, academic papers, and other documents.

How Plagiarism Checkers Work

text
Submitted text
1. Fingerprinting: Break into n-grams (overlapping word sequences)
2. Search: Compare against large database (web, academic papers, student submissions)
3. Match detection: Find similar or identical passages
4. Similarity report: Show % match and highlight sources

Technical Implementation (Basic)

python
from difflib import SequenceMatcher
import hashlib

def basic_plagiarism_check(submitted: str, database: list[str]) -> list[dict]:
    results = []

    for doc in database:
        # Calculate similarity ratio
        similarity = SequenceMatcher(None, submitted.lower(), doc.lower()).ratio()

        if similarity > 0.3:  # 30% threshold
            results.append({
                "similarity": similarity,
                "matched_text": doc[:100],
                "match_percent": f"{similarity:.1%}"
            })

    return sorted(results, key=lambda x: x["similarity"], reverse=True)

# N-gram fingerprinting (more robust)
def get_ngrams(text: str, n: int = 5) -> set:
    words = text.lower().split()
    return {" ".join(words[i:i+n]) for i in range(len(words) - n + 1)}

def ngram_similarity(text1: str, text2: str, n: int = 5) -> float:
    ng1 = get_ngrams(text1, n)
    ng2 = get_ngrams(text2, n)
    intersection = ng1 & ng2
    union = ng1 | ng2
    return len(intersection) / len(union) if union else 0.0

similarity = ngram_similarity("The cat sat on the mat", "A cat was sitting on the mat")
print(f"Similarity: {similarity:.1%}")

Popular Tools

ToolUse CaseDatabase
TurnitinAcademic/educationStudent papers, web, journals
GrammarlyWriting assistanceWeb content
CopyscapeWeb contentWeb pages
iThenticateResearch/publishingAcademic journals
UnicheckEducationWeb + student submissions
PlagScanEnterpriseWeb + academic

Types of Plagiarism Detected

TypeDescription
VerbatimExact word-for-word copy
ParaphrasingSame ideas, different words
MosaicMixing quoted and paraphrased content
Self-plagiarismReusing own previous work without attribution
AI-generatedContent from AI tools (newer detectors)

Modern Plagiarism Checkers vs AI Detectors

Plagiarism CheckerAI Detector
DetectsCopying from sourcesAI-generated text
Compares againstDocument databasesStatistical patterns
AccuracyHigh for exact matchesVariable (~80-90%)
False positivesLowHigher (10-20%)
Turnitin✅ Classic function✅ Added in 2023

Integration in CI/CD (Code Plagiarism)

python
# Detecting code plagiarism (for programming assignments)
import ast
import hashlib

def normalize_code(code: str) -> str:
    '''Normalize Python code by removing variable names'''
    try:
        tree = ast.parse(code)
        # Replace variable names with generic placeholders
        for node in ast.walk(tree):
            if isinstance(node, ast.Name):
                node.id = "VAR"
        return ast.dump(tree)
    except SyntaxError:
        return code

def code_similarity(code1: str, code2: str) -> float:
    norm1 = normalize_code(code1)
    norm2 = normalize_code(code2)
    return SequenceMatcher(None, norm1, norm2).ratio()