Software-assisted plagiarism detection
Computer-assisted plagiarism detection is an Information retrieval (IR) task supported by specialized IR systems, which is referred to as a plagiarism detection system (PDS) or document similarity detection system. A 2019 systematic literature review presents an overview of state-of-the-art plagiarism detection methods.In text documents
Systems for text similarity detection implement one of two generic detection approaches, one being external, the other being intrinsic. External detection systems compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine. Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents that contain text that is similar to a degree above a chosen threshold to text in the suspicious document. Intrinsic PDSes solely analyze the text to be evaluated without performing comparisons to external documents. This approach aims to recognize changes in the unique writing style of an author as an indicator for potential plagiarism. PDSes are not capable of reliably identifying plagiarism without human judgment. Similarities and writing style features are computed with the help of predefined document models and might represent false positives.Effectiveness of those tools in higher education settings
A study was conducted to test the effectiveness of similarity detection software in a higher education setting. One part of the study assigned one group of students to write a paper. These students were first educated about plagiarism and informed that their work was to be run through a content similarity detection system. A second group of students was assigned to write a paper without any information about plagiarism. The researchers expected to find lower rates in group one but found roughly the same rates of plagiarism in both groups.Approaches
The figure below represents a classification of all detection approaches currently in use for computer-assisted content similarity detection. The approaches are characterized by the type of similarity assessment they undertake: global or local. Global similarity assessment approaches use the characteristics taken from larger parts of the text or the document as a whole to compute similarity, while local methods only examine pre-selected text segments as input.=Fingerprinting
= Fingerprinting is currently the most widely applied approach to content similarity detection. This method forms representative digests of documents by selecting a set of multiple substrings ( n-grams) from them. The sets represent the=String matching
= String matching is a prevalent approach used in computer science. When applied to the problem of plagiarism detection, documents are compared for verbatim text overlaps. Numerous methods have been proposed to tackle this task, of which some have been adapted to external plagiarism detection. Checking a suspicious document in this setting requires the computation and storage of efficiently comparable representations for all documents in the reference collection to compare them pairwise. Generally, suffix document models, such as suffix trees or suffix vectors, have been used for this task. Nonetheless, substring matching remains computationally expensive, which makes it a non-viable solution for checking large collections of documents.=Bag of words
= Bag of words analysis represents the adoption of vector space retrieval, a traditional IR concept, to the domain of content similarity detection. Documents are represented as one or multiple vectors, e.g. for different document parts, which are used for pair wise similarity computations. Similarity computation may then rely on the traditional cosine similarity measure, or on more sophisticated similarity measures.=Citation analysis
= Citation-based plagiarism detection (CbPD) relies on citation analysis, and is the only approach to plagiarism detection that does not rely on the textual similarity. CbPD examines the citation and reference information in texts to identify similar=Stylometry
= Stylometry subsumes statistical methods for quantifying an author's unique writing style and is mainly used for authorship attribution or intrinsic plagiarism detection. Detecting plagiarism by authorship attribution requires checking whether the writing style of the suspicious document, which is written supposedly by a certain author, matches with that of a corpus of documents written by the same author. Intrinsic plagiarism detection, on the other hand, uncovers plagiarism based on internal evidences in the suspicious document without comparing it with other documents. This is performed by constructing and comparing stylometric models for different text segments of the suspicious document, and passages that are stylistically different from others are marked as potentially plagiarized/infringed. Although they are simple to extract, character n-grams are proven to be among the best stylometric features for intrinsic plagiarism detection.=Neural networks
= More recent approaches to assess content similarity usingPerformance
Comparative evaluations of content similarity detection systems indicate that their performance depends on the type of plagiarism present (see figure). Except for citation pattern analysis, all detection approaches rely on textual similarity. It is therefore symptomatic that detection accuracy decreases the more plagiarism cases are obfuscated.Software
The design of content similarity detection software for use with text documents is characterized by a number of factors: Most large-scale plagiarism detection systems use large, internal databases (in addition to other resources) that grow with each additional document submitted for analysis. However, this feature is considered by some as a violation of student copyright.In source code
Plagiarism in computer source code is also frequent, and requires different tools than those used for text comparisons in document. Significant research has been dedicated to academic source-code plagiarism. A distinctive aspect of source-code plagiarism is that there are no essay mills, such as can be found in traditional plagiarism. Since most programming assignments expect students to write programs with very specific requirements, it is very difficult to find existing programs that already meet them. Since integrating external code is often harder than writing it from scratch, most plagiarizing students choose to do so from their peers. According to Roy and Cordy, source-code similarity detection algorithms can be classified as based on either * Strings – look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by renaming identifiers. * Tokens – as with strings, but using a lexer to convert the program into tokens first. This discards whitespace, comments, and identifier names, making the system more robust against simple text replacements. Most academic plagiarism detection systems work at this level, using different algorithms to measure the similarity between token sequences. * Parse Trees – build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree comparison can normalize conditional statements, and detect equivalent constructs as similar to each other. * Program Dependency Graphs (PDGs) – a PDG captures the actual flow of control in a program, and allows much higher-level equivalences to be located, at a greater expense in complexity and calculation time. * Metrics – metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely different things. * Hybrid approaches – for instance, parse trees + suffix trees can combine the detection capability of parse trees with the speed afforded by suffix trees, a type of string-matching data structure. The previous classification was developed forAlgorithms
Complications with the use of text-matching software for plagiarism detection
Various complications have been documented with the use of text-matching software when used for plagiarism detection. One of the more prevalent concerns documented centers on the issue of intellectual property rights. The basic argument is that materials must be added to a database in order for the TMS to effectively determine a match, but adding users' materials to such a database may infringe on their intellectual property rights. The issue has been raised in a number of court cases. An additional complication with the use of TMS is that the software finds only precise matches to other text. It does not pick up poorly paraphrased work, for example, or the practice of plagiarizing by use of sufficient word substitutions to elude detection software, which is known as rogeting. It also cannot evaluate whether the paraphrase genuinely reflects an original understanding or is an attempt to bypass detection. Another complication with TMS is its tendency to flag much more content than necessary, including legitimate citations and paraphrasing, making it difficult to find real cases of plagiarism. This issue arises because TMS algorithms mainly look at surface-level text similarities without considering the context of the writing. Educators have raised concerns that reliance on TMS may shift focus away from teaching proper citation and writing skills, and may create an oversimplified view of plagiarism that disregards the nuances of student writing. As a result, scholars argue that these false positives can cause fear in students and discourage them from using their authentic voice.See also
* * :Plagiarism detectors * * * * * – used to estimate similarity between token sequences in several systems *References
Literature
* Carroll, J. (2002). A'' handbook for deterring plagiarism in higher education''. Oxford: The Oxford Centre for Staff and Learning Development, Oxford Brookes University. (96 p.), * Zeidman, B. (2011). ''The Software IP Detective's Handbook''. Prentice Hall. (480 p.), {{ISBN, 0137035330 Educational assessment and evaluation Plagiarism detectors