Stylometry is the application of
the study of linguistic style, usually to written language.
[ Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding manner and meaning. Springer Science & Business Media, 2010.] It has also been applied successfully to music, paintings, and chess.
Stylometry is often used to attribute
authorship to
anonymous
Anonymous may refer to:
* Anonymity, the state of an individual's identity, or personally identifiable information, being publicly unknown
** Anonymous work, a work of art or literature that has an unnamed or unknown creator or author
* Anonym ...
or disputed documents. It has legal as well as academic and literary applications, ranging from the question of the
authorship of Shakespeare's works to
forensic linguistics and has methodological similarities with the analysis of text
readability.
Stylometry may be used to unmask
pseudonymous or anonymous authors, or to reveal some information about the author short of a full identification. Authors may use adversarial stylometry to resist this identification by eliminating their own stylistic characteristics without changing the meaningful content of their communications. It can defeat analyses that do not account for its possibility, but the ultimate effectiveness of stylometry in an adversarial environment is uncertain: stylometric identification may not be reliable, but nor can non-identification be guaranteed; adversarial stylometry's practice itself may be detectable.
History
Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, author identity, and other questions.
The modern practice of the discipline received publicity from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors of uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use
John Fletcher's preference for " 'em", the contractional form of "them", as a marker to distinguish between Fletcher and
Philip Massinger in their collaborations—but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of " 'em" to "them".
The basics of stylometry were established by Polish philosopher
Wincenty Lutosławski in ''Principes de stylométrie'' (1890). Lutosławski used this method to develop a chronology of
Plato's Dialogues.
The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee good quality output. During the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which indicated that six different authors had written that body of work. A check of his method, applied to the works of
James Joyce
James Augustine Aloysius Joyce (born James Augusta Joyce; 2 February 1882 – 13 January 1941) was an Irish novelist, poet, and literary critic. He contributed to the modernist avant-garde movement and is regarded as one of the most influentia ...
, gave the result that ''
Ulysses'', Joyce's multi-perspective, multi-style novel, was composed by five separate individuals, none of whom apparently had any part in the crafting of Joyce's first novel, ''
A Portrait of the Artist as a Young Man.''
In time, however, and with practice, researchers and scholars have refined their methods, to yield better results. One notable early success was the resolution of disputed authorship of twelve of ''
The Federalist Papers
''The Federalist Papers'' is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay under the collective pseudonym "Publius" to promote the ratification of the Constitution of the United States. The ...
'' by Frederick Mosteller and David Wallace.
While there are still questions concerning initial assumptions and methods (and, perhaps, always will be), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic analysis to the Fletcher canon by
Cyrus Hoy and others yielded clear results during the late 1950s and early 1960s.)
Applications
Applications of stylometry include literary studies, historical studies, social studies, information retrieval, and many forensic cases and studies.
Recently, long-standing debates about anonymous medieval Icelandic sagas have been advanced through its utilisation. It can also be applied to
computer code and
intrinsic plagiarism detection, which is to detect plagiarism based on the writing style changes within the document. Stylometry can also be used to predict whether someone is a native or non native English speaker by their
typing speed.
Stylometry as a method is vulnerable to the distortion of text during revision. There is also the case of the author adopting different styles in the course of their career as was demonstrated in the case of
Plato
Plato ( ; Greek language, Greek: , ; born BC, died 348/347 BC) was an ancient Greek philosopher of the Classical Greece, Classical period who is considered a foundational thinker in Western philosophy and an innovator of the writte ...
, who chose different stylistic policies such as those adopted for the early and middle dialogues addressing the Socratic problem.
Features
Textual features of interest for authorship attribution are on the one hand computing occurrences of idiosyncratic expressions or constructions (e.g. checking for how the author uses interpunction or how often the author uses agentless passive constructions) and on the other hand similar to those used for readability analysis such as measures of lexical variation and syntactic variation.
Since authors often have preferences for certain topics, research experiments in authorship attribution mostly remove content words such as nouns, adjectives, and verbs from the feature set, only retaining structural elements of the text to avoid
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
their models to topic rather than author characteristics.
Stylistic features are often computed as averages over a text or over the entire collected works of an author, yielding measures such as average word length or average sentence length. This enables a model to identify authors who have a clear preference for wordy or terse sentences but hides variation: an author with a mix of long and short sentences will have the same average as an author with consistent mid-length sentences. To capture such variation, some experiments use sequences or patterns over observations rather than average observed frequencies, noting e.g. that an author shows a preference for a certain stress or emphasis pattern,
or that an author tends to follow a sequence of long sentences with a short one.
One of the first approaches to authorship identification, by Mendenhall, can be said to aggregate its observations without averaging them.
More recent authorship attribution models use
vector space models to automatically capture what is specific to an author's style, but they also rely on judicious feature engineering for the same reasons as more traditional models.
Adversarial stylometry
Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant
privacy
Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively.
The domain of privacy partially overlaps with security, which can include the concepts of a ...
challenge in its ability to unmask
anonymous
Anonymous may refer to:
* Anonymity, the state of an individual's identity, or personally identifiable information, being publicly unknown
** Anonymous work, a work of art or literature that has an unnamed or unknown creator or author
* Anonym ...
authors or to link
pseudonym
A pseudonym (; ) or alias () is a fictitious name that a person assumes for a particular purpose, which differs from their original or true meaning ( orthonym). This also differs from a new name that entirely or legally replaces an individual's o ...
s to an author's other identities, which, for example, creates difficulties for
whistleblowers
Whistleblowing (also whistle-blowing or whistle blowing) is the activity of a person, often an employee, revealing information about activity within a private or public organization that is deemed illegal, immoral, illicit, unsafe, unethical or ...
, activists, and
hoax
A hoax (plural: hoaxes) is a widely publicised falsehood created to deceive its audience with false and often astonishing information, with the either malicious or humorous intent of causing shock and interest in as many people as possible.
S ...
ers and
fraud
In law, fraud is intent (law), intentional deception to deprive a victim of a legal right or to gain from a victim unlawfully or unfairly. Fraud can violate Civil law (common law), civil law (e.g., a fraud victim may sue the fraud perpetrato ...
sters. The privacy risk is expected to grow as
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
techniques and
text corpora develop.
All adversarial stylometry shares the core idea of faithfully
paraphrasing the source text so that the meaning is unchanged but the stylistic signals are obscured. Such a faithful paraphrase is an
adversarial example for a stylometric classifier. Several broad approaches to this exist, with some overlap: ''imitation'', substituting the author's own style for another's; ''translation'', applying
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
with the hope that this eliminates characteristic style in the source text; and ''obfuscation'', deliberately modifying a text's style to make it not resemble the author's own.
Manually obscuring style is possible, but laborious; in some circumstances, it is preferable or necessary. Automated tooling, either semi- or fully-automatic, could assist an author. How best to perform the task and the design of such tools is an open research question. While some approaches have been shown to be able to defeat particular stylometric analyses, particularly those that do not account for the potential of adversariality, establishing safety in the face of unknown analyses is an issue. Ensuring the faithfulness of the paraphrase is a critical challenge for automated tools.
It is uncertain if the practice of adversarial stylometry is detectable in itself. Some studies have found that particular methods produced signals in the output text, but a stylometrist who is uncertain of what methods may have been used may not be able to reliably detect them.
Current research
Modern stylometry uses
computers
A computer is a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations ('' computation''). Modern digital electronic computers can perform generic sets of operations known as ''programs'', ...
for
statistical analysis
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
, and
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
and access to the growing
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
of texts available via the
Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
.
[ Argamon, Shlomo, ]Jussi Karlgren
Jussi Karlgren is a Swedish computational linguist, research scientist at Spotify, and co-founder of text analytics company Gavagai AB. He holds a PhD in computational linguistics from Stockholm University, and the title of docent (adjoint pro ...
, and James G. Shanahan. Stylistic analysis of text for information access. Papers from the workshop held in conjunction with the
28th Annual International ACM Conference on Research and
Development in Information Retrieval, August 13–19, 2005,
Salvador, Bahia, Brazil. Swedish institute of computer science, 2005. Software systems such as Signature (freeware produced by Peter Millican of Oxford University), JGAAP (the Java Graphical Authorship Attribution Program—freeware produced by Dr
Patrick Juola of Duquesne University), stylo
(an open-source R package for a variety of stylometric analyses, including authorship attribution, developed by
Maciej Eder,
Jan Rybicki and
Mike Kestemont) and Stylene for Dutch (online freeware by Prof
Walter Daelemans of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.
Academic venues and events
Stylometric methods are used for several academic topics, as an application of linguistics, lexicography, or literary study,
in conjunction with natural language processing and machine learning, and applied to plagiarism detection, authorship analysis, or information retrieval.
Forensic linguistics
The
International Association of Forensic Linguists (IAFL) organises the
Biennial Conference of the International Association of Forensic Linguists (13th edition in 2016 in
Porto
Porto (), also known in English language, English as Oporto, is the List of cities in Portugal, second largest city in Portugal, after Lisbon. It is the capital of the Porto District and one of the Iberian Peninsula's major urban areas. Porto c ...
) and publishes ''
The International Journal of Speech, Language and the Law'' with
forensic stylistics as one of its central topics.
AAAI
The
Association for the Advancement of Artificial Intelligence
The Association for the Advancement of Artificial Intelligence (AAAI) is an international Learned society, scientific society devoted to promote research in, and responsible use of, artificial intelligence. AAAI also aims to increase public under ...
(AAAI) has hosted several events on subjective and stylistic analysis of text.
PAN
PAN workshops (originally, plagiarism analysis, authorship identification, and near-duplicate detection, later more generally workshop on uncovering plagiarism, authorship, and social software misuse) organised since 2007 mainly in conjunction with information access conferences such as ACM
SIGIR,
FIRE
Fire is the rapid oxidation of a fuel in the exothermic chemical process of combustion, releasing heat, light, and various reaction Product (chemistry), products.
Flames, the most visible portion of the fire, are produced in the combustion re ...
, and
CLEF. PAN formulates shared challenge tasks for plagiarism detection, authorship identification, author gender identification,
author profiling, vandalism detection,
[Potthast, Martin, Benno Stein, and Teresa Holfeld. "Overview of the 1st International Competition on Wikipedia Vandalism Detection." In CLEF (Notebook Papers/LABs/Workshops). 2010.] and other related text analysis tasks, many of which hinge on stylometry.
Case studies of interest
* In 1439,
Lorenzo Valla showed that the
Donation of Constantine was a
forgery
Forgery is a white-collar crime that generally consists of the false making or material alteration of a legal instrument with the specific mens rea, intent to wikt:defraud#English, defraud. Tampering with a certain legal instrument may be fo ...
, an argument based partly on a comparison of the
Latin
Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
with that used in authentic 4th-century documents.
* In 1952, the Swedish priest
Dick Helander was elected bishop of
Strängnäs. The campaign was competitive and Helander was accused of writing a series of a hundred-some anonymous libelous letters about other candidates to the electorate of the bishopric of Strängnäs. Helander was first convicted of writing the letters and lost his position as bishop but later partially exonerated. The letters were studied using a number of stylometric measures (and also typewriter characteristics) and the various court cases and further examinations, many contracted by Helander himself during the years until his death in 1978, discussed stylometric method and its value as evidence in some detail.
[Text processing text analysis and generation – text typology and attribution. Proceedings of Nobel symposium 51. Edited by Sture Allén. Stockholm: Almqvist & Wiksell international 1982. Data linguistica, 16. Nobel symposium, 51. ]
* In 1975, after
Ronald Reagan
Ronald Wilson Reagan (February 6, 1911 – June 5, 2004) was an American politician and actor who served as the 40th president of the United States from 1981 to 1989. He was a member of the Republican Party (United States), Republican Party a ...
had served as governor of California, he began giving weekly radio commentaries syndicated to hundreds of stations. After his personal notes were made public on his 90th birthday in 2001, a study used stylostatistical methods to determine which of those talks were written by him and which were written by various aides.
* In 1996, the stylometric analysis of the controversial, pseudonymously authored book ''
Primary Colors,'' performed by
Vassar College
Vassar College ( ) is a private liberal arts college in Poughkeepsie, New York, United States. Founded in 1861 by Matthew Vassar, it was the second degree-granting institution of higher education for women in the United States. The college be ...
professor
Donald Foster brought the topic to the attention of a wider audience after correctly identifying the author as
Joe Klein. (This case was resolved only after a handwriting analysis confirmed the authorship.)
* In 1996, stylometric methods were used to compare the ''
Unabomber Manifesto'' with letters written by one of the suspects,
Theodore Kaczynski, which resulted in Kaczynski's apprehension and later conviction.
* In April 2015, researchers using stylometry techniques identified a play, ''
Double Falsehood'', as being the work of
William Shakespeare
William Shakespeare ( 23 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
. Researchers analyzed 54 plays by Shakespeare and
John Fletcher, and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological
valence of their language.
* In 2016,
MacDonald P. Jackson, Emeritus Professor of English at the
University of Auckland
The University of Auckland (; Māori: ''Waipapa Taumata Rau'') is a public research university based in Auckland, New Zealand. The institution was established in 1883 as a constituent college of the University of New Zealand. Initially loc ...
, New Zealand and a Fellow of the
Royal Society of New Zealand, who had spent his entire academic career analyzing authorship attribution, wrote a book titled ''Who Wrote "The Night Before Christmas"?: Analyzing the Clement Clarke Moore Vs. Henry Livingston Question'', in which he evaluates the opposing arguments and, for the first time, uses the author-attribution techniques of modern computational stylistics to examine the long-standing controversy. Jackson employs a range of tests and introduces a new one, statistical analysis of phonemes; he concludes that Livingston is the true author of the classic work.
*In 2017, Simon Fuller and
James O'Sullivan published a study claiming that bestselling author
James Patterson
James Brendan Patterson (born March 22, 1947) is an American author. Among his works are the '' Alex Cross'', '' Michael Bennett'', '' Women's Murder Club'', '' Maximum Ride'', '' Daniel X'', '' NYPD Red'', '' Witch & Wizard'', '' Private'' and ...
does not do any writing in his apparently co-authored novels.
According to O'Sullivan, his collaboration with former U.S. president
Bill Clinton
William Jefferson Clinton (né Blythe III; born August 19, 1946) is an American politician and lawyer who was the 42nd president of the United States from 1993 to 2001. A member of the Democratic Party (United States), Democratic Party, ...
, ''
The President is Missing'', is an exception to this rule.
* In 2017, a group of linguists, computer scientists, and scholars analysed the authorship of
Elena Ferrante
Elena Ferrante () is a pseudonymous Italian novelist. Ferrante's books, originally published in Italian, have been translated into many languages. Her four-book series of '' Neapolitan Novels'' are her most widely known works. ''Time'' magazine ...
. Based on a corpus created at
University of Padua
The University of Padua (, UNIPD) is an Italian public research university in Padua, Italy. It was founded in 1222 by a group of students and teachers from the University of Bologna, who previously settled in Vicenza; thus, it is the second-oldest ...
containing 150 novels written by 40 authors, they analyzed Ferrante's style based on seven of her novels. They were able to compare her writing style with 39 other novelists using, for example, stylo.
The conclusion was the same for all of them:
Domenico Starnone is the secret author of Elena Ferrante.
* In 2018,
Mark Glickman, a senior lecturer in statistics at Harvard University, worked with Ryan Song, a former statistics student at Harvard, and Jason Brown, a professor at Dalhousie University in Nova Scotia, applying stylometry to find that, most likely,
The Beatles
The Beatles were an English Rock music, rock band formed in Liverpool in 1960. The core lineup of the band comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are widely regarded as the Cultural impact of the Beatle ...
' song "
In My Life
"In My Life" is a song by the English Rock music, rock band the Beatles, released on their 1965 studio album, ''Rubber Soul''. Credited to the Lennon–McCartney songwriting partnership, the song is one of only a few in which there is dispute ...
" was composed by John Lennon, but with a 50% chance that Paul McCartney wrote the
middle eight.
*In 2019, the ETSO project: Stylometry applied to the Spanish Golden Age Theater, directed by and
Germán Vega García-Luengos (University of Valladolid) managed to gather 3000 plays of the Spanish Golden Age. After applying stylometrical analysis, the attribution of ''Mujeres y criados'' to
Lope de Vega
Félix Lope de Vega y Carpio (; 25 November 156227 August 1635) was a Spanish playwright, poet, and novelist who was a key figure in the Spanish Golden Age (1492–1659) of Spanish Baroque literature, Baroque literature. In the literature of ...
was ratified, and an authorship problem was detected in ''La monja alférez'', a play attributed to Pérez de Montalbán which, thanks to these analyzes and through historical and philology research, was eventually attributed to
Juan Ruiz de Alarcón. In 2023, the same project found Lope de Vega as the author of ''La francesa Laura'' (The Frenchwoman Laura), despite the manuscript was written years after his death. The comedy was classified as a late work of Lope de Vega and dated from 1628 to 1630, as its flattering treatment of France could be attributed to the momentary good relationship between Spain and France during the
Thirty Years' War
The Thirty Years' War, fought primarily in Central Europe between 1618 and 1648, was one of the most destructive conflicts in History of Europe, European history. An estimated 4.5 to 8 million soldiers and civilians died from battle, famine ...
, having England as a common enemy. In this analysis, the 500 most frequent words of the text under investigation are compared with the 500 of the rest of the works. In the case of ''La francesa Laura'', the finding detected that the 100 works with which it was closest were almost all by Lope de Vega. Machine learning methods, such as
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
analysis, were also conducted with a large range of parameters. The traditional philological analysis on the authorship of works has confirmed the investigations of stylometry and artificial intelligence.
*In 2020, Rachel McCarthy and
James O'Sullivan argued that
Emily Brontë
Emily Jane Brontë (, commonly ; 30 July 1818 – 19 December 1848) was an English writer best known for her 1847 novel, ''Wuthering Heights''. She also co-authored a book of poetry with her sisters Charlotte Brontë, Charlotte and Anne Bront� ...
is the true author of ''
Wuthering Heights
''Wuthering Heights'' is the only novel by the English author Emily Brontë, initially published in 1847 under her pen name "Ellis Bell". It concerns two families of the landed gentry living on the West Yorkshire moors, the Earnshaws and the ...
'', ending speculation by some critics that the novel might have been written by one of her siblings, specifically either
Branwell or
Charlotte.
*In 2020, Hartmut Ilsemann used Rolling Delta and Rolling Classify from the R Stylo program suite to show that the Marlowe corpus is stylistically inhomogeneous, and that the author of the two ''Tamburlaines'' was hardly present in the remaining official corpus of Marlowe.
* In 2022, the Italian scholars Simone Rebora and Massimo Salgaro showed, using John F. Burrows' "Delta distance" method, that
Felix Salten
Felix Salten (; 6 September 1869 – 8 October 1945) was an Austrian author and Literary criticism, literary critic. His most famous work is ''Bambi, a Life in the Woods'', which was adapted into an animated feature film, ''Bambi'', by Walt Disne ...
is the most probable author of the anonymous novel ''
Josefine Mutzenbacher'' from 1906, the final pages excluded.
* In 2023, the Swedish journalist Lapo Lappin claimed that two crime novels by the Swedish author
Camilla Läckberg may be the work of a ghost writer, presumably her editor
Pascal Engman. This claim was first denied by the author and her spokesperson, but later Läckberg admitted that she and Pascal Engman work very closely together and he edits her texts.
Data and methods
Since stylometry has both descriptive use cases, used to characterise the content of a collection, and identificatory use cases, e.g. identifying authors or categories of texts, the methods used to analyse the data and features above range from those built to classify items into sets or to distribute items in a space of feature variation. Most methods are statistical in nature, such as
cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
and
discriminant analysis, are typically based on
philological
Philology () is the study of language in oral and written historical sources. It is the intersection of textual criticism, literary criticism, history, and linguistics with strong ties to etymology. Philology is also defined as the study of ...
data and features, and are fruitful application domains for modern
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods.
Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech. Most systems are based on lexical statistics, i.e. using the frequencies of words and terms in the text to characterise the text (or its author). In this context, unlike for
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
, the observed occurrence patterns of the
most common words are more interesting than the topical terms which are less frequent.
[ Biber, Douglas. Variation across speech and writing. Cambridge University Press, 1991.]
The primary stylometric method is the
writer invariant: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of
function word
In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speak ...
s used by the writer.
In one such method, the text is analyzed to find the 50 most common words. The text is then divided into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using
principal components analysis
Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing.
The data is linear map, linearly transformed ...
(PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.
Gaussian statistics
Stylometric data are distributed according to the
Zipf–Mandelbrot law. The distribution is extremely spiky and
leptokurtic, the reason why researchers could not use statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying
data transformation.
Neural networks
Neural network
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
s, a special case of statistical machine learning methods, have been used to analyze authorship of texts. Texts of undisputed authorship are used to train a neural network by processes such as
backpropagation, such that training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of
Shakespeare
William Shakespeare ( 23 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's natio ...
with his contemporaries
John Fletcher and
Christopher Marlowe
Christopher Marlowe ( ; Baptism, baptised 26 February 156430 May 1593), also known as Kit Marlowe, was an English playwright, poet, and translator of the Elizabethan era. Marlowe is among the most famous of the English Renaissance theatre, Eli ...
, and confirmed the opinion, based on more conventional scholarship, that such collaboration had indeed occurred.
A 1999 study showed that a neural network program reached 70% accuracy in determining the authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den".
A study used
deep belief networks (DBN) for authorship verification model applicable for continuous authentication (CA).
One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has analyzed more often.
Genetic algorithms
The
genetic algorithm
In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...
is another machine learning technique used for stylometry. This involves a method that starts with a set of rules. An example rule might be, "If ''but'' appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are not used. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules attribute the texts correctly.
Rare pairs
One method for identifying style is termed "rare pairs" and relies upon individual habits of
collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words t ...
. The use of certain words may, for a particular author, be associated idiosyncratically with the use of other, predictable words.
Authorship attribution in instant messaging
The diffusion of the internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as
color
Color (or colour in English in the Commonwealth of Nations, Commonwealth English; American and British English spelling differences#-our, -or, see spelling differences) is the visual perception based on the electromagnetic spectrum. Though co ...
s,
layout
In general terms, a layout is a structured arrangement of items within certain limits, or a plan for such arrangement.
Specifically, layout may refer to:
* Page layout, the arrangement of visual elements on a page
** Comprehensive layout (comp), ...
,
fonts,
graphics
Graphics () are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of the data, as in design and manufa ...
,
emoticons
An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers and letters—to express a person's feelings, mood or reaction, without needin ...
, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in. In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices.
Standard stylometric features have been employed to categorize the content of a chat by
instant messaging
Instant messaging (IM) technology is a type of synchronous computer-mediated communication involving the immediate ( real-time) transmission of messages between two or more parties over the Internet or another computer network. Originally involv ...
, or the behavior of the participants,
but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a major difference between chat data and any other type of written information.
See also
*
Data re-identification
*
Digital watermarking
A digital watermark is a kind of marker covertly embedded in a noise-tolerant signal such as audio, video or image data.H.T. Sencar, M. Ramkumar and A.N. Akansu: ''Data Hiding Fundamentals and Applications: Content Security in Digital Multimedia'' ...
*
*
Moshe Koppel
*
Quantitative linguistics
*
Steganography
*
Writeprint
Notes
References
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* Van Droogenbroeck, Frans J. (2016)
Handling the Zipf distribution in computerized authorship attribution
* Van Droogenbroeck, Frans J. (2019)
An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics
*
*
* {{ Cite book , doi = 10.18653/v1/2022.acl-long.509 , doi-access = free , arxiv = 2203.11849 , chapter = A Girl Has A Name, And It's ... Adversarial Authorship Attribution for Deobfuscation , title = Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year = 2022 , last1 = Zhai , first1 = Wanyue , last2 = Rusert , first2 = Jonathan , last3 = Shafiq , first3 = Zubair , last4 = Srinivasan , first4 = Padmini , pages = 7372–7384 , s2cid = 248780012
Further reading
See also the academic journal ''Literary and Linguistic Computing'', now ''Digital Scholarship in the Humanities'' (published by the
University of Oxford
The University of Oxford is a collegiate university, collegiate research university in Oxford, England. There is evidence of teaching as early as 1096, making it the oldest university in the English-speaking world and the List of oldest un ...
) and the ''Language Resources and Evaluation'' journal (previously ''Computers and the Humanities'').
External links
Association for Computers and the HumanitiesLiterary and Linguistic ComputingComputational Stylistics Group''JGAAP'' Authorship Attribution ProgramUncovering the Mystery of J.K. Rowling's Latest Novel
Digital humanities
Quantitative linguistics
Authorship debates
Computational fields of study
Stylistics
Forensic disciplines
Personal identification