Stylometry
   HOME

TheInfoList



OR:

Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well. Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding manner and meaning. Springer Science & Business Media, 2010. Another conceptualization defines it as the linguistic discipline that evaluates an author's style through the application of
statistical analysis Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...
to a body of their work. Stylometry is often used to attribute
authorship An author is the writer of a book, article, play, mostly written work. A broader definition of the word "author" states: "''An author is "the person who originated or gave existence to anything" and whose authorship determines responsibility f ...
to
anonymous Anonymous may refer to: * Anonymity, the state of an individual's identity, or personally identifiable information, being publicly unknown ** Anonymous work, a work of art or literature that has an unnamed or unknown creator or author * Anony ...
or disputed documents. It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works to forensic linguistics and has methodological similarities with the analysis of text
readability Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax) and its presentation (such as typographic aspects that ...
.


History

Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, author identity, and other questions. The modern practice of the discipline received publicity from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors of uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use John Fletcher's preference for "⁠ ⁠’em", the contractional form of "them", as a marker to distinguish between Fletcher and Philip Massinger in their collaborations—- but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of "⁠ ⁠’em" to "them". The basics of stylometry were established by Polish philosopher Wincenty Lutosławski in ''Principes de stylométrie'' (1890). Lutosławski used this method to develop a chronology of
Plato's Dialogues Plato ( ; grc-gre, wikt:Πλάτων, Πλάτων ; 428/427 or 424/423 – 348/347 BC) was a Greeks, Greek philosopher born in Athens during the Classical Greece, Classical period in Ancient Greece. He founded the Platonist school of thou ...
. The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee good quality output. During the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which indicated that six different authors had written that body of work. A check of his method, applied to the works of
James Joyce James Augustine Aloysius Joyce (2 February 1882 – 13 January 1941) was an Irish novelist, poet, and literary critic. He contributed to the Modernism, modernist avant-garde movement and is regarded as one of the most influential and important ...
, gave the result that '' Ulysses'', Joyce's multi-perspective, multi-style novel, was composed by five separate individuals, none of whom apparently had any part in the crafting of Joyce's first novel, ''
A Portrait of the Artist as a Young Man ''A Portrait of the Artist as a Young Man'' is the first novel of Irish writer James Joyce. A ''Künstlerroman'' written in a modernist style, it traces the religious and intellectual awakening of young Stephen Dedalus, Joyce's fictional al ...
.'' In time, however, and with practice, researchers and scholars have refined their methods, to yield better results. One notable early success was the resolution of disputed authorship of twelve of ''
The Federalist Papers ''The Federalist Papers'' is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay under the collective pseudonym "Publius" to promote the ratification of the Constitution of the United States. The c ...
'' by Frederick Mosteller and David Wallace. While there are still questions concerning initial assumptions and methods (and, perhaps, always will be), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic analysis to the Fletcher canon by Cyrus Hoy and others yielded clear results during the late 1950s and early 1960s.)


Applications

Applications of stylometry include literary studies, historical studies, social studies, information retrieval, and many forensic cases and studies. It can also be applied to
computer code A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progra ...
and intrinsic plagiarism detection, which is to detect plagiarism based on the writing style changes within the document. Stylometry can also be used to predict whether someone is a native or non native English speaker by their typing speed. Stylometry as a method is vulnerable to the distortion of text during revision. There is also the case of the author adopting different styles in the course of his career as was demonstrated in the case of
Plato Plato ( ; grc-gre, Πλάτων ; 428/427 or 424/423 – 348/347 BC) was a Greek philosopher born in Athens during the Classical period in Ancient Greece. He founded the Platonist school of thought and the Academy, the first institution ...
, who chose different stylistic policies such as the those adopted for the early and middle dialogues addressing the Socratic problem.


Features

Textual features of interest for authorship attribution are on the one hand computing occurrences of idiosyncratic expressions or constructions (e.g. checking for how the author uses interpunction or how often the author uses agentless passive constructions) and on the other hand similar to those used for readability analysis such as measures of lexical variation and syntactic variation. Since authors often have preferences for certain topics, research experiments in authorship attribution mostly remove content words such as nouns, adjectives, and verbs from the feature set, only retaining structural elements of the text to avoid
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
their models to topic rather than author characteristics. Stylistic features are often computed as averages over a text or over the entire collected works of an author, yielding measures such as average word length or average sentence length. This enables a model to identify authors who have a clear preference for wordy or terse sentences but hides variation: an author with a mix of long and short sentences will have the same average as an author with consistent mid-length sentences. To capture such variation, some experiments use sequences or patterns over observations rather than average observed frequencies, noting e.g. that an author shows a preference for a certain stress or emphasis pattern, or that an author tends to follow a sequence of long sentences with a short one. One of the very first approaches to authorship identification, by Mendenhall, can be said to aggregate its observations without averaging them. More recent authorship attribution models use vector space models to automatically capture what is specific to an author's style, but they also rely on judicious feature engineering for the same reasons as more traditional models.


Current research

Modern stylometry uses
computers A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...
for
statistical analysis Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...
, and
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...
and access to the growing
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of texts available via the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, p ...
. Argamon, Shlomo,
Jussi Karlgren Jussi Karlgren is a Swedish computational linguist, research scientist at Spotify, and co-founder of text analytics company Gavagai AB. He holds a PhD in computational linguistics from Stockholm University, and the title of docent (adjoint prof ...
, and
James G. Shanahan James is a common English language surname and given name: *James (name), the typically masculine first name James * James (surname), various people with the last name James James or James City may also refer to: People * King James (disambiguati ...
. Stylistic analysis of text for information access. Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Information Retrieval, August 13–19, 2005, Salvador, Bahia, Brazil. Swedish institute of computer science, 2005.
Software systems such as Signature (freeware produced by Dr Peter Millican of Oxford University), JGAAP (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola of Duquesne University), stylo (an open-source R package for a variety of stylometric analyses, including authorship attribution, developed by
Maciej Eder Maciej (Polish pronunciation: ) is a Polish given name, the etymological equivalent of Matthias. Its diminutive forms are Maciek, Maciuś. Namedays according to Polish calendar: 30 January, 24 February, 14 May Maciej may refer to: Arts and e ...
, Jan Rybicki and
Mike Kestemont Mike may refer to: Animals * Mike (cat), cat and guardian of the British Museum * Mike the Headless Chicken, chicken that lived for 18 months after his head had been cut off * Mike (chimpanzee), a chimpanzee featured in several books and document ...
) and Stylene for Dutch (online freeware by Prof
Walter Daelemans Walter Daelemans (born June 3, 1960) is professor in computational linguistics at the University of Antwerp. He is also a research director of the Computational Linguistics and Psycholinguistics Research Center (CLiPS). Daelemans holds a Ph.D. fr ...
of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.


Academic venues and events

Stylometric methods are used for several academic topics, as an application of linguistics, lexicography, or literary study, in conjunction with natural language processing and machine learning, and applied to plagiarism detection, authorship analysis, or information retrieval.


Forensic linguistics

The International Association of Forensic Linguists (IAFL) organises the
Biennial Conference of the International Association of Forensic Linguists Biennial means (an event) lasting for two years or occurring every two years. The related term biennium is used in reference to a period of two years. In particular, it can refer to: * Biennial plant, a plant which blooms in its second year and th ...
(13th edition in 2016 in
Porto Porto or Oporto () is the second-largest city in Portugal, the capital of the Porto District, and one of the Iberian Peninsula's major urban areas. Porto city proper, which is the entire municipality of Porto, is small compared to its metropo ...
) and publishes ''
The International Journal of Speech, Language and the Law ''The'' () is a grammatical article in English, denoting persons or things already mentioned, under discussion, implied or otherwise presumed familiar to listeners, readers, or speakers. It is the definite article in English. ''The'' is the ...
'' with
forensic stylistics Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
as one of its central topics.


AAAI

The
Association for the Advancement of Artificial Intelligence The Association for the Advancement of Artificial Intelligence (AAAI) is an international scientific society devoted to promote research in, and responsible use of, artificial intelligence. AAAI also aims to increase public understanding of artif ...
(AAAI) has hosted several events on subjective and stylistic analysis of text.


PAN

PAN workshops (originally, plagiarism analysis, authorship identification, and near-duplicate detection, later more generally workshop on uncovering plagiarism, authorship, and social software misuse) organised since 2007 mainly in conjunction with information access conferences such as ACM SIGIR,
FIRE Fire is the rapid oxidation of a material (the fuel) in the exothermic chemical process of combustion, releasing heat, light, and various reaction products. At a certain point in the combustion reaction, called the ignition point, flames ...
, and
CLEF A clef (from French: 'key') is a musical symbol used to indicate which notes are represented by the lines and spaces on a musical stave. Placing a clef on a stave assigns a particular pitch to one of the five lines, which defines the pitc ...
. PAN formulates shared challenge tasks for plagiarism detection, authorship identification, author gender identification,
author profiling Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include age and g ...
, vandalism detection,Potthast, Martin, Benno Stein, and Teresa Holfeld. "Overview of the 1st International Competition on Wikipedia Vandalism Detection." In CLEF (Notebook Papers/LABs/Workshops). 2010. and other related text analysis tasks, many of which hinge on stylometry.


Case studies of interest

* In 1439,
Lorenzo Valla Lorenzo Valla (; also Latinized as Laurentius; 14071 August 1457) was an Italian Renaissance humanist, rhetorician, educator, scholar, and Catholic priest. He is best known for his historical-critical textual analysis that proved that the ''Do ...
showed that the
Donation of Constantine The ''Donation of Constantine'' ( ) is a forged Roman imperial decree by which the 4th-century emperor Constantine the Great supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope. Composed probably in ...
was a
forgery Forgery is a white-collar crime that generally refers to the false making or material alteration of a legal instrument with the specific intent to defraud anyone (other than themself). Tampering with a certain legal instrument may be forb ...
, an argument based partly on a comparison of the
Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...
with that used in authentic 4th-century documents. * In 1952, the Swedish priest
Dick Helander Dick Adolf Viktor Helander (23 June 1896 – 14 August 1978) was a Swedish bishop in Strängnäs diocese between 1952 and 1953 and a professor. He lost his position as a bishop in the aftermath of the Helander case. Life and career Helander was ...
was elected bishop of
Strängnäs Strängnäs is a locality and the seat of Strängnäs Municipality, Södermanland County, Sweden with 15,363 inhabitants in 2020. It is located by Lake Mälaren and is the episcopal see of the Diocese of Strängnäs, one of the thirteen dioceses ...
. The campaign was competitive and Helander was accused of writing a series of a hundred-some anonymous libelous letters about other candidates to the electorate of the bishopric of Strängnäs. Helander was first convicted of writing the letters and lost his position as bishop but later partially exonerated. The letters were studied using a number of stylometric measures (and also typewriter characteristics) and the various court cases and further examinations, many contracted by Helander himself during the years until his death in 1978, discussed stylometric method and its value as evidence in some detail.Text processing text analysis and generation – text typology and attribution. Proceedings of Nobel symposium 51 / ed. by
Sture Allén Sture Allén (31 December 1928 – 20 June 2022) was a Swedish professor of computational linguistics at the University of Gothenburg, who was the permanent secretary of the Swedish Academy between 1986 and 1999. Born in Gothenburg, he was elect ...
Stockholm : Almqvist & Wiksell international 1982 653 pp. Data linguistica; 16 Nobel symposium; 51
* In 1975, after
Ronald Reagan Ronald Wilson Reagan ( ; February 6, 1911June 5, 2004) was an American politician, actor, and union leader who served as the 40th president of the United States from 1981 to 1989. He also served as the 33rd governor of California from 1967 ...
had served as governor of California, he began giving weekly radio commentaries syndicated to hundreds of stations. After his personal notes were made public on his 90th birthday in 2001, a study used stylostatistical methods to determine which of those talks were written by him and which were written by various aides. * In 1996, the stylometric analysis of the controversial, pseudonymously authored book ''
Primary Colors A set of primary colors or primary colours (see spelling differences) consists of colorants or colored lights that can be mixed in varying amounts to produce a gamut of colors. This is the essential method used to create the perception of a b ...
,'' performed by
Vassar College Vassar College ( ) is a private liberal arts college in Poughkeepsie, New York, United States. Founded in 1861 by Matthew Vassar, it was the second degree-granting institution of higher education for women in the United States, closely foll ...
professor Donald Foster brought the topic to the attention of a wider audience after correctly identifying the author as
Joe Klein Joe Klein (born September 7, 1946) is an American political commentator and author. He is best known for his work as a columnist for ''Time'' magazine and his novel ''Primary Colors'', an anonymously written roman à clef portraying Bill Clinton' ...
. (This case was resolved only after a handwriting analysis confirmed the authorship.) * In 1996, stylometric methods were used to compare the
Unabomber manifesto ''Industrial Society and Its Future'', generally known as the ''Unabomber Manifesto'', is a 1995 anti-technology essay by Ted Kaczynski, the "Unabomber". The manifesto contends that the Industrial Revolution began a harmful process of natura ...
with letters written by one of the suspects,
Theodore Kaczynski Theodore John Kaczynski ( ; born May 22, 1942), also known as the Unabomber (), is an American domestic terrorist and former mathematics professor. Between 1978 and 1995, Kaczynski killed three people and injured 23 others in a nationwide ...
, which resulted in Theodore's apprehension and later conviction. * In April 2015, researchers using stylometry techniques identified a play, ''
Double Falsehood ''Double Falsehood'' (archaic spelling: ''Double Falshood'') or ''The Distrest Lovers'' is a 1727 play by the English writer and playwright Lewis Theobald, although the authorship has been contested ever since the play was first published, with ...
'', as being the work of
William Shakespeare William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
. Researchers analyzed 54 plays by Shakespeare and John Fletcher, and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological valence of their language. * In 2016, MacDonald P. Jackson, Emeritus Professor of English at the
University of Auckland , mottoeng = By natural ability and hard work , established = 1883; years ago , endowment = NZD $293 million (31 December 2021) , budget = NZD $1.281 billion (31 December 2021) , chancellor = Cecilia Tarrant , vice_chancellor = Dawn F ...
, New Zealand and a Fellow of the
Royal Society of New Zealand Royal may refer to: People * Royal (name), a list of people with either the surname or given name * A member of a royal family Places United States * Royal, Arkansas, an unincorporated community * Royal, Illinois, a village * Royal, Iowa, a c ...
, who had spent his entire academic career analyzing authorship attribution, wrote a book titled ''Who Wrote "The Night Before Christmas"?: Analyzing the Clement Clarke Moore Vs. Henry Livingston Question'', in which he evaluates the opposing arguments and, for the first time, uses the author-attribution techniques of modern computational stylistics to examine the long-standing controversy. Jackson employs a range of tests and introduces a new one, statistical analysis of phonemes; he concludes that Livingston is the true author of the classic work. *In 2017, Simon Fuller and James O'Sullivan published a study claiming that bestselling author
James Patterson James Brendan Patterson (born March 22, 1947) is an American author. Among his works are the '' Alex Cross'', '' Michael Bennett'', '' Women's Murder Club'', '' Maximum Ride'', '' Daniel X'', '' NYPD Red'', '' Witch & Wizard'', and ''Private'' ...
does not do any writing in his apparently co-authored novels. According to O'Sullivan, his collaboration with former U.S. president
Bill Clinton William Jefferson Clinton (né Blythe III; born August 19, 1946) is an American politician who served as the 42nd president of the United States from 1993 to 2001. He previously served as governor of Arkansas from 1979 to 1981 and again ...
, '' The President is Missing'', is an exception to this rule. * In 2017, a group of linguists, computer scientists, and scholars analysed the authorship of Elena Ferrante. Based on a corpus created at
University of Padua The University of Padua ( it, Università degli Studi di Padova, UNIPD) is an Italian university located in the city of Padua, region of Veneto, northern Italy. The University of Padua was founded in 1222 by a group of students and teachers from ...
containing 150 novels written by 40 authors, they analyzed Ferrante's style based on seven of her novels. They were able to compare her writing style with 39 other novelists using, for example, stylo. The conclusion was the same for all of them:
Domenico Starnone Domenico Starnone (born 15 February 1943) is an Italian writer, screenwriter and journalist. Born in Saviano, near Naples, he has worked for several newspapers and satirical magazines, including ''L'Unità'', '' Il Manifesto'', ''Tango'', and '' ...
is the secret author of Elena Ferrante. * In 2018, Mark Glickman, a senior lecturer in statistics at Harvard University, worked with Ryan Song, a former statistics student at Harvard, and Jason Brown, a professor at Dalhousie University in Nova Scotia, applying stylometry to find that, most likely,
The Beatles The Beatles were an English Rock music, rock band, formed in Liverpool in 1960, that comprised John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the Cultural impact of the Beatles, most influential band of al ...
' song "
In My Life "In My Life" is a song by the English rock band the Beatles. It appeared on their 1965 album '' Rubber Soul''. Its lyrics were written primarily by John Lennon, credited to Lennon–McCartney. George Martin contributed the piano solo bridge. ...
" was composed by John Lennon, but with a 50% chance that Paul McCartney wrote the middle eight. *In 2019, th
ETSO project: Stylometry applied to the Spanish Golden Age Theater
directed by Álvaro Cuéllar González and Germán Vega García-Luengos (University of Valladolid) managed to gather more than 1200 plays of the Spanish Golden Age. After applying stylometrical analysis, the attribution of ''Mujeres y criados'' to
Lope de Vega Félix Lope de Vega y Carpio ( , ; 25 November 156227 August 1635) was a Spanish playwright, poet, and novelist. He was one of the key figures in the Spanish Golden Age of Baroque literature. His reputation in the world of Spanish literatur ...
was ratified, and an authorship problem was detected in ''La monja alférez'', a play attributed to Pérez de Montalbán which, thanks to these analyzes and through historical and philology research, was eventually attributed to
Juan Ruiz de Alarcón Juan Ruiz de Alarcón (c. 1581 - 4 August 1639) was a New Spain-born Spanish writer of the Golden Age who cultivated different variants of dramaturgy. His works include the comedy '' La verdad sospechosa'' ( es), which is considered a masterpiec ...
. *In 2020, Rachel McCarthy and James O'Sullivan argued that
Emily Brontë Emily Jane Brontë (, commonly ; 30 July 1818 – 19 December 1848) was an English novelist and poet who is best known for her only novel, '' Wuthering Heights'', now considered a classic of English literature. She also published a book of poe ...
is the true author of ''
Wuthering Heights ''Wuthering Heights'' is an 1847 novel by Emily Brontë, initially published under her pen name Ellis Bell. It concerns two families of the landed gentry living on the West Yorkshire moors, the Earnshaws and the Lintons, and their turbulent re ...
'', ending speculation by some critics that the novel might have been written by one of her siblings, specifically either Branwell or
Charlotte Charlotte ( ) is the most populous city in the U.S. state of North Carolina. Located in the Piedmont region, it is the county seat of Mecklenburg County. The population was 874,579 at the 2020 census, making Charlotte the 16th-most populo ...
. *In 2020, Hartmut Ilsemann used Rolling Delta and Rolling Classify from the R Stylo program suite to show that the Marlowe corpus is stylistically inhomogeneous, and that the author of the two ''Tamburlaines'' was hardly present in the remaining official corpus of Marlowe.


Data and methods

Since stylometry has both descriptive use cases, used to characterise the content of a collection, and identificatory use cases, e.g. identifying authors or categories of texts, the methods used to analyse the data and features above range from those built to classify items into sets or to distribute items in a space of feature variation. Most methods are statistical in nature, such as
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
and discriminant analysis, are typically based on
philological Philology () is the study of language in oral and written historical sources; it is the intersection of textual criticism, literary criticism, history, and linguistics (with especially strong ties to etymology). Philology is also defined as t ...
data and features, and are fruitful application domains for modern
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
methods. Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech. Most systems are based on lexical statistics, i.e. using the frequencies of words and terms in the text to characterise the text (or its author). In this context, unlike for
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...
, the observed occurrence patterns of the most common words are more interesting than the topical terms which are less frequent. Biber, Douglas. Variation across speech and writing. Cambridge University Press, 1991. The primary stylometric method is the
writer invariant Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given author and different in texts of different authors. It can be used ...
: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of
function word In linguistics, function words (also called functors) are words that have little lexical meaning or have ambiguous meaning and express grammatical relationships among other words within a sentence, or specify the attitude or mood of the speake ...
s used by the writer. In one such method, the text is analyzed to find the 50 most common words. The text is then divided into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using
principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.


Gaussian statistics

Stylometric data are distributed according to the Zipf-Mandelbrot law. The distribution is extremely spiky and leptokurtic, the reason why researchers could not use statistics to solve e.g. authorship attribution problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying
data transformation In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integrationCIO.com. Agile Comes to Data Integration. Retrieved from: htt ...
.


Neural networks

Neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s, a special case of statistical machine learning methods, have been used to analyze authorship of texts. Texts of undisputed authorship are used to train a neural network by processes such as
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
, such that training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of
Shakespeare William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
with his contemporaries John Fletcher and
Christopher Marlowe Christopher Marlowe, also known as Kit Marlowe (; baptised 26 February 156430 May 1593), was an English playwright, poet and translator of the Elizabethan era. Marlowe is among the most famous of the Elizabethan playwrights. Based upon t ...
, and confirmed the opinion, based on more conventional scholarship, that such collaboration had indeed occurred. A 1999 study showed that a neural network program reached 70% accuracy in determining the authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den". A study used deep belief networks (DBN) for authorship verification model applicable for continuous authentication (CA). One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has analyzed more often.


Genetic algorithms

The
genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to ge ...
is another machine learning technique used for stylometry. This involves a method that starts with a set of rules. An example rule might be, "If ''but'' appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are not used. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules attribute the texts correctly.


Rare pairs

One method for identifying style is termed "rare pairs", and relies upon individual habits of
collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
. The use of certain words may, for a particular author, be associated idiosyncratically with the use of other, predictable words.


Authorship attribution in instant messaging

The diffusion of the internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as
color Color (American English) or colour (British English) is the visual perceptual property deriving from the spectrum of light interacting with the photoreceptor cells of the eyes. Color categories and physical specifications of color are associ ...
s,
layout Layout may refer to: * Page layout, the arrangement of visual elements on a page ** Comprehensive layout (comp), a proposed page layout presented by a designer to their client * Layout (computing), the process of calculating the position of obj ...
, fonts,
graphics Graphics () are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of data, as in design and manufacture, ...
,
emoticons An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings ...
, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in. In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices. Standard stylometric features have been employed to categorize the content of a chat by
instant messaging Instant messaging (IM) technology is a type of online chat allowing real-time text transmission over the Internet or another computer network. Messages are typically transmitted between two or more parties, when each user inputs text and tri ...
, or the behavior of the participants, but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a major difference between chat data and any other type of written information.


See also

* Linguistics and the Book of Mormon, Stylometry (Wordprint Studies) * Moshe Koppel * Quantitative linguistics *
Writeprint Writeprint is a method in forensic linguistics of establishing author identification over the internet, likened to a digital fingerprint. Identity is established through a comparison of distinguishing stylometric characteristics of an unknown wr ...


Notes


References

* * * * * * * * * * * Van Droogenbroeck, Frans J. (2016)
Handling the Zipf distribution in computerized authorship attribution
* Van Droogenbroeck, Frans J. (2019)
An essential rephrasing of the Zipf-Mandelbrot law to solve authorship attribution applications by Gaussian statistics
* {{cite journal , last1=Zenkov , first1=Andrei V. , title=A Method of Text Attribution Based on the Statistics of Numerals , journal=Journal of Quantitative Linguistics , date=2018 , volume=25 , issue=3 , pages=256–270 , doi=10.1080/09296174.2017.1371915, s2cid=49692378


Further reading

See also the academic journal ''Literary and Linguistic Computing'', now ''Digital Scholarship in the Humanities'' (published by the
University of Oxford , mottoeng = The Lord is my light , established = , endowment = £6.1 billion (including colleges) (2019) , budget = £2.145 billion (2019–20) , chancellor ...
) and the ''Language Resources and Evaluation'' journal (previously ''Computers and the Humanities'').


External links


Association for Computers and the HumanitiesLiterary and Linguistic ComputingComputational Stylistics Group''JGAAP'' Authorship Attribution ProgramUncovering the Mystery of J.K. Rowling's Latest Novel
Digital humanities Quantitative linguistics Authorship debates Computational fields of study Stylistics