Author Profiling
   HOME

TheInfoList



OR:

Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include
age Age or AGE may refer to: Time and its effects * Age, the amount of time someone has been alive or something has existed ** East Asian age reckoning, an Asian system of marking age starting at 1 * Ageing or aging, the process of becoming older ...
and
gender Gender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or third gender. Although gender often corresponds to sex, a transgender person may identify with a gender other tha ...
, though more recent studies have looked at other characteristics, like personality traits and occupation. Author profiling is one of the three major fields in automatic authorship identification (AAI), the other two being authorship attribution and authorship identification. The process of AAI emerged at the end of the 19th century.
Thomas Corwin Mendenhall Thomas Corwin Mendenhall (October 4, 1841 – March 23, 1924) was an American autodidact physicist and meteorologist. He was the first professor hired at Ohio State University in 1873 and the superintendent of the United States Coast and Geodeti ...
, an American
autodidact Autodidacticism (also autodidactism) or self-education (also self-learning, self-study and self-teaching) is the practice of education without the guidance of schoolmasters (i.e., teachers, professors, institutions). Overview Autodi ...
physicist and
meteorologist A meteorologist is a scientist who studies and works in the field of meteorology aiming to understand or predict Earth's atmosphere of Earth, atmospheric phenomena including the weather. Those who study meteorological phenomena are meteorologists ...
, was the first to apply this process to the works of
Francis Bacon Francis Bacon, 1st Viscount St Alban (; 22 January 1561 – 9 April 1626) was an English philosopher and statesman who served as Attorney General and Lord Chancellor of England under King James I. Bacon argued for the importance of nat ...
,
William Shakespeare William Shakespeare ( 23 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's nation ...
, and
Christopher Marlowe Christopher Marlowe ( ; Baptism, baptised 26 February 156430 May 1593), also known as Kit Marlowe, was an English playwright, poet, and translator of the Elizabethan era. Marlowe is among the most famous of the English Renaissance theatre, Eli ...
. From these three historic figures, Mendenhall sought to uncover their quantitative stylistic differences by inspecting word lengths. Although much progress has been made in the 21st century, the task of author profiling remains an unsolved problem due to its difficulty.


Techniques

Through the analysis of texts, various author profiling techniques can be applied to predict information about the author. For example, function words, as well as part-of-speech analysis, can be referenced to determine the author's gender and truth of a text. The process of author profiling usually involves the following steps:López-Monroy, A. P., Montes-y-Gómez, M., Escalante, H. J., Villaseñor-Pineda, L. & Stamatatos, E. (2015)
"Discriminative subprofile-specific representations for author profiling in social media."
''In: Knowledge-Based Systems, 89,'' 134 – 147.
# Identifying specific features to be extracted from the text # Building an adopted, standard representation (e.g.
Bag-of-words model The bag-of-words (BoW) model is a model of text which uses an unordered collection (a "multiset, bag") of words. It is used in natural language processing and information retrieval (IR). It disregards word order (and thus most of syntax or gramm ...
) for the target profile # Building a classification model using a standard classifier (e.g.
Support Vector Machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
) for the target profile
Machine learning algorithms The following outline is provided as an overview of, and topical guide to, machine learning: Machine learning (ML) is a subfield of artificial intelligence within computer science that evolved from the study of pattern recognition and computat ...
for author profiling have become increasingly complex over time. Algorithms used in author profiling include: *Support Vector Machines Lundeqvist, E. & Svensson, M. (2017)
"Author profiling: A machine learning approach towards detecting gender, age and native language of users in social media."
''In: Department of Information Technology.''
*
Naive Bayes classifier In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of " probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes th ...
s *Deep averaging networks, many layers in a cycle of machine learning that uses the mean of
word embedding In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
s within a text * Long Short-Term memoryBsi, B. & Zrigui, M. (2018)
"Deep learning techniques for author profiling in social media content."
''In: 31st IBIMA Conference.''
In the past, author profiling was limited to physical documents, often in the form of books and
newspaper articles A newspaper is a periodical publication containing written information about current events and is often typed in black ink with a white or gray background. Newspapers can cover a wide variety of fields such as politics, business, sports, art, ...
. Different combinations of textual attributes belonging to the authors were identified and analyzed using author profiling, including
lexical Lexical may refer to: Linguistics * Lexical corpus or lexis, a complete set of all words in a language * Lexical item, a basic unit of lexicographical classification * Lexicon, the vocabulary of a person, language, or branch of knowledge * Lexical ...
and
syntactical In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency) ...
features. Pioneering research in author profiling focused mostly on a single genre until the shift towards author profiling on social media and the Internet.Bilan, I. & Zhekova, D. (2016)
"CAPS: A cross-genre author profiling system."
''CLEF.''
While attributes, such as
content words Content words, in linguistics, are words that possess semantic content and contribute to the meaning of the sentence in which they occur. In a traditional approach, nouns were said to name objects and other entities, lexical verbs to indicate actio ...
and POS tags, are effective in author profile predictions on physical documents, their effectiveness in author profile predictions on digital texts is subjective and dependent on the type of online content being analyzed. With the advances in technology, author profiling on the Internet has become increasingly common. Digital texts, such as social media posts, blog posts and
emails Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving digital messages using electronic devices over a computer network. It was conceived in the late–20th century as the ...
, are now being used. This has sparked greater research efforts because of the advantages analysing digital texts can bring to sectors like marketing and business. Author profiling on digital texts has also enabled predictions of a wider range of author characteristics such as personality, income and occupation. The most effective attributes for author profiling on digital texts involve a combinations of stylistic and content features. Author profiling on digital texts focuses on cross-genre author profiling, whereby one genre is used for training data and another genre is used for testing data, though both need to be relatively similar for good results. There are some problems when performing author profiling techniques on online texts. These problems include: *Wide variation in lengths of texts used *Class imbalance in data


Author profiling and the Internet

The rise of the internet in the 20th to 21st century catalysed an increase in author profiling research, since data could be mined from the web, including social media platforms, emails and blogs. Content from the web have been analysed in tasks of author profiling to identify the age, gender, geographic origins, nationality and psychometric traits of web users. The information obtained has been used to serve various applications, including marketing and
forensics Forensic science combines principles of law and science to investigate criminal activity. Through crime scene investigations and laboratory analysis, forensic scientists are able to link suspects to evidence. An example is determining the time and ...
.


Social media

The increased integration of social media in people's daily lives have made them a rich source of textual data for author profiling. This is mainly because users frequently upload and share content for various purposes including self-expression, socialisation, and personal businesses. The
Social bot A social bot, also described as a social AI or social algorithm, is a software agent that communicates autonomously on social media. The messages (e.g. tweets) it distributes can be simple and operate in groups and various configurations with ...
is also a frequent feature of social media platforms, especially Twitter, generating content that may be analysed for author profiling.Rangel, F., & Russo, P. (2019).
Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in Twitter.
''CLEF.''
While different platforms contain similar data they may also contain different features depending on the format and structure of the particular platform. There are still limitations in using social media as data sources for author profiling, because data obtained may not always be reliable or accurate. Users sometimes provide false information about themselves or withhold information.Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018).
A survey on author profiling, deception, and irony detection for the Arabic language.
''Language and Linguistics Compass, 12(4).''
As a result, the training of algorithms for author profiling may be impeded by data that is less accurate. Another limitation is the irregularity of text in social media. Features of irregularity include deviation from normal linguistic standards such as spelling errors, unstandardised transliteration as with the substitution of letters with numbers, shorthands, user-created abbreviations for phrases and et cetera, which may pose a challenge to author profiling.Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.-P., Sanchez-Perez, M. A., & Chanona-Hernandez, L. (2016)
"Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts"
''In: Computational Intelligence and Neuroscience'', pg 1–13.
Researchers have adopted methods to overcome these limitations in training their algorithms for author profiling.


Facebook

Facebook is useful for author profiling studies as a
social networking service A social networking service (SNS), or social networking site, is a type of online social media platform which people use to build social networks or social relationships with other people who share similar personal or career content, interest ...
. This is because of how a
social network A social network is a social structure consisting of a set of social actors (such as individuals or organizations), networks of Dyad (sociology), dyadic ties, and other Social relation, social interactions between actors. The social network per ...
may be built, expanded, and used for social action in the site. In such processes, users share personal content that may be used for author profiling studies. Textual data is obtained from Facebook for author profiling from user's personal posts such as 'status updates'.Hsieh, F.C., Sandroni, R.F., & Paraboni, I. (2018).
Author Profiling from Facebook Corpora
. ''LREC.''
These are acquired to produce a corpus in the selected language(s) for author profiling, to create either a bilingual or multilingual database of content words,Fatima, M., Hasan, K., Anwar, S., & Nawab, R. M. A. (2017)
"Multilingual author profiling on Facebook"
''In: Information Processing & Management, 53(4)'', 886–904.
which may then be used for author profiling. In the context of Facebook, author profiling mainly involves English textual data, but also uses non-english languages that include:
Roman Urdu Roman Urdu is the name used for the Urdu language written with the Latin script, also known as Roman script. According to the Urdu scholar Habib R. Sulemani: "Roman Urdu is strongly opposed by the traditional Arabic alphabet, Arabic script lo ...
,
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
,
Brazilian Portuguese Brazilian Portuguese (; ; also known as pt-BR) is the set of Variety (linguistics), varieties of Portuguese language native to Brazil. It is spoken by almost all of the 203 million inhabitants of Brazil and widely across the Brazilian diaspora ...
, Spanish. While author profiling studies on Facebook have been predominantly for gender and age-group identification, there have been attempts to derive attributes to predict
religiosity The ''Oxford English Dictionary'' defines religiosity as: "Religiousness; religious feeling or belief. ..Affected or excessive religiousness". Different scholars have seen this concept as broadly about religious orientations and degrees of inv ...
, the IT background of users, and even basic emotions (as defined by
Paul Ekman Paul Ekman (born February 15, 1934) is an American psychologist and professor emeritus at the University of California, San Francisco who is a pioneer in the study of emotions and their relation to facial expressions. He was ranked 59th out of t ...
) among others.Rangel, F., & Rosso, P. (2013).
Use of Language and Author Profiling: Identification of Gender and Age.


Weibo

Sina Weibo Weibo (), or Sina Weibo (), is a Chinese microblogging ( weibo) website. Launched by Sina Corporation on 14 August 2009, it is one of the biggest social media platforms in China, with over 582 million monthly active users (252 million daily ...
is one of the few Asian social media platforms that contain texts in Asian languages to have been analysed for author profiling. Primary content of focus for author profiling on Weibo content include classical Chinese characters,
hashtag A hashtag is a metadata tag operator that is prefaced by the hash symbol, ''#''. On social media, hashtags are used on microblogging and photo-sharing services–especially Twitter and Tumblr–as a form of user-generated tagging that enable ...
s,
emoticon An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s, kaomoji, homogenous
punctuation Punctuation marks are marks indicating how a piece of writing, written text should be read (silently or aloud) and, consequently, understood. The oldest known examples of punctuation marks were found in the Mesha Stele from the 9th century BC, c ...
,
Latin Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...
sequences (due to the multilingualism of text) and even poetic formats. Particularly popular Chinese expressions, POS tags and word types are also tracked for author profiling.Zhang, W., Caines, A., Alikaniotis, D., & Buttery, P. (2015)
"Predicting author age from Weibo microblog posts."
''LREC.''
Author profiling for Weibo content requires algorithms different from those used for other social media platforms, mainly due to the linguistic differences between
Mandarin Chinese Mandarin ( ; zh, s=, t=, p=Guānhuà, l=Mandarin (bureaucrat), officials' speech) is the largest branch of the Sinitic languages. Mandarin varieties are spoken by 70 percent of all Chinese speakers over a large geographical area that stretch ...
and Western languages. For example, Chinese emotions involve Chinese characters describing the gesture or facial expression in brackets, such as: e.g. 'laughter', 'tears', 'giggle', 'love', 'heart'. This differs from the use of punctuation symbols for emoticons in Western languages, or the common use of the Unicode emojis in other platforms such as Facebook,
Instagram Instagram is an American photo sharing, photo and Short-form content, short-form video sharing social networking service owned by Meta Platforms. It allows users to upload media that can be edited with Social media camera filter, filters, be ...
, et cetera. Further, while there are around 161 western emoticons, there are around 2900 emoticons regularly used in
mainland China "Mainland China", also referred to as "the Chinese mainland", is a Geopolitics, geopolitical term defined as the territory under direct administration of the People's Republic of China (PRC) in the aftermath of the Chinese Civil War. In addit ...
for web content as in Weibo.Chen, L., Qian, T., Wang, F., You, Z., Peng, Q., & Zhong, M. (2015).
Age Detection for Chinese Users in Weibo
" ''WAIM 2015, LNCS 9098'', 83–95.
To tackle these differences, author profiling algorithms have been trained on Chinese emoticons and linguistic features. For example, author profiling algorithms have been designed to detect Chinese stylistic expressions expressing formality and
sentiment Sentiment may refer to: *Feelings, and emotions *Public opinion, also called sentiment *Sentimentality, an appeal to shallow, uncomplicated emotions at the expense of reason *Sentimental novel, an 18th-century literary genre * Market sentiment, op ...
, in place of algorithms detecting English linguistic features such as capital letters. As compared to other more popular, globalised platforms, texts on Weibo are not as commonly used in the task of author profiling. This is likely due to the centralisation of Weibo in the Chinese population of mainland China, limiting its usage to predominantly China Nationals. Studies done for this platform have used
bot Bot or BOT may refer to: Sciences Computing and technology * Chatbot, a computer program that converses in natural language * Internet bot, a software application that runs automated tasks (scripts) over the Internet **Spambot, an internet bot ...
s, machine learning algorithms to identify authors' age and gender. Data is acquired from Weibo microblog posts of willing participants to be analysed, and used to train algorithms that build concept-based profiles of users to a certain accuracy.


Chat logs

Chat logs have been studied for author profiling as they include much textual
discourse Discourse is a generalization of the notion of a conversation to any form of communication. Discourse is a major topic in social theory, with work spanning fields such as sociology, anthropology, continental philosophy, and discourse analysis. F ...
, the analysis of which have contributed to applicational studies including social trends and
forensic science Forensic science combines principles of law and science to investigate criminal activity. Through crime scene investigations and laboratory analysis, forensic scientists are able to link suspects to evidence. An example is determining the time and ...
. Sources of data for author profiling from
chat log A chat log is an archive of transcripts from online chat and instant messaging conversations. Many chat or IM applications allow for the client-side archiving of online chat conversations, while a subset of chat or IM clients (i.e., Google Talk and ...
s include platforms such as
Yahoo! Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...
,
AIM (software) AOL Instant Messenger (AIM, sometimes stylized as aim) was an instant messaging and presence information computer program created by AOL. It used the proprietary OSCAR instant messaging protocol and the TOC protocol to allow users to commun ...
and
WhatsApp WhatsApp (officially WhatsApp Messenger) is an American social media, instant messaging (IM), and voice-over-IP (VoIP) service owned by technology conglomerate Meta. It allows users to send text, voice messages and video messages, make vo ...
. Computational systems have been devised to produce concept-based profiles listing chat topics discussed in a single
chat room The term chat room, or chatroom (and sometimes group chat; abbreviated as GC), is primarily used to describe any form of synchronous conferencing, occasionally even asynchronous conferencing. The term can thus mean any technology, ranging from ...
or by independent users.


Blogs

Author profiling can be used to identify characteristics of blog writers, such as their age, gender and
geographical location In geography, location or place is used to denote a region (point, line, or area) on Earth's surface. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ambiguous bou ...
, based on their different writing styles,Pham, D.D., Tran, G.B., & Pham, S.B. (2009)
Author Profiling for Vietnamese Blogs.
''2009 International Conference on Asian Language Processing,'' 190–194.
This is especially useful when it comes to anonymous blogs. The choice of content words, style-based features and topic-based features are analyzed to discover characteristics of the author. In general, features that are frequently occur in blogs include a high distribution of
verbs A verb is a word that generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual description of English, the basic fo ...
per writing and a relatively high use of
pronouns In linguistics and grammar, a pronoun ( glossed ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the parts of speech, but some modern theorists would not con ...
. The frequency of verbs, pronouns and other word classes are used to profile and classify emotions in the writings of authors, as well as their gender and age. Author profiling using classification models that were used on physical documents in the past, such as Support Vector Machines, have also been tested on blogs. However, it has been proven to be unsuitable for the latter due to its low performance. The machine learning algorithms that work well for author profiling on blogs include: *
Instance-based learning In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have b ...
* Random Decision Forests


Email

Email has been a consistent focus for author profiling due to rich textual data that can be found in various sections of a typical emailing platform. These sections include the sent, inbox, spam, trash, and archived folders.Estival, D., Gaustad, T., Pham, S. B., Radford, W., & Hutchinson, B. (2007)
Author Profiling for English Emails
Multilingual approaches to author profiling for emails have included English, Spanish, and Arabic emails as data sources, among others. Through author profiling, details of email users may be identified, such as their age, gender, geographical origin, level of education, nationality and even
psychometrics Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and rela ...
traits of personality, which includes
neuroticism Neuroticism is a personality trait associated with negative emotions. It is one of the Big Five traits. Individuals with high scores on neuroticism are more likely than average to experience such feelings as anxiety, worry, fear, anger, shame ...
,
agreeableness Agreeableness is the trait theory, personality trait of being kind, Sympathy, sympathetic, cooperative, warm, honest, straightforward, and considerate. In personality psychology, agreeableness is one of the Big Five personality traits, five major ...
,
conscientiousness Conscientiousness is the personality trait of being responsible, :wikt:careful, careful, or :wikt:diligent, diligent. Conscientiousness implies a desire to do a task well, and to take obligations to others seriously. Conscientious people tend to ...
and
extraversion and introversion Extraversion and introversion are a central trait theory, trait dimension in human personality psychology, personality theory. The terms were introduced into psychology by Carl Jung, though both the popular understanding and current psychologic ...
from the
Big Five personality traits In personality psychology and psychometrics, the Big 5 or five-factor model (FFM) is a widely-used Scientific theory, scientific model for describing how personality Trait theory, traits differ across people using five distinct Factor analysis, ...
. In author profiling for email, content is processed for important textual data, while unimportant features such as
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
and other hyper-text markup language (HTML) redundancies are excluded. Important parts of the Multi-purpose Internet Mail Extensions (MIME) that contain content of the emails are also included in the analysis. Obtained data is often parsed into various sections of content, including author text, signature text, advertisement, quoted text, and reply lines. Further analysis of email textual content in author profiling tasks involves the extraction of
tone Tone may refer to: Visual arts and color-related * Tone (color theory), a mix of tint and shade, in painting and color theory * Tone (color), the lightness or brightness (as well as darkness) of a color * Toning (coin), color change in coins * ...
of voice,
sentiment Sentiment may refer to: *Feelings, and emotions *Public opinion, also called sentiment *Sentimentality, an appeal to shallow, uncomplicated emotions at the expense of reason *Sentimental novel, an 18th-century literary genre * Market sentiment, op ...
,
semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
and other
linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
features to be processed.


Applications

Author profiling has applications in various fields where there is a need to identify specific characteristics of an author of a text, with a growing importance in fields like forensics and marketing.Author Profiling 2018
(n.d.).
Depending on its application, the task of author profiling can vary in terms of the characteristics to be identified, number of authors studied and number of texts available for analysis. Although its applications have traditionally been limited to written texts, such as literary works, this has extended to online texts with the advancement of the computer and the Internet.


Forensic linguistics

In the context of
forensic linguistics Forensic linguistics, legal linguistics, or language and the law is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
, author profiling is used to identify characteristics of the author of anonymous, pseudonymous or
forged Forging is a manufacturing process involving the shaping of metal using localized compression (physics), compressive forces. The blows are delivered with a hammer (often a power hammer) or a die (manufacturing), die. Forging is often classif ...
text, based on the author's use of the language. Through linguistic analysis, forensic linguists seek to identify the suspect's motivation and ideology, along with other class features, such as the suspect's ethnicity or profession. While this does not always lead to decisive author identification, such information can help
law enforcement Law enforcement is the activity of some members of the government or other social institutions who act in an organized manner to enforce the law by investigating, deterring, rehabilitating, or punishing people who violate the rules and norms gove ...
narrow the pool of suspects. In most cases, author profiling in the context of forensic linguistics involves a single text problem, in which there is either no or few comparison texts available and no external evidence that points to the author.Grant, T. D. (2008).
Approaching questions in forensic authorship analysis
" ''In Gibbons, J. & Turell, M. T. (eds.). Dimensions of Forensic Linguistics.'' John Benjamins.
Examples of text analysed by forensic linguists include blackmailing letters, confessions,
testaments A testament is a document that the author has sworn to be true. In law it usually means last will and testament. Testament or The Testament can also refer to: Books * ''Testament'' (comic book), a 2005 comic book * ''Testament'', a thriller nov ...
, suicide letters and plagiarised writing. This has also extended to online texts as well, such as sexually explicit online chat logs between middle-aged men and underaged girls, with the increasing number of
cybercrime Cybercrime encompasses a wide range of criminal activities that are carried out using digital devices and/or Computer network, networks. It has been variously defined as "a crime committed on a computer network, especially the Internet"; Cyberc ...
s committed on the Internet. One of the earliest and best-known examples of the use of author profiling is by
Roger Shuy Roger Wellington Shuy (born January 5, 1931, in Akron, Ohio) is an American linguist best known for his work in sociolinguistics and forensic linguistics. He received his BA from Wheaton College in 1952, his MA from Kent State University in ...
, who was asked to examine a ransom note linked to a notorious kidnapping case in 1979. Based on his analysis of the kidnapper's
idiolect Idiolect is an individual's unique use of language, including speech. This unique usage encompasses vocabulary, grammar, and pronunciation. This differs from a dialect, a common set of linguistic characteristics shared among a group of people. Th ...
, Shuy was able to identify crucial elements of the kidnappers identity from his misspellings and a
dialect A dialect is a Variety (linguistics), variety of language spoken by a particular group of people. This may include dominant and standard language, standardized varieties as well as Vernacular language, vernacular, unwritten, or non-standardize ...
item, that is, the kidnapper was well-educated and from
Akron, Ohio Akron () is a city in Summit County, Ohio, United States, and its county seat. It is the List of municipalities in Ohio, fifth-most populous city in Ohio, with a population of 190,469 at the 2020 United States census, 2020 census. The Akron metr ...
. This eventually led to a successful arrest and confession by the suspect. However, there are criticisms that author profiling methods lack objectivity, since these methods are reliant on a forensic linguist's subjective identification of crucial
sociolinguistic Sociolinguistics is the descriptive, scientific study of how language is shaped by, and used differently within, any given society. The field largely looks at how a language changes between distinct social groups, as well as how it varies unde ...
markers . These methods, such as those adopted by literary critic
Donald Wayne Foster Donald Wayne Foster (born 1950) is a retired professor of English at Vassar College in New York. He is known for his work dealing with various issues of Shakespearean authorship through textual analysis. He has also applied these techniques in att ...
, are said to be speculative and based entirely on one's subjective experience, and therefore cannot be tested
empirically In philosophy, empiricism is an Epistemology, epistemological view which holds that true knowledge or justification comes only or primarily from Sense, sensory experience and empirical evidence. It is one of several competing views within ...
.


Bot detection

Author profiling is adopted in the identification of social bots, the most common being
Twitter bot A Twitter bot or an X bot is a type of software bot that controls a Twitter/X account via the Twitter API. The social bot software may autonomously perform actions such as tweeting, retweeting, liking, following, unfollowing, or direct messagin ...
s. Social bots have been deemed as a threat given their commercial, political and ideological influence, such as the
2016 United States presidential election United States presidential election, Presidential elections were held in the United States on November 8, 2016. The Republican Party (United States), Republican ticket of businessman Donald Trump and Indiana Governor, Indiana governor Mike P ...
, during which they polarised political conversations, and spread misinformation and unverified information. In the context of marketing, social bots can artificially inflate the popularity of a product by posting positive reviews, and undermine the reputation of competitive products with unfavourable reviews.Bots and Gender Profiling 2019
. (n.d.).
Therefore, bot detection from an author profiling perspective is a task of high importance.Goubin, Régis & Lefeuvre, Dorian & Alhamzeh, Alaa & Mitrović, Jelena & Egyed-Zsigmond, El˝ & Fossi, Leopold. (2019).
Bots and Gender Profiling using a Multi-layer Architecture Notebook for PAN at CLEF 2019
.
Made to appear as human accounts, bots can mostly be identified by information on their profiles, like their username, profile photo and time of posting. However, the task of identifying bots solely from textual data (i.e. without meta-data) is significantly more challenging, requiring author profiling techniques. This usually involves a classification task based on semantic and syntactic features.Daelemans W. et al. (2019)
Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection
" ''In: Crestani F. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science'', vol 11696. Springer, Cham.
The task of bot and gender profiling was one of four shared tasks organised by PAN, which organises a series of scientific events and shared tasks of digital text forensics and stylometry, in its 2019 edition. Participating teams had achieved much success, with the best results for bot detection for English and Spanish tweets at 95.95% and 93.33% respectively.


Marketing

Author profiling is also useful from a marketing viewpoint, as it allows businesses to identify the
demographics Demography () is the statistical study of human populations: their size, composition (e.g., ethnic group, age), and how they change through the interplay of fertility (births), mortality (deaths), and migration. Demographic analysis examin ...
of people that like or dislike their products based on an analysis of blogs, online product reviews and social media content. This is important since most individuals post their reviews on products anonymously. Author profiling techniques are helpful to business experts in making better informed strategic decisions based on the demographics of their target group. In addition, businesses can target their marketing campaigns at groups of consumers who match the demographics and profile of current customers.


Author identification and influence tracing

Author profiling techniques are used to study
traditional media Old media, also called traditional media or legacy media, are the mass media institutions that dominated prior to the internet; particularly print media, film studios, music studios, advertising agencies, radio broadcasting, and television. Ol ...
and literature to identify the writing style of various authors as well as their written topics of content. Author profiling for literature is also been done to deduce the social networks of authors and their literary influence based on their bibliographic records of co-authorship. In cases of anonymous or
pseudepigraphic A pseudepigraph (also :wikt:anglicized, anglicized as "pseudepigraphon") is a false attribution, falsely attributed work, a text whose claimed author is not the true author, or a work whose real author attributed it to a figure of the past. Th ...
works, sometimes the technique has been used to attempt to identify the author or authors, or determine which works were written by the same person. Some examples of author profiling studies on literature and traditional media include studies on the following:Dzikiene. J. K., Utka, A., & Šarkute, L. (2015).
Authorship Attribution and Author Profiling of Lithuanian Literary Texts
, 96–105.
*The Bible (see
Authorship of the Bible The books of the Bible are the work of multiple authors and have been edited to produce the works known today. The following article outlines the conclusions of the majority of contemporary scholars, along with the traditional views, both Jewi ...
) *
Gospels Gospel originally meant the Christian message (" the gospel"), but in the second century AD the term (, from which the English word originated as a calque) came to be used also for the books in which the message was reported. In this sen ...
of the
New Testament The New Testament (NT) is the second division of the Christian biblical canon. It discusses the teachings and person of Jesus in Christianity, Jesus, as well as events relating to Christianity in the 1st century, first-century Christianit ...
*Shakespeare's works *
The Federalist Papers ''The Federalist Papers'' is a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay under the collective pseudonym "Publius" to promote the ratification of the Constitution of the United States. The ...
in the 1990s and 1960s *Author profiling studies for Lithuanian Literary Texts *''
Primary Colors Primary colors are colorants or colored lights that can be mixed in varying amounts to produce a gamut of colors. This is the essential method used to create the perception of a broad range of colors in, e.g., electronic displays, color printin ...
'', 1996 novel whose author was for a time anonymous *'' A Warning'', a 2019 political book whose author was for a time anonymous


Library cataloguing

Another application of author profiling is in devising strategies for cataloguing library resources based on standard attributes.Nomoto, T. (2009).
Classifying library catalogues by author profiling
" ''In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR 09''.
In this approach, author profiling techniques may improve the efficiency of
library catalog A library catalog (or library catalogue in British English) is a register of all bibliography, bibliographic items found in a library or group of libraries, such as a network of libraries at several locations. A catalog for a group of libra ...
uing in which library resources are automatically classified based on the authors' bibliographic records. This was a significant issue in the early 21st century when much of library cataloguing was still done manually. In using author profiling for library cataloguing, researchers have used machine learning for automatic processes in the library, such as Support Vector Machine algorithms (SVMs). With the use of SVMs for author profiling, bibliographic records of authors within existing
databases In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
may be identified, tracked, and updated to identify an author based on her topics of literary content and
expertise An expert is somebody who has a broad and deep understanding and competence in terms of knowledge, skill and experience through practice and education in a particular field or area of study. Informally, an expert is someone widely recognized a ...
as indicated in his or her bibliographic records. In this case, author profiling uses the
social structures In the social sciences, social structure is the aggregate of patterned social arrangements in society that are both emergent from and determinant of the actions of individuals. Likewise, society is believed to be grouped into structurally rel ...
of authors that may be derived from physical copies of published media to catalogue library resources.


In popular culture

Author profiling has been featured in popular culture. The 2017
Discovery Channel Discovery Channel, known as The Discovery Channel from 1985 to 1995, and often referred to as simply Discovery, is an American cable channel that is best known for its ongoing reality television shows and promotion of pseudoscience. It init ...
mini-series Manhunt: Unabomber is a fictionalised account of the
FBI The Federal Bureau of Investigation (FBI) is the domestic Intelligence agency, intelligence and Security agency, security service of the United States and Federal law enforcement in the United States, its principal federal law enforcement ag ...
investigation surrounding the
Unabomber Theodore John Kaczynski ( ; May 22, 1942 – June 10, 2023), also known as the Unabomber ( ), was an American mathematician and domestic terrorist. He was a mathematics prodigy, but abandoned his academic career in 1969 to pursue a reclusi ...
. It features a criminal profiler who identifies defining characteristics of the Unabomber's identity based on his analysis of the Unabomber's idiolect in his published manifesto and letters. The show highlighted the importance of author profiling in criminal forensics, as it was critical in the capture of the real Unabomber culprit in 1996.Davies, D. (2017, August 22).
FBI Profiler Says Linguistic Work Was Pivotal In Capture Of Unabomber
"


See also

;Related subjects *
Computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Forensic linguistics Forensic linguistics, legal linguistics, or language and the law is the application of linguistic knowledge, methods, and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of ap ...
*
Native-language identification Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and ...
*
Social bot A social bot, also described as a social AI or social algorithm, is a software agent that communicates autonomously on social media. The messages (e.g. tweets) it distributes can be simple and operate in groups and various configurations with ...
*
Stylometry Stylometry is the application of the study of linguistic style, usually to written language. Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding manner and meaning. Springer Scie ...


References

{{reflist Authorship debates Computational fields of study