HOME

TheInfoList



OR:

Culturomics is a form of
computational lexicology Computational lexicology is a branch of computational linguistics, which is concerned with the use of computers in the study of lexicon. It has been more narrowly described by some scholars (Amsler, 1980) as the use of computers in the study of '' ...
that studies
human behavior Human behavior is the potential and expressed capacity (Energy (psychological), mentally, Physical activity, physically, and Social action, socially) of human individuals or groups to respond to internal and external Stimulation, stimuli throu ...
and
cultural trends The bandwagon effect is a psychological phenomenon where people adopt certain behaviors, styles, or attitudes simply because others are doing so. More specifically, it is a cognitive bias by which public opinion or behaviours can alter due to ...
through the quantitative analysis of digitized texts. Researchers data mine large
digital archive An archive is an accumulation of historical records or materials, in any medium, or the physical facility in which they are located. Archives contain primary source documents that have accumulated over the course of an individual or organi ...
s to investigate cultural phenomena reflected in language and word usage. The term is an American
neologism In linguistics, a neologism (; also known as a coinage) is any newly formed word, term, or phrase that has achieved popular or institutional recognition and is becoming accepted into mainstream language. Most definitively, a word can be considered ...
first described in a 2010 ''
Science Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...
'' article called ''Quantitative Analysis of Culture Using Millions of Digitized Books'', co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden. Michel and Aiden helped create the Google Labs project
Google Ngram Viewer The Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of ''n''-grams found in printed sources published between 1500 and 2022 in Google's text corpora in English, ...
which uses n-grams to analyze the
Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical charac ...
digital library for cultural patterns in language use over time. Because the Google Ngram data set is not an unbiased sample, and does not include metadata, there are several pitfalls when using it to study language or the popularity of terms. Medical literature accounts for a large, but shifting, share of the corpus, which does not take into account how often the literature is printed, or read.


Studies

In a study called ''Culturomics 2.0'', Kalev H. Leetaru examined news archives including print and
broadcast media Broadcasting is the distribution of audio audiovisual content to dispersed audiences via a electronic mass communications medium, typically one using the electromagnetic spectrum (radio waves), in a one-to-many model. Broadcasting began wit ...
(television and radio transcripts) for words that imparted tone or "mood" as well as geographic data. The research retroactively predicted the 2011
Arab Spring The Arab Spring () was a series of Nonviolent resistance, anti-government protests, Rebellion, uprisings, and Insurgency, armed rebellions that spread across much of the Arab world in the early 2010s. It began Tunisian revolution, in Tunisia ...
and successfully estimated the final location of
Osama bin Laden Osama bin Laden (10 March 19572 May 2011) was a militant leader who was the founder and first general emir of al-Qaeda. Ideologically a pan-Islamist, Bin Laden participated in the Afghan ''mujahideen'' against the Soviet Union, and support ...
to within . In a 2012 paper by Alexander M. Petersen and co-authors, they found a "dramatic shift in the birth rate and death rates of words": Deaths have increased and births have slowed. The authors also identified a universal "tipping point" in the life cycle of new words: at about 30 to 50 years after their origin, they either enter the long-term
lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
or fall into disuse."The New Science of the Birth and Death of Words "
CHRISTOPHER SHEA, ''Wall Street Journal'', March 16, 2012
Culturomic approaches have been taken in the analysis of newspaper content in a number of studies by I. Flaounas and co-authors. These studies showed macroscopic trends across different news outlets and countries. In 2012, a study of 2.5 million articles suggested that
gender bias Gender bias is the tendency to prefer one gender over another. It is a form of unconscious bias, or implicit bias, which occurs when one individual unconsciously attributes certain attitudes and stereotypes to another person or group of people ...
in news coverage depends on topic and how the readability of newspaper articles is related to topic. A separate study by the same researchers, covering 1.3 million articles from 27 countries, showed macroscopic patterns in the choice of stories to cover. In particular, countries made similar choices when they were related by economic, geographical and cultural links. The cultural links were revealed by the similarity in voting for the
Eurovision song contest The Eurovision Song Contest (), often known simply as Eurovision, is an international Music competition, song competition organised annually by the European Broadcasting Union (EBU) among its members since 1956. Each participating broadcaster ...
. This study was performed on a vast scale, by using
statistical machine translation Statistical machine translation (SMT) is a machine translation approach where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contra ...
, text categorisation and information extraction techniques. The possibility to detect mood shifts in a vast population by analysing
Twitter Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
content was demonstrated in a study by T. Lansdall-Welfare and co-authors. The study considered 84 million tweets generated by more than 9.8 million users from the United Kingdom over a period of 31 months, showing how public sentiment in the UK has changed with the announcement of spending cuts. In a 2013 study by S Sudhahar and co-authors, the automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from Network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. In a 2014 study by T Lansdall-Welfare and co-authors, 5 million news articles were collected over 5 years and then analyzed to suggest a significant shift in sentiment relative to coverage of nuclear power, corresponding with the disaster of Fukushima. The study also extracted concepts that were associated with nuclear power before and after the disaster, explaining the change in sentiment with a change in narrative framing. In 2015, a study revealed the bias of the Google books data set, which "suffers from a number of limitations which make it an obscure mask of cultural popularity," and calls into question the significance of many of the earlier results. Culturomic approaches can also contribute towards conservation science through a better understanding of human-nature relationships, with the first research published by McCallum and Bury in 2013. This study revealed a precipitous decline in public interest in environmental issues. In 2016, a publication by Richard Ladle and colleagues highlighted five key areas where culturomics can be used to advance the practice and science of conservation, including recognizing conservation-oriented constituencies and demonstrating public interest in nature, identifying conservation emblems, providing new metrics and tools for near-real-time
environmental monitoring Environmental monitoring is the processes and activities that are done to characterize and describe the state of the environment. It is used in the preparation of environmental impact assessments, and in many circumstances in which human activit ...
and to support conservation decision making, assessing the cultural impact of conservation interventions, and framing conservation issues and promoting public understanding. In 2017, a study correlated
joint pain Arthralgia () literally means 'joint pain'. Specifically, arthralgia is a symptom of injury, infection, illness (in particular arthritis), or an allergic reaction to medication Medication (also called medicament, medicine, pharmaceutic ...
with Google search activity and temperature. While the study observed higher search activity for hip and knee pain (but not
arthritis Arthritis is a general medical term used to describe a disorder that affects joints. Symptoms generally include joint pain and stiffness. Other symptoms may include redness, warmth, Joint effusion, swelling, and decreased range of motion of ...
) during higher temperatures, it does not (and cannot) control for relevant other factors such as activity. Mass media misinterpreted this as "myth busted: rain does not increase joint pain", while the authors speculate the observed correlation is due to "changes in physical activity levels".


Criticism

Linguist Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
s and
lexicographer Lexicography is the study of lexicons and the art of compiling dictionaries. It is divided into two separate academic disciplines: * Practical lexicography is the art or craft of compiling, writing and editing dictionary, dictionaries. * The ...
s have expressed skepticism regarding the methods and results of some of these studies, including one by Petersen et al."When physicists do linguistics"
BEN ZIMMER, ''Boston Globe'', February 10, 2013
Others have demonstrated bias in the Ngram data set. Their results "call into question the vast majority of existing claims drawn from the Google Books corpus": "Instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’" because it is unclear what caused the observed change in the sample. Ficetola critiqued the use of Google Trends, suggesting interest was actually increasing. But, in their rebuttal McCallum and Bury provided that as far as public policy was concerned, proportional data was important and absolute numbers irrelevant, explaining that policy is driven by the opinion of the largest portion of the population not the absolute number with decisions made according to majority influence, not simply number of votes.


See also

*
Corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
*
-omics Omics is the collective characterization and quantification of entire sets of biological molecules and the investigation of how they translate into the structure, function, and dynamics of an organism or group of organisms. The branches of scien ...
* Public health informatics *
Public health surveillance Public health surveillance (also epidemiological surveillance, clinical surveillance or syndromic surveillance) is, according to the World Health Organization (WHO), "the continuous, systematic collection, analysis and interpretation of health-rela ...
*
Text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...


References


Further reading

* * * * * * * * *{{cite journal , last1 = Lansdall-Welfare , first1 = Thomas , last2 = Sudhahar , first2 = Saatviga , last3 = Thompson , first3 = James , last4 = Lewis , first4 = Justin , last5 = Cristianini , first5 = Nello , year = 2017 , title = Content Analysis of 150 Years of British Periodicals , journal =
Proceedings of the National Academy of Sciences of the United States of America ''Proceedings of the National Academy of Sciences of the United States of America'' (often abbreviated ''PNAS'' or ''PNAS USA'') is a peer-reviewed multidisciplinary scientific journal. It is the official journal of the National Academy of Scie ...
, volume = 114, issue = 4, pages = E457–E465 , doi = 10.1073/pnas.1606380114 , pmid=28069962, pmc = 5278459 , bibcode = 2017PNAS..114E.457L , doi-access = free


External links


Culturomics.org
website by The Cultural Observatory at Harvard directed by Erez Lieberman Aiden and Jean-Baptiste Michel Computational linguistics 2010s neologisms