Citation analysis is the examination of the frequency, patterns, and graphs of
citation
A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose o ...
s in documents. It uses the
directed graph of citationslinks from one document to another documentto reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academic
articles and books. For another example, judges of law support their
judgements by referring back to judgements made in earlier cases (see
citation analysis in a legal context). An additional example is provided by patents which contain
prior art
Prior art (also known as state of the art or background art) is a concept in patent law used to determine the patentability of an invention, in particular whether an invention meets the novelty and the inventive step or non-obviousness criteria f ...
, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.
Documents can be associated with many other features in addition to citations, such as authors, publishers, journals as well as their actual texts. The general analysis of collections of documents is known as
bibliometrics
Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics (the analysis of scientific metri ...
and citation analysis is a key part of that field. For example,
bibliographic coupling and co-citation are association measures based on citation analysis (shared citations or shared references). The citations in a collection of documents can also be represented in forms such as a
citation graph, as pointed out by
Derek J. de Solla Price
Derek John de Solla Price (22 January 1922 – 3 September 1983) was a British physicist, history of science, historian of science, and information science, information scientist. He was known for his investigation of the Antikythera mechanism, ...
in his 1965 article "Networks of Scientific Papers".
This means that citation analysis draws on aspects of
social network analysis
Social network analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of ''nodes'' (individual actors, people, or things within the network) ...
and
network science
Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, Cognitive network, cognitive and semantic networks, and social networks, considering distinct eleme ...
.
An early example of automated citation indexing was
CiteSeer
CiteSeerX (formerly called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.
CiteSeer's goal is to improve the dissemination and access of a ...
, which was used for citations between academic papers, while
Web of Science
The Web of Science (WoS; previously known as Web of Knowledge) is a paid-access platform that provides (typically via the internet) access to multiple databases that provide reference and citation data from academic journals, conference proceedi ...
is an example of a modern system which includes more than just academic books and articles reflecting a wider range of information sources. Today, automated
citation index
A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century H ...
ing has changed the nature of citation analysis research, allowing millions of citations to be analyzed for
large-scale patterns and
knowledge discovery. Citation analysis tools can be used to compute various impact measures for scholars based on data from
citation indices
A citation index is a kind of bibliographic index, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. A form of citation index is first found in 12th-century Heb ...
. These have various applications, from the identification of expert referees to review papers and grant proposals, to providing transparent data in support of academic merit review,
tenure
Tenure is a type of academic appointment that protects its holder from being fired or laid off except for cause, or under extraordinary circumstances such as financial exigency or program discontinuation. Academic tenure originated in the United ...
, and promotion decisions. This competition for limited resources may lead to ethically questionable behavior to increase citations.
A great deal of criticism has been made of the practice of naively using citation analyses to compare the impact of different scholarly articles without taking into account other factors which may affect citation patterns. Among these criticisms, a recurrent one focuses on "field-dependent factors", which refers to the fact that citation practices vary from one area of science to another, and even between fields of research within a discipline.
Overview
While citation indexes were originally designed for
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
, they are increasingly used for
bibliometrics
Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics (the analysis of scientific metri ...
and other studies involving research evaluation. Citation data is also the basis of the popular
journal impact factor.
There is a large body of literature on citation analysis, sometimes called
scientometrics
Scientometrics is a subfield of informetrics that studies quantitative aspects of scholarly literature. Major research issues include the measurement of the impact of research papers and academic journals, the understanding of scientific citati ...
, a term invented by
Vasily Nalimov, or more specifically
bibliometrics
Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics (the analysis of scientific metri ...
. The field blossomed with the advent of the
Science Citation Index
The Science Citation Index Expanded (SCIE) is a citation index owned by Clarivate and previously by Thomson Reuters.
It was created by the Eugene Garfield at the Institute for Scientific Information, launched in 1964 as Science Citation Index ( ...
, which now covers source literature from 1900 on. The leading journals of the field are ''
Scientometrics
Scientometrics is a subfield of informetrics that studies quantitative aspects of scholarly literature. Major research issues include the measurement of the impact of research papers and academic journals, the understanding of scientific citati ...
,'' ''Informetrics,'' and the ''
Journal of the Association for Information Science and Technology''.
ASIST also hosts an
electronic mailing list
A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients.
Mailing lists are often rented or sold. If rented, the renter agrees to use the mailing list only at contra ...
called SIGMETRICS at ASIST. This method is undergoing a resurgence based on the wide dissemination of the Web of Science and Scopus subscription databases in many universities, and the universally available free citation tools such as
CiteBase,
CiteSeerX
CiteSeerX (formerly called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.
CiteSeer's goal is to improve the dissemination and access of a ...
,
Google Scholar
Google Scholar is a freely accessible web search engine that indexes the full text or metadata of Academic publishing, scholarly literature across an array of publishing formats and disciplines. Released in Beta release, beta in November 2004, th ...
, and the former
Windows Live Academic (now available with extra features as
Microsoft Academic
Microsoft Academic was a free internet-based academic search engine for academic publications and literature, developed by Microsoft Research in 2016 as a successor of Microsoft Academic Search. Microsoft Academic was shut down in 2022. Both ...
). Methods of citation analysis research include qualitative, quantitative and computational approaches. The main foci of such scientometric studies have included productivity comparisons, institutional research rankings, journal rankings establishing faculty productivity and tenure standards, assessing the influence of top scholarly articles, tracing the development trajectory of a science or technology field, and developing profiles of top authors and institutions in terms of research performance.
Legal citation
Legal citation is the practice of crediting and referring to authoritative documents and sources. The most common sources of authority cited are court decisions (cases), statutes, regulations, government documents, treaties, and scholarly writin ...
analysis is a citation analysis technique for analyzing
legal documents
Legal instrument is a legal term of art that is used for any formally executed written document that can be formally attributed to its author, records and formally expresses a legally enforceable act, process, or contractual duty, obligation, or ...
to facilitate the understanding of the inter-related regulatory compliance documents by the exploration the citations that connect provisions to other provisions within the same document or between different documents. Legal citation analysis uses a
citation graph extracted from a regulatory document, which could supplement
E-discovery - a process that leverages on technological innovations in
big data analytics.
[ by Cat Casey and Alejandra Perez]
History
In a 1965 paper,
Derek J. de Solla Price
Derek John de Solla Price (22 January 1922 – 3 September 1983) was a British physicist, history of science, historian of science, and information science, information scientist. He was known for his investigation of the Antikythera mechanism, ...
described the inherent linking characteristic of the SCI as "Networks of Scientific Papers".
The links between citing and cited papers became dynamic when the SCI began to be published online. The
Social Sciences Citation Index
The Social Sciences Citation Index (SSCI) is a commercial citation index product of Clarivate Analytics. It was originally developed by the Institute for Scientific Information from the Science Citation Index. The Social Sciences Citation Index is ...
became one of the first databases to be mounted on the
Dialog system in 1972. With the advent of the
CD-ROM
A CD-ROM (, compact disc read-only memory) is a type of read-only memory consisting of a pre-pressed optical compact disc that contains computer data storage, data computers can read, but not write or erase. Some CDs, called enhanced CDs, hold b ...
edition, linking became even easier and enabled the use of
bibliographic coupling for finding related records. In 1973, Henry Small published his classic work on
Co-Citation analysis which became a
self-organizing
Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order and disorder, order arises from local interactions between parts of an initially disordered system. The process can be spont ...
classification system that led to
document clustering experiments and eventually an "Atlas of Science" later called "Research Reviews".
The inherent topological and graphical nature of the worldwide citation network which is an inherent property of the
scientific literature
Scientific literature encompasses a vast body of academic papers that spans various disciplines within the natural and social sciences. It primarily consists of academic papers that present original empirical research and theoretical ...
was described by
Ralph Garner (
Drexel University
Drexel University is a private university, private research university with its main campus in Philadelphia, Pennsylvania, United States. Drexel's undergraduate school was founded in 1891 by Anthony Joseph Drexel, Anthony J. Drexel, a financier ...
) in 1965.
The use of citation counts to rank journals was a technique used in the early part of the nineteenth century but the systematic ongoing measurement of these counts for scientific journals was initiated by Eugene Garfield at the Institute for Scientific Information who also pioneered the use of these counts to rank authors and
papers. In a landmark paper of 1965 he and
Irving Sher showed the correlation between citation frequency and eminence in demonstrating that
Nobel Prize
The Nobel Prizes ( ; ; ) are awards administered by the Nobel Foundation and granted in accordance with the principle of "for the greatest benefit to humankind". The prizes were first awarded in 1901, marking the fifth anniversary of Alfred N ...
winners published five times the average number of papers while their work was cited 30 to 50 times the average. In a long series of essays on the Nobel and other prizes Garfield reported this phenomenon. The usual summary measure is known as
impact factor
The impact factor (IF) or journal impact factor (JIF) of an academic journal is a type of journal ranking. Journals with higher impact factor values are considered more prestigious or important within their field.
The Impact Factor of a journa ...
, the number of citations to a journal for the previous two years, divided by the number of articles published in those years. It is widely used, both for appropriate and inappropriate purposesin particular, the use of this measure alone for ranking authors and papers is therefore
quite controversial.
In an early study in 1964 of the use of Citation Analysis in writing the history of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
, Garfield and Sher demonstrated the potential for generating
historiographs,
topological maps of the most important steps in the history of scientific topics. This work was later automated by E. Garfield,
A. I. Pudovkin of the
Institute of Marine Biology,
Russian Academy of Sciences
The Russian Academy of Sciences (RAS; ''Rossíyskaya akadémiya naúk'') consists of the national academy of Russia; a network of scientific research institutes from across the Russian Federation; and additional scientific and social units such ...
and
V. S. Istomin of
Center for Teaching, Learning, and Technology,
Washington State University
Washington State University (WSU, or colloquially Wazzu) is a Public university, public Land-grant university, land-grant research university in Pullman, Washington, United States. Founded in 1890, WSU is also one of the oldest Land-grant uni ...
and led to the creation of the
HistCite software around 2002.
Automatic citation indexing was introduced in 1998 by
Lee Giles,
Steve Lawrence and
Kurt Bollacker and enabled automatic algorithmic extraction and grouping of citations for any digital academic and scientific document. Where previous citation extraction was a manual process, citation measures could now scale up and be computed for any scholarly and scientific field and document venue, not just those selected by organizations such as ISI. This led to the creation of new systems for public and automated citation indexing, the first being
CiteSeer
CiteSeerX (formerly called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.
CiteSeer's goal is to improve the dissemination and access of a ...
(now
CiteSeerX
CiteSeerX (formerly called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.
CiteSeer's goal is to improve the dissemination and access of a ...
, soon followed by Cora, which focused primarily on the field of
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
and
information science. These were later followed by large scale academic domain citation systems such as the Google Scholar and Microsoft Academic. Such autonomous citation indexing is not yet perfect in citation extraction or citation clustering with an error rate estimated by some at 10% though a careful statistical sampling has yet to be done. This has resulted in such authors as
Ann Arbor
Ann Arbor is a city in Washtenaw County, Michigan, United States, and its county seat. The 2020 United States census, 2020 census recorded its population to be 123,851, making it the List of municipalities in Michigan, fifth-most populous cit ...
,
Milton Keynes
Milton Keynes ( ) is a city status in the United Kingdom, city in Buckinghamshire, England, about north-west of London. At the 2021 Census, the population of Milton Keynes urban area, its urban area was 264,349. The River Great Ouse forms t ...
, and
Walton Hall being credited with extensive academic output.
SCI claims to create automatic citation indexing through purely programmatic methods. Even the older records have a similar magnitude of error.
Citation impact
Citation analysis for legal documents
Citation analysis for legal documents is an approach to facilitate the understanding and analysis of inter-related
regulatory compliance
In general, compliance means conforming to a rule, such as a specification, policy, standard or law. Compliance has traditionally been explained by reference to deterrence theory, according to which punishing a behavior will decrease the viol ...
documents by exploration of the citations that connect
provisions to other provisions within the same document or between different documents. Citation analysis uses a
citation graph extracted from a regulatory document, which could supplement
E-discovery - a process that leverages on technological innovations in
big data analytics.
Citation analysis for plagiarism detection
Citation analysis for natural language processing
Natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP), a field at the intersection of
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
and
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
is poised to substantially impact society through various innovations such as
large language models
A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...
. The impact on and of NLP has been extensively studied through citations. Researchers have analyzed various factors such as the cross-field influence between different fields, industry impact, temporal citation patterns, plagiarism, geographic location, and gender. Many studies show the field is becoming more insular, with a narrowing focus, reduced interdisciplinarity, and concentration of funding across few industry actors.
Controversies
*''
E-publishing'': due to the unprecedented growth of
electronic resource (e-resource) availability, one of the questions currently being explored is, "how often are e-resources being cited in my field?" For instance, there are claims that On-Line access to
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
literature
Literature is any collection of Writing, written work, but it is also used more narrowly for writings specifically considered to be an art form, especially novels, Play (theatre), plays, and poetry, poems. It includes both print and Electroni ...
leads to higher citation rates, however,
humanities
Humanities are academic disciplines that study aspects of human society and culture, including Philosophy, certain fundamental questions asked by humans. During the Renaissance, the term "humanities" referred to the study of classical literature a ...
articles may suffer if not in print.
* ''
Self-citations'': it has been criticized that authors game the system by accumulating citations by citing themselves excessively.
For instance, it has been found that men tend to cite themselves more often than women.
*Citation pollution: the infiltration of
retracted research, or fake research, being cited in legitimate research, but negatively impacting on the validity of the research.
It is due to various factors, including the publication race and the concerning rise in unscrupulous business practices related to so-called
predatory
Predation is a biological interaction in which one organism, the predator, kills and eats another organism, its prey. It is one of a family of common feeding behaviours that includes parasitism and micropredation (which usually do not kill ...
or deceptive publishers, research quality, in general, is facing different types of threats.
*''Citation justice'' and ''citation bias'': Because having others cite a publication helps the original author's career prospects, and because the key works in some fields were published by men, by older scholars, and by white people, there have been calls to promote social justice by deliberately citing publications by people from marginalized backgrounds, or by checking citations for bias before publication.
See also
*
Journalology
Journalology (also known as publication science) is the scholarly study of all aspects of the academic publishing process. The field seeks to improve the quality of scholarly research by implementing evidence-based practices in academic publishing. ...
*
Main path analysis
*
PageRank
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...
*
San Francisco Declaration on Research Assessment
The San Francisco Declaration on Research Assessment (DORA) is a statement that denounces the practice of correlating the journal impact factor to the merits of a specific scientist's contributions. Also according to this statement, this practice ...
Notes
References
Analysis
Citation metrics
{{Academic publishing