HOME

TheInfoList



OR:

Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of
knowledge extraction Knowledge extraction is the creation of knowledge from structured ( relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must ...
and automated
hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...
generation that uses papers and other
academic publications An academy (Attic Greek: Ἀκαδήμεια; Koine Greek Ἀκαδημία) is an institution of tertiary education. The name traces back to Plato's school of philosophy, founded approximately 386 BC at Akademia, a sanctuary of Athena, the go ...
(the "literature") to find new relationships between existing knowledge (the "discovery"). Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated. LBD can help researchers to quickly discover and explore hypotheses as well as gain information on relevant advances inside and outside of their niches and increase interdisciplinary information sharing. The most basic and widespread type of LBD is called the ABC paradigm because it centers around three concepts called A, B and C. It states that if there is a connection between A and B and one between B and C, then there is one between A and C which, if not explicitly stated, is yet to be explored.


History

The LBD technique was pioneered by
Don R. Swanson Don R. Swanson (October 10, 1924 – November 18, 2012) was an American information scientist, most known for his work in literature-based discovery in the biomedical domain. His particular method has been used as a model for further work, and is ...
in the 1980s. He hypothesized that the combination of two separately published results indicating an A-B relationship and a B-C relationship are evidence of an A-C relationship which is unknown or unexplored. He used this to propose
fish oil Fish oil is oil derived from the tissues of oily fish. Fish oils contain the omega−3 fatty acids eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA), precursors of certain eicosanoids that are known to reduce inflammation in the bod ...
as a treatment for
Raynaud syndrome Raynaud syndrome, also known as Raynaud's phenomenon, is a medical condition in which the spasm of small arteries causes episodes of reduced blood flow to end arterioles. Typically the fingers, and, less commonly, the toes, are involved. Rare ...
due to their shared relationship with
blood viscosity Hemorheology, also spelled haemorheology (''haemo'' from Greek ‘αἷμα, ''haima'' 'blood'; and ''rheology'', from Greek ῥέω ''rhéō'', ' flow' and -λoγία, ''-logia'' 'study of'), or blood rheology, is the study of flow properties o ...
. This hypothesis was later shown to have merit in a prospective study and he continually proposed other discoveries using similar methods.


Swanson linking

''Swanson linking'' is a term proposed in 2003 that refers to connecting two pieces of knowledge previously thought to be unrelated. For example, it may be known that illness A is caused by chemical B, and that drug C is known to reduce the amount of chemical B in the body. However, because the respective articles were published separately from one another (called "disjoint data"), the relationship between illness A and drug C may be unknown. ''Swanson linking'' aims to find these relationships and report them. Although the ABC paradigm is widely used, critics of the system have argued that much of science is not captured on simple assertions and it is rather built from analogies and images at a higher level of
abstraction Abstraction is a process where general rules and concepts are derived from the use and classifying of specific examples, literal (reality, real or Abstract and concrete, concrete) signifiers, first principles, or other methods. "An abstraction" ...
.


Systems

LBD comes generally in two flavours: open and closed discovery. In open discovery, only A is given. The approach finds Bs and uses them to return possibly interesting Cs to the user, thus ''generating hypotheses'' from A. With closed discovery, the A and C are given to the approach which seeks to find the Bs which can link the two, thus ''testing a hypothesis'' about A and C. A number of systems to perform literature-based discovery have been developed over the years, extending the original idea of Don Swanson, and the evaluation of the quality of such systems is an active area of research. Some systems include web versions for increased user-friendliness. A common approach to many systems is the use of MeSH terms to represent scientific articles. This is used by the systems Manjal, BITOLA and LitLinker. One well-known system within the field is called ''Arrowsmith'' and is tailored to find connections between two disjoint sets of articles, an approach labeled "two-node" search. Another well-known system, LION LBD, uses PubTator for annotating PubMed scientific articles with concepts such as
chemical A chemical substance is a unique form of matter with constant chemical composition and characteristic properties. Chemical substances may take the form of a single element or chemical compounds. If two or more chemical substances can be combin ...
s, genes/proteins,
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...
s,
disease A disease is a particular abnormal condition that adversely affects the structure or function (biology), function of all or part of an organism and is not immediately due to any external injury. Diseases are often known to be medical condi ...
s and
species A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...
; as well as sentence-level annotation of cancer hallmarks that describe fundamental cancer processes and behaviour. It uses co-occurrence metrics to rank relations between concepts and performs both open and closed discovery. While LBD systems are based on traditional statistical methods, other systems leverage sophisticated
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods, like
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
s. Some LBD systems represent the connection between concepts as a
knowledge graph In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a Graph (discrete mathematics), graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interl ...
, and thus employ techniques of
graph theory In mathematics and computer science, graph theory is the study of ''graph (discrete mathematics), graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of ''Vertex (graph ...
. The graph-based representation is also the foundation for LBD systems that employ
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph (or edge or relationship). The graph relates the dat ...
s like
Neo4J Neo4j is a graph database management system (GDBMS) developed by Neo4j Inc. The data elements Neo4j stores are nodes, edges connecting them, and attributes of nodes and edges. Described by its developers as an ACID-compliant transactional d ...
, enabling discovery via graph query languages such as Cypher. Graph-based LBD systems represent the relations between concepts using a different relation types, such as those in the UMLS Semantic Network. Some approaches go further and try to apply contextualized relations, an approach also used by the
Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
for their Causal Activity Modeling (GO-CAM).


Use of databases

Besides extracting information from the body of scientific articles, LBD systems often employ structured knowledge from biocurated biological resources, like the Online Mendelian Inheritance in Men (OMIM).


List of systems

These are the published LBD systems, ordered by date of publication: * 1986 - Arrowsmith * 2000 - BITOLA V1 * 2001 - DAD * 2003 - LitLinker * 2004 - ACS * 2004 - Manjal * 2004 - IRIDESCENT * 2005 - BITOLA V2 * 2006 - LitLinker V2 * 2007 - Arrowsmith V2 * 2008 - Anni 2.0 * 2008 - CoPub Discovery * 2009 - RajoLink * 2010 - Sem-BT * 2015 - Obvio * 2016 - Spark * 2017 - Mine the gap * 2019 - LION LBD


Semantic typing

A common task in literature-based discovery is assigning words/concepts to different semantic types. A concept might be classified under one type or multiple types. For example in the Unified Medical Language System (UMLS) the term ''migraine'' is classified under the type ''disease and syndrome'', while the term ''magnesium'' is under two types: ''biologically active substance'' and ''element'', ''ion'', or ''isotope.'' The ''typing'' of concepts hones the discovery of connections between particular classes of concepts, i.e. ''diseases''-''genes'' or ''diseases''-''drugs''. ''''


System evaluation

The evaluation of literature-based discoveries is challenging, and includes both experimental and ''in silico'' methods. Methods try to quantify the amount of knowledge generated by systems, that should be provided in an amount and richness that is useful for scientists. Evaluation is difficult in LBD for several reasons: disagreement about the role of LBD systems in research and thus what makes a successful one; difficulty in determining how useful, interesting or actionable a discovery is; and difficulty in objectively defining a ‘
discovery Discovery may refer to: * Discovery (observation), observing or finding something unknown * Discovery (fiction), a character's learning something unknown * Discovery (law), a process in courts of law relating to evidence Discovery, The Discovery ...
’, which hinders the creation of a standard evaluation set which quantifies when a discovery has been replicated or found. A popular method used in LBD is to ''replicate previous discoveries.'' These are usually LBD-based discoveries as they are relatively easy to quantify compared to other discoveries. There are only a handful of such discoveries and approaches tuned to perform well on these discoveries might not generalise. In this type of evaluation, the
literature Literature is any collection of Writing, written work, but it is also used more narrowly for writings specifically considered to be an art form, especially novels, Play (theatre), plays, and poetry, poems. It includes both print and Electroni ...
before the discovery to be replicated is used to generate a ranked list of discovery candidates as target or linking terms. Success is measured by reporting the rank of the term(s) of interest; the higher the rank, the better the approach. ''Literature- or time-slicing'' involves splitting the existing literature at a point in time. The LBD system is then exposed to the literature before the split and is evaluated by how many of the discoveries in the later period it can discover. LBD systems have used term co-occurrences, relationships from external biomedical resources (e.g SemMedDB) and semantic relationships to generate the gold standards. A high precision approach is to get expert opinion to generate the gold standard, but this is time-consuming, expensive and tends to produce low recall rates. The advantage of time-slicing in comparison to the replication of previous discoveries is the evaluation on a large number of test instances. This raises the need for evaluation metrics which can quantify performance on large, ranked lists. LBD works have used metrics popular in Information Retrieval which include Precision, Recall, Area Under the Curve (AUC), Precision at ''k'',
Mean Average Precision Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of informati ...
(MAP) and others. The approach of ''Proposing new discoveries'' ''or treatments'' goes beyond replicating past discoveries or predicting time-sliced instances of a particular relationship and shows that a system is capable of being used in realistic situations. This is usually accompanied by peer-reviewed publication in the domain or vetting by a
domain expert A subject-matter expert (SME) is a person who has accumulated great knowledge in a particular field or topic and this level of knowledge is demonstrated by the person's degree, licensure, and/or through years of professional experience with the su ...
.


Text mining

The automation of literature-based discovery relies heavily on
text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
. The language in scientific articles often include ambiguities, and an important step for coeherent parsing of the literature is the extraction of the sense of each term in the context they are used, a task called
Word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...
(WSD). For example, terms for genes like CT (''PCYT1A'') called and MR (''NR3C2'') can be confused with the acronyms for Computational Tomography and Magnetic Resonance, requiring sofisticated disambiguation systems. Terms are often reconciled to
ontologies In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...
or other sources of unique identifiers, such as the Unified Medical Language System (UMLS). This process of mapping multiple different utterances to a single name or identifier is called normalization.


Usage


Life sciences

LBD has already been used in different ways to identify new connections between biomedical entities and new candidate genes and treatments for illnesses.


Drug discovery

LBD has seen use in drug development and repurposing as well as predicting adverse drug reactions. The method of literature-based discovery has been used to search for treatments for a number of human diseases, including: *
diabetic retinopathy Diabetic retinopathy (also known as diabetic eye disease) is a medical condition in which damage occurs to the retina due to diabetes. It is a leading cause of blindness in developed countries and one of the lead causes of sight loss in the wor ...
*
dilated cardiomyopathy Dilated cardiomyopathy (DCM) is a condition in which the heart becomes enlarged and cannot pump blood effectively. Symptoms vary from none to feeling tired, leg swelling, and shortness of breath. It may also result in chest pain or fainting. C ...
*
Parkinson's disease Parkinson's disease (PD), or simply Parkinson's, is a neurodegenerative disease primarily of the central nervous system, affecting both motor system, motor and non-motor systems. Symptoms typically develop gradually and non-motor issues become ...
*
prostate cancer Prostate cancer is the neoplasm, uncontrolled growth of cells in the prostate, a gland in the male reproductive system below the bladder. Abnormal growth of the prostate tissue is usually detected through Screening (medicine), screening tests, ...
*
gastric cancer Stomach cancer, also known as gastric cancer, is a malignant tumor of the stomach. It is a cancer that develops in the lining of the stomach. Most cases of stomach cancers are gastric carcinomas, which can be divided into a number of subtypes ...
*
multiple sclerosis Multiple sclerosis (MS) is an autoimmune disease resulting in damage to myelinthe insulating covers of nerve cellsin the brain and spinal cord. As a demyelinating disease, MS disrupts the nervous system's ability to Action potential, transmit ...


Gene and protein function discovery

The approach has also been used to propose relations of genes with particular diseases, like
breast cancer Breast cancer is a cancer that develops from breast tissue. Signs of breast cancer may include a Breast lump, lump in the breast, a change in breast shape, dimpling of the skin, Milk-rejection sign, milk rejection, fluid coming from the nipp ...
. In the context of
systems A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and is exp ...
vaccinology A vaccine is a biological preparation that provides active acquired immunity to a particular infectious or malignant disease. The safety and effectiveness of vaccines has been widely studied and verified. A vaccine typically contains an ag ...
, it was used to identify proteins related to
interferon gamma Interferon gamma (IFNG or IFN-γ) is a dimerized soluble cytokine that is the only member of the type II class of interferons. The existence of this interferon, which early in its history was known as immune interferon, was described by E. F. ...
and that play a role in the response to
vaccine A vaccine is a biological Dosage form, preparation that provides active acquired immunity to a particular infectious disease, infectious or cancer, malignant disease. The safety and effectiveness of vaccines has been widely studied and verifi ...
s. It has also been used to propose mechanisms for currently used drugs.


Biomarker discovery

LBD has been explored as a tool to identify
biomarker In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...
s for
diagnostic Diagnosis (: diagnoses) is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in a lot of different academic discipline, disciplines, with variations in the use of logic, analytics, and experience, to determine " ...
and prognostic for diseases, e.g. for the risk of
type 2 diabetes Type 2 diabetes (T2D), formerly known as adult-onset diabetes, is a form of diabetes mellitus that is characterized by high blood sugar, insulin resistance, and relative lack of insulin. Common symptoms include increased thirst, frequent ...
.


Other uses

Besides providing scientific hypotheses about the world, LBD has also been used to improve
data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
, via the automatic identification of possible confounding factors using the medical literature. It has also been used to understand better disease
etiology Etiology (; alternatively spelled aetiology or ætiology) is the study of causation or origination. The word is derived from the Greek word ''()'', meaning "giving a reason for" (). More completely, etiology is the study of the causes, origins ...
and the relation of different diseases, for example looking for the genes connecting
myocardial infarction A myocardial infarction (MI), commonly known as a heart attack, occurs when Ischemia, blood flow decreases or stops in one of the coronary arteries of the heart, causing infarction (tissue death) to the heart muscle. The most common symptom ...
and depression, and connections between psychiatric and somatic diseases.


Beyond life sciences

LBD has mostly been deployed in the biomedical domain, but it has also been used outside of it as it has been applied to research into developing water purification systems, accelerating development of
developing countries A developing country is a sovereign state with a less-developed Secondary sector of the economy, industrial base and a lower Human Development Index (HDI) relative to developed countries. However, this definition is not universally agreed upon. ...
and identifying promising research collaborations.


See also

* Arrowsmith System *
Implicature In pragmatics, a subdiscipline of linguistics, an implicature is something the speaker suggests or implies with an utterance, even though it is not literally expressed. Implicatures can aid in communicating more efficiently than by explicitly sayi ...
*
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
*
Metaphor A metaphor is a figure of speech that, for rhetorical effect, directly refers to one thing by mentioning another. It may provide, or obscure, clarity or identify hidden similarities between two different ideas. Metaphors are usually meant to cr ...
*
Text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
* Biocuration * BioCreative


Additional reading

* Wilson, Patrick (1977). ''Public Knowledge, Private Ignorance: Toward a Library and Information Policy''. Greenwood Publishing Group. p. 156. .


References

{{reflist Information retrieval techniques Medical research