A controlled vocabulary provides a way to organize knowledge for subsequent retrieval. Controlled vocabularies are used in
subject indexing
Subject indexing is the act of describing or classifying a document
A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
schemes,
subject headings,
thesauri,
taxonomies and other
knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, preferred terms that have been preselected by the designers of the schemes, in contrast to
natural language vocabularies, which have no such restriction.
In library and information science
In
library and information science
Library and information science (LIS)Library and Information Sciences is the name used in the Dewey Decimal Classification for class 20 from the 18th edition (1971) to the 22nd edition (2003). are two interconnected disciplines that deal with inf ...
, controlled vocabulary is a carefully selected list of
words and
phrase
In grammar, a phrasecalled expression in some contextsis a group of words or singular word acting as a grammatical unit. For instance, the English language, English expression "the very happy squirrel" is a noun phrase which contains the adject ...
s, which are used to
tag units of information (document or work) so that they may be more easily retrieved by a search. Controlled vocabularies solve the problems of
homographs,
synonyms
A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...
and
polyseme
Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from '' monosemy'', where a word has a single mea ...
s by a
bijection
In mathematics, a bijection, bijective function, or one-to-one correspondence is a function between two sets such that each element of the second set (the codomain) is the image of exactly one element of the first set (the domain). Equival ...
between concepts and preferred terms. In short, controlled vocabularies reduce unwanted ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.
For example, in the
Library of Congress Subject Headings (a subject heading system that uses a controlled vocabulary), preferred terms—subject headings in this case—have to be chosen to handle choices between variant spellings of the same word (American versus British), choice among scientific and popular terms (''cockroach'' versus ''Periplaneta americana''), and choices between synonyms (''automobile'' versus ''car''), among other difficult issues.
Choices of preferred terms are based on the principles of ''user warrant'' (what terms users are likely to use), ''literary warrant'' (what terms are generally used in the literature and documents), and ''structural warrant'' (terms chosen by considering the structure, scope of the controlled vocabulary).
Controlled vocabularies also typically handle the problem of
homographs with qualifiers. For example, the term ''pool'' has to be qualified to refer to either ''swimming pool'' or the game ''pool'' to ensure that each preferred term or heading refers to only one concept.
Types used in libraries
There are two main kinds of controlled vocabulary tools used in libraries: subject headings and
thesauri. While the differences between the two are diminishing, there are still some minor differences:
* Historically, subject headings were designed to describe books in
library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles.
* Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialized covering very specific disciplines.
* Because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesaurus terms are always in direct order.
* Subject headings tend to use more pre-coordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one preferred subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Thesauri list not only equivalent terms but also narrower, broader terms and related terms among various preferred and non-preferred (but potentially synonymous) terms, while historically most subject headings did not. For example, the
Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "
Broader term" and "
Narrow term".
The
terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the
Library of Congress system,
Medical Subject Headings
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing Academic journal, journal articles and books in the Life science, life sciences. It serves as a thesaurus of index terms that facilitates searc ...
(MeSH) created by the
United States National Library of Medicine, and
Sears. Well known thesauri include the
Art and Architecture Thesaurus and the
ERIC
The given name Eric, Erich, Erikk, Erik, Erick, Eirik, or Eiríkur is derived from the Old Norse name ''Eiríkr'' (or ''Eríkr'' in Old East Norse due to monophthongization).
The first element, ''ei-'' may be derived from the older Proto-N ...
Thesaurus.
When selecting terms for a controlled vocabulary, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language.
Lastly the amount of pre-coordination (in which case the degree of enumeration versus synthesis becomes an issue) and post-coordination in the system is another important issue. Controlled vocabulary elements (terms/phrases) employed as
tags, to aid in the content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
.
Indexing languages
There are three main types of indexing languages.
* Controlled indexing language – only approved terms can be used by the indexer to describe the document
*
Natural language indexing language – any term from the document in question can be used to describe the document
* Free indexing language – any term (not only from the document) can be used to describe the document
When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.
In recent years
free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is ''indexed''). These methods have been compared in some studies, such as the 2007 article, "A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search".
Advantages
Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce
irrelevant items in the retrieval list. These irrelevant items (
false positives) are often caused by the inherent ambiguity of
natural language. Take the English word
''football'' for example. ''Football'' is the name given to a number of different
team sport
A team sport is a type of sport where the fundamental nature of the game or sport requires the participation of multiple individuals working together as a team, and it is inherently impossible or highly impractical to execute the sport as a s ...
s. Worldwide the most popular of these team sports is
association football
Association football, more commonly known as football or soccer, is a team sport played between two teams of 11 Football player, players who almost exclusively use their feet to propel a Ball (association football), ball around a rectangular f ...
, which also happens to be called ''
soccer
Association football, more commonly known as football or soccer, is a team sport played between two teams of 11 Football player, players who almost exclusively use their feet to propel a Ball (association football), ball around a rectangular f ...
'' in several countries. The word ''football'' is also applied to
rugby football
Rugby football is the collective name for the team sports of rugby union or rugby league.
Rugby football started at Rugby School in Rugby, Warwickshire, England, where the rules were first codified in 1845. Forms of football in which the ball ...
(
rugby union
Rugby union football, commonly known simply as rugby union in English-speaking countries and rugby 15/XV in non-English-speaking world, Anglophone Europe, or often just rugby, is a Contact sport#Terminology, close-contact team sport that orig ...
and
rugby league
Rugby league football, commonly known as rugby league in English-speaking countries and rugby 13/XIII in non-Anglophone Europe, is a contact sport, full-contact sport played by two teams of thirteen players on a rectangular Rugby league playin ...
),
American football
American football, referred to simply as football in the United States and Canada and also known as gridiron football, is a team sport played by two teams of eleven players on a rectangular American football field, field with goalposts at e ...
,
Australian rules football
Australian football, also called Australian rules football or Aussie rules, or more simply football or footy, is a contact sport played between two teams of 18 players on an Australian rules football playing field, oval field, often a modified ...
,
Gaelic football
Gaelic football (; short name '')'', commonly known as simply Gaelic, GAA, or football, is an Irish team sport. A form of football, it is played between two teams of 15 players on a rectangular grass pitch. The objective of the sport is to score ...
, and
Canadian football. A search for ''football'' therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by
tagging the documents in such a way that the ambiguities are eliminated.
Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually
relevant to the search topic).
In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct preferred term is searched, there is no need to search for other terms that might be synonyms of that term.
Disadvantages
A controlled vocabulary search may lead to unsatisfactory
recall, in that it will fail to retrieve some documents that are actually relevant to the search question.
This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with that of the indexer.
Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A
free text search would automatically pick up that article regardless.
On the other hand, free text searches have high exhaustivity (every word is searched) so although it has much lower precision, it has potential for high recall as long as the searcher overcome the problem of synonyms by entering every combination.
Controlled vocabularies may become outdated rapidly in fast developing fields of knowledge, unless the preferred terms are updated regularly. Even in an ideal scenario, a controlled vocabulary is often less specific than the words of the text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while this precise problem is not a factor in a free text, as it uses the author's own words.
The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.
Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including
faceted classification, which enables a given data record or document to be described in multiple ways.
Word choice in chosen vocabularies is not neutral, and the indexer must carefully consider the ethics of their word choices. For example, traditionally colonialist terms have often been the preferred terms in chosen vocabularies when discussing First Nations issues, which has caused controversy.
Applications
Controlled vocabularies, such as the
Library of Congress Subject Headings, are an essential component of
bibliography
Bibliography (from and ), as a discipline, is traditionally the academic study of books as physical, cultural objects; in this sense, it is also known as bibliology (from ). English author and bibliographer John Carter describes ''bibliograph ...
, the study and classification of books. They were initially developed in
library and information science
Library and information science (LIS)Library and Information Sciences is the name used in the Dewey Decimal Classification for class 20 from the 18th edition (1971) to the 22nd edition (2003). are two interconnected disciplines that deal with inf ...
. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the
Medical Subject Headings
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing Academic journal, journal articles and books in the Life science, life sciences. It serves as a thesaurus of index terms that facilitates searc ...
(MeSH) developed by the
U.S. National Library of Medicine. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup
X.25
X.25 is an ITU-T standard protocol suite for Packet switched network, packet-switched data communication in wide area network, wide area networks (WAN). It was originally defined by the CCITT, International Telegraph and Telephone Consultative Co ...
networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first
full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.
Technical communication
In large organizations, controlled vocabularies may be introduced to improve
technical communication
Technical communication (or tech comm) is communication of technical subject matter such as engineering, science, or technology content. The largest part of it tends to be technical writing, though importantly it often requires aspects of visual ...
. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in
technical writing and
knowledge management
Knowledge management (KM) is the set of procedures for producing, disseminating, utilizing, and overseeing an organization's knowledge and data. It alludes to a multidisciplinary strategy that maximizes knowledge utilization to accomplish organ ...
, where effort is expended to use the same word throughout a
document
A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
or
organization
An organization or organisation (English in the Commonwealth of Nations, Commonwealth English; American and British English spelling differences#-ise, -ize (-isation, -ization), see spelling differences) is an legal entity, entity—such as ...
instead of slightly different ones to refer to the same thing.
Semantic web and structured data
Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a
Semantic Web
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
To enable the encoding o ...
, in which the content of Web pages is described using a machine-readable
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
scheme. One of the first proposals for such a scheme is the
Dublin Core Initiative. An example of a controlled vocabulary which is usable for
indexing web pages is
PSH.
It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on
faceted classification principles.
Controlled vocabularies of the
Semantic Web
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
To enable the encoding o ...
define the concepts and relationships (terms) used to describe a field of interest or area of concern. For instance, to declare a person in a machine-readable format, a vocabulary is needed that has the formal definition of "Person", such as the Friend of a Friend (
FOAF) vocabulary, which has a Person class that defines typical properties of a person including, but not limited to, name, honorific prefix, affiliation, email address, and homepage, or the Person vocabulary of
Schema.org. Similarly, a book can be described using the Book vocabulary of
Schema.org and general publication terms from the
Dublin Core vocabulary, an event with the Event vocabulary of
Schema.org,
and so on.
To use machine-readable terms from any controlled vocabulary, web designers can choose from a variety of annotation formats, including RDFa,
HTML5 Microdata, or
JSON-LD in the markup, or
RDF serializations (RDF/XML, Turtle, N3, TriG, TriX) in external files.
See also
References
{{Reflist
External links
Directory of Linked Open Vocabularies (LOV)
Identifiers
Information retrieval techniques
Information science
Knowledge representation
Library cataloging and classification
Ontology (information science)
Semantic Web
Technical communication