Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of
information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given
corpus.
In the
semantic web
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
To enable the encoding o ...
era, a growing number of communities and networked enterprises started to access and interoperate through the
internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
. Modeling these communities and their information needs is important for several
web applications, like topic-driven
web crawler
Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
s,
web service
A web service (WS) is either:
* a service offered by an electronic device to another electronic device, communicating with each other via the Internet, or
* a server running on a computer device, listening for requests at a particular port over a n ...
s,
recommender systems, etc. The development of terminology extraction is also essential to the
language industry.
One of the first steps to model a
knowledge domain is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain
concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature.
Typically, approaches to automatic term extraction make use of linguistic processors (
part of speech tagging,
phrase chunking) to extract terminological candidates, i.e. syntactically plausible terminological
noun phrase
A noun phrase – or NP or nominal (phrase) – is a phrase that usually has a noun or pronoun as its head, and has the same grammatical functions as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently ...
s. Noun phrases include compounds (e.g. "credit card"), adjective noun phrases (e.g. "local tourist information office"), and prepositional noun phrases (e.g. "board of directors"). In English, the first two (compounds and adjective noun phrases) are the most frequent. Terminological entries are then filtered from the candidate list using statistical and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a
domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for
semantic similarity,
knowledge management
Knowledge management (KM) is the set of procedures for producing, disseminating, utilizing, and overseeing an organization's knowledge and data. It alludes to a multidisciplinary strategy that maximizes knowledge utilization to accomplish organ ...
,
human translation and
machine translation, etc.
Bilingual terminology extraction
The methods for terminology extraction can be applied to
parallel corpora. Combined with e.g.
co-occurrence statistics, candidates for term translations can be obtained. Bilingual terminology can be extracted also from comparable corpora
[
] (corpora containing texts within the same text type, domain but not translations of documents between each other).
See also
*
Computational linguistics
*
Glossary
A glossary (from , ''glossa''; language, speech, wording), also known as a vocabulary or clavis, is an alphabetical list of Term (language), terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a gloss ...
*
Natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Domain ontology
*
Subject indexing
Subject indexing is the act of describing or classifying a document
A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
*
Taxonomy (general)
*
Terminology
Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, Compound (linguistics), com ...
*
Text mining
*
Text simplification
References
{{Natural Language Processing
Tasks of natural language processing
Library science terminology
Computing terminology