Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of
information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
. The goal of terminology extraction is to automatically extract relevant terms from a given
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
.
In the
semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the
internet
The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a ''internetworking, network of networks'' that consists ...
. Modeling these communities and their information needs is important for several
web application
A web application (or web app) is application software that is accessed using a web browser. Web applications are delivered on the World Wide Web to users with an active network connection.
History
In earlier computing models like client-serve ...
s, like topic-driven
web crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...
s,
web services,
recommender systems
A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...
, etc. The development of terminology extraction is also essential to the
language industry.
One of the first steps to model a
knowledge domain is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain
concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature.
Typically, approaches to automatic term extraction make use of linguistic processors (
part of speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...
,
phrase chunking
Phrase chunking is a phase of natural language processing that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases, abbreviated as NP, VP, and PP, respectively. Typically, each subconstituent or ...
) to extract terminological candidates, i.e. syntactically plausible terminological
noun phrase
In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently o ...
s. Noun phrases include compounds (e.g. "credit card"), adjective noun phrases (e.g. "local tourist information office"), and prepositional noun phrases (e.g. "board of directors"). In English, the first two (compounds and adjective noun phrases) are the most frequent. Terminological entries are then filtered from the candidate list using statistical and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a
domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for
semantic similarity,
knowledge management
Knowledge management (KM) is the collection of methods relating to creating, sharing, using and managing the knowledge and information of an organization. It refers to a multidisciplinary approach to achieve organisational objectives by making ...
,
human translation and
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, etc.
Bilingual terminology extraction
The methods for terminology extraction can be applied to
parallel corpora. Combined with e.g.
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...
statistics, candidates for term translations can be obtained. Bilingual terminology can be extracted also from comparable corpora
[
] (corpora containing texts within the same text type, domain but not translations of documents between each other).
See also
*
Computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Glossary
A glossary (from grc, γλῶσσα, ''glossa''; language, speech, wording) also known as a vocabulary or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a gl ...
*
Natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
*
Domain ontology
*
Subject indexing
*
Taxonomy (general)
Taxonomy is the practice and science of categorization or classification.
A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
*
Terminology
Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, compound word, or multi-wo ...
*
Text mining
*
Text simplification
Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
References
{{Natural Language Processing
Tasks of natural language processing
Extraction Extraction may refer to:
Science and technology
Biology and medicine
* Comedo extraction, a method of acne treatment
* Dental extraction, the surgical removal of a tooth from the mouth
Computing and information science
* Data extraction, the pro ...
Computing terminology