Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

. The goal of terminology extraction is to automatically extract relevant terms from a given

corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the

internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a ''internetworking, network of networks'' that consists ...

. Modeling these communities and their information needs is important for several

web application A web application (or web app) is application software that is accessed using a web browser. Web applications are delivered on the World Wide Web to users with an active network connection. History In earlier computing models like client-serve ...

s, like topic-driven

web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...

s, web services,

recommender systems A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...

, etc. The development of terminology extraction is also essential to the language industry. One of the first steps to model a knowledge domain is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature. Typically, approaches to automatic term extraction make use of linguistic processors (

part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...

phrase chunking Phrase chunking is a phase of natural language processing that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases, abbreviated as NP, VP, and PP, respectively. Typically, each subconstituent or ...

) to extract terminological candidates, i.e. syntactically plausible terminological

noun phrase In linguistics, a noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently o ...

s. Noun phrases include compounds (e.g. "credit card"), adjective noun phrases (e.g. "local tourist information office"), and prepositional noun phrases (e.g. "board of directors"). In English, the first two (compounds and adjective noun phrases) are the most frequent. Terminological entries are then filtered from the candidate list using statistical and

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for semantic similarity,

knowledge management Knowledge management (KM) is the collection of methods relating to creating, sharing, using and managing the knowledge and information of an organization. It refers to a multidisciplinary approach to achieve organisational objectives by making ...

, human translation and

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

, etc.

Bilingual terminology extraction

The methods for terminology extraction can be applied to parallel corpora. Combined with e.g.

co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...

statistics, candidates for term translations can be obtained. Bilingual terminology can be extracted also from comparable corpora (corpora containing texts within the same text type, domain but not translations of documents between each other).

References

{{Natural Language Processing Tasks of natural language processing

Extraction Extraction may refer to: Science and technology Biology and medicine * Comedo extraction, a method of acne treatment * Dental extraction, the surgical removal of a tooth from the mouth Computing and information science * Data extraction, the pro ...

Computing terminology

Bilingual terminology extraction

See also

References