Linguistic categories include *

Lexical category In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ass ...

, a part of speech such as ''noun'', ''preposition'', etc. *

Syntactic category A syntactic category is a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech (e.g. noun, verb, preposition, etc.), are syntactic categories. In phrase structure grammars, the ''phrasa ...

, a similar concept which can also include phrasal categories *

Grammatical category In linguistics, a grammatical category or grammatical feature is a property of items within the grammar of a language. Within each category there are two or more possible values (sometimes called grammemes), which are normally mutually exclusive ...

, a grammatical feature such as ''tense'', ''gender'', etc. The definition of linguistic categories is a major concern of

linguistic theory Theoretical linguistics is a term in linguistics that, like the related term general linguistics, can be understood in different ways. Both can be taken as a reference to the theory of language, or the branch of linguistics that inquires into the ...

, and thus, the definition and naming of categories varies across different theoretical frameworks and grammatical traditions for different languages. The

operationalization In research design, especially in psychology, social sciences, life sciences and physics, operationalization or operationalisation is a process of defining the measurement of a phenomenon which is not directly measurable, though its existence is ...

of linguistic categories in

lexicography Lexicography is the study of lexicons and the art of compiling dictionaries. It is divided into two separate academic disciplines: * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoretical le ...

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...

, and terminology management typically requires resource-, problem- or application-specific definitions of linguistic categories. In

Cognitive linguistics Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from cognitive science, cognitive psychology, neuropsychology and linguistics. Models and theoretical accounts of cognitive linguistics are cons ...

it has been argued that linguistic categories have a prototype structure like that of the categories of common words in a language.

John R Taylor John is a common English name and surname: * John (given name) * John (surname) John may also refer to: New Testament Works * Gospel of John, a title often shortened to John * First Epistle of John, often shortened to 1 John * Second Ep ...

(1995) ''Linguistic Categorization: Prototypes in Linguistic Theory'', 2nd ed., ch.2 p.21

Linguistic category inventories

To facilitate the

interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader de ...

between lexical resources, linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to perform quantitative evaluation (for language-specific inventories), to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., in Universal grammar, but also heavily criticized.

Part-of-Speech tagsets

Schools commonly teach that there are 9

parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

in English:

noun In grammar, a noun is a word that represents a concrete or abstract thing, like living creatures, places, actions, qualities, states of existence, and ideas. A noun may serve as an Object (grammar), object or Subject (grammar), subject within a p ...

verb A verb is a word that generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual description of English, the basic f ...

, article,

adjective An adjective (abbreviations, abbreviated ) is a word that describes or defines a noun or noun phrase. Its semantic role is to change information given by the noun. Traditionally, adjectives are considered one of the main part of speech, parts of ...

preposition Adpositions are a part of speech, class of words used to express spatial or temporal relations (''in, under, towards, behind, ago'', etc.) or mark various thematic relations, semantic roles (''of, for''). The most common adpositions are prepositi ...

pronoun In linguistics and grammar, a pronoun (Interlinear gloss, glossed ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the part of speech, parts of speech, but so ...

adverb An adverb is a word or an expression that generally modifies a verb, an adjective, another adverb, a determiner, a clause, a preposition, or a sentence. Adverbs typically express manner, place, time, frequency, degree, or level of certainty by ...

, conjunction, and

interjection An interjection is a word or expression that occurs as an utterance on its own and expresses a spontaneous feeling, situation or reaction. It is a diverse category, with many different types, such as exclamations ''(ouch!'', ''wow!''), curses (''da ...

. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their

case Case or CASE may refer to: Instances * Instantiation (disambiguation), a realization of a concept, theme, or design * Special case, an instance that differs in a certain way from others of the type Containers * Case (goods), a package of relate ...

(role as subject, object, etc.),

grammatical gender In linguistics, a grammatical gender system is a specific form of a noun class system, where nouns are assigned to gender categories that are often not related to the real-world qualities of the entities denoted by those nouns. In languages wit ...

, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different

inflection In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...

s of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech.Universal POS tags
/ref> In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as

Greek Greek may refer to: Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group *Greek language, a branch of the Indo-European language family **Proto-Greek language, the assumed last common ancestor of all kno ...

and

Latin Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...

can be very large; tagging ''words'' in

agglutinative language An agglutinative language is a type of language that primarily forms words by stringing together morphemes (word parts)—each typically representing a single grammatical meaning—without significant modification to their forms ( agglutinations) ...

s such as

Inuit languages The Inuit languages are a closely related group of Indigenous languages of the Americas, indigenous American languages traditionally spoken across the North American Arctic and the adjacent subarctic regions as far south as Labrador. The Inuit ...

may be virtually impossible. Work on

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

methods for tagging

Koine Greek Koine Greek (, ), also variously known as Hellenistic Greek, common Attic, the Alexandrian dialect, Biblical Greek, Septuagint Greek or New Testament Greek, was the koiné language, common supra-regional form of Greek language, Greek spoken and ...

(DeRose 1990) has used over 1,000 parts of speech and found that about as many words were

ambiguous Ambiguity is the type of meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A common aspect of ambiguit ...

in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as ''ncmsan'' for ''category= noun, type= common, gender= masculine, number= singular, case= accusative, animate= no. The most popular tag set for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.

Multilingual annotation schemes

For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of the

European Commission The European Commission (EC) is the primary Executive (government), executive arm of the European Union (EU). It operates as a cabinet government, with a number of European Commissioner, members of the Commission (directorial system, informall ...

that ran within the DG XIII

Linguistic Research Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), morphology (structure of words), phonetics (speech sounds and equivalent gestures ...

and Engineering programme from 1994 to 1998, coordinated by Consorzio Pisa Ricerche, Pisa, Italy. The EAGLES guidelines provide guidance for markup to be used with

text corpora In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in cor ...

, particularly for identifying features relevant in

and

. Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations for ''de facto'' standards and rules of best practice for: * Large-scale language resources (such as text corpora, computational

lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...

s and speech corpora); * Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; * Means of assessing and evaluating resources, tools and products. The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe. A generation later, a similar effort was initiated by the research community under the umbrella of Universal Dependencies. Petrov et al. have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies), and morphosyntax (Interset interlingua, partially building on the Multext-East/Eagles tradition) in the context of the

Universal Dependencies Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in th ...

(UD), an international cooperative project to create

treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empi ...

s of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automated

text processing In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. ''Text'' usually refers to all the alphanumeric characters specified on the keyboard of the person engaging th ...

in the field of

(NLP) and research into natural language syntax and grammar, especially within

linguistic typology Linguistic typology (or language typology) is a field of linguistics that studies and classifies languages according to their structural features to allow their comparison. Its aim is to describe and explain the structural diversity and the co ...

. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory. The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., ''aux:pass'' for an auxiliary (UD ''aux'') used to mark passive voice. The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology, frame semantics and

coreference In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...

. For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to (and extended for) a broad range of languages, e.g., Icelandic, Old English, Middle English, Middle Low German, Early Modern High German, Yiddish, Portuguese, Japanese, Arabic and Chinese.

Conventions for interlinear glosses

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines (''inter-'' + ''linear''), such as between a line of original text and its

translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...

into another

language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...

. When glossed, each line of the original text acquires one or more lines of transcription known as an ''interlinear text'' or ''interlinear glossed text'' (''IGT'')—''interlinear'' for short. Such glosses help the reader follow the relationship between the

source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. More generally, source material or symbolic sources are ob ...

and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules.Comrie, B., Haspelmath, M., & Bickel, B. (2008)
The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses
''Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January'', ''28'', 2010. Wikipedia also provides a

List of glossing abbreviations This article lists common abbreviations for grammatical terms that are used in linguistic interlinear glossing of oral languages in English. The list provides conventional glosses as established by standard inventories of glossing abbreviations su ...

that draws on this and other sources.

General Ontology for Linguistic Description (GOLD)

GOLD ("General Ontology for Linguistic Description") is an

ontology Ontology is the philosophical study of existence, being. It is traditionally understood as the subdiscipline of metaphysics focused on the most general features of reality. As one of the most fundamental concepts, being encompasses all of realit ...

for

descriptive linguistics In the study of language, description or descriptive linguistics is the work of objectively analyzing and describing how language is actually used (or how it was used in the past) by a speech community. François & Ponsonnet (2013). All aca ...

. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen (2003). Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from

endangered language An endangered language or moribund language is a language that is at risk of disappearing as its speakers die out or shift to speaking other languages. Language loss occurs when the language has no more native speakers and becomes a " dead langua ...

s. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with the ISO 12620 Data Category Registry (ISOcat); it is, however, more stringently structured. GOLD was maintained by the

LINGUIST List The LINGUIST List is an online resource for the academic field of linguistics. It was founded by Anthony Aristar in early 1990 at the University of Western Australia, and is used as a reference by the National Science Foundation in the United S ...

and others from 2007 to 2010. Th
RELISH
project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat. As of 2018, GOLD data remains an important terminology hub in the context of the Linguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced by OLiA (for linguistic annotation, building on GOLD and ISOcat) an
lexinfo.net
(for dictionary metadata, building on ISOcat).

ISO 12620 (ISO TC37 Data Category Registry, ISOcat)

ISO 12620 is a

standard Standard may refer to: Symbols * Colours, standards and guidons, kinds of military signs * Standard (emblem), a type of a large symbol or emblem used for identification Norms, conventions or requirements * Standard (metrology), an object ...

from

ISO/TC 37 ISO/TC 37 is a technical committee within the International Organization for Standardization (ISO) that prepares Technical standard, standards and other documents concerning methodology and principles for terminology and language resources. IS ...

that defines a ''Data Category Registry'', a registry for registering linguistic terms used in various fields of

and

and defining mappings both between different terms and between different systems in which the same terms are used. An earlier implementation of this standard, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology (see below). The goal of the registry is that new systems can reuse existing terminology, or at least be easily mapped to existing terminology, to aid

. The standard is used by other standards such as

Lexical Markup Framework Language resource management – Lexical markup framework (LMF; ISO 24613), produced by ISO/TC 37, is the ISO standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles ...

(ISO 24613:2008), and a number of terminologies have been added to the registry, including the Eagles guidelines, the National Corpus of Polish, and the TermBase eXchange format from the Localization Industry Standards Association. However, the 2019 edition, ISO 12620:2019, no longer provides a registry of terms for language technology and is now restricted to terminology resources, hence the revised title "Management of terminology resources – Data category specifications". Accordingly, ISOcat is no longer actively developed. , successor systems CLARIN Concept Registry and DatCatInfo were emerging. For linguistic categories relevant to lexical resources, the ''lexinfo'' vocabulary represents an established community standard, in particular in connection with the OntoLex vocabulary and machine-readable dictionaries in the context of Linguistic Linked Open Data technologies. Like the OntoLex vocabulary builds on the

(LMF), lexinfo builds on (the LMF section of) ISOcat.Cimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020). ''Linguistic Linked Data'' (pp. 137–160). Springer, Cham. Unlike ISOcat, however, lexinfo is actively maintained and currently (May 2020) extended in a community effort.

Ontologies of Linguistic Annotation (OLiA)

Similar in spirit to GOLD, the Ontologies of Linguistic Annotation (OLiA) provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant for linguistic annotation and linguistic corpora in the form of an

. In addition, they also provide machine-readable annotation schemes for more than 100 languages, linked with the OLiA reference model. The OLiA ontologies represent a major hub of annotation terminology in the (Linguistic) Linked Open Data cloud, with applications for search, retrieval and machine learning over heterogeneously annotated language resources. In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,Chiarcos, C. (2008)
An ontology of linguistic annotations
In ''LDV Forum'' (Vol. 23, No. 1, pp. 1-16). GOLD, ISOcat, CLARIN Concept Registry, Universal Dependencies,Christian Chiarcos, Maxim Ionov and Christian Fäth (2020), Annotation interoperability in the post-ISOcat era, LREC 2020
lexinfo, etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub

References

External links

Leipzig Glossing RulesISOcatDatCatInfo Data Category Repository (DCR)
Linguistics Information science Semantic Web Ontology (information science) #12620 Terminology Translation Computational linguistics {{ISO standards