Linguistic categories include *

Lexical category In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...

, a part of speech such as ''noun'', ''preposition'', etc. *

Syntactic category A syntactic category is a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech (e.g. noun, verb, preposition, etc.), are syntactic categories. In phrase structure grammars, the ''phrasal c ...

, a similar concept which can also include phrasal categories *

Grammatical category In linguistics, a grammatical category or grammatical feature is a property of items within the grammar of a language. Within each category there are two or more possible values (sometimes called grammemes), which are normally mutually exclusiv ...

, a grammatical feature such as ''tense'', ''gender'', etc. The definition of linguistic categories is a major concern of

linguistic theory Theoretical linguistics is a term in linguistics which, like the related term general linguistics, can be understood in different ways. Both can be taken as a reference to theory of language, or the branch of linguistics which inquires into the n ...

, and thus, the definition and naming of categories varies across different theoretical frameworks and grammatical traditions for different languages. The

operationalization In research design, especially in psychology, social sciences, life sciences and physics, operationalization or operationalisation is a process of defining the measurement of a phenomenon which is not directly measurable, though its existence is in ...

of linguistic categories in

lexicography Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries. * Practical lexicography is the art or craft of compiling, writing and editing dictionaries. * Theoreti ...

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...

, and terminology management typically requires resource-, problem- or application-specific definitions of linguistic categories. In

Cognitive linguistics Cognitive linguistics is an interdisciplinary branch of linguistics, combining knowledge and research from cognitive science, cognitive psychology, neuropsychology and linguistics. Models and theoretical accounts of cognitive linguistics are con ...

it has been argued that linguistic categories have a prototype structure like that of the categories of common words in a language.

John R Taylor John is a common English name and surname: * John (given name) * John (surname) John may also refer to: New Testament Works * Gospel of John, a title often shortened to John * First Epistle of John, often shortened to 1 John * Seco ...

(1995) ''Linguistic Categorization: Prototypes in Linguistic Theory'', 2nd ed., ch.2 p.21

Linguistic category inventories

To facilitate the

interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...

between lexical resources, linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to perform quantitative evaluation (for language-specific inventories), to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., in

Universal grammar Universal grammar (UG), in modern linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that there are innate constraints on what the grammar of a possible hu ...

, but also heavily criticized.

Part-of-Speech tagsets

Schools commonly teach that there are 9

parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

in English:

noun A noun () is a word that generally functions as the name of a specific object or set of objects, such as living creatures, places, actions, qualities, states of existence, or ideas.Example nouns for: * Living creatures (including people, alive, ...

verb A verb () is a word ( part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descr ...

article Article often refers to: * Article (grammar), a grammatical element used to indicate definiteness or indefiniteness * Article (publishing), a piece of nonfictional prose that is an independent part of a publication Article may also refer to: ...

adjective In linguistics, an adjective ( abbreviated ) is a word that generally modifies a noun or noun phrase or describes its referent. Its semantic role is to change information given by the noun. Traditionally, adjectives were considered one of the ...

preposition Prepositions and postpositions, together called adpositions (or broadly, in traditional grammar, simply prepositions), are a class of words used to express spatial or temporal relations (''in'', ''under'', ''towards'', ''before'') or mark various ...

pronoun In linguistics and grammar, a pronoun ( abbreviated ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the parts of speech, but some modern theorists would not ...

adverb An adverb is a word or an expression that generally modifies a verb, adjective, another adverb, determiner, clause, preposition, or sentence. Adverbs typically express manner, place, time, frequency, degree, level of certainty, etc., answering ...

conjunction Conjunction may refer to: * Conjunction (grammar), a part of speech * Logical conjunction, a mathematical operator ** Conjunction introduction, a rule of inference of propositional logic * Conjunction (astronomy), in which two astronomical bodies ...

, and

interjection An interjection is a word or expression that occurs as an utterance on its own and expresses a spontaneous feeling or reaction. It is a diverse category, encompassing many different parts of speech, such as exclamations ''(ouch!'', ''wow!''), curse ...

. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their

case Case or CASE may refer to: Containers * Case (goods), a package of related merchandise * Cartridge case or casing, a firearm cartridge component * Bookcase, a piece of furniture used to store books * Briefcase or attaché case, a narrow box to ca ...

(role as subject, object, etc.),

grammatical gender In linguistics, grammatical gender system is a specific form of noun class system, where nouns are assigned with gender categories that are often not related to their real-world qualities. In languages with grammatical gender, most or all noun ...

, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different

inflection In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and ...

s of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as

features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software ite ...

somewhat independent from part-of-speech.Universal POS tags
/ref> In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as

Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...

and

Latin Latin (, or , ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through ...

can be very large; tagging ''words'' in

agglutinative language An agglutinative language is a type of synthetic language with morphology that primarily uses agglutination. Words may contain different morphemes to determine their meanings, but all of these morphemes (including stems and affixes) tend to rem ...

s such as

Inuit languages The Inuit languages are a closely related group of indigenous American languages traditionally spoken across the North American Arctic and adjacent subarctic, reaching farthest south in Labrador. The related Yupik languages (spoken in weste ...

may be virtually impossible. Work on

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

methods for tagging

Koine Greek Koine Greek (; Koine el, ἡ κοινὴ διάλεκτος, hē koinè diálektos, the common dialect; ), also known as Hellenistic Greek, common Attic, the Alexandrian dialect, Biblical Greek or New Testament Greek, was the common supra-reg ...

(DeRose 1990) has used over 1,000 parts of speech and found that about as many words were

ambiguous Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...

in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as ''Ncmsan'' for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no. The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.

Multilingual annotation schemes

For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of the

European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body ...

that ran within the DG XIII Linguistic Research and Engineering programme from 1994 to 1998, coordinated by Consorzio Pisa Ricerche, Pisa, Italy. The EAGLES guidelines provide guidance for markup to be used with text corpora, particularly for identifying features relevant in

and

. Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations for ''de facto'' standards and rules of best practice for: * Large-scale language resources (such as text corpora, computational

lexicon A lexicon is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word (), neuter of () meaning 'of or fo ...

s and speech corpora); * Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; * Means of assessing and evaluating resources, tools and products. The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe. A generation later, a similar effort was initiated by the research community under the umbrella of Universal Dependencies. Petrov et al. have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies), and morphosyntax (Interset interlingua, partially building on the Multext-East/Eagles tradition) in the context of the

Universal Dependencies Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in ...

(UD), an international cooperative project to create

treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...

s of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automated text processing in the field of

(NLP) and research into natural language syntax and grammar, especially within

linguistic typology Linguistic typology (or language typology) is a field of linguistics that studies and classifies languages according to their structural features to allow their comparison. Its aim is to describe and explain the structural diversity and the co ...

. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory. The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., ''aux:pass'' for an auxiliary (UD ''aux'') used to mark passive voice. The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology, frame semantics and

coreference In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...

. For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to (and extended for) a broad range of languages, e.g., Icelandic, Old English, Middle English, Middle Low German, Early Modern High German, Yiddish, Portuguese, Japanese, Arabic and Chinese.

Conventions for interlinear glosses

linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Ling ...

, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines (''inter-'' + ''linear''), such as between a line of original text and its

translation Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...

into another

language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...

. When glossed, each line of the original text acquires one or more lines of transcription known as an interlinear text or interlinear glossed text (IGT)—interlinear for short. Such glosses help the reader follow the relationship between the

source text A source text is a text (sometimes oral) from which information or ideas are derived. In translation, a source text is the original text that is to be translated into another language. Description In historiography, distinctions are commonly m ...

and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules.Comrie, B., Haspelmath, M., & Bickel, B. (2008)
The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses
''Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January'', ''28'', 2010. Wikipedia also provides a

List of glossing abbreviations This article lists common abbreviations for grammatical terms that are used in linguistic interlinear glossing of oral languages in English. The list provides conventional glosses as established by standard inventories of glossing abbreviations ...

that draws on this and other sources.

General Ontology for Linguistic Description (GOLD)

GOLD ("General Ontology for Linguistic Description") is an

ontology In metaphysics, ontology is the philosophy, philosophical study of being, as well as related concepts such as existence, Becoming (philosophy), becoming, and reality. Ontology addresses questions like how entities are grouped into Category ...

for

descriptive linguistics In the study of language, description or descriptive linguistics is the work of objectively analyzing and describing how language is actually used (or how it was used in the past) by a speech community. François & Ponsonnet (2013). All acad ...

. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen (2003). Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from

endangered language An endangered language or moribund language is a language that is at risk of disappearing as its speakers die out or shift to speaking other languages. Language loss occurs when the language has no more native speakers and becomes a "dead lang ...

s. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with the ISO 12620 Data Category Registry (ISOcat); it is, however, more stringently structured. GOLD was maintained by the

LINGUIST List The LINGUIST List is a major online resource for the academic field of linguistics. It was founded by Anthony Aristar in early 1990 at the University of Western Australia, and is used as a reference by the National Science Foundation in the Unit ...

and others from 2007 to 2010. Th
RELISH
project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat. As of 2018, GOLD data remains an important terminology hub in the context of the Linguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced by OLiA (for linguistic annotation, building on GOLD and ISOcat) an
lexinfo.net
(for dictionary metadata, building on ISOcat).

ISO 12620 (ISO TC37 Data Category Registry, ISOcat)

ISO 12620 is a

standard Standard may refer to: Symbols * Colours, standards and guidons, kinds of military signs * Standard (emblem), a type of a large symbol or emblem used for identification Norms, conventions or requirements * Standard (metrology), an object th ...

from

ISO/TC 37 ISO/TC 37 is a technical committee within the International Organization for Standardization (ISO) that prepares standards and other documents concerning methodology and principles for terminology and language resources. Title: Terminology an ...

that defines a ''Data Category Registry'', a registry for registering linguistic terms used in various fields of

and

and defining mappings both between different terms and the same terms used in different systems. An earlier implementation of this standard, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology (see below). The goal of the registry is that new systems can reuse existing terminology, or at least be easily mapped to existing terminology, to aid

. The standard is used by other standards such as

Lexical Markup Framework Language resource management - Lexical markup framework (LMF; ISO 24613:2008), is the International Organization for Standardization ISO/TC37 standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope ...

(ISO 24613:2008), and a number of terminologies have been added to the registry, including the Eagles guidelines, the National Corpus of Polish, and the TermBase eXchange format from the

Localization Industry Standards Association Localization Industry Standards Association or LISA was a Swiss-based trade body concerning the translation of computer software (and associated materials) into multiple natural languages, which existed from 1990 to February 2011. It counted among ...

. However, the current edition ISO 12620:2019 does no longer provide a registry of terms for language technology and terminology, but it is now restricted to terminology resources, hence the revised title "Management of terminology resources — Data category specifications". Accordingly, ISOcat is no longer actively developed. As of May 2020, successor systems, CLARIN Concept Registry and DatCatInfo are only emerging. For linguistic categories relevant to lexical resources, the ''lexinfo'' vocabulary represents an established community standard, in particular in connection with the OntoLex vocabulary and machine-readable dictionaries in the context of Linguistic Linked Open Data technologies. Like the OntoLex vocabulary builds on the

(LMF), lexinfo builds on (the LMF section of) ISOcat.Cimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020). ''Linguistic Linked Data'' (pp. 137-160). Springer, Cham. Unlike ISOcat, however, lexinfo is actively maintained and currently (May 2020) extended in a community effort.

Ontologies of Linguistic Annotation (OLiA)

Similar in spirit to GOLD, the Ontologies of Linguistic Annotation (OLiA) provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant for linguistic annotation and linguistic corpora in the form of an

ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exi ...

. In addition, they also provide machine-readable annotation schemes for more than 100 languages, linked with the OLiA reference model. The OLiA ontologies represent a major hub of annotation terminology in the (Linguistic)

Linked Open Data In computing, linked data (often capitalized as Linked Data) is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but r ...

cloud, with applications for search, retrieval and machine learning over heterogeneously annotated language resources. In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,Chiarcos, C. (2008)
An ontology of linguistic annotations
In ''LDV Forum'' (Vol. 23, No. 1, pp. 1-16). GOLD, ISOcat, CLARIN Concept Registry, Universal Dependencies,Christian Chiarcos, Maxim Ionov and Christian Fäth (2020), Annotation interoperability in the post-ISOcat era, LREC 2020
lexinfo, etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub

References

External links

Leipzig Glossing RulesISOcatDatCatInfo Data Category Repository (DCR)
Linguistics Information science Semantic Web Ontology (information science) #12620 Terminology Translation Computational linguistics {{ISO standards