corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...

, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text (

corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

) as corresponding to a particular

part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...

, based on both its definition and its

context In semiotics, linguistics, sociology and anthropology, context refers to those objects or entities which surround a ''focal event'', in these disciplines typically a communicative event, of some kind. Context is "a frame that surrounds the event ...

. A simplified form of this is commonly taught to school-age children, in the identification of words as

noun In grammar, a noun is a word that represents a concrete or abstract thing, like living creatures, places, actions, qualities, states of existence, and ideas. A noun may serve as an Object (grammar), object or Subject (grammar), subject within a p ...

verb A verb is a word that generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual description of English, the basic f ...

adjective An adjective (abbreviations, abbreviated ) is a word that describes or defines a noun or noun phrase. Its semantic role is to change information given by the noun. Traditionally, adjectives are considered one of the main part of speech, parts of ...

adverb An adverb is a word or an expression that generally modifies a verb, an adjective, another adverb, a determiner, a clause, a preposition, or a sentence. Adverbs typically express manner, place, time, frequency, degree, or level of certainty by ...

s, etc. Once performed by hand, POS tagging is now done in the context of

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, using

algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...

which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS taggers, employs rule-based algorithms.

Principle

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex. This is not rare—in

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

s (as opposed to many

artificial language Artificial languages are languages of a typically very limited size which emerge either in computer simulations between artificial agents, robot interactions or controlled psychological experiments with humans. They are different from both constr ...

s), a large percentage of word-forms are

ambiguous Ambiguity is the type of meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A common aspect of ambiguit ...

. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb: : The sailor dogs the hatch. Correct

grammatical In linguistics, grammaticality is determined by the conformity to language usage as derived by the grammar of a particular speech variety. The notion of grammaticality rose alongside the theory of generative grammar, the goal of which is to formu ...

tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a

nautical Seamanship is the art, competence, and knowledge of operating a ship, boat or other craft on water. The'' Oxford Dictionary'' states that seamanship is "The skill, techniques, or practice of handling a ship or boat at sea." It involves topic ...

term meaning "fastens (a watertight door) securely").

Tag sets

Schools commonly teach that there are 9

parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

in English:

article Article often refers to: * Article (grammar), a grammatical element used to indicate definiteness or indefiniteness * Article (publishing), a piece of nonfictional prose that is an independent part of a publication Article(s) may also refer to: ...

preposition Adpositions are a part of speech, class of words used to express spatial or temporal relations (''in, under, towards, behind, ago'', etc.) or mark various thematic relations, semantic roles (''of, for''). The most common adpositions are prepositi ...

pronoun In linguistics and grammar, a pronoun (Interlinear gloss, glossed ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the part of speech, parts of speech, but so ...

, conjunction, and

interjection An interjection is a word or expression that occurs as an utterance on its own and expresses a spontaneous feeling, situation or reaction. It is a diverse category, with many different types, such as exclamations ''(ouch!'', ''wow!''), curses (''da ...

. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "

case Case or CASE may refer to: Instances * Instantiation (disambiguation), a realization of a concept, theme, or design * Special case, an instance that differs in a certain way from others of the type Containers * Case (goods), a package of relate ...

" (role as subject, object, etc.),

grammatical gender In linguistics, a grammatical gender system is a specific form of a noun class system, where nouns are assigned to gender categories that are often not related to the real-world qualities of the entities denoted by those nouns. In languages wit ...

, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different

inflection In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...

s of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as

features Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

somewhat independent from part-of-speech.Universal POS tags
/ref> In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. Work on

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

methods for tagging

Koine Greek Koine Greek (, ), also variously known as Hellenistic Greek, common Attic, the Alexandrian dialect, Biblical Greek, Septuagint Greek or New Testament Greek, was the koiné language, common supra-regional form of Greek language, Greek spoken and ...

(DeRose 1990) has used over 1,000 parts of speech and found that about as many words were

in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as ''Ncmsan'' for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no. The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use and include versions for multiple languages. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as

Greek Greek may refer to: Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group *Greek language, a branch of the Indo-European language family **Proto-Greek language, the assumed last common ancestor of all kno ...

and

Latin Latin ( or ) is a classical language belonging to the Italic languages, Italic branch of the Indo-European languages. Latin was originally spoken by the Latins (Italic tribe), Latins in Latium (now known as Lazio), the lower Tiber area aroun ...

can be very large; tagging ''words'' in

agglutinative language An agglutinative language is a type of language that primarily forms words by stringing together morphemes (word parts)—each typically representing a single grammatical meaning—without significant modification to their forms ( agglutinations) ...

s such as

Inuit languages The Inuit languages are a closely related group of Indigenous languages of the Americas, indigenous American languages traditionally spoken across the North American Arctic and the adjacent subarctic regions as far south as Labrador. The Inuit ...

may be virtually impossible. At the other extreme, Petrov et al. have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, and so on). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

History

The Brown Corpus

Research on part-of-speech tagging has been closely tied to

. The first major corpus of English for computer analysis was the

Brown Corpus The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...

developed at

Brown University Brown University is a Private university, private Ivy League research university in Providence, Rhode Island, United States. It is the List of colonial colleges, seventh-oldest institution of higher education in the US, founded in 1764 as the ' ...

Henry Kučera Henry Kučera (15 February 1925 – 20 February 2010), born Jindřich Kučera (), was a Czech-American linguist who pioneered corpus linguistics, linguistic software, a major contributor to the ''American Heritage Dictionary'', and a pioneer i ...

and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences). The

was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article then verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree). This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as

CLAWS A claw is a curved, pointed appendage found at the end of a toe or finger in most amniotes (mammals, reptiles, birds). Some invertebrates such as beetles and spiders have somewhat similar fine, hooked structures at the end of the leg or tars ...

and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word

British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...

, even though larger corpora are rarely so thoroughly curated. For some time, part-of-speech tagging was considered an inseparable part of

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, because there are certain cases where the correct part of speech cannot be decided without understanding the

semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

or even the

pragmatics In linguistics and the philosophy of language, pragmatics is the study of how Context (linguistics), context contributes to meaning. The field of study evaluates how human language is utilized in social interactions, as well as the relationship ...

of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

Use of hidden Markov models

In the mid-1980s, researchers in Europe began to use

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

s (HMMs) to disambiguate parts of speech, when working to tag the

Lancaster-Oslo-Bergen Corpus The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for t ...

of British English. HMMs involve counting cases (such as from the Brown Corpus) and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can, of course, be used to benefit from knowledge about the following words. More advanced ("higher-order") HMMs learn the probabilities not only of pairs but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb. When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with the highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this and achieved accuracy in the 93–95% range.

Eugene Charniak Eugene Charniak (1946 – June 13, 2023) was a professor of computer Science and cognitive Science at Brown University. He held an A.B. in Physics from the University of Chicago and a Ph.D. from M.I.T. in Computer Science. His research was in th ...

points out in ''Statistical techniques for natural language parsing'' (1997) that merely assigning the most common tag to each known word and the tag "

proper noun A proper noun is a noun that identifies a single entity and is used to refer to that entity ('' Africa''; ''Jupiter''; '' Sarah''; ''Walmart'') as distinguished from a common noun, which is a noun that refers to a class of entities (''continent, ...

" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech. CLAWS pioneered the field of HMM-based part of speech tagging but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech. HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.

Dynamic programming methods

In 1987,

Steven DeRose Stephen or Steven is an English first name. It is particularly significant to Christians, as it belonged to Saint Stephen ( ), an early disciple and deacon who, according to the Book of Acts, was stoned to death; he is widely regarded as the firs ...

and Kenneth W. Church independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the

Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This i ...

known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (an actual measurement of triple probabilities would require a much larger corpus). Both methods achieved an accuracy of over 95%. DeRose's 1990 dissertation at

included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective. These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated from the other levels of processing; this, in turn, simplified the theory and practice of computerized language analysis and encouraged researchers to find ways to separate other pieces as well. Markov Models became the standard method for the part-of-speech assignment.

Unsupervised taggers

The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights. These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

Other taggers and methods

Some current major algorithms for part-of-speech tagging include the

Brill tagger The Brill tagger is an inductive method for part-of-speech tagging. It was described and invented by Eric Brill in his 1993 PhD thesis. It can be summarized as an "error-driven transformation-based tagger". It is: * a form of supervised learning, ...

, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Many

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

methods have also been applied to the problem of POS tagging. Methods such as SVM,

maximum entropy classifier In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

, and nearest-neighbor have all been tried, and most can achieve accuracy above 95%. A direct comparison of several methods is reported (with references) at the ACL Wiki. This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported here are the best that can be achieved with a given approach; nor even the best that ''have'' been achieved with a given approach. In 2014, a paper reported using the structure regularization method for part-of-speech tagging, achieving 97.36% on a standard benchmark dataset.

References

Works cited

*Charniak, Eugene. 1997.
Statistical Techniques for Natural Language Parsing
. ''AI Magazine'' 18(4):33–44. *Hans van Halteren, Jakub Zavrel, Walter Daelemans. 2001. Improving Accuracy in NLP Through Combination of Machine Learning Systems. ''Computational Linguistics''. 27(2): 199–229
PDF
*DeRose, Steven J. 1990. "Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages." Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences. Electronic Edition available a

* D.Q. Nguyen, D.Q. Nguyen, D.D. Pham and S.B. Pham (2016). "A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging." ''AI Communications'', vol. 29, no. 3, pages 409–422
[.pdf]
{{Natural Language Processing Corpus linguistics Tasks of natural language processing Markov models Word-sense disambiguation Computational linguistics