Sliding window based part-of-speech tagging is used to
part-of-speech tag a text.
A high percentage of words in a
natural language are words which out of context can be assigned more than one part of speech. The percentage of these ambiguous words is typically around 30%, although it depends greatly on the language. Solving this problem is very important in many areas of
natural language processing. For example in
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
changing the part-of-speech of a word can dramatically change its translation.
Sliding window based part-of-speech taggers are programs which assign a single part-of-speech to a given lexical form of a word, by looking at a fixed sized "window" of words around the word to be
disambiguated.
The two main advantages of this approach are:
* It is possible to automatically train the tagger, getting rid of the need of manually tagging a corpus.
* The tagger can be implemented as a
finite state automaton
A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
(
Mealy machine
In the theory of computation, a Mealy machine is a finite-state machine whose output values are determined both by its current state and the current inputs. This is in contrast to a Moore machine, whose output values are determined solely by its cu ...
)
Formal definition
Let
:
be the set of grammatical tags of the application, that is, the set of all possible tags which may be assigned to a word, and let
:
be the vocabulary of the application. Let
:
be a function for morphological analysis which assigns each
its set of possible tags,
, that can be implemented by a full-form lexicon, or a morphological analyser. Let
:
be the set of word classes, that in general will be a
partition
Partition may refer to:
Computing Hardware
* Disk partitioning, the division of a hard disk drive
* Memory partition, a subdivision of a computer's memory, usually for use by a single job
Software
* Partition (database), the division of a ...
of
with the restriction that for each
all of the words
will receive the same set of tags, that is, all of the words in each word class
belong to the same ambiguity class.
Normally,
is constructed in a way that for high frequency words, each word class contains a single word, while for low frequency words, each word class corresponds to a single ambiguity class. This allows good performance for high frequency ambiguous words, and doesn't require too many parameters for the tagger.
With these definitions it is possible to state problem in the following way: Given a text
each word