Morphological parsing, in
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
, is the process of determining the
morphemes
A morpheme is any of the smallest meaningful constituents within a linguistic expression and particularly within a word. Many words are themselves standalone morphemes, while other words contain multiple morphemes; in linguistic terminology, this ...
from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'foxes' can be decomposed into 'fox' (the stem), and 'es' (a suffix indicating plurality).
The generally accepted approach to morphological parsing is through the use of a
finite state transducer
A finite-state transducer (FST) is a finite-state machine with two memory ''tapes'', following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. ...
(FST), which inputs words and outputs their stem and modifiers. The FST is initially created through algorithmic parsing of some word source, such as a dictionary, complete with modifier markups.
Another approach is through the use of an indexed lookup method, which uses a constructed
radix tree
In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The resu ...
. This is not an often-taken route because it breaks down for morphologically complex languages.
With the advancement of
neural networks
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
in natural language processing, it became less common to use FST for morphological analysis, especially for languages for which there is a lot of available
training data. For such languages, it is possible to build character-level
language models without explicit use of a morphological parser.
[Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov]
"Enriching Word Vectors with Subword Information"
/ref>
Orthographic
Orthographic rules are general rules used when breaking a word into its stem
Stem or STEM most commonly refers to:
* Plant stem, a structural axis of a vascular plant
* Stem group
* Science, technology, engineering, and mathematics
Stem or STEM can also refer to:
Language and writing
* Word stem, part of a word respon ...
and modifiers. An example would be: singular English words ending with -y, when pluralized, end with -ies. Contrast this to morphological rules which contain corner cases to these general rules. Both of these types of rules are used to construct systems that can do morphological parsing.
Morphological
Morphological rules are exceptions to the orthographic rules used when breaking a word into its stem and modifiers. An example would be while one normally pluralizes a word in English by adding 's' as a suffix, the word 'fish' does not change when pluralized. Contrast this to orthographic rules which contain general rules. Both of these types of rules are used to construct systems that can do morphological parsing.
Various models of natural morphological processing have been proposed. Some experimental studies suggest that monolingual
Monoglottism ( Greek μόνος ''monos'', "alone, solitary", + γλῶττα , "tongue, language") or, more commonly, monolingualism or unilingualism, is the condition of being able to speak only a single language, as opposed to multilingualism. ...
speakers process words as wholes upon listening to them, while their late bilingual
Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. When the languages are just two, it is usually called bilingualism. It is believed that multilingual speakers outnumber monolin ...
s peers break words down into their corresponding morphemes, because their lexical representations are not as specific, and because lexical processing in the second language may be less frequent than processing the mother tongue.
Applications
Applications of morphological processing include machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
, spell checker
In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic ...
, and information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
.
References
Grammar
parsing
Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
Natural language parsing
{{comp-ling-stub