In
linguistic morphology
In linguistics, morphology is the study of words, including the principles by which they are formed, and how they relate to one another within a language. Most approaches to morphology investigate the structure of words in terms of morphemes, wh ...
and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their
word stem
In linguistics, a word stem is a word part responsible for a word's lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. For instance, in Athabaskan linguistics, a verb stem ...
, base or
root
In vascular plants, the roots are the plant organ, organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often bel ...
form—generally a written word form. The stem need not be identical to the
morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
Algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s for stemming have been studied in
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
since the 1960s. Many
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s treat words with the same stem as
synonym
A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...
s as a kind of
query expansion, a process called conflation.
A
computer program
A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...
or subroutine that stems word may be called a ''stemming program'', ''stemming algorithm'', or ''stemmer''.
Examples
A stemmer for English operating on the stem ''cat'' should identify such
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
s as ''cats'', ''catlike'', and ''catty''. A stemming algorithm might also reduce the words ''fishing'', ''fished'', and ''fisher'' to the stem ''fish''. The stem need not be a word, for example the Porter algorithm reduces ''argue'', ''argued'', ''argues'', ''arguing'', and ''argus'' to the stem ''argu''.
History
The first published stemmer was written by
Julie Beth Lovins in 1968. This paper was remarkable for its early date and had great influence on later work in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor
John W. Tukey of
Princeton University
Princeton University is a private university, private Ivy League research university in Princeton, New Jersey, United States. Founded in 1746 in Elizabeth, New Jersey, Elizabeth as the College of New Jersey, Princeton is the List of Colonial ...
, the algorithm developed at
Harvard University
Harvard University is a Private university, private Ivy League research university in Cambridge, Massachusetts, United States. Founded in 1636 and named for its first benefactor, the History of the Puritans in North America, Puritan clergyma ...
by
Michael Lesk, under the direction of Professor
Gerard Salton
Gerard A. "Gerry" Salton (8 March 1927 – 28 August 1995) was a professor of computer science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time, and "the father ...
, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.
A later stemmer was written by
Martin Porter and was published in the July 1980 issue of the journal ''Program''. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the
Tony Kent Strix award
The UKeiG Strix award is an annual award for outstanding contributions to the field of information retrieval and is presented in memory of Dr. Tony Kent, a past Fellow of the Institute of Information Scientists (IIS), who died in 1997. Tony Kent ma ...
in 2000 for his work on stemming and information retrieval.
Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official
free software
Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...
(mostly
BSD
The Berkeley Software Distribution (BSD), also known as Berkeley Unix or BSD Unix, is a discontinued Unix operating system developed and distributed by the Computer Systems Research Group (CSRG) at the University of California, Berkeley, beginni ...
-licensed) implementation of the algorithm around the year 2000. He extended this work over the next few years by building
Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.
The Paice-Husk Stemmer was developed by
Chris D Paice at Lancaster University in the late 1980s, it is an iterative stemmer and features an externally stored set of stemming rules. The standard set of rules provides a 'strong' stemmer and may specify the removal or replacement of an ending. The replacement technique avoids the need for a separate stage in the process to recode or provide partial matching. Paice also developed a direct measurement for comparing stemmers based on counting the over-stemming and under-stemming errors.
Algorithms
There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.
A simple stemmer looks up the inflected form in a
lookup table
In computer science, a lookup table (LUT) is an array data structure, array that replaces runtime (program lifecycle phase), runtime computation of a mathematical function (mathematics), function with a simpler array indexing operation, in a proc ...
. The advantages of this approach are that it is simple, fast, and easily handles exceptions. The disadvantages are that all inflected forms must be explicitly listed in the table: new or unfamiliar words are not handled, even if they are perfectly regular (e.g. cats ~ cat), and the table may be large. For languages with simple morphology, like English, table sizes are modest, but highly inflected languages like Turkish may have hundreds of potential inflected forms for each root.
A lookup approach may use preliminary
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
to avoid overstemming.
The production technique
The lookup table used by a stemmer is generally produced semi-automatically. For example, if the word is "run", then the inverted algorithm might automatically generate the forms "running", "runs", "runned", and "runly". The last two forms are valid constructions, but they are unlikely..
Suffix-stripping algorithms
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:
* if the word ends in 'ed', remove the 'ed'
* if the word ends in 'ing', remove the 'ing'
* if the word ends in 'ly', remove the 'ly'
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those
lexical categories which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules.
Lemmatisation
Lemmatization (or less commonly lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
In computational lingui ...
attempts to improve upon this challenge.
Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.
Additional algorithm criteria
Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.
It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it results in a non-existent term whereas the other overlapping rule does not. For example, given the English term ''friendlies'', the algorithm may identify the ''ies'' suffix and apply the appropriate rule and achieve the result of '. ' is likely not found in the lexicon, and therefore the rule is rejected.
One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces ''ies'' with ''y''. How this affects the algorithm varies on the algorithm's design. To illustrate, the algorithm may identify that both the ''ies'' suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example, ''friendlies'' becomes ''friendly'' instead of ''.
Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term ''friendly'', where the ''ly'' stripping rule is likely identified and accepted. In summary, ''friendlies'' becomes (via substitution) ''friendly'' which becomes (via stripping) ''friend''.
This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for ''friendlies'' in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form ''friend''. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the brute force approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best.
Lemmatisation algorithms
A more complex approach to the problem of determining a stem of a word is
lemmatisation
Lemmatization (or less commonly lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
In computational lingui ...
. This process involves first determining the
part of speech
In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if the stemmer is able to grasp more information about the word being stemmed, then it can apply more accurate normalization rules (which unlike suffix stripping rules can also modify the stem).
Stochastic algorithms
Stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...
algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).
Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.
''n''-gram analysis
Some stemming techniques use the
n-gram context of a word to choose the correct stem for a word.
Hybrid approaches
Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.
Affix stemmers
In
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
, the term
affix
In linguistics, an affix is a morpheme that is attached to a word stem to form a new word or word form. The main two categories are Morphological derivation, derivational and inflectional affixes. Derivational affixes, such as ''un-'', ''-ation' ...
refers to either a
prefix
A prefix is an affix which is placed before the stem of a word. Particularly in the study of languages, a prefix is also called a preformative, because it alters the form of the word to which it is affixed.
Prefixes, like other affixes, can b ...
or a
suffix
In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns and adjectives, and verb endings, which form the conjugation of verbs. Suffixes can ca ...
. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word ''indefinitely'', identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping. A study of affix stemming for several European languages can be found here.
Matching algorithms
Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside")..
Language challenges
While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.
Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun
declension
In linguistics, declension (verb: ''to decline'') is the changing of the form of a word, generally to express its syntactic function in the sentence by way of an inflection. Declension may apply to nouns, pronouns, adjectives, adverbs, and det ...
s), a Hebrew one is even more complex (due to
nonconcatenative morphology, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on.
Multilingual stemming
Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.
Error metrics
There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...
. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a
false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.
For example, the widely used Porter stemmer stems "universal", "university", and "universe" to "univers". This is a case of overstemming: though these three words are
etymologically
Etymology ( ) is the study of the origin and evolution of words—including their constituent units of sound and meaning—across time. In the 21st century a subfield within linguistics, etymology has become a more rigorously scientific study. ...
related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results.
An example of understemming in the Porter stemmer is "alumnus" → "alumnu", "alumni" → "alumni", "alumna"/"alumnae" → "alumna". This English word keeps Latin morphology, and so these near-synonyms are not conflated.
Applications
Stemming is used as an approximate method for grouping words with a similar basic meaning together. For example, a text mentioning "daffodils" is probably closely related to a text mentioning "daffodil" (without the s). But in some cases, words with the same morphological stem have
idiom
An idiom is a phrase or expression that largely or exclusively carries a Literal and figurative language, figurative or non-literal meaning (linguistic), meaning, rather than making any literal sense. Categorized as formulaic speech, formulaic ...
atic meanings which are not closely related: a user searching for "marketing" will not be satisfied by most documents mentioning "markets" but not "marketing".
Information retrieval
Stemmers can be used as elements in
query systems such as
Web
Web most often refers to:
* Spider web, a silken structure created by the animal
* World Wide Web or the Web, an Internet-based hypertext system
Web, WEB, or the Web may also refer to:
Computing
* WEB, a literate programming system created by ...
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s. The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
researchers to deem stemming irrelevant in general. An alternative approach, based on searching for
n-grams rather than stems, may be used instead. Also, stemmers may provide greater benefits in other languages than English.
Domain analysis
Stemming is used to determine domain vocabularies in
domain analysis.
Use in commercial products
Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.
The
Snowball stemmers have been compared with commercial lexical stemmers with varying results.
Google Search
Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
adopted word stemming in 2003.
''The Essentials of Google Search''
Web Search Help Center, Google Inc. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings will obviously find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".
Text mining
Stemming is used as a task in pre-processing texts before performing text mining analyses on it.
See also
*
* — stemming is a form of reverse derivation
*
*
*
*
*
* — stemming is generally regarded as a form of NLP
* — implements several stemming algorithms in Python
*
* — designed for creating stemming algorithms
*
*
References
Further reading
* Dawson, J. L. (1974); ''Suffix Removal for Word Conflation'', Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33–46
* Frakes, W. B. (1984); ''Term Conflation for Information Retrieval'', Cambridge University Press
* Frakes, W. B. & Fox, C. J. (2003);
Strength and Similarity of Affix Removal Stemming Algorithms
', SIGIR Forum, 37: 26–30
* Frakes, W. B. (1992); ''Stemming algorithms, Information retrieval: data structures and algorithms'', Upper Saddle River, NJ: Prentice-Hall, Inc.
* Hafer, M. A. & Weiss, S. F. (1974);
Word segmentation by letter successor varieties
', Information Processing & Management 10 (11/12), 371–386
* Harman, D. (1991);
How Effective is Suffixing?
', Journal of the American Society for Information Science 42 (1), 7–15
* Hull, D. A. (1996);
Stemming Algorithms – A Case Study for Detailed Evaluation
', JASIS, 47(1): 70–84
* Hull, D. A. & Grefenstette, G. (1996); ''A Detailed Analysis of English Stemming Algorithms'', Xerox Technical Report
* Kraaij, W. & Pohlmann, R. (1996);
Viewing Stemming as Recall Enhancement
', in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); ''Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22'', pp. 40–48
* Krovetz, R. (1993);
Viewing Morphology as an Inference Process
', in ''Proceedings of ACM-SIGIR93'', pp. 191–203
* Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981);
An Evaluation of some Conflation Algorithms for Information Retrieval
', Journal of Information Science, 3: 177–183
* Lovins, J. (1971);
', JASIS, 22: 28–40
* Lovins, J. B. (1968);
Development of a Stemming Algorithm
', Mechanical Translation and Computational Linguistics, 11, 22—31
* Jenkins, Marie-Claire; and Smith, Dan (2005)
''Conservative Stemming for Search and Indexing''
* Paice, C. D. (1990);
'', SIGIR Forum, 24: 56–61
* Paice, C. D. (1996)
Method for Evaluation of Stemming Algorithms based on Error Counting
', JASIS, 47(8): 632–649
* Popovič, Mirko; and Willett, Peter (1992)
''The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data''
Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390
* Porter, Martin F. (1980);
An Algorithm for Suffix Stripping
', Program, 14(3): 130–137
* Savoy, J. (1993);
Stemming of French Words Based on Grammatical Categories
' Journal of the American Society for Information Science, 44(1), 1–9
* Ulmschneider, John E.; & Doszkocs, Tamas (1983);
', Online Review, 7(4), 301–318
* Xu, J.; & Croft, W. B. (1998);
Corpus-Based Stemming Using Coocurrence of Word Variants
', ACM Transactions on Information Systems, 16(1), 61–81
External links
��includes Porter and Snowball stemmers
SMILE Stemmer
��free online service, includes Porter and Paice/Husk' Lancaster stemmers (Java API)
Themis
��open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)
Snowball
��free stemming algorithms for many languages, includes source code, including stemmers for five romance languages
Snowball on C#
��port of Snowball stemmers for C# (14 languages)
��Ruby extension to Snowball API
PECL
��PHP extension to the Snowball API
��stemming library in C++ released under BSD
��with source code in a couple of languages
��including source code in several languages
—Lancaster University, UK
Official home page of the UEA-Lite Stemmer
��University of East Anglia, UK
PTStemmer
��A Java/Python/.Net stemming toolkit for the Portuguese language
jsSnowball
��open source JavaScript implementation of Snowball stemming algorithms for many languages
��implementation for Java
hindi_stemmer
��open source stemmer for Hindi
czech_stemmer
��open source stemmer for Czech
Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers
Tamil Stemmer
{{Natural Language Processing
Linguistic morphology
Natural language processing
Tasks of natural language processing
Computational linguistics
Information retrieval techniques