Lexicostatistics is a method of
comparative linguistics
Comparative linguistics is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness.
Genetic relatedness implies a common origin or proto-language and comparative linguistics aim ...
that involves comparing the percentage of
lexical cognates between languages to determine their relationship. Lexicostatistics is related to the
comparative method
In linguistics, the comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor and then extrapolating backwards ...
but does not reconstruct a
proto-language
In the tree model of historical linguistics, a proto-language is a postulated ancestral language from which a number of attested languages are believed to have descended by evolution, forming a language family. Proto-languages are usually unatte ...
. It is to be distinguished from
glottochronology
Glottochronology (from Attic Greek γλῶττα ''tongue, language'' and χρόνος ''time'') is the part of lexicostatistics which involves comparative linguistics and deals with the chronological relationship between languages.Sheila Embleton ...
, which attempts to use lexicostatistical methods to estimate the length of time since two or more languages diverged from a common earlier proto-language. This is merely one application of lexicostatistics, however; other applications of it may not share the assumption of a constant rate of change for basic lexical items.
The term "lexicostatistics" is misleading in that mathematical equations are used but not statistics. Other features of a language may be used other than the lexicon, though this is unusual. Whereas the comparative method used shared identified innovations to determine sub-groups, lexicostatistics does not identify these. Lexicostatistics is a distance-based method, whereas the comparative method considers language characters directly. The lexicostatistics method is a simple and fast technique relative to the comparative method but has limitations (discussed below). It can be validated by cross-checking the trees produced by both methods.
History
Lexicostatistics was developed by
Morris Swadesh in a series of articles in the 1950s, based on earlier ideas. The concept's first known use was by
Dumont d'Urville
Jules Sébastien César Dumont d'Urville (; 23 May 1790 – 8 May 1842) was a French explorer and naval officer who explored the south and western Pacific, Australia, New Zealand and Antarctica. As a botanist and cartographer, he gave his name ...
in 1834 who compared various "Oceanic" languages and proposed a method for calculating a coefficient of relationship.
Hymes (1960) and Embleton (1986) both review the history of lexicostatistics.
Method
Create word list
The aim is to generate a list of universally used meanings (hand, mouth, sky, I). Words are then collected for these meaning slots for each language being considered. Swadesh reduced a larger set of meanings down to 200 originally. He later found that it was necessary to reduce it further but that he could include some meanings that were not in his original list, giving his later 100-item list. The
Swadesh list
A Swadesh list () is a compilation of cultural universal, tentatively universal concepts for the purposes of lexicostatistics. That is, a Swadesh list is a list of forms and concepts which all languages, without exception, have terms for, such as ...
in
Wiktionary
Wiktionary (, ; , ; rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number o ...
gives the total 207 meanings in a number of languages. Alternative lists that apply more rigorous criteria have been generated, e.g. the
Dolgopolsky list and the
Leipzig–Jakarta list, as well as lists with a more specific scope; for example,
Dyen,
Kruskal and Black have 200 meanings for 84
Indo-European languages
The Indo-European languages are a language family native to the northern Indian subcontinent, most of Europe, and the Iranian plateau with additional native branches found in regions such as Sri Lanka, the Maldives, parts of Central Asia (e. ...
in digital form.
Determine cognacies
A trained and experienced linguist is needed to make cognacy decisions. However, the decisions may need to be refined as the state of knowledge increases. However, lexicostatistics does not rely on all the decisions being correct. For each pair of words (in different languages) in this list, the cognacy of a form could be positive, negative or indeterminate. Sometimes a language has multiple words for one meaning, e.g. ''small'' and ''little'' for ''not big''.
Calculate lexicostatistic percentages
This percentage is related to the proportion of meanings for a particular language pair that are cognate, i.e. relative to the total without indeterminacy. This value is entered into an
table of distances, where N is the number of languages being compared. When completed, this table is half-filled in
triangular form. The higher the proportion of cognacy the closer the languages are related.
Create family tree
Creation of the language tree is based solely on the table found above. Various sub-grouping methods can be used but that adopted by Dyen, Kruskal and Black was:
* all lists are placed in a
pool
Pool may refer to:
Bodies of water
* Swimming pool, usually an artificial structure containing a large body of water intended for swimming
* Reflecting pool, a shallow pool designed to reflect a structure and its surroundings
* Tide pool, a roc ...
* the two closest members are removed and form a nucleus which is placed in the pool
* this step is repeated
* under certain conditions a nucleus becomes a group
* this is repeated until the pool only contains one group.
Calculations have to be of nucleus and group lexical percentages.
Applications
A leading exponent of lexicostatistics application has been
Isidore Dyen. He used lexicostatistics to classify
Austronesian languages
The Austronesian languages ( ) are a language family widely spoken throughout Maritime Southeast Asia, parts of Mainland Southeast Asia, Madagascar, the islands of the Pacific Ocean and Taiwan (by Taiwanese indigenous peoples). They are spoken ...
as well as
Indo-European
The Indo-European languages are a language family native to the northern Indian subcontinent, most of Europe, and the Iranian plateau with additional native branches found in regions such as Sri Lanka, the Maldives, parts of Central Asia (e. ...
ones.
A major study of the latter was reported by Dyen, Kruskal and Black (1992).
Studies have also been carried out on
Amerindian
In the Americas, Indigenous peoples comprise the two continents' pre-Columbian inhabitants, as well as the ethnic groups that identify with them in the 15th century, as well as the ethnic groups that identify with the pre-Columbian population of ...
and
African languages
The number of languages natively spoken in Africa is variously estimated (depending on the delineation of language vs. dialect) at between 1,250 and 2,100, and by some counts at over 3,000. Nigeria alone has over 500 languages (according to SI ...
.
Pama-Nyungan
The problem of internal branching within the
Pama-Nyungan language family has been a long-standing issue for Australianist linguistics, and general consensus held that internal connections between the 25+ different subgroups of Pama-Nyungan were either impossible to reconstruct or that the subgroups were not in fact genetically related at all.
In 2012, Claire Bowern and Quentin Atkinson published the results from their application of computational
phylogenetic
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
methods on 19
doculectsrepresenting all major subgroups and isolates of Pama-Nyungan.
Their model "recovered" many of the branches and divisions that had erstwhile been proposed and accepted by many other Australianists, while also providing some insight into the more problematic branches, such as
Paman (which is complicated by the lack of data) and
Ngumpin-Yapa (where the genetic picture is obscured by very high rates of borrowing between languages). Their dataset forms the largest of its kind for a
hunter-gatherer
A hunter-gatherer or forager is a human living in a community, or according to an ancestrally derived Lifestyle, lifestyle, in which most or all food is obtained by foraging, that is, by gathering food from local naturally occurring sources, esp ...
language family, and the second largest overall after
AustronesianGreenhill et al. 2008). They conclude that Pama-Nyungan languages are in fact not exceptional to lexicostatistical methods, which have successfully been applied to other language families of the world.
Criticisms
People such as
Hoijer (1956) have shown that there were difficulties in finding equivalents to the meaning items while many have found it necessary to modify Swadesh's lists. Gudschinsky (1956) questioned whether it was possible to obtain a universal list.
Factors such as
borrowing, tradition and
taboo
A taboo is a social group's ban, prohibition or avoidance of something (usually an utterance or behavior) based on the group's sense that it is excessively repulsive, offensive, sacred or allowed only for certain people.''Encyclopædia Britannica ...
can skew the results, as with other methods. Sometimes lexicostatistics has been used with
lexical similarity
In linguistics, lexical similarity is a measure of the degree to which the word sets of two given languages are similar. A lexical similarity of 1 (or 100%) would mean a total overlap between vocabularies, whereas 0 means there are no common words. ...
being used rather than cognacy to find resemblances. This is then equivalent to
mass comparison.
The choice of meaning slots is subjective, as is the choice of
synonym
A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...
s.
Improved methods
Some of the modern computational statistical hypothesis testing methods can be regarded as improvements of lexicostatistics in that they use similar word lists and distance measures.
See also
*
Basic English
Basic English (a backronym for British American Scientific International and Commercial English) is a controlled language based on standard English, but with a greatly simplified vocabulary and grammar. It was created by the linguist and philo ...
*
Cognate
In historical linguistics, cognates or lexical cognates are sets of words that have been inherited in direct descent from an etymological ancestor in a common parent language.
Because language change can have radical effects on both the s ...
*
Comparative linguistics
Comparative linguistics is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness.
Genetic relatedness implies a common origin or proto-language and comparative linguistics aim ...
*
Comparative method
In linguistics, the comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor and then extrapolating backwards ...
*
Global Lexicostatistical Database
*
Glottochronology
Glottochronology (from Attic Greek γλῶττα ''tongue, language'' and χρόνος ''time'') is the part of lexicostatistics which involves comparative linguistics and deals with the chronological relationship between languages.Sheila Embleton ...
*
Historical linguistics
Historical linguistics, also known as diachronic linguistics, is the scientific study of how languages change over time. It seeks to understand the nature and causes of linguistic change and to trace the evolution of languages. Historical li ...
*
Indo-European studies
Indo-European studies () is a field of linguistics and an interdisciplinary field of study dealing with Indo-European languages, both current and extinct. The goal of those engaged in these studies is to amass information about the hypothetical p ...
*
Intercontinental Dictionary Series
*
Linguistic distance
*
Mass lexical comparison
Mass comparison is a method developed by Joseph Greenberg to determine the level of genetic relatedness between languages. It is now usually called multilateral comparison. Mass comparison has been referred to as a "methodological deception" an ...
*
Proto-language
In the tree model of historical linguistics, a proto-language is a postulated ancestral language from which a number of attested languages are believed to have descended by evolution, forming a language family. Proto-languages are usually unatte ...
*
Swadesh list
A Swadesh list () is a compilation of cultural universal, tentatively universal concepts for the purposes of lexicostatistics. That is, a Swadesh list is a list of forms and concepts which all languages, without exception, have terms for, such as ...
*
Word list
References
Further reading
* Dobson, Annette (1969). Lexicostatistical Grouping. Anthropological Linguistics 7, 216-221.
* Dobson, Annette and Black, Paul (1979). Multidimensional Scaling of some Lexicostatistical Data. Mathematical Scientist 1979/4, 55-61.
* McMahon, April and McMahon, Robert (2005). Language Classification by Numbers. Oxford University Press.
* Sankoff, David (1970). "On the Rate of Replacement of Word-Meaning Relationships." ''Language'' 46.564-569.
* Wittmann, Henri (1969). "A lexico-statistic inquiry into the diachrony of Hittite." ''Indogermanische Forschungen'' 74.1-1
* Wittmann, Henri (1973). "The lexicostatistical classification of the French-based Creole languages." ''Lexicostatistics in genetic linguistics: Proceedings of the Yale conference, April 3–4, 1971'', dir. Isidore Dyen, 89-99. La Haye: Mouto
External links
The Global Lexicostatistical Database part of the
Evolution of Human Languages
The Evolution of Human Languages (EHL) project is a historical-comparative linguistics research project hosted by the Santa Fe Institute. It aims to provide a detailed genealogical classification of the world's languages.
The project was founded ...
project
IE database
{{Long-range comparative linguistics
Historical linguistics
Comparative linguistics
Quantitative linguistics
Mathematical linguistics