Lexical Frequency Analysis
   HOME

TheInfoList



OR:

A word list is a list of words in a
lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
, generally sorted by frequency of occurrence (either by graded levels, or as a ranked list). A word list is compiled by lexical frequency analysis within a given
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
, and is used in
corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
to investigate genealogies and evolution of languages and texts. A word which appears only once in the corpus is called a
hapax legomena In corpus linguistics, a ''hapax legomenon'' ( also or ; ''hapax legomena''; sometimes abbreviated to ''hapax'', plural ''hapaxes'') is a word or an expression that occurs only once within a context: either in the written record of an entire ...
. In
pedagogy Pedagogy (), most commonly understood as the approach to teaching, is the theory and practice of learning, and how this process influences, and is influenced by, the social, political, and psychological development of learners. Pedagogy, taken ...
, word lists are used in curriculum design for
vocabulary acquisition Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language. In other words, it is how human beings gain the ability to be aware of language, to understand it, and to produce and use words and ...
. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of
checklist A checklist is a type of job aid used in repetitive tasks to reduce failure by compensating for potential limits of human memory and attention. Checklists are used both to ensure that safety-critical system preparations are carried out completely ...
to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus
register Register or registration may refer to: Arts, entertainment, and media Music * Register (music), the relative "height" or range of a note, melody, part, instrument, etc. * ''Register'', a 2017 album by Travis Miller * Registration (organ), ...
, and the definition of "
word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field. In
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
, a frequency list is a sorted list of
word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
s (word types) together with their
frequency Frequency is the number of occurrences of a repeating event per unit of time. Frequency is an important parameter used in science and engineering to specify the rate of oscillatory and vibratory phenomena, such as mechanical vibrations, audio ...
, where frequency here usually means the number of occurrences in a given
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
, from which the rank can be derived as the position in the list.


Methodology


Factors

Nation () noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists: * corpus representativeness * word frequency and range * treatment of word families * treatment of idioms and fixed expressions * range of information * various other criteria


Corpora


Traditional written corpus

Most of currently available studies are based on written
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
, more easily available and easy to process.


SUBTLEX movement

However, proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. made a long critical evaluation of the traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. The initial research saw a handful of follow-up studies, providing valuable frequency count analysis for various languages. In depth SUBTLEX researches over cleaned up open subtitles were produced for French (), American English (; ), Dutch (), Chinese (), Spanish (), Greek (), Vietnamese (), Brazil Portuguese () and Portugal Portuguese (), Albanian (), Polish () and Catalan (2019), Welsh (Van Veuhen et al. 2024). SUBTLEX-IT (2015) provides raw data only.


Lexical unit

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise : English "can't" and French "" include punctuations while French "chateau d'eau" designs a concept different from the simple addition of its components while including a space. It may also be preferable to group words of a
word family A word family is the base form of a word plus its inflected forms and derived forms made with suffixes and prefixes plus its cognates, i.e. all words that have a common etymological origin, some of which even native speakers don't recognize as be ...
under the representation of its
base word A root (also known as a root word or radical) is the core of a word that is irreducible into more meaningful elements. In morphology, a root is a morphologically simple unit which can be left bare or to which a prefix or a suffix can attach. The ...
. Thus, ''possible, impossible, possibility'' are words of the same word family, represented by the base word ''*possib*''. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.


Statistics

It seems that
Zipf's law Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to . The best known instance of Zipf's law applies to the ...
holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
. German linguists define the ''Häufigkeitsklasse'' (frequency class) N of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word ''outragious'' has a ratio of 76/3789654 and belongs in class 16. :N=\left\lfloor0.5-\log_2\left(\frac\right)\right\rfloor where \lfloor\ldots\rfloor is the
floor function In mathematics, the floor function is the function that takes as input a real number , and gives as output the greatest integer less than or equal to , denoted or . Similarly, the ceiling function maps to the least integer greater than or eq ...
. Frequency lists, together with
semantic network A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, ...
s, are used to identify the least common, specialized terms to be replaced by their
hypernym Hypernymy and hyponymy are the semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term''. The hyponym names a subtype of ...
s in a process of
semantic compression In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas ca ...
.


Pedagogy

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors ().
Paul Nation Paul Nation (complete name Ian Stephen Paul Nation, born 28 April 1944) is a scholar in the field of linguistics and teaching methodology. As a professor in the field of applied linguistics with a specialization in pedagogical methodology, he cr ...
's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes hematicvocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" ().


Effects of words frequency

Word frequency is known to have various effects (; ). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (). Lexical access is positively influenced by high word frequency, a phenomenon called
word frequency effect The word frequency effect is a psychological phenomenon where recognition times are faster for words seen more frequently than for words seen less frequently. Word frequency depends on individual awareness of the tested language. The phenomenon ...
(). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.


Languages

Below is a review of available resources.


English

Word counting is an ancient field, with known discussion back to
Hellenistic In classical antiquity, the Hellenistic period covers the time in Greek history after Classical Greece, between the death of Alexander the Great in 323 BC and the death of Cleopatra VII in 30 BC, which was followed by the ascendancy of the R ...
time. In 1944,
Edward Thorndike Edward Lee Thorndike ( – ) was an American psychologist who spent nearly his entire career at Teachers College, Columbia University. His work on comparative psychology and the learning process led to his " theory of connectionism" and helped ...
, Irvin Lorge and colleagues hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency in the Corpus of Contemporary American English, was first attested to in 1999, and does not appear in any of these three lists. ;The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944) The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (). ;The General Service List (West, 1953) The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List. ;The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971) A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (). ;The Brown (Francis and Kucera, 1982) LOB and related corpora These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists ().


French

;Traditional datasets A review has been made by . An attempt was made in the 1950s–60s with the
Français fondamental is a list of words and grammatical concepts, devised in the beginning of the 1950s for teaching foreigners and residents of the French Union, France's colonial empire. A series of investigations in the 1950s and 1960s showed that a small number o ...
. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules. It is claimed that 70 grammatical words constitute 50% of the communicatives sentence, while 3,680 words make about 95~98% of coverage. A list of 3,000 frequent words is available. The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue
Étienne Brunet Étienne, a French analog of Stephen or Steven, is a masculine given name. An archaic variant of the name, prevalent up to the mid-17th century, is Estienne. Étienne, Etienne, Ettiene or Ettienne may refer to: People Artists and entertainers *E ...
. Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain". More recently, the project Lexique3 provides 142,000 French words, with
orthography An orthography is a set of convention (norm), conventions for writing a language, including norms of spelling, punctuation, Word#Word boundaries, word boundaries, capitalization, hyphenation, and Emphasis (typography), emphasis. Most national ...
,
phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians ...
, syllabation,
part of speech In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are ...
,
gender Gender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or third gender. Although gender often corresponds to sex, a transgender person may identify with a gender other tha ...
, number of occurrence in the source corpus, frequency rank, associated
lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms ta ...
s, etc., available under an open license CC-by-sa-4.0. ;Subtlex This Lexique3 is a continuous study from which originate the Subtlex movement cited above. made a completely new counting based on online film subtitles.


Spanish

There have been several studies of Spanish word frequency ().


Chinese

Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (). American sinologist
John DeFrancis John DeFrancis (August 31, 1911January 2, 2009) was an American linguist, sinologist, author of Chinese language textbooks, lexicographer of Chinese dictionaries, and professor emeritus of Chinese Studies at the University of Hawaiʻi at Mānoa ...
mentioned its importance for Chinese as a foreign language learning and teaching in ''Why Johnny Can't Read Chinese'' (). As a frequency toolkit, Da () and the Taiwanese Ministry of Education () provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the
People's Republic of China China, officially the People's Republic of China (PRC), is a country in East Asia. With population of China, a population exceeding 1.4 billion, it is the list of countries by population (United Nations), second-most populous country after ...
, and the
Republic of China (Taiwan) Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...
's
TOP Top most commonly refers to: * Top, a basic term of orientation, distinguished from bottom, front, back, and sides * Spinning top, a ubiquitous traditional toy * Top (clothing), clothing designed to be worn over the torso * Mountain top, a moun ...
list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, recently made a rich study of Chinese word and character frequencies.


Other

Wiktionary Wiktionary (, ; , ; rhyming with "dictionary") is a multilingual, web-based project to create a free content dictionary of terms (including words, phrases, proverbs, linguistic reconstructions, etc.) in all natural languages and in a number o ...
contains frequency lists in more languages. Most frequently used words in different languages based on Wikipedia or combined corpora.


See also

*
Corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
*
Letter frequency Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi (c. AD 801–873), who formally developed the method to break ciph ...
* Most common words in English *
Long tail In statistics and business, a long tail of some distributions of numbers is the portion of the distribution having many occurrences far from the "head" or central part of the distribution. The distribution could involve popularities, random n ...
*
Google Ngram Viewer The Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of ''n''-grams found in printed sources published between 1500 and 2022 in Google's text corpora in English, ...
– shows changes in word/phrase frequency (and relative frequency) over time


Notes


References


Theoretical concepts

* * * . *
database
* * * (frequency list of German words) * *


Written texts-based databases

* . * . * . * .


SUBTLEX movement

* * * * * * * *
databases
* * * * {{DEFAULTSORT:Word lists by frequency Quantitative linguistics Computational linguistics de:Häufigkeitsklasse hy:Հաճախականության բառարաններ