Below are two estimates of the most common words in Modern Spanish. Each estimate comes from an analysis of a different

text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...

. A ''text corpus'' is a large collection of samples of written and/or spoken language, that has been carefully prepared for linguistic analysis. To determine which words are the most common, researchers create a

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

of all the words found in the corpus, and categorise them based on the context in which they are used. The first table lists the 100 most common

word form In linguistics, morphology is the study of words, including the principles by which they are formed, and how they relate to one another within a language. Most approaches to morphology investigate the structure of words in terms of morphemes, wh ...

s from the Corpus de Referencia del Español Actual (CREA), a text corpus compiled by the

Real Academia Española The Royal Spanish Academy (, ; ) is Spain's official royal institution with a mission to ensure the stability of the Spanish language. It is based in Madrid, Spain, and is affiliated with national language academies in 22 other Hispanopho ...

(RAE). The RAE is Spain's official institution for documenting,

planning Planning is the process of thinking regarding the activities required to achieve a desired goal. Planning is based on foresight, the fundamental capacity for mental time travel. Some researchers regard the evolution of forethought - the cap ...

, and standardising the Spanish language. A ''word form'' is any of the grammatical variations of a word. The second table is a list of 100 most common lemmas found in a text corpus compiled by Mark Davies and other language researchers at

Brigham Young University Brigham Young University (BYU) is a Private education, private research university in Provo, Utah, United States. It was founded in 1875 by religious leader Brigham Young and is the flagship university of the Church Educational System sponsore ...

in the United States. A '' lemma'' is the primary form of a word—the one that would appear in a dictionary. The Spanish

infinitive Infinitive ( abbreviated ) is a linguistics term for certain verb forms existing in many languages, most often used as non-finite verbs that do not show a tense. As with many linguistic concepts, there is not a single definition applicable to all ...

'' tener'' ("to have") is a lemma, while '' tiene'' ("has")—which is a

conjugation Conjugation or conjugate may refer to: Linguistics *Grammatical conjugation, the modification of a verb from its basic form *Emotive conjugation or Russell's conjugation, the use of loaded language Mathematics *Complex conjugation, the change o ...

of ''tener''—is a word form.

Real Academia Española

The list below comes from "1000 formas más frecuentes" ()", a list published by the Real Academia Española (RAE) from analysis of more than 160 million

s found in the Corpus de Referencia del Español Actual (), or CREA. CREA is a computerised

corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

of texts written in Spanish, and of transcripts of spoken Spanish. It includes books, magazines, and newspapers with a wide variety of content, as well as transcripts of spoken language from radio and television broadcasts and other sources. All the works in the collection are from 1975 to 2004. CREA includes samples from all Spanish-speaking countries. The list of "2000 most frequent word forms" comes from an analysis of CREA version 3.2.

Plural In many languages, a plural (sometimes list of glossing abbreviations, abbreviated as pl., pl, , or ), is one of the values of the grammatical number, grammatical category of number. The plural of a noun typically denotes a quantity greater than ...

s, verb conjugations, and other

inflection In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...

s are ranked separately.

Homonym In linguistics, homonyms are words which are either; '' homographs''—words that mean different things, but have the same spelling (regardless of pronunciation), or '' homophones''—words that mean different things, but have the same pronunciat ...

s, however, are not distinguished from one another. CREA 3.2 was published in June 2008.

Mark Davies

In 2006, Mark Davies, an associate professor of

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, published his estimate of the 5000 most common words in Modern Spanish. To make this list, he compiled samples only from 20th-century sources—especially from the years 1970 to 2000. Most of the sources are from the 1990s. Of the 20 million words in the corpus, about one-third (~6,750,000 words) come from transcripts of spoken Spanish: conversations, interviews, lectures, sermons, press conferences, sports broadcasts, and so on. Among the written sources are novels, plays, short stories, letters, essays, newspapers, and the encyclopedia ''

Encarta Microsoft ''Encarta'' is a discontinued Digital data, digital multimedia encyclopedia and search engine published by Microsoft from 1993 to 2009. Originally sold on CD-ROM or DVD, it was also available online via annual subscription, although ...

''. The samples, written and spoken, come from Spain and at least 10 Latin American countries. Most of the samples were previously compiled for the Corpus del Español (2001), a 100 million-word corpus that includes works from the 13th century through the 20th. The 5000 words in Davies' list are lemmas. A ''lemma'' is the form of the word as it would appear in a dictionary. Singular nouns and plurals, for example, are treated as the same word, as are

s and verb conjugations. The table below includes the top 100 words from Davies' list of 5000. This list distinguishes between the

definite article In grammar, an article is any member of a class of dedicated words that are used with noun phrases to mark the identifiability of the referents of the noun phrases. The category of articles constitutes a part of speech. In English, both "the" ...

s ''lo'' and ''la'' and the pronouns ''lo'' and ''la''; all are ranked individually. The adjectives ''ese'' and ''esa'' are ranked together (as are ''este'' and ''esta'') ), but the pronoun ''eso'' is separate. All conjugations of a verb are ranked together. A highlighted row indicates that the word was found to occur especially frequently in samples of spoken Spanish.Davies (2006), p. 9

Notes

References

External links

* {{cite web , url=https://crscardellino.github.io/SBWCE/ , title=Spanish Billion Words Corpus and Embeddings , last=Cardellino , first=Cristian , website=crscardellino.github.io , publisher=Cristian Cardellino , date=March 2016 Spanish language Spanish words and phrases

Real Academia Española

Mark Davies

See also

Notes

References

External links