Brown Corpus
   HOME

TheInfoList



OR:

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured
corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...
of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by
Henry Kučera Henry Kučera (15 February 1925 – 20 February 2010), born Jindřich Kučera (), was a Czech-American linguist who pioneered corpus linguistics, linguistic software, a major contributor to the ''American Heritage Dictionary'', and a pioneer i ...
and W. Nelson Francis at
Brown University Brown University is a Private university, private Ivy League research university in Providence, Rhode Island, United States. It is the List of colonial colleges, seventh-oldest institution of higher education in the US, founded in 1764 as the ' ...
, in
Rhode Island Rhode Island ( ) is a state in the New England region of the Northeastern United States. It borders Connecticut to its west; Massachusetts to its north and east; and the Atlantic Ocean to its south via Rhode Island Sound and Block Is ...
, it is a general language corpus containing 500 samples of English with 2000+ words each, compiled from works published in the United States in 1961, covering a wide range of styles and varieties of prose. It contained 1,014,312 words. Its construction cost the U.S. Office of Education ~$23,000 in 1963-64.


History

Its original name was "A Standard Sample of Present-day Edited American English for use with digital computers", as described in a manual in 1964.Francis, W. N., and H. Kučera.
Manual of Information to Accompany a Standard Sample of Present-day Edited American English, for Use with Digital Computers
'' Original ed. 1964, revised 1971, revised and augmented 1979. Providence, R.I.: Department of Linguistics, Brown University.
In 1967, Kučera and Francis published their classic work, entitled ''"Computational Analysis of Present-Day American English"'', which provided basic statistics on what is known today simply as the ''Brown Corpus''. The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
, and was for many years among the most-cited resources in the field. Shortly after publication of the first
lexicostatistical Lexicostatistics is a method of comparative linguistics that involves comparing the percentage of lexical cognates between languages to determine their relationship. Lexicostatistics is related to the comparative method but does not reconstruct a ...
analysis,
Boston Boston is the capital and most populous city in the Commonwealth (U.S. state), Commonwealth of Massachusetts in the United States. The city serves as the cultural and Financial centre, financial center of New England, a region of the Northeas ...
publisher Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new ''
American Heritage Dictionary American(s) may refer to: * American, something of, from, or related to the United States of America, commonly known as the "United States" or "America" ** Americans, citizens and nationals of the United States of America ** American ancestry, p ...
''. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information. The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under
part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its definiti ...
) helped considerably in this, but the high error rate meant that extensive manual proofreading was required. The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the model for many later corpora such as the Lancaster-Oslo-Bergen Corpus (British English from the early 1990s) and the Freiburg-Brown Corpus of American English (FROWN) (American English from the early 1990s). Tagging the corpus enabled far more sophisticated statistical analysis, such as the work programmed by Andrew Mackie, and documented in books on English grammar. One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a
hyperbola In mathematics, a hyperbola is a type of smooth function, smooth plane curve, curve lying in a plane, defined by its geometric properties or by equations for which it is the solution set. A hyperbola has two pieces, called connected component ( ...
: the frequency of the ''n''-th most frequent word is roughly proportional to 1/''n''. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are ''
hapax legomena In corpus linguistics, a ''hapax legomenon'' ( also or ; ''hapax legomena''; sometimes abbreviated to ''hapax'', plural ''hapaxes'') is a word or an expression that occurs only once within a context: either in the written record of an entire ...
'': words that occur only once in the corpus.Kirsten Malmkjær,
The Linguistics Encyclopedia
', 2nd ed, Routledge, 2002, , p. 87.
This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his ''The Psychobiology of Language''), and is known as
Zipf's law Zipf's law (; ) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to . The best known instance of Zipf's law applies to the ...
. Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary American English, the
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
or the International Corpus of English) tend to be much larger, on the order of 100 million words.


Sample distribution

The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were ''first'' published then, and were written by native speakers of American English. Verses and dramas were rejected on account of their presenting different problems for linguistics research compared to standard prose, but short verse passages quoted in prose samples were kept. Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words. The text was mostly sampled from the Brown University Library and the Providence Athenaeum. For the daily press, the list of American newspapers of which the
New York Public Library The New York Public Library (NYPL) is a public library system in New York City. With nearly 53 million items and 92 locations, the New York Public Library is the second-largest public library in the United States behind the Library of Congress a ...
keeps microfilms files was used, and ''
The Providence Journal ''The Providence Journal'', colloquially known as the ''ProJo'', is a daily newspaper serving the metropolitan area of Providence, the largest newspaper in Rhode Island, US. The newspaper was first published in 1829. The newspaper had won four ...
''. Some periodical materials in the categories ''Skills and Hobbies'' and ''Popular Lore'' were somewhat arbitrarily chosen from "the contents of one of the largest second-hand magazine stores in
New York City New York, often called New York City (NYC), is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive w ...
". The original data entry was done on upper-case only
keypunch A keypunch is a device for precisely punching holes into stiff paper cards at specific locations as determined by keys struck by a human operator. Other devices included here for that same function include the gang punch, the pantograph punch, ...
machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes. The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories: * A. PRESS: Reportage (''44 texts'') ** Political ** Sports ** Society ** Spot News ** Financial ** Cultural * B. PRESS: Editorial (''27 texts'') ** Institutional Daily ** Personal ** Letters to the Editor * C. PRESS: Reviews (''17 texts'') ** ''theatre'' ** ''books'' ** ''music'' ** ''dance'' * D. RELIGION (''17 texts'') ** Books ** Periodicals ** Tracts * E. SKILL AND HOBBIES (''36 texts'') ** Books ** Periodicals * F. POPULAR LORE (''48 texts'') ** Books ** Periodicals * G. BELLES-LETTRES - Biography, Memoirs, etc. (''75 texts'') ** Books ** Periodicals * H. MISCELLANEOUS: US Government & House Organs (''30 texts'') ** Government Documents ** Foundation Reports ** Industry Reports ** College Catalog ** Industry House organ * J. LEARNED (''80 texts'') ** Natural Sciences ** Medicine ** Mathematics ** Social and Behavioral Sciences ** Political Science, Law, Education ** Humanities ** Technology and Engineering * K. FICTION: General (''29 texts'') ** Novels ** Short Stories * L. FICTION: Mystery and Detective Fiction (''24 texts'') ** Novels ** Short Stories * M. FICTION: Science (''6 texts'') ** Novels ** Short Stories * N. FICTION: Adventure and Western (''29 texts'') ** Novels ** Short Stories * P. FICTION: Romance and Love Story (''29 texts'') ** Novels ** Short Stories * R. HUMOR (''9 texts'') ** Novels ** Essays, etc.


Part-of-speech tags used


See also

* LOB Corpus, a corpus of British English based on the same parameters as the Brown Corpus *
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...


References


External links


Brown Corpus Manual