The Oxford English Corpus (OEC) is a

text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...

of 21st-century English, used by the makers of the ''

Oxford English Dictionary The ''Oxford English Dictionary'' (''OED'') is the principal historical dictionary of the English language, published by Oxford University Press (OUP), a University of Oxford publishing house. The dictionary, which published its first editio ...

'' and by

Oxford University Press Oxford University Press (OUP) is the publishing house of the University of Oxford. It is the largest university press in the world. Its first book was printed in Oxford in 1478, with the Press officially granted the legal right to print books ...

' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words. It includes language from the UK, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. The text is mainly collected from

web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...

s; some printed texts, such as academic journals, have been collected to supplement particular subject areas. The sources are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from

Hansard ''Hansard'' is the transcripts of parliamentary debates in Britain and many Commonwealth of Nations, Commonwealth countries. It is named after Thomas Curson Hansard (1776–1833), a London printer and publisher, who was the first official printe ...

to the language of blogs, emails, and social media". This may be contrasted with similar databases that sample only a specific kind of writing. The corpus is generally available only to researchers at Oxford University Press, but other researchers who can demonstrate a strong need may apply for access. The digital version of the Oxford English Corpus is formatted in

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...

and usually analysed with Sketch Engine software.The Oxford English Corpus
Retrieved February 4, 2014. By April 27, 2006, the dictionary database had 1 billion words. Each document in the OE Corpus is accompanied by

metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...

including: *title *author (if known; many websites make this difficult to determine reliably) *author gender (if known) *language type (e.g. British English, American English) *source website *year (+ date, if known) *date of collection *domain + subdomain * document statistics (number of tokens, sentences, etc.)

References

Applied linguistics Databases in England English corpora Linguistic research English corpus English corpus Types of databases Corpora {{corpora-stub

See also

References