British National Corpus
   HOME

TheInfoList



OR:

The British National Corpus (BNC) is a 100-million-word
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
of samples of written and spoken
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ...
from a wide range of sources. The corpus covers
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
of the late 20th century from a wide variety of
genre Genre () is any form or type of communication in any mode (written, spoken, digital, artistic, etc.) with socially-agreed-upon conventions developed over time. In popular usage, it normally describes a category of literature, music, or other f ...
s, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistic for analysis of corpora


History

The project to create the BNC involved the collaboration of three publishers (with the
Oxford University Press Oxford University Press (OUP) is the university press of the University of Oxford. It is the largest university press in the world, and its printing history dates back to the 1480s. Having been officially granted the legal right to print book ...
as the lead collaborator,
Longman Longman, also known as Pearson Longman, is a publishing company founded in London, England, in 1724 and is owned by Pearson PLC. Since 1968, Longman has been used primarily as an imprint by Pearson's Schools business. The Longman brand is also ...
and W. & R. Chambers), two universities (the
University of Oxford , mottoeng = The Lord is my light , established = , endowment = £6.1 billion (including colleges) (2019) , budget = £2.145 billion (2019–20) , chancellor ...
and
Lancaster University , mottoeng = Truth lies open to all , established = , endowment = £13.9 million , budget = £317.9 million , type = Public , city = Bailrigg, City of Lancaster , country = England , coor = , campus = Bailrigg , faculty ...
), and the
British Library The British Library is the national library of the United Kingdom and is one of the largest libraries in the world. It is estimated to contain between 170 and 200 million items from many countries. As a legal deposit library, the Briti ...
. The creation of the BNC started in 1991 under the management of the BNC consortium, and the project was finished by 1994. There have been no additions of new samples after 1994, but the BNC underwent slight revisions before the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007).''What is the BNC?''
Retrieved 12 March 2012.
The BNC was the vision of computational linguists whose goal was a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of modern (at the time of building the corpus), naturally occurring
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
in the form of
speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
and text or
writing Writing is a medium of human communication which involves the representation of a language through a system of physically inscribed, mechanically transferred, or digitally represented symbols. Writing systems do not themselves constitute h ...
that could be analyzed by a computer. Hence, it was compiled as a general corpus to pave the way for automatic search and processing in the field of
corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
. One of the ways the BNC was to be differentiated from existing corpora at that time was to open up the data not just to academic research, but also to commercial and educational uses. The corpus was restricted to just
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
, and was not extended to cover
World Englishes World Englishes is a term for emerging localised or indigenised varieties of English, especially varieties that have developed in territories influenced by the United Kingdom or the United States. The study of World Englishes consists of identi ...
. This was partly because a significant portion of the cost of the project was being funded by the British government which was logically interested in supporting documentation of its own
linguistic variety In sociolinguistics, a variety, also called an isolect or lect, is a specific form of a language or language cluster. This may include languages, dialects, registers, styles, or other forms of language, as well as a standard variety.Meecham, Ma ...
. Because of its potentially unprecedented size, the BNC required funds from the commercial and academic institutions as well. In turn, BNC
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
then became available for commercial and academic research.


Description

The BNC is a monolingual corpus, as it records samples of
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
use in
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
only, although occasionally words and phrases from other languages may also be present. It is a
synchronic Synchronic may refer to: * ''Synchronic'' (film), a 2019 American science fiction film starring Jamie Dornan and Anthony Mackie * Synchronic analysis, the analysis of a language at a specific point of time * Synchronicity, the experience of two or ...
corpus, as only
language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...
use from the late 20th century is represented; the BNC is not meant to be a historical record of the development of
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
over the ages. From the beginning, those involved in the gathering of written data sought to make the BNC a balanced corpus, and hence looked for data in various mediums.


Components and content

90% of the BNC is samples of '' written corpus'' use. These samples were extracted from regional and national newspapers, published research journals or periodicals from various academic fields, fiction and non-fiction books, other published material, and unpublished material such as leaflets, brochures, letters, essays written by students of differing academic levels, speeches, scripts, and many other types of texts.''British National Corpus''
Retrieved 12 March 2012.
The remaining 10% of the BNC is samples of
spoken language A spoken language is a language produced by articulate sounds or (depending on one's definition) manual gestures, as opposed to a written language. An oral language or vocal language is a language produced with the vocal tract in contrast with a si ...
use. These are presented and recorded in the form of orthographic transcriptions. The ''spoken corpus'' consists of two parts: one part is
demographic Demography () is the statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as ed ...
, containing the transcriptions of spontaneous natural
conversation Conversation is interactive communication between two or more people. The development of conversational skills and etiquette is an important part of socialization. The development of conversational skills in a new language is a frequent focus ...
s produced by volunteers of various age groups, social classes and originating from different regions. These conversations were produced in different situations, including formal business or government meetings to conversations on radio shows and phone-ins. These were to account for both the demographic distribution of spoken language and those of linguistically significant variation due to context. The other part involves context-governed samples such as transcriptions of recordings made at specific types of meeting and event. All the original recordings transcribed for inclusion in the BNC have been deposited at the
British Library Sound Archive The British Library Sound Archive, formerly the British Institute of Recorded Sound; also known as the National Sound Archive (NSA), in London, England is among the largest collections of recorded sound in the world, including music, spoken word a ...
. The majority of the recordings are freely available from the Oxford University Phonetics Laboratory.


Sub-corpora and tagging

Two sub-corpora (subsets of the BNC data) have been released: BNC Baby and BNC Sampler. Both these sub-corpora may be ordered online via the BNC webpage. BNC Baby is a sub-corpus of BNC that consists of four sets of samples, each containing one million words tagged as they are in BNC itself. The words in each sample set correspond to a specific
genre Genre () is any form or type of communication in any mode (written, spoken, digital, artistic, etc.) with socially-agreed-upon conventions developed over time. In popular usage, it normally describes a category of literature, music, or other f ...
label. One sample set contains spoken conversation and the other three sample sets contain written text:
academic writing Academic writing or scholarly writing is nonfiction produced as part of academic work, including reports on empirical fieldwork or research in facilities for the natural sciences or social sciences, monographs in which scholars analyze culture, ...
,
fiction Fiction is any creative work, chiefly any narrative work, portraying individuals, events, or places that are imaginary, or in ways that are imaginary. Fictional portrayals are thus inconsistent with history, fact, or plausibility. In a tradi ...
and
newspapers A newspaper is a periodical publication containing written information about current events and is often typed in black ink with a white or gray background. Newspapers can cover a wide variety of fields such as politics, business, spor ...
respectively. The latest (third) edition has been released and comes in XML format. The BNC Sampler is a two-part sub-corpora, a part each for written and spoken data; each part contains one million words. The BNC Sampler was originally used in a project to work out how to improve the tagging process for the BNC, which eventually led to the BNC World edition. Throughout the project, the BNC Sampler was improved with increasing expertise and knowledge for tagging to arrive at its current form. The BNC corpus has been tagged for grammatical information (
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
). The tagging system, named CLAWS, went through improvements to yield the latest CLAWS4 system, which is used for tagging the BNC. CLAWS1 was based on a
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...
and, when employed in automatic tagging, managed to successfully tag 96% to 97% of each text analyzed. CLAWS1 was upgraded to CLAWS2 by removing the need for manual processing to prepare the texts for automatic tagging. The latest version, CLAWS4, includes improvements such as more powerful
word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consc ...
(WSD) abilities, and the ability to deal with variation in
orthography An orthography is a set of conventions for writing a language, including norms of spelling, hyphenation, capitalization, word breaks, emphasis, and punctuation. Most transnational languages in the modern period have a writing system, and ...
and
markup language Markup language refers to a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document ...
. Later work on the tagging system looked at increasing the success rates in automatic tagging and reducing the work needed for manual processing, while maintaining effectiveness and efficiency by introducing software to replace some of the manual work. Subsequently, a new program called the "Template Tagger" was introduced for a corrective function. Tags indicating ambiguity were later added. Manual tagging is still necessary, as CLAWS4 is still unable to deal with foreign words.


TEI and access

The corpus is marked up following the recommendations of the
Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and main ...
(TEI) and includes full linguistic
annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
and contextual information. The licence for the CLAWS4 part-of-speech tagger may be purchased to use the tagger. Alternatively, a tagging service is offered at
Lancaster University , mottoeng = Truth lies open to all , established = , endowment = £13.9 million , budget = £317.9 million , type = Public , city = Bailrigg, City of Lancaster , country = England , coor = , campus = Bailrigg , faculty ...
. The BNC itself may be ordered with either a personal or institutional license. The edition available is the BNC
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
edition and it comes with the
Xaira Xaira is an XML Aware Indexing and Retrieval Architecture developed at Oxford University, it was funded by the Mellon Foundation between 2005 and 2006. It is based on SARA,corpus manager, BNCweb, has been developed for the BNC XML edition. The interface is designed to be easy to use, and the program offers query features and functions for corpus analysis. Users can retrieve results and data from searches and analyses.


Permission issues

The BNC was the first
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
of its size to be made widely available. This could be attributed to the standard forms of agreement, between rights owners and the Consortium on the one hand, and between corpus users and the Consortium on the other.
Intellectual property rights Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, cop ...
owners were sought for their agreement with the standard licence, including willingness to incorporate their materials in the corpus without any fees. This arrangement may have been facilitated by the originality of the concept and the prominence associated with the project. However, it was a challenge to keep the identity of contributors hidden without discrediting the value of their work. Any distinct allusion to the identity of contributors was largely removed; the alternative solution of substituting the identity of a contributor with a different name was discussed, but not considered feasible. Additionally, contributors had earlier been asked only to incorporate transcribed versions of their
speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
and not the
speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...
itself. While permission could be sought from initial contributors again, the lack of success in the anonymization process meant that it would be challenging to seek materials from initial contributors. At the same time, two factors compounded the unwillingness of rights owners to donate their materials: full texts were to be excluded, and there was no motivation for them to disseminate information using the corpus, particularly since the corpus operates on a non-commercial basis.


Problems and limitations


Categories

By 2001, the BNC still had no text categorisation for written texts beyond that of domain, and no categorisation for spoken texts except by context and
demographic Demography () is the statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as ed ...
or
socio-economic Socioeconomics (also known as social economics) is the social science that studies how economic activity affects and is shaped by social processes. In general it analyzes how modern societies progress, stagnate, or regress because of their loc ...
classes. For example, a wide variety of imaginative texts (
novels A novel is a relatively long work of narrative fiction, typically written in prose and published as a book. The present English word for a long work of prose fiction derives from the for "new", "news", or "short story of something new", itself ...
,
short stories A short story is a piece of prose fiction that typically can be read in one sitting and focuses on a self-contained incident or series of linked incidents, with the intent of evoking a single effect or mood. The short story is one of the oldest t ...
,
poems Poetry (derived from the Greek ''poiesis'', "making"), also called verse, is a form of literature that uses aesthetic and often rhythmic qualities of language − such as phonaesthetics, sound symbolism, and metre − to evoke meanings ...
, and drama scripts) were included in the BNC, but such inclusions were deemed useless as researchers were unable to easily retrieve the
subgenres Genre () is any form or type of communication in any mode (written, spoken, digital, artistic, etc.) with socially-agreed-upon conventions developed over time. In popular usage, it normally describes a category of literature, music, or other fo ...
on which they wanted to work (e.g., poetry). Because this
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
was omitted in the file headers and in all BNC documentation, there was no way to know whether an "imaginative" text actually came from a novel, a short story, a drama script or a collection of poems unless the title actually included words such as "novel" or "poem"). With the 2002 introduction of a new version, the BNC World Edition, BNC attempted to deal with this problem. Besides domain, there are now 70 categories for genre for both spoken and written data, and so researchers can now specifically retrieve texts by genre. Even after these additions, however, implementation is still tricky, as assigning a genre or subgenre to a text is not straightforward. The divisions are less clear for spoken data than they are for written data, as there was more variation in topic and execution. Also, there will always be possible subsets of genres of each subgenre. How far genres are subdivided is pre-determined for the sake of a default, but researchers have the option of making the divisions more general or specific according to their needs. Categorisation is also a problem, as certain texts, while deemed to belong to an interdisciplinary genre such as linguistics, include content that is subsequently categorised into either arts or science categories due to the nature of their content.


Classification and discourse

Some texts were classified under the wrong category, usually because of a misleading title. Users cannot always rely on the titles of the files as indications of their real content: For example, many texts with "lecture" in their title are actually classroom discussions or tutorial seminars involving a very small group of people, or were popular lectures (addressed to a general audience rather than to students at an institution of higher learning). One reason is that genre and subgenre labels can only be assigned for the majority of the texts in a category. There are subgenres within genres, and for each text the content may not be uniform throughout and may span multiple subgenres. Also, production pressures coupled with insufficient information led to hasty decisions, resulting in inaccuracy and inconsistency in records. The proportion of written to spoken material in the BNC is 10:1, making spoken material under-represented. This is because the cost of collecting and transcribing one million words of naturally occurring speech is at least 10 times higher than the cost of adding another million words of newspaper text. Some linguists have argued that this represents a deficiency in the corpus, since speech and writing are both equally important in a language. The BNC is not ideal for the study of many features of spoken discourse, since most of its transcripts are orthographic. Paralinguistic features are only roughly indicated.


Limitations and misappropriates

Despite being an excellent source of lexical information, the BNC can only really be used to study a limited set of grammatical patterns, particularly those which have distinctive lexical correlates. While it is easy enough to find all the occurrences of "enjoy", and to sort them according to the part-of-speech category of the following word, it requires additional work to find all cases of verbs followed by a
gerund In linguistics, a gerund ( abbreviated ) is any of various nonfinite verb forms in various languages; most often, but not exclusively, one that functions as a noun. In English, it has the properties of both verb and noun, such as being modifiab ...
, since the SARA index of the BNC does not include part-of-speech categories such as "all verbs" or "all V-ing forms". Some lexical correlates are also too ambiguous to allow them to be used in queries: any search for restrictive
relative clauses A relative clause is a clause that modifies a noun or noun phraseRodney D. Huddleston, Geoffrey K. Pullum, ''A Student's Introduction to English Grammar'', CUP 2005, p. 183ff. and uses some grammatical device to indicate that one of the arguments ...
would provide the user with irrelevant data, given the number of other uses of wh-
pronoun In linguistics and grammar, a pronoun ( abbreviated ) is a word or a group of words that one may substitute for a noun or noun phrase. Pronouns have traditionally been regarded as one of the parts of speech, but some modern theorists would not ...
s and of that in the language (not to mention the impossibility of identifying relative clauses with pronoun deletion, as in "the man I saw"). Particular
semantic Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
and
pragmatic Pragmatism is a philosophical movement. Pragmatism or pragmatic may also refer to: *Pragmaticism, Charles Sanders Peirce's post-1905 branch of philosophy * Pragmatics, a subfield of linguistics and semiotics *'' Pragmatics'', an academic journal i ...
categories (doubt, cognisance, disagreements, summaries, etc.) are difficult to locate for the same reason. This means, for example, that while one can compare speech by men and by women, one cannot compare speech ''to'' women and ''to'' men. The nature of the BNC as a large mixed corpus renders it unsuitable for the study of highly specific text-types or genres, as any one of them is likely to be inadequately represented and may not be recognisable from the encoding. For example, there are very few business letters and service encounters in the BNC, and those wishing to explore their specific conventions would do better to compile a small corpus including only texts of those types.


Uses


English language education

There are two general ways in which corpus material can be used in language teaching. Firstly, publishers and researchers could use corpus samples to create language-learning references, syllabuses and other related tools or materials. For example, the BNC was used by a group of Japanese researchers as a tool in their creation of an English-language–learning website for learners of
English for specific purposes English for specific purposes (ESP) is a subset of English as a second or foreign language. It usually refers to teaching the English language to university students or people already in employment, with reference to the particular vocabulary an ...
(ESP). The website enabled English-language learners to download frequently heard and used sentence patterns, and then base their own usage of the English language on these sentence patterns. The BNC served as the source from which the frequently used expressions were extracted. In using this website, users thus relied on reference samples from the BNC to guide them in their learning of the English language. Such creation of materials that facilitate language-learning typically involves the use of very large corpora (comparable to the size of the BNC), as well as advanced software and technology. A large amount of money, time, and expertise in the field of
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
are invested in the development of such language-learning material. Secondly, the analysis of the corpus can be incorporated directly into the language teaching and learning environment. With this method, language learners are given the opportunity to categorize language data from the corpus and subsequently form conclusions about the patterns and features of their target language from their categorizations. This method involves a greater amount of work on the part of the language leaner and is referred to as “data-driven learning” by Tim Johns. The corpus data used for data-driven learning is relatively smaller, and consequently the generalisations made about the target language may be of limited value. In general, the BNC is useful as a reference source for the purposes of producing and perceiving text. The BNC can be used as a
reference Reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another object. The first object in this relation is said to ''refer to'' the second object. It is called a '' name'' ...
source when studying the use of individual words in various contexts, so that learners become familiar with the different ways to use particular words in suitable contexts. Other than language-related information, encyclopedic information is also found in the BNC. Learners perusing data from the BNC are also introduced to British cultural features and
stereotypes In social psychology, a stereotype is a generalized belief about a particular category of people. It is an expectation that people might have about every person of a particular group. The type of expectation can vary; it can be, for example ...
.


Bilingual dictionaries, tests and evaluation

The BNC was the source of more than 12,000 words and phrases used for the production of a range of
bilingual dictionaries A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be ''unidirectional'', meaning that they list the meanings of words of one la ...
in India in 2012, translating 22 local languages into English. This was part of a larger movement to push for improvements in education, the preservation of India's vernacular languages, and the development of
translation Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
work. The large size of the BNC provides a large-scale resource on which to test programs. It has been used as a test bed for the
Text Encoding Initiative The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and main ...
(TEI) guidelines. The BNC has also been used to provide 20 million words to evaluate English subcategorization acquisition systems for the
Senseval SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. ...
initiative for computational analysis of meaning.


Research


Collocational Evidence from the British National Corpus

Hoffman & Lehmann (2000) explored the mechanisms behind speakers' ability to manipulate their large inventory of
collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
s which are ready for use and can be easily expanded grammatically or syntactically to adapt to the current speech situation. Word combinations occurring in low frequency were extracted from the BNC to offer some insight into it.


Collocational behaviour of man and woman

Pearce (2008) examined the representation of men and women in this corpus by using
Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in corpus linguistics, translators or lan ...
. The corpus query tool was used to explore grammatical behaviour of the noun
lemmas Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), ...
"man" and "woman" (i.e., the nouns "man"/"men" and "woman"/"women").


Non-sentential Utterances A Corpus Study

Fernandez & Ginzburg (2002) investigated dialogue which included non-sentiential utterances using the BNC.


A corpus-based EAP course for NNS doctoral students

Lee & Swales (2006) designed an experimental course in corpus-informed English for Academic Purposes (EAP) for doctoral students at the English Language Institute (ELI) of the University of Michigan in the US. Participants used three main corpora as the basis of their investigations: Hyland's Research Article Corpus, the Michigan Corpus of Academic Spoken English (MICASE), and academic texts from the BNC.


Future work


Morphological processing

As part of ongoing work on morphological processing, a key area of
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
(NLP), data from the BNC was used to test the accuracy, reliability and swiftness of computational tools developed to facilitate the analysis and processing of morphological markers in
British English British English (BrE, en-GB, or BE) is, according to Oxford Dictionaries, "English as used in Great Britain, as distinct from that used elsewhere". More narrowly, it can refer specifically to the English language in England, or, more broadl ...
. The computational tools involved a program that enabled the analysis of
inflectional morphology In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and de ...
in British English (known as an analyser) and a program that generated morphological markings based on the analysis from the analyser. Data from the BNC was also used to build up an extensive repository of information about British English morphological markers. In particular, approximately 1,100 lemmas were extracted from the BNC and compiled into a checklist which was consulted by the morphological generator before
verbs A verb () is a word (part of speech) that in syntax generally conveys an action (''bring'', ''read'', ''walk'', ''run'', ''learn''), an occurrence (''happen'', ''become''), or a state of being (''be'', ''exist'', ''stand''). In the usual descrip ...
that allowed consonant doubling were accurately inflected. Since the BNC represents a recognizable effort to collect and subsequently process such a large amount of data, it has become an influential forerunner in the field and a model or exemplary corpus on which the development of later corpora was based.


BNC2014

In July 2014, Cambridge University Press and the Centre for Corpus Approaches to Social Science (CASS) announced at Lancaster University that a new British National Corpus - the BNC2014 - was under compilation. The first stage of the collaborative project between the two institutions was to compile a new spoken corpus of British English from the early to mid 2010s."Centre for Corpus Approaches to Social Science"
Retrieved 17 March 2015.
The 11.5-million-word Spoken British National Corpus 2014 was released to the public on 25 September 2017. The 100-million-word written component of the BNC2014 has been compiled, and a restricted version was released to the public on 19 Nov 2021. However, unlike its earlier edition, the corpus texts in the written component of BNC2014 have not been made freely available. Limited querying functions are currently provided through customized software developed by Lancaster University.


See also

*
American National Corpus The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and ...
*
Bank of English The Bank of English is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth ...
*
Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
* Corpus of Contemporary American English (COCA) * International Corpus of English *
Lou Burnard Lou Burnard (born 1946 in Birmingham, England) is an internationally recognised expert in digital humanities, particularly in the area of text encoding and digital libraries. He was assistant director of Oxford University Computing Services (OUC ...
*
Oxford English Corpus The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the ''Oxford English Dictionary'' and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly ...
* Spoken English Corpus


References


External links


British National Corpus website

Free BNC interface

Audio BNCAudio BNC index

BNC with audio recordsBNC word frequenciesBNCweb(register here)
{{Authority control English corpora Commercial digital libraries Text Encoding Initiative Applied linguistics Linguistic research British digital libraries Corpora