The Survey of English Usage was the first research centre in Europe to carry out research with

corpora Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

. The Survey is based in the Department of English Language and Literature at

University College London University College London (Trade name, branded as UCL) is a Public university, public research university in London, England. It is a Member institutions of the University of London, member institution of the Federal university, federal Uni ...

History

The Survey of English Usage was founded as the Survey of Spoken English at

Durham University Durham University (legally the University of Durham) is a collegiate university, collegiate public university, public research university in Durham, England, founded by an Act of Parliament (UK), Act of Parliament in 1832 and incorporated by r ...

in 1959 by

Randolph Quirk Charles Randolph Quirk, Baron Quirk (12 July 1920 – 20 December 2017) was a British linguist and politician. He was the Quain Professor of English language and literature at University College London from 1968 to 1981. He sat as a crossbe ...

, moving with him to

in 1960. Many well-known linguists have spent time doing research at the Survey, including Bas Aarts, Valerie Adams, John Algeo, Dwight Bolinger, Noël Burton-Roberts,

David Crystal David Crystal, (born 6 July 1941) is a British linguist who works on the linguistics of the English language. Crystal studied English at University College London and has lectured at Bangor University and the University of Reading. He was aw ...

, Derek Davy, Jan Firbas, Sidney Greenbaum, Liliane Haegeman, Robert Ilson, Ruth Kempson,

Geoffrey Leech Geoffrey Neil Leech FBA (16 January 1936 – 19 August 2014) was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests we ...

, Jan Rusiecki, Jan Svartvik, and Joe Taglicht. The current director is Bas Aarts. It was originally deposited in huge metal fireproof cabinets in two or three rooms in the Foster Court of UCL. The original Survey Corpus predated modern computing. It was recorded on reel-to-reel tapes, transcribed on paper, filed in filing cabinets, and indexed on paper cards. Transcriptions were annotated with a detailed

prosodic In linguistics, prosody () is the study of elements of speech, including intonation (linguistics), intonation, stress (linguistics), stress, Rhythm (linguistics), rhythm and loudness, that occur simultaneously with individual phonetic segments: v ...

and

paralinguistic Paralanguage, also known as vocalics, is a component of meta-communication that may modify meaning, give nuanced meaning, or convey emotion, by using suprasegmental techniques such as prosody, pitch, volume, intonation, etc. It is sometimes def ...

annotation developed by Crystal and Quirk (1964). Sets of paper cards were manually annotated for grammatical structures and filed, so, for example, all noun phrases could be found in the noun phrase filing cabinet in the Survey. Naturally, corpus searches required a visit to the Survey. This corpus is now known more widely as the London-Lund Corpus (LLC), as it was the responsibility of co-workers in Lund, Sweden, to computerise the corpus. Thirty-four of the spoken texts were published in book form as Svartvik and Quirk (1980), and the corpus was used as the basis for the famous book ''

A Comprehensive Grammar of the English Language ''A Comprehensive Grammar of the English Language'' is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985. In 1991, it was called "The g ...

'' (Quirk ''et al.'' 1985).Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). ''A Comprehensive Grammar of the English Language'' London: Longman.

Current research

Constructing corpora

In 1988 Sidney Greenbaum proposed a new project, ICE, the

International Corpus of English The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are include ...

. ICE was to be an international project, carried out at research centres around the world, to compile corpora of English varieties where English was the first or second official language. ICE texts would contain spoken and written English in a balanced sample of one million words per component so that these samples could be compared in a wide variety of ways. The ICE project continues around the world to the present day. ICE-GB, the British Component of ICE, was compiled at the Survey. ICE-GB was annotated to a very detailed level, including constructing a full grammatical analysis (parse) for every sentence in the corpus. The first release of ICE-GB took place in 1998. ICE-GB was distributed with software for searching and exploring the

parsed corpus In linguistics, a treebank is a parsed text corpus that annotated, annotates syntactic or semantic sentence (linguistics), sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which ...

called ICECUP. Release 2 of ICE-GB has now been released and is available on CD. As well as contrasting varieties of English, many researchers are interested in language development and change over time. A recent project at the Survey undertook the parsing of a large (400,000 word) selection of the spoken part of the LLC in a manner directly comparable with ICE-GB, forming a new, 800,000 word diachronic corpus, called the Diachronic Corpus of Present-Day Spoken English (DCPSE). DCPSE has now been released and is available on CD from the Survey. These two corpora comprise the largest collection of parsed and corrected, orthographically transcribed spoken English language data in the world, with over one million words of spoken English in this form.

Exploring corpora

Parsed corpora are large databases containing detailed grammatical tree structures. One of the consequences of forming large collections of valuable linguistic data is a pressing need for methods and tools to help researchers and other users make the most of them. So in parallel with the parsing of natural language data, the Survey team have carried out research and development of software tools to help linguists use these corpora. The ICECUP research platform uses an intuitive grammatical query representation called Fuzzy Tree Fragments (FTFs) to search parsed corpora.

Linguistic research with corpora

As well as distributing corpora and tools to the

corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...

research community, the SEU carries out research into English language. Recent projects include research on the English Noun Phrase, Subordination in Spoken and Written English, and the English Verb Phrase. The Survey also provides support for PhD students who carry out research into English language corpora.

References

External links

* {{authority control English language Corpus linguistics Applied linguistics Linguistic research institutes