A speech corpus (or spoken corpus) is a

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

of speech audio files and text transcriptions. In

speech technology Speech technology relates to the technologies designed to duplicate and respond to the human voice. They have many uses. These include aid to the voice-disabled, the hearing-disabled, and the blind, along with communication with computers without a ...

, speech corpora are used, among other things, to create acoustic models (which can then be used with a

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

or speaker identification engine). In

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, spoken corpora are used to do research into

phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians ...

, conversation analysis,

dialectology Dialectology (from Ancient Greek, Greek , ''dialektos'', "talk, dialect"; and , ''-logy, -logia'') is the scientific study of dialects: subsets of languages. Though in the 19th century a branch of historical linguistics, dialectology is often now c ...

and other fields. A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases). There are two types of speech corpora: # Read Speech, which includes: #* Book excerpts #* Broadcast news #* Lists of words #* Sequences of numbers # Spontaneous Speech, which includes: #* Dialogs – between two or more people (includes meetings; one such corpus is the KEC); #* Narratives – a person telling a story (one such corpus is the Buckeye Corpus); #* Map-tasks – one person explains a route on a map to another; #* Appointment-tasks – two people try to find a common meeting time based on individual schedules. A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.

References

* Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum. * Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

External links

Santa Barbara Corpus of Spoken American English

Buckeye Corpus
The Buckeye Corpus of Conversational Speech
The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings

* ttp://std.metu.edu.tr/en/ The Spoken Turkish Corpus at METU Ankara
Spoken Corpus Klient with the Corp-Oral Corpus at ILTEC Lisbon

VoxForge – open source speech corpora

OLAC: Open Language Archives Community

Simmortel Speech Recognition Corpus for Indian English and Hindi

ELRA: the European Language Resources Association

The PELCRA Conversational Corpus of Polish

The Arabic Speech Corpus

Corpus of Political Speeches
: Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
Large Multimodal Corpus of Human Speech
Corpora Corpus linguistics Speech recognition Dialectology Phonetics Language documentation de:Textkorpus {{corpora-stub

See also

References

External links