A speech corpus (or spoken corpus) is a
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
of speech audio files and text
transcriptions.
In
speech technology, speech corpora are used, among other things, to create
acoustic models (which can then be used with a
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
or
speaker identification engine).
In
linguistics
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...
, spoken corpora are used to do research into
phonetic
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...
,
conversation analysis,
dialectology and other fields.
A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).
There are two types of Speech Corpora:
# Read Speech – which includes:
#* Book excerpts
#* Broadcast news
#* Lists of words
#* Sequences of numbers
# Spontaneous Speech – which includes:
#* Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
#* Narratives – a person telling a story (one such corpus is the
Buckeye Corpus);
#* Map-tasks – one person explains a route on a map to another;
#* Appointment-tasks – two people try to find a common meeting time based on individual schedules.
A special kind of speech corpora are
non-native speech databases that contain speech with foreign accent.
See also
*
Arabic Speech Corpus
The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. Th ...
*
Common Voice
*
EXMARaLDA
EXMARaLDA (Extensible Markup Language for Discourse Annotation) is a set of free software tools for creating, managing and analyzing spoken language corpora. It consists of a transcription tool (comparable to tools like Praat or Transcriber), a t ...
*
Lingua Libre
Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.
Description
Lingua Libre enables to record words, phr ...
, an online
libre tool
*
List of children's speech corpora A child speech corpus is a speech corpus documenting first-language language acquisition. Such databases are used in the development of computer-assisted language learning systems and the characterization of children's speech at difference ages. ...
*
Non-native speech database
A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and se ...
*
Praat
Praat (; , '' "talk"'') is a free computer software package for speech analysis in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the University of Amsterdam. It can run on a wide range of operat ...
*
Spoken English Corpus
*
The BABEL Speech Corpus
*
TIMIT
*
Transcriber
*
Transcription (linguistics)
Transcription in the linguistic sense is the systematic representation of spoken language in written form. The source can either be utterances (''speech'' or ''sign language'') or preexisting text in another writing system.
Transcription s ...
References
* Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
* Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.
External links
Santa Barbara Corpus of Spoken American EnglishBuckeye CorpusThe Buckeye Corpus of Conversational Speech
The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings*
ttp://std.metu.edu.tr/en/ The Spoken Turkish Corpus at METU AnkaraSpoken Corpus Klient with the Corp-Oral Corpus at ILTEC LisbonVoxForge – open source speech corporaOLAC: Open Language Archives CommunitySimmortel Speech Recognition Corpus for Indian English and HindiELRA: the European Language Resources AssociationThe PELCRA Conversational Corpus of PolishThe Arabic Speech CorpusCorpus of Political Speeches: Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
Corpora
Corpus linguistics
Speech recognition
Dialectology
Phonetics
Language documentation
de:Textkorpus
{{corpora-stub