Pronunciation Assessment
   HOME

TheInfoList



OR:

Automatic pronunciation assessment is the use of
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
to verify the correctness of pronounced
speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...
, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with
computer-aided instruction Educational technology (commonly abbreviated as edutech, or edtech) is the combined use of computer hardware, software, and educational theory and practice to facilitate learning and teaching. When referred to with its abbreviation, "EdTech", ...
for
computer-assisted language learning Computer-assisted language learning (CALL), known as computer-aided instruction (CAI) in British English and computer-aided language instruction (CALI) in American English, Levy (1997: p. 1) briefly defines it as "the exploration and study of co ...
(CALL), speech remediation, or
accent reduction Accent reduction, also known as accent modification or accent neutralization, is a systematic approach for learning or adopting a new speech accent. It is the process of learning the sound system (or phonology) and melodic intonation of a langua ...
. Pronunciation assessment does not determine unknown speech (as in dictation or
automatic transcription Transcription software assists in the conversion of human speech into a text transcript. Audio or video files can be transcribed manually or automatically. Transcriptionists can replay a recording several times in a transcription editor and type w ...
) but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's
pronunciation Pronunciation is the way in which a word or a language is spoken. To This may refer to generally agreed-upon sequences of sounds used in speaking a given word or all language in a specific dialect—"correct" or "standard" pronunciation—or si ...
and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as
intonation Intonation may refer to: *Intonation (linguistics), variation of speaking pitch that is not used to distinguish words *Intonation (music), a musician's realization of pitch accuracy, or the pitch accuracy of a musical instrument *Intonation Music ...
, pitch,
tempo In musical terminology, tempo (Italian for 'time'; plural 'tempos', or from the Italian plural), measured in beats per minute, is the speed or pace of a given musical composition, composition, and is often also an indication of the composition ...
,
rhythm Rhythm (from Greek , ''rhythmos'', "any regular recurring motion, symmetry") generally means a " movement marked by the regulated succession of strong and weak elements, or of opposite or different conditions". This general meaning of regular r ...
, and syllable and word stress. Pronunciation assessment is also used in reading tutoring, for example in products such as
Microsoft Teams Microsoft Teams is a team collaboration platform developed by Microsoft as part of the Microsoft 365 suite. It offers features such as workspace chat, video conferencing, file storage, and integration with both Microsoft and third-party applicat ...
and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat
speech disorders Speech disorders, impairments, or impediments, are a type of communication disorder in which normal manner of articulation, speech is disrupted. This can mean fluency disorders like stuttering and cluttering. Someone who is unable to speak due to ...
such as
apraxia Apraxia is a motor disorder caused by damage to the brain (specifically the posterior parietal cortex or corpus callosum), which causes difficulty with motor planning to perform tasks or movements. The nature of the damage determines the di ...
.


Intelligibility

The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility, a shortcoming corrected in 2011 at the
Toyohashi University of Technology Toyohashi University of Technology (豊橋技術科学大学; ''Toyohashi Gijutsu Kagaku Daigaku''), often abbreviated to Toyohashi Tech or TUT, is a national engineering university located in Toyohashi, Aichi, Japan. Distinguished for the upp ...
, and included in the
Versant The Versant suite of tests are computerized tests of spoken language available from Pearson PLC. Versant tests were the first fully automated tests of spoken language to use advanced speech processing technology (including speech recognition) t ...
high-stakes English fluency assessment from
Pearson Pearson may refer to: Organizations Education * Lester B. Pearson College, Victoria, British Columbia, Canada * Pearson College (UK), London, owned by Pearson PLC *Lester B. Pearson High School (disambiguation) Companies * Pearson plc, a UK-based ...
and mobile apps from 17zuoye Education & Technology, but still missing in 2023 products from
Google Search Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
,
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
,
Educational Testing Service Educational Testing Service (ETS), founded in 1947, is the world's largest private educational testing and assessment organization. It is headquartered in Lawrence Township, Mercer County, New Jersey, Lawrence Township, New Jersey, but has a P ...
, Speechace, and ELSA. Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias, especially in high-stakes assessments; from words with multiple correct pronunciations; and from phoneme coding errors in machine-readable pronunciation dictionaries. In the
Common European Framework of Reference for Languages The Common European Framework of Reference for Languages: Learning, Teaching, Assessment, abbreviated in English as CEFR, CEF, or CEFRL, is a guideline used to describe achievements of learners of foreign languages across Europe and, increasingl ...
(CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels. In 2022, researchers found that some newer speech to text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores closely correlated with genuine listener intelligibility. In 2023, others were able to assess intelligibility using
dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walk ...
based distance from Wav2Vec2 representation of good speech.


Evaluation

Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation
speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text Transcription (linguistics), transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with ...
es for others to use for improving assessment quality. Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions. Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgment with automated feedback can improve accuracy and fairness.


Recent developments

Some promising areas for improvement being developed in 2024 include articulatory
feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...
and
transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
to suppress unnecessary corrections. Other interesting advances under development include "
augmented reality Augmented reality (AR), also known as mixed reality (MR), is a technology that overlays real-time 3D computer graphics, 3D-rendered computer graphics onto a portion of the real world through a display, such as a handheld device or head-mounted ...
" interfaces for mobile devices using
optical character recognition Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
to provide pronunciation training on text found in user environments. As of mid-2024, audio multimodal large language models have been used to assess pronunciation.


See also

*
Phonetics Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
— often called "forced alignment" (of audio to its expected phonemes) in this context *
Statistical classification When classification is performed by a computer, statistical methods are normally used to develop the algorithm. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''f ...


References


External links

*
International Speech Communication Association The International Speech Communication Association (ISCA) is a non-profit organization and one of the two main professional associations for speech communication science and technology, the other association being the IEEE Signal Processing Society ...
(ISCA) Special Interest Group o
Speech and Language Technologies in Education (SLaTE)
{{Natural language processing Educational technology Language learning software Natural language processing Phonetics Speech recognition Statistical classification