MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a

codec listening test A codec listening test is a scientific Experiment, study designed to compare two or more lossy sound reproduction, audio codecs, usually with respect to perceived fidelity or compression efficiency. Most tests take the form of a double-blind comp ...

to evaluate the perceived quality of the output from

lossy In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size ...

audio compression algorithms. It is defined by

ITU-R The ITU Radiocommunication Sector (ITU-R) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU) and is responsible for radio communications. Its role is to manage the international radio-frequenc ...

recommendation BS.1534-3.ITU-R recommendation BS.1534
/ref> The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small or sensitive audio impairments, Recommendatio
ITU-R BS.1116-3
(ABC/HR) is recommended instead. MUSHRA can be used to test audio codecs across a broad spectrum of use cases: music and film consumption, speech for e.g.

podcast A podcast is a Radio program, program made available in digital format for download over the Internet. Typically, a podcast is an Episode, episodic series of digital audio Computer file, files that users can download to a personal device or str ...

s and

radio Radio is the technology of communicating using radio waves. Radio waves are electromagnetic waves of frequency between 3 hertz (Hz) and 300 gigahertz (GHz). They are generated by an electronic device called a transmitter connec ...

online streaming Streaming media refers to multimedia delivered through a network for playback using a media player. Media is transferred in a ''stream'' of packets from a server to a client and is rendered in real-time; this contrasts with file downl ...

(in which trade-offs between quality and efficiency of size and computation are paramount), modern digital

telephony Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunications services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is ...

, and

VOIP Voice over Internet Protocol (VoIP), also known as IP telephony, is a set of technologies used primarily for voice communication sessions over Internet Protocol (IP) networks, such as the Internet. VoIP enables voice calls to be transmitted as ...

applications (which require quasi-real-time, low-bitrate encoding that remains intelligible). Professional, "

audiophile An audiophile (from + ) is a person who is enthusiastic about high-fidelity sound reproduction. The audiophile seeks to achieve high sound quality in the audio reproduction of recorded music, typically in a quiet listening space in a room with ...

", and "

prosumer A prosumer is an individual who both consumes and produces. The term is a portmanteau of the words '' producer'' and ''consumer''. Research has identified six types of prosumers: DIY prosumers, self-service prosumers, customizing prosumers, co ...

" uses are typically better suited to alternative tests, like the aforementioned ABC/HR, with a base assumption of high-quality,

high-resolution audio High-resolution audio is a term for music files with bit depth greater than 16-bit and sampling frequency higher than 44.1 kHz or 48 kHz used in CD and DVD formats. The Audio Engineering Society (AES), Consumer Technology Association ( ...

wherein there will be minimal detectable differences between reference material and the codec output. The main advantage over the

mean opinion score Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale ...

(MOS) methodology (which serves a similar purpose) is that MUSHRA requires fewer participants to obtain statistically significant results. This is because all codecs are presented at the same time, to the same participants, such that a paired t-test or repeated measures

analysis of variance Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...

can be used for statistical analysis. Furthermore, the 0–100 scale used by MUSHRA makes it possible to express perceptible differences with a high degree of granularity, especially compared to the 0-5 modified

Likert scale A Likert scale ( ,) is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, s ...

often used by MOS experiments. In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference, and one or more anchors (i.e. severely impaired encodings that both the experimenters and participants are supposed to immediately recognise as such; used similarly to the reference to provide a baseline demonstrating - "anchoring" - for participants the actuality of the low end of the quality scale). The recommendation specifies that a low-range and a mid-range anchor should be included in the test signals. These are typically a 7 kHz and a 3.5 kHz low-pass version of the reference. The purpose of the anchors is to calibrate the scale so that minor artifacts are not unduly penalized. This is particularly important when comparing or pooling results from different labs.

Listener behavior

Both, MUSHRA and ITU BS.1116 tests call for trained expert listeners who know what typical artifacts sound like and where they are likely to occur. Expert listeners also have a better internalization of the rating scale which leads to more repeatable results than with untrained listeners. Thus, with trained listeners, fewer listeners are needed to achieve statistically significant results. It is assumed that preferences are similar for expert listeners and naive listeners and thus results of expert listeners are also predictive for consumers. In agreement with this assumption Schinkel-Bielefeld et al. found no differences in the rank order between expert listeners and untrained listeners when using test signals containing only timbre and no spatial artifacts. However, Rumsey et al. showed that for signals containing spatial artifacts, expert listeners weigh spatial artifacts slightly stronger than untrained listeners, who primarily focus on timbre artifacts. In addition to this, it has been shown that expert listeners make more use of the option to listen to smaller sections of the signals under test repeatedly and perform more comparisons between the signals under test and the reference. In contrast to the naive listener who produces a preference rating, expert listeners therefore produce an audio quality rating, rating the differences between the signal under test and the uncompressed original, which is the actual goal of a MUSHRA-test.

Pre- or post-screening

The MUSHRA guidelines describe two major possibilities for assessing the reliability of a listener (described below). The easiest and most common is to disqualify, post-hoc, all listeners who rate the hidden reference repeat below 90 MUSHRA points for more than 15% of all test items. The hidden reference ''should'', in the ideal case, be rated at 100 points to indicate perceptual equivalence with the original reference audio. While it can happen that the hidden reference and a high-quality signal are confused, the specification provides that a rating of lower than 90 should only be given when the listener is certain that the rated signal is different from the original reference, so a rating below 90 for the hidden reference is considered a clear and obvious listener error. The other possibility to assess a listener's performance is eGauge, a framework based on the

(ANOVA). It computes ''agreement'', ''repeatability'', and ''discriminability'', though only the latter two are recommended for pre- or post-screening. Agreement is the ANOVA of a listener's concurrence with the rest of the listeners. Repeatability examines the individual's internal

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * Reliability (computer networking), a category used to des ...

when rating the same test signal again in comparison to the variance of the other test signals. Discriminability analyses a sort of intertest reliability by checking that listeners can distinguish between test signals of different conditions. As eGauge requires listening to every test signal twice, its use is temporally inefficient in the immediate term relative to the prior method of post-screening listeners based on a hidden reference. eGauge does have advantages when used with a longer-term view. It negates the small chance of a complete redo in the rare case in which a sample's results lack sufficient

statistical power In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...

due to an excessive failure rate discovered after the fact. Additionally, the initial inefficiency can be amortised over a series of experiments by removing the need for recruitment phases: if a listener has proven a reliable listener using eGauge, he or she can also be considered a reliable listener for future listening tests, provided the nature of the test is not substantially altered (e.g. a reliable listener for stereo tests is not necessarily equally good at perceiving artifacts in 5.1 or 22.2 configurations or potentially even mono formats).

Test items

It is important to choose critical test items. Specifically, items that are difficult to encode and are likely to produce artifacts. At the same time, the test items should be ecologically valid: they should be representative of broadcast material and not mere synthetic signals designed to be difficult to encode at the expense of realism. A method to choose critical material is presented by Ekeroot ''et al.'' who propose a ranking-by-elimination procedure. While this is effective at selecting the most critical test items, it does not ensure inclusion of a variety of test items prone to different artifacts. Ideally, a MUSHRA test item should maintain similar characteristics for its entire duration (e.g. the use of consistent

instrumentation Instrumentation is a collective term for measuring instruments, used for indicating, measuring, and recording physical quantities. It is also a field of study about the art and science about making measurement instruments, involving the related ...

in music or the same person's voice with similar

cadence In Classical music, Western musical theory, a cadence () is the end of a Phrase (music), phrase in which the melody or harmony creates a sense of full or partial resolution (music), resolution, especially in music of the 16th century onwards.Don ...

and

tone Tone may refer to: Visual arts and color-related * Tone (color theory), a mix of tint and shade, in painting and color theory * Tone (color), the lightness or brightness (as well as darkness) of a color * Toning (coin), color change in coins * ...

in spoken audio). It can be difficult for the listener to decide on a unidimensional MUSHRA rating if some parts of the items demonstrate different artifacts or stronger artifacting compared to other parts, which is rendered more likely by large variations in the characteristics of the audio. Often, shorter items lead to less variability as they demonstrate greater ''stationarity'' (perceptual consistency and constancy). However, even when trying to choose stationary items, ecologically valid stimuli (i.e. audio that is likely to appear or similar to that likely to appear in real-world situations such as on radio) will very often have sections that are slightly more critical than the rest of the signal (examples include keywords in a speech or major

phrases In grammar, a phrasecalled expression in some contextsis a group of words or singular word acting as a grammatical unit. For instance, the English expression "the very happy squirrel" is a noun phrase which contains the adjective phrase "very ...

of music and are dependent on the stimulus type). Stationarity is important as listeners who focus on different sections of the signal tend to evaluate it differently. Listeners who are more analytical seem to be better at identifying the most critical regions of a stimulus than those who are less analytical.

Language of test items

ITU-T P.800 tests, based on the mean opinion score methodology, are commonly used to evaluate telephone codecs for use in e.g.

. This standard specifies that the tested speech items should always be in the native language of the listeners. When MUSHRA is used instead for these purposes, language matching becomes unnecessary. MUSHRA experiments do not aim to test the intelligibility of spoken words but solely the quality of the audio containing those words and the presence or absence of audible artifacts (e.g. distortion). A MUSHRA study with

Mandarin Chinese Mandarin ( ; zh, s=, t=, p=Guānhuà, l=Mandarin (bureaucrat), officials' speech) is the largest branch of the Sinitic languages. Mandarin varieties are spoken by 70 percent of all Chinese speakers over a large geographical area that stretch ...

and

German German(s) may refer to: * Germany, the country of the Germans and German things **Germania (Roman era) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizenship in Germany, see also Ge ...

listeners found no significant difference between rating foreign and native language test items. Despite the lack of distinction in the end results, listeners did need more time and comparison opportunities (repetitions) to accurately evaluate the foreign language items. This compensation is impossible in ITU-T P.800 ACR tests wherein items are heard only once and no comparison to the reference audio is possible. In such tests, unlike MUSHRA tests, foreign language items are perceived and then rated as being of lower quality, irrespective of actual codec quality, when listeners' proficiency in the target language is low.

References

External links

webMUSHRA: a MUSHRA compliant web audio API based experiment software, configurable using YAML

RateIt: A GUI for performing MUSHRA experiments
* {{web archive , url=https://web.archive.org/web/20081019073953/http://www.elec.qmul.ac.uk/digitalmusic/downloads/#mushram , title=MUSHRAM - A Matlab interface for MUSHRA listening tests
A Max/MSP interface for MUSHRA listening tests

A Browser Based Audio Evaluation Tool, for running many different tests including MUSHRA - No coding needed

BeaqleJS: HTML5 and JavaScript based framework for listening tests

mushraJS+Server: based on mushraJS with mochiweb server, which is erlang web server
Signal processing ITU-R recommendations Psychophysics