An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample , the first reference, and sample , the second reference) followed by one unknown sample that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low

p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

in a predetermined number of trials, then the

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is ...

cannot be rejected and it cannot be proven that there is a perceptible difference between A and B. ABX tests can easily be performed as

double-blind trial In a blind or blinded experiment, information which may influence the participants of the experiment is withheld until after the experiment is complete. Good blinding can reduce or eliminate experimental biases that arise from a participants' expe ...

s, eliminating any possible unconscious influence from the researcher or the test supervisor. Because samples A and B are provided just prior to sample X, the difference does not have to be discerned from assumption based on long-term memory or past experience. Thus, the ABX test answers whether or not, under ideal circumstances, a perceptual difference can be found. ABX tests are commonly used in evaluations of digital

audio data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible

compression artifact A compression artifact (or artefact) is a noticeable distortion of media (including images, audio, and video) caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it beco ...

s that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given

bitrate In telecommunications and computing, bit rate (bitrate or as a variable ''R'') is the number of bits that are conveyed or processed per unit of time. The bit rate is expressed in the unit bit per second (symbol: bit/s), often in conjunction ...

. ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.

History

The history of ABX testing and naming dates back to 1950 in a paper published by two Bell Labs researchers, W. A. Munson and Mark B. Gardner, titled '' Standardizing Auditory Tests''.

The purpose of the present paper is to describe a test procedure which has shown promise in this direction and to give descriptions of equipment which have been found helpful in minimizing the variability of the test results. The procedure, which we have called the “ABX” test, is a modification of the method of paired comparisons. An observer is presented with a time sequence of three signals for each judgment he is asked to make. During the first time interval he hears signal A, during the second, signal B, and finally signal X. His task is to indicate whether the sound heard during the X interval was more like that during the A interval or more like that during the B interval. For a threshold test, the A interval is quiet, the B interval is signal, and the X interval is either quiet or signal.

The test has evolved to other variations such as subject control over duration and sequence of testing. One such example was the hardware ABX comparator in 1977, built by the ABX company in Troy, Michigan, and documented by one of its founders, David Clark.

Refinements to the A/B test The author's first experience with double-blind audibility testing was as a member of the SMWTMS Audio Club in early 1977. A button was provided which would select at random component A or B. Identifying one of these, the X component was greatly hampered by not having the known A and B available for reference. This was corrected by using three interlocked pushbuttons, A, B, and X. Once an X was selected, it would remain that particular A or B until it was decided to move on to another random selection. However, another problem quickly became obvious. There was always an audible relay transition time delay when switching from A to B. When switching from A to X, however, the time delay would be missing if X was really A and present if X was really B. This extraneous cue was removed by inserting a fixed length dropout time when any change was made. The dropout time was selected to be 50 ms which produces a slight consistent click while allowing subjectively instant comparison.

The ABX company is now defunct and hardware comparators in general as commercial offerings extinct. Myriad of software tools exist such as Foobar ABX plug-in for performing file comparisons. But hardware equipment testing requires building custom implementations.

Hardware tests

ABX test equipment utilizing relays to switch between two different hardware paths can help determine if there are perceptual differences in cables and components. Video, audio and digital transmission paths can be compared. If the switching is microprocessor controlled, double-blind tests are possible. Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the ''ABX Comparator'' by

QSC Audio Products QSC is an American manufacturer of audio products including power amplifiers, loudspeakers, digital mixers and digital signal processors including the Q-Sys networked audio, video and control platform. QSC products are used by professional insta ...

from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.

Confidence

If only one ABX trial were performed, random guessing would incur a 50% chance of choosing the correct answer, the same as flipping a coin. In order to make a statement having some degree of

confidence Confidence is a state of being clear-headed either that a hypothesis or prediction is correct or that a chosen course of action is the best or most effective. Confidence comes from a Latin word 'fidere' which means "to trust"; therefore, having ...

, many trials must be performed. By increasing the number of trials, the likelihood of statistically asserting a person's ability to distinguish A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered

statistically significant In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the p ...

. The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.QSC ABX Comparator user manual. (1998) p. 10 QSC recommended that no more than 25 trials be performed, as subject fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B). However, a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same subject conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds

N/2+\sqrt

. Important decisions are normally based on a higher level of confidence, since an erroneous "significant result" would be claimed in one of 20 such tests simply by chance.

Software tests

The

foobar2000 foobar2000 (often abbreviated as fb2k or f2k) is a freeware audio player for Microsoft Windows, iOS and Android developed by Peter Pawłowski. It has a modular design, which provides user flexibility in configuration and customization. Stan ...

and the Amarok audio players support software-based ABX testing, the latter using a third-party script. Lacinato ABX is a cross-platform audio testing tool for Linux, Windows, and 64-bit Mac. Lacinato WebABX is a web-based cross-browser audio ABX tool. Open source aveX was mainly developed for

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, whi ...

which also provides test-monitoring from a remote computer. ABX patcher is an ABX implementation for

Max/MSP Max, also known as Max/MSP/Jitter, is a visual programming language for music and multimedia developed and maintained by San Francisco-based software company Cycling '74. Over its more than thirty-year history, it has been used by composers, per ...

. More ABX software can be found at the archived PCABX website.

Codec listening tests

A codec listening test is a

scientific Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...

study Study or studies may refer to: General * Education ** Higher education * Clinical trial * Experiment * Observational study * Research * Study skills, abilities and approaches applied to learning Other * Study (art), a drawing or series of ...

designed to compare two or more

lossy In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data si ...

audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound *Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum * Digital audio, representation of sou ...

codec A codec is a device or computer program that encodes or decodes a data stream or signal. ''Codec'' is a portmanteau of coder/decoder. In electronic communications, an endec is a device that acts as both an encoder and a decoder on a signal or ...

s, usually with respect to perceived

fidelity Fidelity is the quality of faithfulness or loyalty. Its original meaning regarded duty in a broader sense than the related concept of ''fealty''. Both derive from the Latin word ''fidēlis'', meaning "faithful or loyal". In the City of London fin ...

or compression efficiency.

Potential flaws

ABX is a type of forced choice testing. A subject's choices can be on merit, i.e. the subject indeed honestly tried to identify whether X seemed closer to A or B. But uninterested or tired subjects might choose randomly without even trying. If not caught, this may dilute the results of other subjects who intently took the test and subject the outcome to

Simpson's paradox Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science st ...

, resulting in false summary results. Simply looking at the outcome totals of the test (''m'' out of ''n'' answers correct) cannot reveal occurrences of this problem. This problem becomes more acute if the differences are small. The user may get frustrated and simply aim to finish the test by voting randomly. In this regard, forced choice tests such as ABX tend to favor negative outcomes when differences are small if proper protocols are not used to guard against this problem. Best practices call for both the inclusion of controls and the screening of subjects:

A major consideration is the inclusion of appropriate control conditions. Typically, control conditions include the presentation of unimpaired audio materials, introduced in ways that are unpredictable to the subjects. It is the differences between judgement of these control stimuli and the potentially impaired ones that allows one to conclude that the grades are actual assessments of the impairments.

3.2.2 Post-screening of subjects Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared with the mean result and another relies on the ability of the subject to make correct identifications. The first class is never justifiable. Whenever a subjective listening test is performed with the test method recommended here, the required information for the second class of post-screening is automatically available. A suggested statistical method for doing this is described in Attachment 1.' The methods are primarily used to eliminate subjects who cannot make the appropriate discriminations. The application of a post-screening method may clarify the tendencies in a test result. However, bearing in mind the variability of subjects’ sensitivities to different artefacts, caution should be exercised.

Other flaws include lack of subject training and familiarization with the test and content selected:

4.1 Familiarization or training phase Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test environment, the grading process, the grading scales and the methods of their use. Subjects should also become thoroughly familiar with the artefacts under study. For the most sensitive tests they should be exposed to all the material they will be grading later in the formal grading sessions. During familiarization or training, subjects should be preferably together in groups (say, consisting of three subjects), so that they can interact freely and discuss the artefacts they detect with each other.

Other problems might arise from the ABX equipment itself, as outlined by Clark, where the equipment provides a tell, allowing the subject to identify the source. Lack of transparency of the ABX fixture creates similar problems. Since auditory tests and many other sensory tests rely on

short-term memory Short-term memory (or "primary" or "active memory") is the capacity for holding a small amount of information in an active, readily available state for a short interval. For example, short-term memory holds a phone number that has just been recit ...

, which only lasts a few seconds, it is critical that the test fixture allows the subject to identify short segments that can be compared quickly. Pops and glitches in switching apparatus likewise must be eliminated, as they may dominate or otherwise interfere with the stimuli being tested in what is stored in the subject's short-term memory.

Alternatives

Algorithmic Audio Compression Evaluation

Since ABX testing requires human beings for evaluation of lossy audio codecs, it is time-consuming and costly. Therefore, cheaper approaches have been developed, e.g.

PEAQ Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994-1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Rad ...

, which is an implementation of the ODG.

MUSHRA

MUSHRA MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms. It is defined by ITU-R recommen ...

, the subject is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. A 0–100 rating scale makes it possible to rate very small differences, and the hidden version still provides discrimination checks.

Discrimination testing

Alternative general methods are used in discrimination testing, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically: ;Duo–trio: AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B). ;Triangle: XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3. In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.

References

{{DEFAULTSORT:Abx Test Digital audio Statistical tests Psychophysics