Measurement
The main idea of measuring subjective video quality is similar to theSource selection
Typically, a system should be tested with a representative number of different contents and content characteristics. For example, one may select excerpts from contents of different genres, such as action movies, news shows, and cartoons. The length of the source video depends on the purpose of the test, but typically, sequences of no less than 10 seconds are used. The amount of motion and spatial detail should also cover a broad range. This ensures that the test contains sequences which are of different complexity. Sources should be of pristine quality. There should be no visible coding artifacts or other properties that would lower the quality of the original sequence.Settings
The design of the HRCs depends on the system under study. Typically, multiple independent variables are introduced at this stage, and they are varied with a number of levels. For example, to test the quality of aViewers
Number of viewers
Viewers are also called "observers" or "subjects". A certain minimum number of viewers should be invited to a study, since a larger number of subjects increases the reliability of the experiment outcome, for example by reducing the standard deviation of averaged ratings. Furthermore, there is a risk of having to exclude subjects for unreliable behavior during rating. The minimum number of subjects that are required for a subjective video quality study is not strictly defined. According to ITU-T, any number between 4 and 40 is possible, where 4 is the absolute minimum for statistical reasons, and inviting more than 40 subjects has no added value. In general, at least 15 observers should participate in the experiment. They should not be directly involved in picture quality evaluation as part of their work and should not be experienced assessors. In other documents, it is also claimed that at minimum 10 subjects are needed to obtain meaningful averaged ratings. However, most recommendations for the number of subjects have been designed for measuring video quality encountered by a home television or PC user, where the range and diversity of distortions tend to be limited (e.g., to encoding artifacts only). Given the large ranges and diversity of impairments that may occur on videos captured with mobile devices and/or transmitted over wireless networks, generally, a larger number of human subjects may be required. Brunnström and Barkowsky have provided calculations for estimating the minimum number of subjects necessary based on existing subjective tests. They claim that in order to ensure statistically significant differences when comparing ratings, a larger number of subjects than usually recommended may be needed.Viewer selection
Viewers should be non-experts in the sense of not being professionals in the field of video coding or related domains. This requirement is introduced to avoid potential subject bias. Typically, viewers are screened for normal vision or corrected-to-normal vision using Snellen charts.Test environment
Subjective quality tests can be done in any environment. However, due to possible influence factors from heterogenous contexts, it is typically advised to perform tests in a neutral environment, such as a dedicated laboratory room. Such a room may be sound-proofed, with walls painted in neutral grey, and using properly calibrated light sources. Several recommendations specify these conditions. Controlled environments have been shown to result in lower variability in the obtained scores.Crowdsourcing
Analysis of results
Opinions of viewers are typically averaged into the mean opinion score (MOS). To this aim, the labels of categorical scales may be translated into numbers. For example, the responses "bad" to "excellent" can be mapped to the values 1 to 5, and then averaged. MOS values should always be reported with their statisticalSubject screening
Often, additional measures are taken before evaluating the results. Subject screening is a process in which viewers whose ratings are considered invalid or unreliable are rejected from further analysis. Invalid ratings are hard to detect, as subjects may have rated without looking at a video, or cheat during the test. The overall reliability of a subject can be determined by various procedures, some of which are outlined in ITU-R and ITU-T recommendations.ITU-T Rec. P.910 : Subjective video quality assessment methods for multimedia applications= Advanced models
= While rating stimuli, humans are subject to biases. These may lead to different and inaccurate scoring behavior and consequently result in MOS values that are not representative of the “true quality” of a stimulus. In the recent years, advanced models have been proposed that aim at formally describing the rating process and subsequently recovering noisiness in subjective ratings. According to Janowski et al., subjects may have an opinion bias that generally shifts their scores, as well as a scoring imprecision that is dependent on the subject and stimulus to be rated. Li et al. have proposed to differentiate between ''subject inconsistency'' and ''content ambiguity''.Standardized testing methods
There are many ways to select proper sequences, system settings, and test methodologies. A few of them have been standardized. They are thoroughly described in several ITU-R and ITU-T recommendations, among those ITU-R BT.500 and ITU-T P.910. While there is an overlap in certain aspects, the BT.500 recommendation has its roots in broadcasting, whereas P.910 focuses on multimedia content. A standardized testing method usually describes the following aspects: * how long an experiment session lasts * where the experiment takes place * how many times and in which order each PVS should be viewed * whether ratings are taken once per stimulus (e.g. after presentation) or continuously * whether ratings are absolute, i.e. referring to one stimulus only, or relative (comparing two or more stimuli) * which scale ratings are taken on Another recommendation, ITU-T P.913,ITU-T P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environmentExamples
Below, some examples of standardized testing procedures are explained.Single-Stimulus
* ACR (Absolute Category Rating): each sequence is rated individually on the ''ACR scale''. The labels on the scale are "bad", "poor", "fair", "good", and "excellent", and they are translated to the values 1, 2, 3, 4 and 5 when calculating the MOS. * ACR-HR (Absolute Category Rating with Hidden Reference): a variation of ACR, in which an original unimpaired source sequence is shown in addition to the impaired sequences, without informing the subjects of its presence (hence, "hidden"). The ratings are calculated as differential scores between the reference and the impaired versions. The differential score is defined as the score of the PVS minus the score given to the hidden reference, plus the number of points on the scale. For example, if a PVS is rated as “poor", and its corresponding hidden reference as “good", then the rating is . When these ratings are averaged, the result is not a MOS, but a differential MOS ("DMOS"). * SSCQE (Single Stimulus Continuous Quality Rating): a longer sequence is rated continuously over time using a slider device (a variation of a fader), on which subjects rate the current quality. Samples are taken in regular intervals, resulting in a quality curve over time rather than a single quality rating.Double-stimulus or multiple stimulus
* DSCQS (Double Stimulus Continuous Quality Scale): the viewer sees an unimpaired reference and the impaired sequence in a random order. They are allowed to re-view the sequences, and then rate the quality for both on a continuous scale labeled with the ACR categories. * DSIS (Double Stimulus Impairment Scale) and DCR (Degradation Category Rating): both refer to the same method. The viewer sees an unimpaired reference video, then the same video impaired, and after that they are asked to vote on the second video using a so-called ''impairment scale'' (from "impairments are imperceptible" to "impairments are very annoying"). * PC (Pair Comparison): instead of comparing an unimpaired and impaired sequence, different impairment types (HRCs) are compared. All possible combinations of HRCs should be evaluated.Choice of methodology
Which method to choose largely depends on the purpose of the test and possible constraints in time and other resources. Some methods may have fewer context effects (i.e. where the order of stimuli influences the results), which are unwanted test biases.Pinson, Margaret and Wolf, StephenDatabases
The results of subjective quality tests, including the used stimuli, are called databases. A number of subjective picture and video quality databases based on such studies have been made publicly available by research institutes. These databases – some of which have become de facto standards – are used globally by television, cinematic, and video engineers around the world to design and test objective quality models, since the developed models can be trained against the obtained subjective data.References
External links