Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in

classical test theory Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observ ...

. Measurement invariance is often tested in the framework of multiple-group

confirmatory factor analysis In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research.Kline, R. B. (2010). ''Principles and practice of structural equation modeling (3rd ed.).'' New York, New York: G ...

(CFA). In the context of

structural equation models Structural equation modeling (SEM) is a label for a diverse set of methods used by scientists in both experimental and observational research across the sciences, business, and other fields. It is used most in the social and behavioral scienc ...

, including CFA, measurement invariance is often termed ''factorial invariance''.

Definition

In the common factor model, measurement invariance may be defined as the following equality: :

f(\textit \mid \boldsymbol, \textbf) = f(\textit \mid \boldsymbol)

where

f(\cdot)

is a distribution function,

\textit

is an observed score,

\boldsymbol

is a factor score, and ''s'' denotes group membership (e.g., Caucasian=0, African American=1). Therefore, measurement invariance entails that given a subject's factor score, his or her observed score is not dependent on his or her group membership.

Types of invariance

Several different types of measurement invariance can be distinguished in the common factor model for continuous outcomes: : 1) ''Equal form'': The number of factors and the pattern of factor-indicator relationships are identical across groups. : 2) ''Equal loadings'': Factor loadings are equal across groups. : 3) ''Equal intercepts'': When observed scores are regressed on each factor, the intercepts are equal across groups. : 4) ''Equal residual variances'': The residual variances of the observed scores not accounted for by the factors are equal across groups. The same typology can be generalized to the discrete outcomes case: : 1) ''Equal form'': The number of factors and the pattern of factor-indicator relationships are identical across groups. : 2) ''Equal loadings'': Factor loadings are equal across groups. : 3) ''Equal thresholds'': When observed scores are regressed on each factor, the thresholds are equal across groups. : 4) ''Equal residual variances'': The residual variances of the observed scores not accounted for by the factors are equal across groups. Each of these conditions corresponds to a multiple-group confirmatory factor model with specific constraints. The tenability of each model can be tested statistically by using a

likelihood ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after i ...

or other indices of fit. Meaningful comparisons between groups usually require that all four conditions are met, which is known as ''strict measurement invariance''. However, strict measurement invariance rarely holds in applied context. Usually, this is tested by sequentially introducing additional constraints starting from the equal form condition and eventually proceeding to the equal residuals condition if the fit of the model does not deteriorate in the meantime.

Tests for invariance

Although further research is necessary on the application of various invariance tests and their respective criteria across diverse testing conditions, two approaches are common among applied researchers. For each model being compared (e.g., Equal form, Equal Intercepts) a ''χ²'' fit statistic is iteratively estimated from the minimization of the difference between the model implied mean and covariance matrices and the observed mean and covariance matrices. As long as the models under comparison are nested, the difference between the ''χ²'' values and their respective degrees of freedom of any two CFA models of varying levels of invariance follows a ''χ²'' distribution (diff ''χ²'') and as such, can be inspected for significance as an indication of whether increasingly restrictive models produce appreciable changes in model-data fit. However, there is some evidence the diff ''χ²'' is sensitive to factors unrelated to changes in invariance targeted constraints (e.g., sample size). Consequently it is recommended that researchers also use the difference between the comparative fit index (ΔCFI) of two models specified to investigate measurement invariance. When the difference between the CFIs of two models of varying levels of measurement invariance (e.g., equal forms versus equal loadings) is below −0.01 (that is, it drops by more than 0.01), then invariance in likely untenable. The CFI values being subtracted are expected to come from nested models as in the case of diff ''χ²'' testing; however, it seems that applied researchers rarely take this into consideration when applying the CFI test.

Levels of Equivalence

Equivalence can also be categorized according to three hierarchical levels of measurement equivalence. # Configural equivalence: The factor structure is the same across groups in a multi-group confirmatory factor analysis. # Metric equivalence: Factor loadings are similar across groups. # Scalar equivalence: Values/Means are also equivalent across groups.

Implementation

Tests of measurement invariance are available in the

R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...

Criticism

The well-known Political scientist Christian Welzel and his colleagues criticize the excessive reliance on invariance tests as criteria for the validity of cultural and psychological constructs in

cross-cultural Cross-cultural may refer to *cross-cultural studies, a comparative tendency in various fields of cultural analysis *cross-cultural communication, a field of study that looks at how people from differing cultural backgrounds communicate *any of vari ...

statistics. They have demonstrated that the invariance criteria favor constructs with low between-group

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

, while constructs with high between-group variance fail these tests. A high between-group variance is indeed necessary for a construct to be useful in cross-cultural comparisons. The between-group variance is highest if some group

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arith ...

s are near the extreme ends of the closed-ended scales, where the intra-group variance is necessarily low. Low intra-group variance yields low

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...

s and low factor loadings which scholars routinely interpret as an indication of inconsistency. Welzel and colleagues recommend instead to rely on nomological criteria of construct validity based on whether the construct correlates in expected ways with other measures of between-group differences. They offer several examples of cultural constructs that have high

explanatory power Explanatory power is the ability of a hypothesis or theory to explain the subject matter effectively to which it pertains. Its opposite is ''explanatory impotence''. In the past, various criteria or measures for explanatory power have been prop ...

and

predictive power The concept of predictive power, the power of a scientific theory to generate testable predictions, differs from '' explanatory power'' and ''descriptive power'' (where phenomena that are already known are retrospectively explained or describ ...

in cross-cultural comparisons, yet fail the tests for invariance. Proponents of invariance testing counter-argue that the reliance on nomological linkage ignores that such external validation hinges on the assumption of comparability.{{cite journal , last1=Meuleman , first1=Bart , last2=Żółtak , first2=Tomasz , title=Why Measurement Invariance is Important in Comparative Research. A Response to Welzel et al. (2021) , journal=Sociological Methods & Research , date=2022 , page=00491241221091755 , doi=10.1177/00491241221091755

References

Psychometrics Latent variable models