Genomic control (GC) is a statistical method that is used to control for the
confounding
In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...
effects of
population stratification in
genetic association
Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.
Studies of genetic association aim to test whether single-locus alleles or genotype fre ...
studies. The method was originally outlined by
Bernie Devlin and
Kathryn Roeder
Kathryn M. Roeder is an American statistician known for her development of statistical methods to uncover the genetic basis of complex disease and her contributions to mixture models, semiparametric inference, and multiple testing. Roeder holds po ...
in a 1999 paper. It involves using a set of anonymous
genetic markers to estimate the effect of population structure on the distribution of the
chi-square statistic
Pearson's chi-squared test (\chi^2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests (e.g., ...
. The distribution of the chi-square statistics for a given
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
that is suspected to be associated with a given
trait can then be compared to the distribution of the same statistics for an allele that is expected not to be related to the trait. The method is supposed to involve the use of markers that are not
linked to the marker being tested for a possible association. In theory, it takes advantage of the tendency of population structure to cause
overdispersion
In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model.
A common task in applied statistics is choosing a parametric model to fit a giv ...
of
test statistics in association analyses. The genomic control method is as
robust as family-based designs, despite being applied to population-based data. It has the potential to lead to a decrease in
statistical power
In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H_0) when a specific alternative hypothesis (H_1) is true. It is commonly denoted by 1-\beta, and represents the chances ...
to detect a true association, and it may also fail to eliminate the biasing effects of population stratification. A more robust form of the genomic control method can be performed by expressing the association being studied as two
Cochran–Armitage trend tests, and then applying the method to each test separately.
The assumption of population homogeneity in association studies, especially case-control studies, can easily be violated and can lead to both
type I and type II errors
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
. It is therefore important for the models used in the study to compensate for the population structure. The problem in case control studies is that if there is a genetic involvement in the disease, the case population is more likely to be related than the individuals in the control population. This means that the assumption of independence of observations is violated. Often this will lead to an overestimation of the significance of an association but it depends on the way the sample was chosen. If, coincidentally, there is a higher allele frequency in a subpopulation of the cases, you will find association with any trait that is more prevalent in the case population. This kind of spurious association increases as the sample population grows so the problem should be of special concern in large scale association studies when loci only cause relatively small effects on the trait. A method that in some cases can compensate for the above described problems has been developed by Devlin and Roeder (1999).
It uses both a
frequentist and a
Bayesian approach (the latter being appropriate when dealing with a large number of
candidate genes).
The frequentist way of correcting for population structure works by using markers that are not linked with the trait in question to correct for any inflation of the statistic caused by population structure. The method was first developed for binary traits but has since been generalized for quantitative ones. For the binary one, which applies to finding genetic differences between the case and control populations, Devlin and Roeder (1999) use
Armitage's trend test
:
and the
test for allelic frequencies
:
If the population is in
Hardy–Weinberg equilibrium the two statistics are approximately equal. Under the
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
of no population stratification the trend test is asymptotic
distribution with one degree of freedom. The idea is that the statistic is inflated by a factor
so that
where
depends on the effect of stratification. The above method rests upon the assumptions that the inflation factor
is constant, which means that the loci should have roughly equal mutation rates, should not be under different selection in the two populations, and the amount of Hardy–Weinberg disequilibrium measured in Wright's
coefficient of inbreeding
The coefficient of inbreeding of an individual is the probability that two alleles at any locus in an individual are identical by descent from the common ancestor(s) of the two parents.
The coefficient of inbreeding is: The probability that two ...
''F'' should not differ between the different loci. The last of these is of greatest concern. If the effect of the stratification is similar across the different loci
can be estimated from the unlinked markers
:
where ''L'' is the number of unlinked markers. The denominator is derived from the
gamma distribution
In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
as a robust estimator of
. Other estimators have been suggested, for example, Reich and Goldstein suggested using the mean of the statistics instead. This is not the only way to estimate
but according to Bacanu et al. it is an appropriate estimate even if some of the unlinked markers are actually in disequilibrium with a disease causing locus or are themselves associated with the disease. Under the null hypothesis and when correcting for stratification using ''L'' unlinked genes,
is approximately
distributed. With this correction the overall type I error rate should be approximately equal to
even when the population is stratified. Devlin and Roeder (1999)
mostly considered the situation where
gives a 95% confidence level and not smaller p-values. Marchini et al. (2004)
demonstrates by simulation that genomic control can lead to an anti-conservative p-value if this value is very small and the two populations (case and control) are extremely distinct. This was especially a problem if the number of unlinked markers were in the order 50−100. This can result in false positives (at that significance level).
References
{{Reflist
Statistical genetics
Statistical methods