Genome-wide complex trait analysis (GCTA) Genome-based

restricted maximum likelihood In statistics, the restricted (or residual, or reduced) maximum likelihood (REML) approach is a particular form of maximum likelihood estimation that does not base estimates on a maximum likelihood fit of all the information, but instead uses a lik ...

(GREML) is a statistical method for

heritability Heritability is a statistic used in the fields of Animal husbandry, breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. T ...

estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trait. GCTA is typically applied to common single nucleotide polymorphisms (

SNPs In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in ...

) on a genotyping array (or "chip") and thus termed "chip" or "SNP" heritability. GCTA operates by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. GCTA makes a number of modeling assumptions and whether/when these assumptions are satisfied continues to be debated. The GCTA framework has also been extended in a number of ways: quantifying the contribution from multiple SNP categories (i.e. functional partitioning); quantifying the contribution of Gene-Environment interactions; quantifying the contribution of non-additive/non-linear effects of SNPs; and bivariate analyses of multiple phenotypes to quantify their genetic covariance (co-heritability or

genetic correlation In multivariate quantitative genetics, a genetic correlation (denoted r_g or r_a) is the proportion of variance that two traits share due to genetic causes, the correlation between the genetic influences on a trait and the genetic influences on a d ...

). GCTA estimates have implications for the potential for discovery from Genome-wide Association Studies (GWAS) as well as the design and accuracy of

polygenic scores In genetics, a polygenic score (PGS) is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype. The PGS is also called the polygenic index (PGI) or genome-wide score; in the context of disease risk, i ...

. GCTA estimates from common variants are typically substantially lower than other estimates of total or narrow-sense heritability (such as from twin or kinship studies), which has contributed to the debate over the

Missing heritability problem Missing or The Missing may refer to: Film * ''Missing'' (1918 film), an American silent drama directed by James Young * ''Missing'' (1982 film), an American historical drama directed by Costa-Gavras about the 1973 coup in Chile *, a Belgian film ...

History

Estimation in biology/animal breeding using standard

ANOVA Analysis of variance (ANOVA) is a family of statistical methods used to compare the means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variation ''w ...

/ REML methods of variance components such as heritability, shared-environment, maternal effects etc. typically requires individuals of known relatedness such as parent/child; this is often unavailable or the pedigree data unreliable, leading to inability to apply the methods or requiring strict laboratory control of all breeding (which threatens the

external validity External validity is the validity of applying the conclusions of a scientific study outside the context of that study. In other words, it is the extent to which the results of a study can generalize or transport to other situations, people, stimul ...

of all estimates), and several authors have noted that relatedness could be measured directly from genetic markers (and if individuals were reasonably related, economically few markers would have to be obtained for statistical power), leading Kermit Ritland to propose in 1996 that directly measured pairwise relatedness could be compared to pairwise phenotype measurements (Ritland 1996
"A Marker-based Method for Inferences About Quantitative Inheritance in Natural Populations"
). As genome sequencing costs dropped steeply over the 2000s, acquiring enough markers on enough subjects for reliable estimates using very distantly related individuals became possible. An early application of the method to humans came with Visscher et al. 2006/2007, which used SNP markers to estimate the actual relatedness of siblings and estimate heritability from the direct genetics. In humans, unlike the original animal/plant applications, relatedness is usually known with high confidence in the 'wild population', and the benefit of GCTA is connected more to avoiding assumptions of classic behavioral genetics designs and verifying their results, and partitioning heritability by SNP class and chromosomes. The first use of GCTA proper in humans was published in 2010, finding 45% of variance in human height can be explained by the included SNPs."Common SNPs explain a large proportion of heritability for human height"
Yang et al 2010 (Large GWASes on height have since confirmed the estimate."Defining the role of common variation in the genomic and biological architecture of adult human height"
Wood et al 2014) The GCTA algorithm was then described and a software implementation published in 2011. It has since been used to study a wide variety of biological, medical, psychiatric, and psychological traits in humans, and inspired many variant approaches.

Benefits

Robust heritability

Twin and family studies have long been used to estimate variance explained by particular categories of genetic and environmental causes. Across a wide variety of human traits studied, there is typically minimal shared-environment influence, considerable non-shared environment influence, and a large genetic component (mostly additive), which is on average ~50% and sometimes much higher for some traits such as height or intelligence. However, the twin and family studies have been criticized for their reliance on a number of assumptions that are difficult or impossible to verify, such as the equal environments assumption (that the environments of

monozygotic Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of Twin Last Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two e ...

and

dizygotic Twins are two offspring produced by the same pregnancy.MedicineNet > Definition of Twin Last Editorial Review: 19 June 2000 Twins can be either ''monozygotic'' ('identical'), meaning that they develop from one zygote, which splits and forms two e ...

twins are equally similar), that there is no misclassification of zygosity (mistaking identical for fraternal & vice versa), that twins are unrepresentative of the general population, and that there is no assortative mating. Violations of these assumptions can result in both upwards and downwards bias of the parameter estimates. (This debate & criticism have particularly focused on the

heritability of IQ Research on the heritability of intelligence quotient (IQ) inquires into the degree of variation in IQ within a population that is due to genetic variation between individuals in that population. There has been significant controversy in the academ ...

.) The use of SNP or whole-genome data from unrelated subject participants (with participants too related, typically >0.025 or ~fourth cousins levels of similarity, being removed, and several principal components included in the regression to avoid & control for

population stratification Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies ar ...

) bypasses many heritability criticisms: twins are often entirely uninvolved, there are no questions of equal treatment, relatedness is estimated precisely, and the samples are drawn from a broad variety of subjects. In addition to being more robust to violations of the twin study assumptions, SNP data can be easier to collect since it does not require rare twins and thus also heritability for rare traits can be estimated (with due correction for

ascertainment bias In statistics, sampling bias is a bias (statistics), bias in which a sample is collected in such a way that some members of the intended statistical population, population have a lower or higher sampling probability than others. It results in a b ...

GWAS power

GCTA estimates can be used to resolve the

missing heritability problem Missing or The Missing may refer to: Film * ''Missing'' (1918 film), an American silent drama directed by James Young * ''Missing'' (1982 film), an American historical drama directed by Costa-Gavras about the 1973 coup in Chile *, a Belgian film ...

and

design A design is the concept or proposal for an object, process, or system. The word ''design'' refers to something that is or has been intentionally created by a thinking agent, and is sometimes used to refer to the inherent nature of something ...

GWASes which will yield genome-wide statistically-significant hits. This is done by comparing the GCTA estimate with the results of smaller GWASes. If a GWAS of n=10k using SNP data fails to turn up any hits, but the GCTA indicates a high heritability accounted for by SNPs, then that implies that a large number of variants are involved ( polygenicity) and thus that much larger GWASes will be required to accurately estimate each SNP's effect and directly account for a fraction of the GCTA heritability.

Disadvantages

# Limited inference: GCTA estimates are inherently limited in that they cannot estimate broadsense heritability like twin/family studies as they only estimate the heritability due to SNPs. Hence, while they serve as a critical check on the unbiasedness of the twin/family studies, GCTAs cannot replace them for estimating total genetic contributions to a trait. # Substantial data requirements: the number of SNPs genotyped per person should be in the thousands and ideally the hundreds of thousands for reasonable estimates of genetic similarity (although this is no longer such an issue for current commercial chips which default to hundreds of thousands or millions of markers); and the number of persons, for somewhat stable estimates of plausible SNP heritability, should be at least ''n''>1000 and ideally ''n''>10000. In contrast, twin studies can offer precise estimates with a fraction of the sample size. # Computational inefficiency: The original GCTA implementation scales poorly with increasing data size (

\mathcal(\text \cdot n^2)

), so even if enough data is available for precise GCTA estimates, the computational burden may be unfeasible. GCTA can be meta-analyzed as a standard precision-weighted fixed-effect meta-analysis, so research groups sometimes estimate cohorts or subsets and then pool them meta-analytically (at the cost of additional complexity and some loss of precision). This has motivated the creation of faster implementations and variant algorithms which make different assumptions, such as using

moment matching Moment or Moments may refer to: Science * Moment (mathematics), a concept in probability theory and statistics * Moment (physics), a combination of a physical quantity and a distance ** Moment of force or torque Time * Present time * An instant * ...

. # Need for raw data: GCTA requires genetic similarity of all subjects and thus their raw genetic information; due to privacy concerns, individual patient data is rarely shared. GCTA cannot be run on the summary statistics reported publicly by many GWAS projects, and if pooling multiple GCTA estimates, a

meta-analysis Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, th ...

must be performed.
In contrast, there are alternative techniques which operate on summaries reported by GWASes without requiring the raw data e.g. " LD score regression" contrasts

linkage disequilibrium Linkage disequilibrium, often abbreviated to LD, is a term in population genetics referring to the association of genes, usually linked genes, in a population. It has become an important tool in medical genetics and other fields In defining LD, it ...

statistics (available from public datasets like 1000 Genomes) with the public summary effect-sizes to infer heritability and estimate genetic correlations/overlaps of multiple traits. The

Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The institu ...

run
LD Hub
which provides a public web interface to >=177 traits with LD score regression. Another method using summary data is HESS. # Confidence intervals may be incorrect, or outside the 0-1 range of heritability, and highly imprecise due to asymptotics. # Underestimation of SNP heritability: GCTA implicitly assumes all classes of SNPs, rarer or commoner, newer or older, more or less in linkage disequilibrium, have the same effects on average; in humans, rarer and newer variants tend to have larger and more negative effects as they represent

mutation load Genetic load is the difference between the fitness of an average genotype in a population and the fitness of some reference genotype, which may be either the best present in a population, or may be the theoretically optimal genotype. The average i ...

being purged by negative selection. As with measurement error, this will bias GCTA estimates towards underestimating heritability.

Interpretation

GCTA provides an unbiased estimate of the total variance in phenotype explained by all variants included in the relatedness matrix (and any variation correlated with those SNPs). This estimate can also be interpreted as the maximum prediction accuracy (R^2) that could be achieved from a linear predictor using all SNPs in the relatedness matrix. The latter interpretation is particularly relevant to the development of Polygenic Risk Scores, as it defines their maximum accuracy. GCTA estimates are sometimes misinterpreted as estimates of total (or narrow-sense, i.e. additive) heritability, but this is not a guarantee of the method. GCTA estimates are likewise sometimes misinterpreted as "lower bounds" on the narrow-sense heritability but this is also incorrect: first because GCTA estimates can be biased (including biased upwards) if the model assumptions are violated, and second because, by definition (and when model assumptions are met), GCTA can provide an unbiased estimate of the narrow-sense heritability if all causal variants are included in the relatedness matrix. The interpretation of the GCTA estimate in relation to the narrow-sense heritability thus depends on the variants used to construct the relatedness matrix. Most frequently, GCTA is run with a single relatedness matrix constructed from common SNPs and will not capture (or not fully capture) the contribution of the following factors: # Any rare or low-frequency variants that are not directly genotyped/imputed. # Any non-linear, dominance, or epistatic genetic effects. Note that GCTA can be extended to estimate the contribution of these effects through more complex relatedness matrices. # The effects of Gene-Environment interactions. Note that GCTA can be extended to estimate the contribution of GxE interactions when the E is known, by including additional variance components. # Structural variants, which are typically not genotyped or imputed. # Measurement error: GCTA does not model any uncertainty or error on the measured trait. GCTA makes several model assumptions and may produce biased estimates under the following conditions: # The distribution of causal variants is systematically different from the distribution of variants included in the relatedness matrix (even if all causal variants are included in the relatedness matrix). For example, if causal variants are systematically at a higher/lower frequency or in higher/lower correlation than all genotyped variants. This can produce either an upwards or downwards bias depending on the relationship between the causal variants and variants used. Various extensions to GCTA have been proposed (for example, GREML-LDMS) to account for these distributional shifts. # Population stratification is not fully accounted for by covariates. GCTA (specifically GREML) accounts for stratification through the inclusion of fixed effect covariates, typically principal components. If these covariates do not fully capture the stratification the GCTA estimate will be biased, generally upwards. Accounting for recent population structure is particularly challenging for studies of rare variants. # Residual genetic or environmental relatedness present in the data. GCTA assumes a homogenous population with an independent and identically distributed environmental term. This assumption is violated if related individuals and/or individuals with substantially shared environments are included in the data. In this case, the GCTA estimate will additionally capture the contribution of any genetic variation correlated with the genetic relationship: either direct genetic effects or correlated environment. # The presence of "indirect" genetic effects. When genetic variants present in the relatedness matrix are correlated with variants present in other individuals that influence the participant's environment, those effects will also be captured in the GCTA estimate. For example, if variants inherited by a participant from their mother influenced their phenotype through their maternal environment, then the effect of those variants will be included in the GCTA estimate even though it is "indirect" (i.e. mediated by parental genetics). This may be interpreted as an upward bias as such "indirect" effects are not strictly causal (altering them in the participant would not lead to a change in phenotype in expectation).

Implementations

The original "GCTA" software package is the most widely used; its primary functionality covers the GREML estimation of SNP heritability, but includes other functionality: Other implementations and variant algorithms include: * FAST-LMM * FAST-LMM-Select: like GCTA in using

ridge regression Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

but including

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

to try to exclude irrelevant SNPs which only add noise to the relatedness estimates * LMM-

Lasso A lasso or lazo ( or ), also called reata or la reata in Mexico, and in the United States riata or lariat (from Mexican Spanish lasso for roping cattle), is a loop of rope designed as a restraint to be thrown around a target and tightened when ...

* GEMMA * EMMAX
REACTA (formerly ACTA)
claims order of magnitude runtime reductions
BOLT-REML
BOLT-LMM
manual
), faster & better scaling;"Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis"
Loh et al 2015; see als
"Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis"
Loh et al 2015 with potentially better efficiency in the meta-analysis scenario
MEGHA
* PLINK >1.9 (December 2013) support
"the use of genetic relationship matrices in mixed model association analysis and other calculations"
* LDAK:Speed et al 2016
"Re-evaluation of SNP heritability in complex human traits"
/ref> loosens the GCTA assumption that all SNPs, regardless of genotyping quality or frequency, have same averaged expected effect, allowing for potentially finding much more SNP heritability * GREML-IBD:Evans et al 2017
"Narrow-sense heritability estimation of complex traits using identity-by-descent information."
/ref> GCTA for

identity by descent A DNA segment is identical by descent (IBD) in two or more individuals if: * they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals * the segment is maximal, t ...

, attempting to estimate heritability based on shared genome segments in distant otherwise-unrelated relatives, in order to capture the effect of rarer variants which are not measured by SNP panels or otherwise imputed

References

External links

GCTA-GREML Power Calculator"Statistical Power to Detect Genetic (Co)Variance of Complex Traits Using SNP Data in Unrelated Samples"
Visscher et al. 2014)
"Genomics, Big Data, Medicine, and Complex Traits"
(Peter Visscher talk)
"The Genetic Architectures of Psychological Traits"
Lee 2014 slides
"Heritability-based models for prediction of complex traits"David Balding
{{Webarchive, url=https://web.archive.org/web/20161008184835/https://sites.google.com/site/baldingstatisticalgenetics/home , date=2016-10-08 2015 Behavioural genetics Medical genetics Single-nucleotide polymorphisms Statistical genetics Twin studies Genetics studies Quantitative genetics Molecular genetics