In multivariate
quantitative genetics
Quantitative genetics is the study of quantitative traits, which are phenotypes that vary continuously—such as height or mass—as opposed to phenotypes and gene-products that are Categorical variable, discretely identifiable—such as eye-col ...
, a genetic correlation (denoted
or
) is the proportion of
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
that two traits share due to
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
tic causes, the
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
between the genetic influences on a trait and the genetic influences on a different trait
[ Plomin et al., p. 123] estimating the degree of
pleiotropy
Pleiotropy () is a condition in which a single gene or genetic variant influences multiple phenotypic traits. A gene that has such multiple effects is referred to as a ''pleiotropic gene''. Mutations in pleiotropic genes can impact several trait ...
or causal overlap. A genetic correlation of 0 implies that the genetic effects on one trait are independent of the other, while a correlation of 1 implies that all of the genetic influences on the two traits are identical. The bivariate genetic correlation can be generalized to inferring genetic
latent variable
In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...
factors across > 2 traits using
factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...
. Genetic correlation models were introduced into behavioral genetics in the 1970s–1980s.
Genetic correlations have applications in validation of
genome-wide association study
In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of Single-nucleotide polymorphism, genetic variants in different individuals to see if any variant is associated with a trait. GWA s ...
(GWAS) results, breeding, prediction of traits, and discovering the
etiology
Etiology (; alternatively spelled aetiology or ætiology) is the study of causation or origination. The word is derived from the Greek word ''()'', meaning "giving a reason for" (). More completely, etiology is the study of the causes, origins ...
of traits & diseases.
They can be estimated using individual-level data from twin studies and molecular genetics, or even with GWAS
summary statistics
In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in
* a measure of ...
.
Genetic correlations have been found to be common in non-human genetics and to be broadly similar to their respective phenotypic correlations,
and also found extensively in human traits, dubbed the 'phenome'.
This finding of widespread pleiotropy has implications for artificial selection in agriculture, interpretation of phenotypic correlations, social inequality,
attempts to use
Mendelian randomization in causal inference, the understanding of the biological origins of
complex traits
Complex traits are phenotypes that are controlled by two or more genes and do not follow Mendel's Law of Dominance. They may have a range of expression which is typically continuous. Both environmental and genetic factors often impact the variat ...
, and the design of GWASes.
A genetic correlation is to be contrasted with environmental correlation between the environments affecting two traits (e.g. if poor nutrition in a household caused both lower IQ and height); a genetic correlation between two traits can contribute to the observed (
phenotypic
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (physical form and structure), its developmental processes, its biochemical and physiological propert ...
) correlation between two traits, but genetic correlations can also be opposite observed phenotypic correlations if the environment correlation is sufficiently strong in the other direction, perhaps due to tradeoffs or specialization.
The observation that genetic correlations usually mirror phenotypic correlations is known as "Cheverud's Conjecture" and has been confirmed in animals and humans, and showed they are of similar sizes; for example, in the
UK Biobank
UK Biobank is a long-term prospective biobank study in the United Kingdom (UK) that houses the de-identified biological samples and health-related data of half a million people. The volunteer participants aged 40-69 were recruited between 2006 ...
, of 118 continuous human traits, only 29% of their intercorrelations have opposite signs,
and a later analysis of 17 high-quality UKBB traits reported correlation near-unity.
Interpretation
Genetic correlations are not the same as
heritability
Heritability is a statistic used in the fields of Animal husbandry, breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. T ...
, as it is about the overlap between the two sets of influences and not their absolute magnitude; two traits could be both highly heritable but not be genetically correlated or have small heritabilities and be completely correlated (as long as the heritabilities are non-zero).
For example, consider two traits – dark skin and black hair. These two traits may individually have a very high heritability (most of the population-level variation in the trait due to genetic differences, or in simpler terms, genetics contributes significantly to these two traits), however, they may still have a very low genetic correlation if, for instance, these two traits were being controlled by different, non-overlapping, non-linked genetic loci.
A genetic correlation between two traits will tend to produce phenotypic correlations – e.g. the genetic correlation between
intelligence
Intelligence has been defined in many ways: the capacity for abstraction, logic, understanding, self-awareness, learning, emotional knowledge, reasoning, planning, creativity, critical thinking, and problem-solving. It can be described as t ...
and
SES or education and family SES
implies that intelligence/SES will also correlate phenotypically. The phenotypic correlation will be limited by the degree of genetic correlation and also by the heritability of each trait. The expected phenotypic correlation is the ''bivariate heritability and can be calculated as the square roots of the heritabilities multiplied by the genetic correlation. (Using a Plomin example, for two traits with heritabilities of 0.60 & 0.23,
, and phenotypic correlation of ''r''=0.45 the bivariate heritability would be
, so of the observed phenotypic correlation, 0.28/0.45 = 62% of it is due to correlative genetic effects, which is to say nothing of trait mutability in and of itself.)
Cause
Genetic correlations can arise due to:
#
linkage disequilibrium Linkage disequilibrium, often abbreviated to LD, is a term in population genetics referring to the association of genes, usually linked genes, in a population. It has become an important tool in medical genetics and other fields
In defining LD, it ...
(two neighboring genes tend to be inherited together, each affecting a different trait)
# biological pleiotropy (a single gene having multiple otherwise unrelated biological effects, or
shared regulation of multiple genes)
# mediated pleiotropy (a gene affects trait ''X'' and trait ''X'' affects trait ''Y'').
# biases:
population stratification
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies ar ...
such as ancestry or
assortative mating (sometimes called "gametic phase disequilibrium"), spurious stratification such as
ascertainment bias
In statistics, sampling bias is a bias (statistics), bias in which a sample is collected in such a way that some members of the intended statistical population, population have a lower or higher sampling probability than others. It results in a b ...
/self-selection or
Berkson's paradox
Berkson's paradox, also known as Berkson's bias, collider bias, or Berkson's fallacy, is a result in conditional probability and statistics which is often found to be counterintuitive, and hence a veridical paradox. It is a complicating factor ar ...
, or
misclassification of diagnoses
Uses
Causes of changes in traits
Genetic correlations are useful because they can be analyzed over time within an individual (e.g. intelligence is stable over a lifetime, due to the same genetic influences – intelligence measured at childhood has a
correlation with intelligence at old age
), or across diagnoses, allowing researchers to test whether two traits share the same genetic basis, whether different genes influence a trait in different populations, and to what degree traits can meaningfully cluster due sharing a biological basis and
genetic architecture
Genetic architecture is the underlying genetic basis of a phenotypic trait and its variational properties. Phenotypic variation for quantitative traits is, at the most basic level, the result of the segregation of alleles at quantitative trait ...
.
Boosting GWASes
Genetic correlations can be used in
GWASes by using
polygenic score
In genetics, a polygenic score (PGS) is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype. The PGS is also called the polygenic index (PGI) or genome-wide score; in the context of disease risk, i ...
s or genome-wide hits for one (often more easily measured) trait to increase the
prior probability
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
of variants for a second trait; for example, since intelligence and years of education are highly genetically correlated, a GWAS for education will inherently also be a GWAS for intelligence and be able to predict variance in intelligence as well
and the strongest SNP candidates can be used to increase the
statistical power
In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
of a smaller GWAS,
a combined analysis on the latent trait done where each measured genetically-correlated trait helps reduce measurement error and boosts the GWAS's power considerably (e.g. Krapohl et al. 2017, using
elastic net and multiple polygenic scores, improving intelligence prediction from 3.6% of variance to 4.8%; Hill et al. 2017b
uses MTAG
to combine 3 ''g''-loaded traits of education, household income, and a cognitive test score to find 107 hits & doubles predictive power of intelligence) or one could do a GWAS for multiple traits jointly.
Genetic correlations can also quantify the contribution of correlations <1 across datasets which might create a false "
missing heritability", by estimating the extent to which differing measurement methods, ancestral influences, or environments create only partially overlapping sets of relevant genetic variants.
Breeding
Genetic correlations are also useful in applied contexts such as
plant
Plants are the eukaryotes that form the Kingdom (biology), kingdom Plantae; they are predominantly Photosynthesis, photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with c ...
/
animal breeding
Animal breeding is a branch of animal science that addresses the evaluation (using best linear unbiased prediction and other methods) of the genetic value (estimated breeding value, EBV) of livestock. Selecting for breeding animals with superior ...
by allowing substitution of more easily measured but highly genetically correlated characteristics (particularly in the case of sex-linked or binary traits under the
liability-threshold model, where differences in the phenotype can rarely be observed but another highly correlated measure, perhaps an
endophenotype In genetic epidemiology, endophenotype (or intermediate phenotype) is a term used to separate behavioral symptoms into more stable phenotypes with a clear genetic connection. By seeing the EP notion as a special case of a larger collection of mul ...
, is available in all individuals), compensating for different environments than the breeding was carried out in, making more accurate predictions of breeding value using the multivariate breeder's equation as compared to predictions based on the univariate breeder's equation using only per-trait heritability & assuming independence of traits, and avoiding unexpected consequences by taking into consideration that
artificial selection
Selective breeding (also called artificial selection) is the process by which humans use animal breeding and plant breeding to selectively develop particular phenotypic traits (characteristics) by choosing which typically animal or plant ...
for/against trait ''X'' will also increase/decrease all traits which positively/negatively correlate with ''X''. The limits to selection set by the inter-correlation of traits, and the possibility for genetic correlations to change over long-term breeding programs, lead to
Haldane's dilemma
Haldane's dilemma, also known as the waiting time problem, is a limit on the speed of beneficial evolution, calculated by J. B. S. Haldane in 1957. Before the invention of DNA sequencing technologies, it was not known how much gene polymorphism ...
limiting the intensity of selection and thus progress.
Breeding experiments on genetically correlated traits can measure the extent to which correlated traits are inherently developmentally linked & response is constrained, and which can be dissociated. Some traits, such as the size of
eyespots on the butterfly ''
Bicyclus anynana
''Bicyclus anynana'' (squinting bush brown) is a small brown butterfly in the family Nymphalidae, the most globally diverse family of butterflies. It is primarily found in eastern Africa from southern Sudan to Swaziland, Eswatini.[standardizing
In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...]
this, i.e., by converting the covariance matrix to a correlation matrix. Generally, if
is a genetic covariance matrix and
, then the correlation matrix is
. For a given genetic covariance
between two traits, one with genetic variance
and the other with genetic variance
, the genetic correlation is computed in the same way as the correlation coefficient
.
Computing the genetic correlation
Genetic correlations require a genetically informative sample. They can be estimated in breeding experiments on two traits of known heritability and selecting on one trait to measure the change in the other trait (allowing inferring the genetic correlation), family/adoption/
twin studies
Twin studies are studies conducted on Identical twin, identical or Fraternal twin, fraternal twins. They aim to reveal the importance of environmental and genetics, genetic influences for traits, phenotypes, and disorders. Twin research is consid ...
(analyzed using
SEMs or
DeFries–Fulker extremes analysis), molecular estimation of relatedness such as
GCTA,
methods employing polygenic scores like HDL (High-Definition Likelihood),
LD score regression,
BOLT-REML,
CPBayes, or HESS,
comparison of genome-wide SNP hits in GWASes (as a loose lower bound), and phenotypic correlations of populations with at least some related individuals.
As with estimating SNP heritability and genetic correlation, the better computational scaling & the ability to estimate using only established summary association statistics is a particular advantage for HDL
and LD score regression over competing methods. Combined with the increasing availability of GWAS summary statistics or polygenic scores from datasets like the
UK Biobank
UK Biobank is a long-term prospective biobank study in the United Kingdom (UK) that houses the de-identified biological samples and health-related data of half a million people. The volunteer participants aged 40-69 were recruited between 2006 ...
, such summary-level methods have led to an explosion of genetic correlation research since 2015.
The methods are related to
Haseman–Elston regression & PCGC regression. Such methods are typically genome-wide, but it is also possible to estimate genetic correlations for specific variants or genome regions.
One way to consider it is using trait X in twin 1 to predict trait Y in twin 2 for monozygotic and dizygotic twins (i.e. using twin 1's IQ to predict twin 2's brain volume); if this
cross-correlation
In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a ''sliding dot product'' or ''sliding inner-product''. It is commonly used f ...
is larger for the more genetically-similar monozygotic twins than for the dizygotic twins, the similarity indicates that the traits are not genetically independent and there is some common genetics influencing both IQ and brain volume. (Statistical power can be boosted by using siblings as well.)
Genetic correlations are affected by methodological concerns; underestimation of heritability, such as due to
assortative mating, will lead to overestimates of longitudinal genetic correlation,
and moderate levels of misdiagnoses can create pseudo correlations.
As they are affected by heritabilities of both traits, genetic correlations have low statistical power, especially in the presence of measurement errors biasing heritability downwards, because "estimates of genetic correlations are usually subject to rather large sampling errors and therefore seldom very precise": the
standard error
The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...
of an estimate
is
. (Larger genetic correlations & heritabilities will be estimated more precisely.
) However, inclusion of genetic correlations in an analysis of a pleiotropic trait can boost power for the same reason that multivariate regressions are more powerful than separate univariate regressions.
Twin methods have the advantage of being usable without detailed biological data, with human genetic correlations calculated as far back as the 1970s and animal/plant genetic correlations calculated in the 1930s, and require sample sizes in the hundreds for being well-powered, but they have the disadvantage of making assumptions which have been criticized, and in the case of rare traits like anorexia nervosa it may be difficult to find enough twins with a diagnosis to make meaningful cross-twin comparisons, and can only be estimated with access to the twin data; molecular genetic methods like GCTA or LD score regression have the advantage of not requiring specific degrees of relatedness and so can easily study rare traits using
case-control designs, which also reduces the number of assumptions they rely on, but those methods could not be run until recently, require large sample sizes in the thousands or hundreds of thousands (to obtain precise SNP heritability estimates, see the standard error formula), may require individual-level genetic data (in the case of GCTA but not LD score regression).
More concretely, if two traits, say height and weight have the following additive genetic variance-covariance matrix:
Then the genetic correlation is .55, as seen is the standardized matrix below:
In practice,
structural equation modeling
Structural equation modeling (SEM) is a diverse set of methods used by scientists for both observational and experimental research. SEM is used mostly in the social and behavioral science fields, but it is also used in epidemiology, business, ...
applications such as Mx or
OpenMx (and before that, historically,
LISREL) are used to calculate both the genetic covariance matrix and its standardized form. In
R, will standardize the matrix.
Typically, published reports will provide genetic variance components that have been standardized as a proportion of total variance (for instance in an ACE
twin study
Twin studies are studies conducted on identical or fraternal twins. They aim to reveal the importance of environmental and genetic influences for traits, phenotypes, and disorders. Twin research is considered a key tool in behavioral genetics ...
model standardised as a proportion of V-total = A+C+E). In this case, the metric for computing the genetic covariance (the variance within the genetic covariance matrix) is lost (because of the standardizing process), so you cannot readily estimate the genetic correlation of two traits from such published models. Multivariate models (such as the
Cholesky decomposition
In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for eff ...
) will, however, allow the viewer to see shared genetic effects (as opposed to the genetic correlation) by following path rules. It is important therefore to provide the unstandardised path coefficients in publications.
See also
*
Gene-environment correlation
*
Heritability of intelligence;
g factor (psychometrics)
The ''g'' factor is a construct developed in psychometric investigations of Cognitive skill, cognitive abilities and human intelligence. It is a variable that summarizes positive correlations among different cognitive tasks, reflecting the asser ...
*
Cognitive epidemiology
*
Lothian birth-cohort studies
*
Mendelian randomization
References
Cited sources
*
*
External links
The G-matrix Online
{{DEFAULTSORT:Genetic Correlation
Statistical genetics