HOME

TheInfoList



OR:

Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in
allele An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution. ::"The chrom ...
frequencies Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from ''angular frequency''. Frequency is measured in hertz (Hz) which is e ...
between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other. Genetic variants do not necessarily cause observable changes in organisms, but can be correlated by coincidence because of population structure—a variant that is common in a population that has a high rate of disease may erroneously be thought to cause the disease. For this reason, population structure is a common
confounding variable In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...
in
medical genetics Medical genetics is the branch tics in that human genetics is a field of scientific research that may or may not apply to medicine, while medical genetics refers to the application of genetics to medical care. For example, research on the caus ...
studies, and accounting for and controlling its effect is important in genome wide association studies (GWAS). By tracing the origins of structure, it is also possible to study the genetic ancestry of groups and individuals.


Description

The basic cause of population structure in sexually reproducing species is non-random mating between groups: if all individuals within a population mate randomly, then the allele frequencies should be similar between groups. Population structure commonly arises from physical separation by distance or barriers, like mountains and rivers, followed by
genetic drift Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance. Genetic drift may cause gene variants to disappear completely and there ...
. Other causes include
gene flow In population genetics, gene flow (also known as gene migration or geneflow and allele flow) is the transfer of genetic material from one population to another. If the rate of gene flow is high enough, then two populations will have equivalent a ...
from migrations,
population bottleneck A population bottleneck or genetic bottleneck is a sharp reduction in the size of a population due to environmental events such as famines, earthquakes, floods, fires, disease, and droughts; or human activities such as specicide, widespread violen ...
s and expansions,
founder effect In population genetics, the founder effect is the loss of genetic variation that occurs when a new population is established by a very small number of individuals from a larger population. It was first fully outlined by Ernst Mayr in 1942, us ...
s,
evolutionary pressure Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
, random chance, and (in humans) cultural factors. Even in lieu of these factors, individuals tend to stay close to where they were born, which means that alleles will not be distributed at random with respect to the full range of the species.


Measures

Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures. Many statistical methods rely on simple population models in order to infer historical demographic changes, such as the presence of population bottlenecks, admixture events or population divergence times. Often these methods rely on the assumption of panmictia, or homogeneity in an ancestral population. Misspecification of such models, for instance by not taking into account the existence of structure in an ancestral population, can give rise to heavily biased parameter estimates. Simulation studies show that historical population structure can even have genetic effects that can easily be misinterpreted as historical changes in population size, or the existence of admixture events, even when no such events occurred.


Heterozygosity

One of the results of population structure is a reduction in heterozygosity. When populations split, alleles have a higher chance of reaching fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods. This reduction in heterozygosity can be thought of as an extension of
inbreeding Inbreeding is the production of offspring from the mating or breeding of individuals or organisms that are closely related genetically. By analogy, the term is used in human reproduction, but more commonly refers to the genetic disorders a ...
, with individuals in subpopulations being more likely to share a recent common ancestor. The scale is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's ''F''-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity. For example, F_ measures the inbreeding coefficient at a single locus for an individual I relative to some subpopulation S: :F_ = 1 - \frac Here, H_I is the fraction of individuals in subpopulation S that are heterozygous. Assuming there are two alleles, A_1, A_2 that occur at respective frequencies p_S, q_S, it is expected that under random mating the subpopulation S will have a heterozygosity rate of H_S = 2p_S(1-p_S) = 2 p_S q_S. Then: :F_ = 1 - \frac Similarly, for the total population T, we can define H_T = 2 p_T q_T allowing us to compute the expected heterozygosity of subpopulation S and the value F_ as: :F_ = 1 - \frac = 1 - \frac If ''F'' is 0, then the allele frequencies between populations are identical, suggesting no structure. The theoretical maximum value of 1 is attained when an allele reaches total fixation, but most observed maximum values are far lower. ''FST'' is one of the most common measures of population structure and there are several different formulations depending on the number of populations and the alleles of interest. Although it is sometimes used as a
genetic distance Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have s ...
between populations, it does not always satisfy the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
and thus is not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...
. It also depends on within-population diversity, which makes interpretation and comparison difficult.


Admixture inference

An individual's genotype can be modelled as an admixture between ''K'' discrete clusters of populations. Each cluster is defined by the frequencies of its genotypes, and the contribution of a cluster to an individual's genotypes is measured via an
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via
Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
, modelling allele frequencies at each locus with a
Dirichlet distribution In probability and statistics, the Dirichlet distribution (after Peter Gustav Lejeune Dirichlet), often denoted \operatorname(\boldsymbol\alpha), is a family of continuous multivariate probability distributions parameterized by a vector \bold ...
. Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques. Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the ''K'' populations. Varying ''K'' can illustrate different scales of population structure; using a small ''K'' for the entire human population will subdivide people roughly by continent, while using large ''K'' will partition populations into finer subgroups. Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of ''K'', but rather an approximation considered useful for a given question. They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested. Clusters may be admixed themselves, and may not have a useful interpretation as source populations.


Dimensionality reduction

Genetic data are high dimensional and
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
techniques can capture population structure.
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) was first applied in population genetics in 1978 by
Cavalli-Sforza Luigi Luca Cavalli-Sforza (; 25 January 1922 – 31 August 2018) was an Italian geneticist. He was a population geneticist who taught at the University of Parma, the University of Pavia and then at Stanford University. Works Schooling and p ...
and colleagues and resurged with
high-throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
. Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals. One formulation considers N individuals and S bi-allelic SNPs. For each individual i, the value at locus l is g_ is the number of non-reference alleles (one of 0, 1, 2). If the allele frequency at l is p_, then the resulting N \times S matrix of normalized genotypes has entries: :\frac PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form. Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous
isolation by distance Isolation by distance (IBD) is a term used to refer to the accrual of local genetic variation under geographically limited dispersal. The IBD model is useful for determining the distribution of gene frequencies over a geographic region. Both disp ...
in the data, the top PC vectors will reflect geographic variation. The
eigenvector In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...
s generated by PCA can be explicitly written in terms of the mean coalescent times for pairs of individuals, making PCA useful for inference about the population histories of groups in a given sample. PCA cannot, however, distinguish between different processes that lead to the same mean coalescent times.
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
and discriminant analysis have been used to study differentiation, population assignment, and to analyze genetic distances. Neighborhood graph approaches like
t-distributed stochastic neighbor embedding t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally de ...
(t-SNE) and
uniform manifold approximation and projection Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...
(UMAP) can visualize continental and subcontinental structure in human data. With larger datasets, UMAP better captures multiple scales of population structure; fine-scale patterns can be hidden or split with other methods, and these are of interest when the range of populations is diverse, when there are admixed populations, or when examining relationships between genotypes, phenotypes, and/or geography. Variational autoencoders can generate artificial genotypes with structure representative of the input data, though they do not recreate linkage disequilibrium patterns.


Demographic inference

Population structure is an important aspect of evolutionary and
population genetics Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and po ...
. Events like migrations and interactions between groups leave a genetic imprint on populations. Admixed populations will have
haplotype A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent. Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA o ...
chunks from their ancestral groups, which gradually shrink over time because of recombination. By exploiting this fact and matching shared haplotype chunks from individuals within a genetic dataset, researchers may trace and date the origins of population admixture and reconstruct historic events such as the rise and fall of empires, slave trades, colonialism, and population expansions.


Role in genetic epidemiology

Population structure can be a problem for
association studies Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence. Studies of genetic association aim to test whether single-locus alleles or genotype fre ...
, such as case-control studies, where the association between the trait of interest and locus could be incorrect. As an example, in a study population of Europeans and East Asians, an association study of
chopstick Chopsticks ( or ; Pinyin: ''kuaizi'' or ''zhu'') are shaped pairs of equal-length sticks of Chinese origin that have been used as kitchen and eating utensils in most of East and Southeast Asia for over three millennia. They are held in the ...
usage may "discover" a gene in the Asian individuals that leads to chopstick use. However, this is a spurious relationship as the genetic variant is simply more common in Asians than in Europeans. Also, actual genetic findings may be overlooked if the locus is less prevalent in the population where the case subjects are chosen. For this reason, it was common in the 1990s to use family-based data where the effect of population structure can easily be controlled for using methods such as the
transmission disequilibrium test The transmission disequilibrium test (TDT) was proposed by Spielman, McGinnis and Ewens (1993) as a family-based association test for the presence of genetic linkage between a genetic marker and a trait. It is an application of McNemar's test. A ...
(TDT).
Phenotype In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (biology), morphology or physical form and structure, its Developmental biology, developmental proc ...
s (measurable traits), such as height or risk for heart disease, are the product of some combination of genes and environment. These traits can be predicted using
polygenic score In genetics, a polygenic score (PGS), also called a polygenic risk score (PRS), polygenic index (PGI), genetic risk score, or genome-wide score, is a number that summarizes the estimated effect of many genetic variants on an individual's phenotyp ...
s, which seek to isolate and estimate the contribution of genetics to a trait by summing the effects of many individual genetic variants. To construct a score, researchers first enrol participants in an association study to estimate the contribution of each genetic variant. Then, they can use the estimated contributions of each genetic variant to calculate a score for the trait for an individual who was not in the original association study. If structure in the study population is correlated with environmental variation, then the polygenic score is no longer measuring the genetic component alone. Several methods can at least partially control for this confounding effect. The
genomic control Genomic control (GC) is a statistical method that is used to control for the confounding effects of population stratification in genetic association studies. The method was originally outlined by Bernie Devlin and Kathryn Roeder in a 1999 paper. ...
method was introduced in 1999 and is a relatively nonparametric method for controlling the inflation of
test statistic A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specifi ...
s. It is also possible to use
unlinked In the mathematical field of knot theory, an unlink is a link that is equivalent (under ambient isotopy) to finitely many disjoint circles in the plane. Properties * An ''n''-component link ''L'' ⊂ S3 is an unlink if and only if ...
genetic marker A genetic marker is a gene or DNA sequence with a known location on a chromosome that can be used to identify individuals or species. It can be described as a variation (which may arise due to mutation or alteration in the genomic loci) that can be ...
s to estimate each individual's ancestry proportions from some ''K'' subpopulations, which are assumed to be unstructured. More recent approaches make use of
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA), as demonstrated by Alkes Price and colleagues, or by deriving a genetic relationship matrix (also called a kinship matrix) and including it in a linear
mixed model A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. ...
(LMM). PCA and LMMs have become the most common methods to control for confounding from population structure. Though they are likely sufficient for avoiding false positives in association studies, they are still vulnerable to overestimating effect sizes of marginally associated variants and can substantially bias estimates of polygenic scores and trait
heritability Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of her ...
. If environmental effects are related to a variant that exists in only one specific region (for example, a pollutant is found in only one city), it may not be possible to correct for this population structure effect at all. For many traits, the role of structure is complex and not fully understood, and incorporating it into genetic studies remains a challenge and is an active area of research.


References

{{Genetics-footer Genetic epidemiology Medical genetics Population genetics