The candidate gene approach to conducting
genetic association
Genetic association is when one or more genotypes within a population co-occur with a phenotype, phenotypic trait association (statistics), more often than would be expected by chance occurrence.
Studies of genetic association aim to test whether ...
studies focuses on associations between
genetic variation
Genetic variation is the difference in DNA among individuals or the differences between populations among the same species. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources ...
within pre-specified genes of interest, and
phenotypes
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (physical form and structure), its developmental processes, its biochemical and physiological properti ...
or disease states. This is in contrast to
genome-wide association studies
In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on assoc ...
(GWAS), which is a hypothesis-free approach that scans the entire
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
for associations between common genetic variants (typically
SNPs
In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in ...
) and traits of interest. Candidate genes are most often selected for study based on ''a priori'' knowledge of the gene's biological functional impact on the trait or disease in question.
The rationale behind focusing on allelic variation in specific, biologically relevant regions of the genome is that certain alleles within a gene may directly impact the function of the gene in question and lead to variation in the phenotype or disease state being investigated. This approach often uses the
case-control study design to try to answer the question, "Is one allele of a candidate gene more frequently seen in subjects with the disease than in subjects without the disease?"
Candidate genes hypothesized to be associated with
complex traits have generally not been replicated by subsequent GWASs
or highly powered replication attempts. The failure of candidate gene studies to shed light on the specific genes underlying such traits has been ascribed to insufficient
statistical power
In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...
, low
prior probability
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
that scientists can correctly guess a specific allele within a specific gene that is related to a trait, poor methodological practices, and
data dredging
Data dredging, also known as data snooping or ''p''-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. Th ...
.
Selection
Suitable candidate genes are generally selected based on known biological, physiological, or functional relevance to the disease in question. This approach is limited by its reliance on existing knowledge about known or theoretical biology of disease. However, molecular tools are allowing insight into disease mechanisms and pinpointing potential regions of interest in the genome.
Genome-wide association studies
In genomics, a genome-wide association study (GWA study, or GWAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on assoc ...
(GWAS) and
quantitative trait locus
A quantitative trait locus (QTL) is a locus (section of DNA) that correlates with variation of a quantitative trait in the phenotype of a population of organisms. QTLs are mapped by identifying which molecular markers (such as SNPs or AFLPs) ...
(QTL) mapping examine common variation across the entire genome, and as such can detect a new region of interest that is in or near a potential candidate gene.
Microarray
A microarray is a multiplex (assay), multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of biological interactions. It is a two-dimensional array on a Substrate (materials science), solid substrate—usu ...
data allow researchers to examine differential gene expression between cases and controls, and can help pinpoint new potential genes of interest.
The great variability between organisms can sometimes make it difficult to distinguish normal variation in
single-nucleotide polymorphism
In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a ...
s (SNP) from a candidate gene with disease-associated variation.
In analyzing large amounts of data, there are several other factors that can help lead to the most probable variant. These factors include priorities in SNPs, relative risk of functional change in genes, and
linkage disequilibrium Linkage disequilibrium, often abbreviated to LD, is a term in population genetics referring to the association of genes, usually linked genes, in a population. It has become an important tool in medical genetics and other fields
In defining LD, it ...
among SNPs.
[
In addition, the availability of genetic information through online databases enables researchers to mine existing data and web-based resources for new candidate gene targets.] Many online databases are available to research genes across species.
* Gene is one such database that allows access to information about phenotypes, pathways, and variations of many genes across species.
* When examining functionality between genes in pathways, the Gene Ontology Consortium can help map these relationships. The GO Project describes gene products in three different ways via a species-independent manner: biological processes, cellular components, and molecular functions. Using this information can further a priori knowledge of a pathway and thus help to choose the most likely candidate gene involved.
* Topp Gene is another useful database that allows users to prioritize candidate genes using functional annotations or network analysis. ToppGene aids researchers in selecting a subset of likely candidate genes from larger sets of candidate genes, likely discovered through high-throughput genome technologies.
*Lynx is an integrated systems biology platform that allows users to prioritize candidate genes using both functional annotations and gene pairwise association networks. Lynx provides two sophisticated prioritization tools, Cheetoh and PINTA, to help users select candidate genes from the whole genome based on the relevance to input gene list which can be a list of known genes contributing to certain disease or phenotype, or differentially expressed gene from next-generation RNA sequencing
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also kn ...
technology.
Prior to the candidate-gene approach
Before the candidate-gene approach was fully developed, various other methods were used to identify genes linked to disease-states. These methods studied genetic linkage
Genetic linkage is the tendency of Nucleic acid sequence, DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two Genetic marker, genetic markers that are physically near ...
and positional cloning
A genetic screen or mutagenesis screen is an experimental technique used to identify and select individuals who possess a phenotype of interest in a mutagenized population. Hence a genetic screen is a type of phenotypic screen. Genetic screens ...
through the use of a genetic screen
A genetic screen or mutagenesis screen is an experimental technique used to identify and select individuals who possess a phenotype of interest in a mutagenized population. Hence a genetic screen is a type of phenotypic screen. Genetic screens ...
, and were effective at identifying relative risk genes in Mendelian diseases. However, these methods are not as beneficial when studying complex diseases for several reasons:[
# Complex diseases tend to vary in both age of onset and severity. This can be due to variation in ]penetrance
Penetrance in genetics is the proportion of individuals carrying a particular variant (or allele) of a gene (genotype) that also expresses an associated trait (phenotype). In medical genetics, the penetrance of a disease-causing mutation is the pr ...
and expressivity. For most human diseases, variable expressivity of the disease phenotype is the norm. This makes choosing one specific age group or phenotypic marker more difficult to select for study.[
# The origins of complex disease involve many biological pathways, some of which may differ between disease phenotypes.][
# Most importantly, complex diseases often illustrate genetic heterogeneity – multiple genes can be found that interact and produce one disease state. Oftentimes, each single gene is partially responsible for the phenotype produced and overall risk for the disorder.][
]
Criticisms
A study of candidate genes seeks to balance the use of data while attempting to minimize the chance of creating false positive or negative results.[ Because this balance can often be difficult, there are several criticisms of the candidate gene approach that are important to understand before beginning such a study. For instance, the candidate-gene approach has been shown to produce a high rate of false positives, which requires that the findings of single genetic associations be treated with great caution.
One critique is that findings of association within candidate-gene studies have not been easily replicated in follow up studies. For instance, a recent investigation on 18 well-studied candidate genes for depression (10 publications or more each) failed to identify any significant association with depression, despite using samples orders of magnitude larger than those from the original publications. In addition to statistical issues (e.g. underpowered studies), ]population stratification
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating (or ''panmictic'') population, allele frequencies ar ...
has often been blamed for this inconsistency; therefore caution must also be taken in regards to what criteria define a certain phenotype, as well as other variations in design study.[
Additionally, because these studies incorporate ''a priori'' knowledge, some critics argue that our knowledge is not sufficient to make valid predictions. Therefore, results gained from these 'hypothesis-driven' approaches are dependent on the ability to select plausible candidates from the genome, rather than use a hypothesis-free approach.
]
Use in research studies
One of the earliest successes using the candidate gene approach was finding a single base mutation in the non-coding region
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
of the '' APOC3'' (apolipoprotein C3 gene) that associated with higher risks of hypertriglyceridemia
Hypertriglyceridemia is the presence of high amounts of triglycerides in the blood. Triglycerides are the most abundant fatty molecule in most organisms. Hypertriglyceridemia occurs in various physiologic conditions and in various diseases, and ...
and atherosclerosis
Atherosclerosis is a pattern of the disease arteriosclerosis, characterized by development of abnormalities called lesions in walls of arteries. This is a chronic inflammatory disease involving many different cell types and is driven by eleva ...
. In a study by Kim et al., genes linked to the obesity trait in both pigs and humans were discovered using comparative genomics
Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...
and chromosomal heritability. By using these two methods, the researchers were able to overcome the criticism that candidate gene studies are solely focused on prior knowledge. Comparative genomics was completed by examining both human and pig quantitative trait loci
A quantitative trait locus (QTL) is a Locus (genetics), locus (section of DNA) that correlates with variation of a quantitative trait in the phenotype of a Population genetics, population of organisms. QTLs are mapped by identifying which molecula ...
through a method known as genome-wide complex trait analysis
Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for heritability estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trai ...
(GCTA), which allowed the researchers to then map genetic variance to specific chromosomes. This allowed the parameter of heritability to provide understanding of where phenotypic variation was on specific chromosomal regions, thus extending to candidate markers and genes within these regions. Other studies may also use computational methods to find candidate genes in a widespread, complementary way, such as one study by Tiffin et al. studying genes linked to type 2 diabetes
Type 2 diabetes (T2D), formerly known as adult-onset diabetes, is a form of diabetes mellitus that is characterized by high blood sugar, insulin resistance, and relative lack of insulin. Common symptoms include increased thirst, frequent ...
.
Many studies have similarly used candidate genes as part of a multi-disciplinary approach to examining a trait or phenotype. One example of manipulating candidate genes can be seen in a study completed by Martin E. Feder on heat-shock proteins and their function in ''Drosophila melanogaster
''Drosophila melanogaster'' is a species of fly (an insect of the Order (biology), order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the "vinegar fly", "pomace fly" ...
''. Feder designed a holistic approach to study ''Hsp70
The 70 kilodalton heat shock proteins (Hsp70s or DnaK) are a family of conserved ubiquitously expressed heat shock proteins. Proteins with similar structure exist in virtually all living organisms and play crucial roles in the development of can ...
'', a candidate gene that was hypothesized to play a role in how an organism adapted to stress. ''D. melanogaster'' is a highly useful model organism for studying this trait due to the way it can support a diverse number of genetic approaches for studying a candidate gene. The different approaches this study took included both genetically modifying the candidate gene (using site-specific homologous recombination
Homologous recombination is a type of genetic recombination in which genetic information is exchanged between two similar or identical molecules of double-stranded or single-stranded nucleic acids (usually DNA as in Cell (biology), cellular organi ...
and the expression of various proteins), as well as examining the natural variation of ''Hsp70''. He concluded that the results of these studies gave a multi-faceted view of ''Hsp70''. The manipulation of candidate genes is also seen in Caspar C. Chater's study of the origin and function of stoma
In botany, a stoma (: stomata, from Greek language, Greek ''στόμα'', "mouth"), also called a stomate (: stomates), is a pore found in the Epidermis (botany), epidermis of leaves, stems, and other organs, that controls the rate of gas exc ...
ta in ''Physcomitrella patens
''Physcomitrella patens'' is a synonym of ''Physcomitrium patens'', the spreading earthmoss. It is a moss, a bryophyte used as a model organism for studies on plant evolution, development, and physiology.
Distribution and ecology
''Physcomitr ...
'', a moss. ''PpSMF1'', ''PpSMF2'' and ''PpSCRM1'' were the three candidate genes that were knocked down by homologous recombination to see any changes in the development of stomata. With the knock down experiment, Chater observed that ''PpSMF1'' and ''PpSCRM1'' were responsible for stomata development in ''P. patens.'' By engineering and modifying these candidate genes, they were able to confirm the ways in which this gene was linked to a change phenotype. Understanding the natural and historical context in which these phenotypes operate by examining the natural genome structure complemented this.
References
External links
Gene
Gene Ontology Consortium
Topp Gene
Lynx Platform
{{DEFAULTSORT:Candidate Gene
Quantitative genetics