Single nucleotide polymorphism annotation (
SNP annotation) is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP
annotation
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
the biological
information
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on
nucleic acid
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main ...
and
protein sequence
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...
s.
Introduction

Single nucleotide polymorphisms (SNPs) play an important role in
genome wide association
In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...
studies because they act as primary
biomarkers
In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...
. SNPs are currently the marker of choice due to their large numbers in virtually all
populations
Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...
of individuals. The location of these biomarkers can be tremendously important in terms of predicting functional significance,
genetic mapping
Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be sepa ...
and
population genetics
Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and pop ...
. Each SNP represents a nucleotide change between two individuals at a defined location. SNPs are the most common genetic variant found in all individual with one SNP every 100–300
bp in some
species
In biology, a species is the basic unit of Taxonomy (biology), classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of ...
. Since there is a massive number of SNPs on the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
, there is a clear need to prioritize SNPs according to their potential effect in order to expedite genotyping and analysis.
Annotating large numbers of SNPs is a difficult and complex process, which need computational methods to handle such a large dataset. Many tools available have been developed for SNP annotation in different organisms: some of them are optimized for use with
organisms
In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells ( cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and f ...
densely sampled for SNPs (such as
humans
Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...
), but there are currently few tools available that are species non-specific or support non-model organism data. The majority of SNP annotation tools provide computationally predicted putative deleterious effects of SNPs. These tools examine whether a SNP resides in functional genomic regions such as exons, splice sites, or transcription regulatory sites, and predict the potential corresponding functional effects that the SNP may have using a variety of machine-learning approaches. But the tools and systems that prioritize functionally significant SNPs, suffer from few limitations: First, they examine the putative deleterious effects of SNPs with respect to a single biological function that provide only partial information about the functional significance of SNPs. Second, current systems classify SNPs into deleterious or neutral group.
Many annotation algorithms focus on single nucleotide variants (SNVs), considered more rare than SNPs as defined by their minor allele frequency (MAF). As a consequence, training data for the corresponding prediction methods may be different and hence one should be careful to select the appropriate tool for a specific purpose. For the purposes of this article, "SNP" will be used to mean both SNP and SNV, but readers should bear in mind the differences.
SNP annotation

For SNP annotation, many kinds of genetic and genomic information are used. Based on the different features used by each annotation tool, SNP annotation methods may be split roughly into the following categories:
Gene based annotation
Genomic information from surrounding genomic elements is among the most useful information for interpreting the biological function of an observed variant. Information from a known
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
is used as a reference to indicate whether the observed variant resides in or near a gene and if it has the potential to disrupt the
protein sequence
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...
and its function. Gene based annotation is based on the fact that non-synonymous
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
s can alter the protein sequence and that
splice site mutation
A splice site mutation is a genetic mutation that inserts, deletes or changes a number of nucleotides in the specific site at which splicing takes place during the processing of precursor messenger RNA into mature messenger RNA. Splice site co ...
may disrupt the transcript splicing pattern.
Knowledge based annotation
Knowledge base annotation is done based on the information of gene attribute, protein function and its
metabolism
Metabolism (, from el, μεταβολή ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run c ...
. In this type of annotation more emphasis is given to
genetic variation
Genetic variation is the difference in DNA among individuals or the differences between populations. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources of genetic variation, b ...
that disrupts the protein function domain,
protein-protein interaction and
biological pathway A biological pathway is a series of interactions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on a ...
. The non-coding region of genome contain many important regulatory elements including
promoter,
enhancer and insulator, any kind of change in this
regulatory region can change the functionality of that protein. The mutation in DNA can change the
RNA sequence and then influence the
RNA secondary structure
Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule.
The secondary structures of biolo ...
, RNA binding protein recognition and miRNA binding activity,.
Functional annotation
This method mainly identifies variant function based on the information whether the variant loci are in the known functional region that harbor genomic or epigenomic signals. The function of non-coding variants are extensive in terms of the affected genomic region and they involve in almost all processes of gene regulation from transcriptional to post translational level
Transcriptional gene regulation
Transcriptional gene regulation process depends on many spatial and temporal factors in the nucleus such as global or local chromatin states, nucleosome positioning, TF binding, enhancer/promoter activities. Variant that alter the function of any of these biological processes may alter the gene regulation and cause phenotypic abnormality. Genetic variants that located in distal regulatory region can affect the binding motif of TFs, chromatin regulators and other distal transcriptional factors, which disturb the interaction between enhancer/silencer and its target gene.
Alternative splicing
Alternative splicing is one of the most important components that show functional complexity of genome. Modified splicing has significant effect on the phenotype that is relevance to disease or drug metabolism. A change in splicing can be caused by modifying any of the components of the splicing machinery such as splice sites or splice enhancers or silencers. Modification in the alternative splicing site can lead to a different protein form which will show a different function. Humans use an estimated 100,000 different proteins or more, so some genes must be capable of coding for a lot more than just one protein. Alternative splicing occurs more frequently than was previously thought and can be hard to control; genes may produce tens of thousands of different transcripts, necessitating a new gene model for each alternative splice.
RNA processing and post transcriptional regulation
Mutations in the untranslated region (UTR) affect many
post-transcriptional regulation
Post-transcriptional regulation is the control of gene expression at the RNA level. It occurs once the RNA polymerase has been attached to the gene's promoter and is synthesizing the nucleotide sequence. Therefore, as the name indicates, it occur ...
. Distinctive structural features are required for many RNA molecules and cis-acting regulatory elements to execute effective functions during gene regulation. SNVs can alter the secondary structure of RNA molecules and then disrupt the proper folding of RNAs, such as tRNA/mRNA/lncRNA folding and miRNA binding recognition regions.
Translation and post translational modifications
Single nucleotide variant can also affect the cis-acting regulatory elements in mRNA’s to inhibit/promote the translation initiation. Change in the synonymous codons region due to mutation may affect the translation efficiency because of codon usage biases. The translation elongation can also be retarded by mutations along the ramp of ribosomal movement. In the post-translational level, genetic variants can contribute to proteostasis and amino acid modifications. However, mechanisms of variant effect in this field are complicated and there are only a few tools available to predict variant’s effect on translation related modifications.
Protein function
Non-synonymous is the variant in exons that change the amino acid sequence encoded by the gene, including single base changes and non frameshift indels. It has been extremely investigated the function of non-synonymous variants on protein and many algorithms have been developed to predict the deleteriousness and pathogenesis of single nucleotide variants (SNVs). Classical bioinformatics tools, such as SIFT, Polyphen and
MutationTaster, successfully predict the functional consequence of non-synonymous substitution. PopViz webserver provides a gene-centric approach to visualize the mutation damage prediction scores (CADD, SIFT, PolyPhen-2) or the population genetics (minor allele frequency) versus the amino acid positions of all coding variants of a certain human gene.
PopViz is also cross-linked with UniProt database, where the protein domain information can be found, and to then identify the predicted deleterious variants fall into these protein domains on the PopViz plot.
Evolutionary conservation and nature selection
Comparative genomics
Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
approaches were used to predict the function-relevant variants under the assumption that the functional genetic locus should be conserved across different species at an extensive phylogenetic distance. On the other hand, some adaptive traits and the population differences are driven by positive selections of advantageous variants, and these genetic mutations are functionally relevant to population specific phenotypes. Functional prediction of variants’ effect in different biological processes is pivotal to pinpoint the molecular mechanism of diseases/traits and direct the experimental validation.
List of available SNP annotation tools
To annotate the vast amounts of available NGS data, currently a large number of SNPs annotation tools are available. Some of them are specific to specific SNPs while others are more general. Some of the available SNPs annotation tools are as follows SNPeff, Ensembl Variant Effect Predictor (VEP), ANNOVAR, FATHMM, PhD-SNP, PolyPhen-2, SuSPect, F-SNP, AnnTools, SeattleSeq, SNPit, SCAN, Snap, SNPs&GO, LS-SNP, Snat, TREAT, TRAMS, Maviant,
MutationTaster, SNPdat, Snpranker, NGS – SNP, SVA, VARIANT, SIFT, LIST-S2, PhD-SNP and FAST-SNP. The functions and approaches used in SNPs annotation tools are listed below.
Algorithms used in annotation tools
Variant annotation tools use machine learning algorithms to predict variant annotations. Different annotation tools use different algorithms. Common algorithms include:
* Interval/Random forest-eg.MutPred, SNPeff
*
Neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
-eg.SNAP
*
Support Vector Machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
-e.g. PhD-SNP, SNPs&GO
* Bayesian classification-eg.PolyPhen-2
Comparison of variant annotation tools
A large number of variant annotation tools are available for variant annotation. The annotation by different tools does not alway agree amongst each other, as the defined rules for data handling differ between applications. It is frankly impossible to perform a perfect comparison of the available tools. Not all tools have the same input and output nor the same functionality. Below is a table of major annotation tools and their functional area.
Application
Different annotations capture diverse aspects of variant function. Simultaneous use of multiple, varied functional annotations could improve
rare variants association analysis power of
whole exome and
whole genome sequencing
Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a ...
studies. Some tools have been developed to enable functionally-informed phenotype-genotype association analysis for common and rare variants by incorporating functional annotations in biobank-scale cohorts.
Conclusions
The next generation of SNP annotation webservers can take advantage of the growing amount of data in core bioinformatics resources and use intelligent agents to fetch data from different sources as needed. From a user’s point of view, it is more efficient to submit a set of SNPs and receive results in a single step, which makes meta-servers the most attractive choice. However, if SNP annotation tools deliver heterogeneous data covering sequence, structure, regulation, pathways, etc., they must also provide frameworks for integrating data into a decision algorithms, and quantitative confidence measures so users can assess which data are relevant and which are not.
References
{{Reflist
Molecular biology
Bioinformatics
Genomics