A nucleic acid sequence is a succession of bases within the

nucleotides Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

forming

allele An allele is a variant of the sequence of nucleotides at a particular location, or Locus (genetics), locus, on a DNA molecule. Alleles can differ at a single position through Single-nucleotide polymorphism, single nucleotide polymorphisms (SNP), ...

s within a

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

(using GACT) or

RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...

(GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched)

polymers A polymer () is a substance or material that consists of very large molecules, or macromolecules, that are constituted by many repeating subunits derived from one or more species of monomers. Due to their broad spectrum of properties, b ...

, specifying the sequence is equivalent to defining the

covalent A covalent bond is a chemical bond that involves the sharing of electrons to form electron pairs between atoms. These electron pairs are known as shared pairs or bonding pairs. The stable balance of attractive and repulsive forces between atom ...

structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the

primary structure Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthe ...

. The sequence represents genetic information. Biological

deoxyribonucleic acid Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of a ...

represents the

information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...

which directs the functions of an

organism An organism is any life, living thing that functions as an individual. Such a definition raises more problems than it solves, not least because the concept of an individual is also difficult. Many criteria, few of them widely accepted, have be ...

. Nucleic acids also have a

secondary structure Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...

and

tertiary structure Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...

. Primary structure is sometimes mistakenly referred to as "primary sequence". However there is no parallel concept of secondary or tertiary sequence.

Nucleotides

Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide consists of three subunits: a

phosphate Phosphates are the naturally occurring form of the element phosphorus. In chemistry, a phosphate is an anion, salt, functional group or ester derived from a phosphoric acid. It most commonly means orthophosphate, a derivative of orthop ...

group and a

sugar Sugar is the generic name for sweet-tasting, soluble carbohydrates, many of which are used in food. Simple sugars, also called monosaccharides, include glucose Glucose is a sugar with the Chemical formula#Molecular formula, molecul ...

(

ribose Ribose is a simple sugar and carbohydrate with molecular formula C5H10O5 and the linear-form composition H−(C=O)−(CHOH)4−H. The naturally occurring form, , is a component of the ribonucleotides from which RNA is built, and so this comp ...

in the case of

deoxyribose Deoxyribose, or more precisely 2-deoxyribose, is a monosaccharide with idealized formula H−(C=O)−(CH2)−(CHOH)3−H. Its name indicates that it is a deoxy sugar, meaning that it is derived from the sugar ribose by loss of a hydroxy group. D ...

) make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of

nucleobase Nucleotide bases (also nucleobases, nitrogenous bases) are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic building blocks of nuc ...

s. The nucleobases are important in

base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

ing of strands to form higher-level secondary and tertiary structures such as the famed

double helix In molecular biology, the term double helix refers to the structure formed by base pair, double-stranded molecules of nucleic acids such as DNA. The double Helix, helical structure of a nucleic acid complex arises as a consequence of its Nuclei ...

. The possible letters are ''A'', ''C'', ''G'', and ''T'', representing the four

nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

bases of a DNA strand –

adenine Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...

cytosine Cytosine () (symbol C or Cyt) is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine ( uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attac ...

guanine Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...

thymine Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...

–

ly linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. With regards to transcription, a sequence is on the coding strand if it has the same order as the transcribed RNA. One sequence can be complementary to another sequence, meaning that they have the base on each position in the complementary (i.e., A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to the sense strand.

Notation

While A, T, C, and G represent a particular nucleotide at a position, there are also letters that represent ambiguity which are used when more than one kind of nucleotide could occur at that position. The rules of the International Union of Pure and Applied Chemistry (

IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...

) are as follows: For example, W means that either an adenine or a thymine could occur in that position without impairing the sequence's functionality. These symbols are also valid for RNA, except with U (uracil) replacing T (thymine). Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and RNA also contain bases that have been modified after the nucleic acid chain has been formed. In DNA, the most common modified base is 5-methylcytidine (m5C). In RNA, there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine (I), ribothymidine (rT) and 7-methylguanosine (m7G).

Hypoxanthine Hypoxanthine is a naturally occurring purine derivative. It is occasionally found as a constituent of nucleic acids, where it is present in the anticodon of tRNA in the form of its nucleoside inosine. It has a tautomer known as 6-hydroxypurine. Hyp ...

and xanthine are two of the many bases created through

mutagen In genetics, a mutagen is a physical or chemical agent that permanently changes genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level. As many mutations can cause cancer in ...

presence, both of them through deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is produced from

, and xanthine is produced from

. Similarly, deamination of

results in

uracil Uracil () (nucleoside#List of nucleosides and corresponding nucleobases, symbol U or Ura) is one of the four nucleotide bases in the nucleic acid RNA. The others are adenine (A), cytosine (C), and guanine (G). In RNA, uracil binds to adenine via ...

. ;Example of comparing and determining the % difference between two nucleotide sequences * AATCCGCTAG * AAACCCTTAG Given the two 10-nucleotide sequences, line them up and compare the differences between them. Calculate the percent difference by taking the number of differences between the DNA bases divided by the total number of nucleotides. In this case there are three differences in the 10 nucleotide sequence. Thus there is a 30% difference.

Biological significance

In biological systems, nucleic acids contain information which is used by a living cell to construct specific

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

s. The sequence of

s on a nucleic acid strand is translated by cell machinery into a sequence of

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

s making up a protein strand. Each group of three bases, called a

codon Genetic code is a set of rules used by living cells to translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links prote ...

, corresponds to a single amino acid, and there is a specific

genetic code Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...

by which each possible combination of three bases corresponds to a specific amino acid. The central dogma of molecular biology outlines the mechanism by which proteins are constructed using information contained in nucleic acids.

is transcribed into

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA is ...

molecules, which travel to the

ribosome Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...

where the mRNA is used as a template for the construction of the protein strand. Since nucleic acids can bind to molecules with complementary sequences, there is a distinction between "

sense A sense is a biological system used by an organism for sensation, the process of gathering information about the surroundings through the detection of Stimulus (physiology), stimuli. Although, in some cultures, five human senses were traditio ...

" sequences which code for proteins, and the complementary "antisense" sequence, which is by itself nonfunctional, but can bind to the sense strand.

Sequence determination

DNA sequencing is the process of determining the

sequence of a given

fragment. The sequence of the DNA of a living thing encodes the necessary information for that living thing to survive and reproduce. Therefore, determining the sequence is useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of a DNA sequence may be useful in practically any biological

research Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to ...

. For example, in

medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...

it can be used to identify, diagnose and potentially develop treatments for

genetic disease A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosome abnormality. Although polygenic disorders are ...

s. Similarly, research into

pathogens In biology, a pathogen (, "suffering", "passion" and , "producer of"), in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a germ. The term ...

may lead to treatments for contagious diseases.

Biotechnology Biotechnology is a multidisciplinary field that involves the integration of natural sciences and Engineering Science, engineering sciences in order to achieve the application of organisms and parts thereof for products and services. Specialists ...

is a burgeoning discipline, with the potential for many useful products and services. RNA is not sequenced directly. Instead, it is copied to a DNA by

reverse transcriptase A reverse transcriptase (RT) is an enzyme used to convert RNA genome to DNA, a process termed reverse transcription. Reverse transcriptases are used by viruses such as HIV and hepatitis B to replicate their genomes, by retrotransposon mobi ...

, and this DNA is then sequenced. Current sequencing methods rely on the discriminatory ability of DNA polymerases, and therefore can only distinguish four bases. An inosine (created from adenosine during RNA editing) is read as a G, and 5-methyl-cytosine (created from cytosine by

DNA methylation DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter (genetics), promoter, DNA methylati ...

) is read as a C. With current technology, it is difficult to sequence small amounts of DNA, as the signal is too weak to measure. This is overcome by

polymerase chain reaction The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA (or a part of it) sufficiently to enable detailed st ...

(PCR) amplification.

Digital representation

Once a nucleic acid sequence has been obtained from an organism, it is stored ''

in silico In biology and other experimental sciences, an ''in silico'' experiment is one performed on a computer or via computer simulation software. The phrase is pseudo-Latin for 'in silicon' (correct ), referring to silicon in computer chips. It was c ...

'' in digital format. Digital genetic sequences may be stored in

sequence database In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("Digital data, digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a ...

s, be analyzed (see ''Sequence analysis'' below), be digitally altered and be used as templates for creating new actual DNA using artificial gene synthesis.

Sequence analysis

Digital genetic sequences may be analyzed using the tools of

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

to attempt to determine its function.

Genetic testing

The DNA in an organism's

genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...

can be analyzed to diagnose vulnerabilities to inherited

disease A disease is a particular abnormal condition that adversely affects the structure or function (biology), function of all or part of an organism and is not immediately due to any external injury. Diseases are often known to be medical condi ...

s, and can also be used to determine a child's paternity (genetic father) or a person's

ancestry An ancestor, also known as a forefather, fore-elder, or a forebear, is a parent or ( recursively) the parent of an antecedent (i.e., a grandparent, great-grandparent, great-great-grandparent and so forth). ''Ancestor'' is "any person from ...

. Normally, every person carries two variations of every

gene In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...

, one inherited from their mother, the other inherited from their father. The

human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 23 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual Mitochondrial DNA, mitochondria. These ar ...

is believed to contain around 20,000–25,000 genes. In addition to studying

chromosome A chromosome is a package of DNA containing part or all of the genetic material of an organism. In most chromosomes, the very long thin DNA fibers are coated with nucleosome-forming packaging proteins; in eukaryotic cells, the most import ...

s to the level of individual genes, genetic testing in a broader sense includes

biochemical Biochemistry, or biological chemistry, is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology, ...

tests for the possible presence of

s, or mutant forms of genes associated with increased risk of developing genetic disorders. Genetic testing identifies changes in chromosomes, genes, or proteins. Usually, testing is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed.

Sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of

, or

to identify regions of similarity that may be due to functional,

structural A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...

, or

evolution Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...

ary relationships between the sequences. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as

point mutation A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences ...

s and gaps as insertion or deletion mutations (

indel Indel (insertion-deletion) is a molecular biology term for an insertion or deletion of bases in the genome of an organism. Indels ≥ 50 bases in length are classified as structural variants. In coding regions of the genome, unless the lengt ...

s) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between

s occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or

sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...

is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose

side chain In organic chemistry and biochemistry, a side chain is a substituent, chemical group that is attached to a core part of the molecule called the "main chain" or backbone chain, backbone. The side chain is a hydrocarbon branching element of a mo ...

s have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA

bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

Computational phylogenetics Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, Heuristic (computer science), heuristics, and approaches involved in Phylogenetics, phylogenetic analyses. The goal i ...

makes extensive use of sequence alignments in the construction and interpretation of

phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...

s, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young

most recent common ancestor A most recent common ancestor (MRCA), also known as a last common ancestor (LCA), is the most recent individual from which all organisms of a set are inferred to have descended. The most recent common ancestor of a higher taxon is generally assu ...

, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "

molecular clock The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleot ...

" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and

selection Selection may refer to: Science * Selection (biology), also called natural selection, selection in evolution ** Sex selection, in genetics ** Mate selection, in mating ** Sexual selection in humans, in human sexuality ** Human mating strat ...

are constant across sequence lineages. Therefore, it does not account for possible differences among organisms or species in the rates of

DNA repair DNA repair is a collection of processes by which a cell (biology), cell identifies and corrects damage to the DNA molecules that encode its genome. A weakened capacity for DNA repair is a risk factor for the development of cancer. DNA is cons ...

or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between

silent mutation Silent mutations, also called synonymous or samesense mutations, are mutations in DNA that do not have an observable effect on the organism's phenotype. The phrase ''silent mutation'' is often used interchangeably with the phrase '' synonymous mut ...

s that do not alter the meaning of a given

and other mutations that result in a different

being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.

Sequence motifs

Frequently the primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D and H/ACA boxes of snoRNAs, Sm binding site found in spliceosomal RNAs such as U1, U2, U4, U5, U6, U12 and U3, the Shine-Dalgarno sequence, the

Kozak consensus sequence The Kozak consensus sequence (Kozak consensus or Kozak sequence) is a Nucleic acid sequence, nucleic acid motif that functions as the protein Translation (biology), translation initiation site in most eukaryotic Messenger RNA, mRNA transcripts. Reg ...

and the RNA polymerase III terminator.

Sequence entropy

, a sequence entropy, also known as sequence complexity or information profile, is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection.

References

External links

A bibliography on features, patterns, correlations in DNA and protein texts
{{DEFAULTSORT:Nucleic Acid Sequence DNA Molecular biology Nucleic acids RNA