HOME

TheInfoList



OR:

The human genome is a complete set of
nucleic acid sequence A nucleic acid sequence is a succession of Nucleobase, bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequence ...
s for
humans Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...
, encoded as DNA within the 23
chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...
pairs in
cell nuclei The cell nucleus (pl. nuclei; from Latin or , meaning ''kernel'' or ''seed'') is a membrane-bound organelle found in eukaryotic cells. Eukaryotic cells usually have a single nucleus, but a few cell types, such as mammalian red blood cells, h ...
and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the
mitochondrial genome Mitochondrial DNA (mtDNA or mDNA) is the DNA located in mitochondria, cellular organelles within eukaryotic cells that convert chemical energy from food into a form that cells can use, such as adenosine triphosphate (ATP). Mitochondrial D ...
. Human
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
s include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
,
transfer RNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino a ...
,
ribozyme Ribozymes (ribonucleic acid enzymes) are RNA molecules that have the ability to catalyze specific biochemical reactions, including RNA splicing in gene expression, similar to the action of protein enzymes. The 1982 discovery of ribozymes demonst ...
s,
small nuclear RNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approximately 150 nucleotides. They are transcr ...
s, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions,
telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mo ...
s,
centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...
s, and
origins of replication The origin of replication (also called the replication origin) is a particular sequence in a genome at which replication is initiated. Propagation of the genetic material between generations requires timely and accurate duplication of DNA by semi ...
, plus large numbers of
transposable elements A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transp ...
, inserted viral DNA, non-functional
pseudogene Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are ...
s and simple, highly-repetitive sequences. Introns make up a large percentage of
non-coding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regul ...
. Some of this non-coding DNA is non-functional
junk DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regu ...
, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA. Haploid human genomes, which are contained in
germ cells Germ or germs may refer to: Science * Germ (microorganism), an informal word for a pathogen * Germ cell, cell that gives rise to the gametes of an organism that reproduces sexually * Germ layer, a primary layer of cells that forms during embry ...
(the
egg An egg is an organic vessel grown by an animal to carry a possibly fertilized egg cell (a zygote) and to incubate from it an embryo within the egg until the embryo has become an animal fetus that can survive on its own, at which point the a ...
and sperm
gamete A gamete (; , ultimately ) is a haploid cell that fuses with another haploid cell during fertilization in organisms that reproduce sexually. Gametes are an organism's reproductive cells, also referred to as sex cells. In species that produce ...
cells created in the
meiosis Meiosis (; , since it is a reductional division) is a special type of cell division of germ cells in sexually-reproducing organisms that produces the gametes, such as sperm or egg cells. It involves two rounds of division that ultimately ...
phase of
sexual reproduction Sexual reproduction is a type of reproduction that involves a complex life cycle in which a gamete ( haploid reproductive cells, such as a sperm or egg cell) with a single set of chromosomes combines with another gamete to produce a zygote th ...
before
fertilization Fertilisation or fertilization (see spelling differences), also known as generative fertilisation, syngamy and impregnation, is the fusion of gametes to give rise to a new individual organism or offspring and initiate its development. Pro ...
) consist of 3,054,815,472 DNA base pairs (if X chromosome is used), while female
diploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respecti ...
genomes (found in
somatic cells A somatic cell (from Ancient Greek σῶμα ''sôma'', meaning "body"), or vegetal cell, is any biological cell forming the body of a multicellular organism other than a gamete, germ cell, gametocyte or undifferentiated stem cell. Such cells compo ...
) have twice the DNA content. While there are significant differences among the genomes of human individuals (on the order of 0.1% due to single-nucleotide variants and 0.6% when considering
indel Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
s), these are considerably smaller than the differences between humans and their closest living relatives, the
bonobo The bonobo (; ''Pan paniscus''), also historically called the pygmy chimpanzee and less often the dwarf chimpanzee or gracile chimpanzee, is an endangered great ape and one of the two species making up the genus '' Pan,'' the other being the co ...
s and chimpanzees (~1.1% fixed single-nucleotide variants and 4% when including indels). Size in basepairs can vary too; the
telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mo ...
length decreases after every round of DNA replication. Although the sequence of the human genome has been completely determined by DNA sequencing, it is not yet fully understood. Most, but not all,
gene In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s have been identified by a combination of high throughput experimental and
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and
RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
products (in particular, annotation of the complete CHM13v2.0 sequence is still ongoing). And yet,
overlapping gene An overlapping gene (or OLG) is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene. In this way, a nucleotide sequence may make a contribution to the function of one or more gen ...
s are quite common, in some cases allowing two protein coding genes from each strand to reuse base pairs twice (for example, genes DCDC2 and KAAG1). Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including
regulation of gene expression Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are wid ...
, organization of chromosome architecture, and signals controlling
epigenetic inheritance Transgenerational epigenetic inheritance is the transmission of epigenetic markers from one organism to the next (i.e., from parent to child) that affects the traits of offspring without altering the primary structure of DNA (i.e. the sequence of ...
. There are also a significant number of retroviruses in human DNA, at least 3 of which have been proven to possess an important function (i.e.,
HIV The human immunodeficiency viruses (HIV) are two species of '' Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immu ...
-like HERV-K, HERV-W, and HERV-FRD play a role in placenta formation by inducing cell-cell fusion). In 2003, scientists reported the sequencing of 85% of the entire human genome, but as of 2020 at least 8% was still missing. In 2021, scientists reported sequencing the complete female genome (i.e., without the Y chromosome). This sequence identified 19,969
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
-coding sequences, accounting for approximately 1.5% of the genome, and 63,494 genes in total, most of them being
non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non- ...
genes. The genome consists of regulatory DNA sequences,
LINEs Line most often refers to: * Line (geometry), object with zero thickness and curvature that stretches to infinity * Telephone line, a single-user circuit on a telephone communication system Line, lines, The Line, or LINE may also refer to: Arts ...
,
SINEs Sines () is a city and a municipality in Portugal. The municipality, divided into two parishes, has around 14,214 inhabitants (2021) in an area of . Sines holds an important oil refinery and several petrochemical industries. It is also a popular ...
,
intron An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
s, and sequences for which as yet no function has been determined. The
human Y chromosome In human genetics, a human Y-chromosome DNA haplogroup is a haplogroup defined by mutations in the non- recombining portions of DNA from the male-specific Y chromosome (called Y-DNA). Many people within a haplogroup share similar numbers of sh ...
, consisting of 62,460,029 base pairs from a different cell line and found in all males, was sequenced completely in January 2022.


Sequencing

The first human genome sequences were published in nearly complete draft form in February 2001 by the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
and
Celera Corporation Celera is a subsidiary of Quest Diagnostics which focuses on genetic sequencing and related technologies. It was founded in 1998 as a business unit of Applera, spun off into an independent company in 2008, and finally acquired by Quest Diagnosti ...
. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence, leaving just 341 gaps in the sequence, representing highly-repetitive and other DNA that could not be sequenced with the technology available at the time. The human genome was the first of all vertebrates to be sequenced to such near-completion, and as of 2018, the diploid genomes of over a million individual humans had been determined using
next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...
. These data are used worldwide in
biomedical science Biomedical sciences are a set of sciences applying portions of natural science or formal science, or both, to develop knowledge, interventions, or technology that are of use in healthcare or public health. Such disciplines as medical micr ...
,
anthropology Anthropology is the scientific study of humanity, concerned with human behavior, human biology, cultures, societies, and linguistics, in both the present and past, including past human species. Social anthropology studies patterns of be ...
, forensics and other branches of science. Such genomic studies have led to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including
human evolution Human evolution is the evolutionary process within the history of primates that led to the emergence of ''Homo sapiens'' as a distinct species of the hominid family, which includes the great apes. This process involved the gradual development o ...
. By 2018, the total number of genes had been raised to at least 46,831, plus another 2300
micro-RNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miRN ...
genes. A 2018 population survey found another 300 million bases of human genome that was not in the reference sequence. Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome. In 2022 the Telomere-to-Telomere (T2T) consortium reported the complete sequence of a human female genome, filling all the gaps in the
X chromosome The X chromosome is one of the two sex-determining chromosomes ( allosomes) in many organisms, including mammals (the other is the Y chromosome), and is found in both males and females. It is a part of the XY sex-determination system and XO se ...
(2020) and the 22 autosomes (May 2021). The previously unsequenced parts contain
immune response An immune response is a reaction which occurs within an organism for the purpose of defending against foreign invaders. These invaders include a wide variety of different microorganisms including viruses, bacteria, parasites, and fungi which could ...
genes that help to adapt to and survive infections, as well as genes that are important for predicting drug response. The completed human genome sequence will also provide better understanding of human formation as an individual organism and how humans vary both between each other and other species.


Achieving completeness

Although the 'completion' of the human genome project was announced in 2001, there remained hundreds of gaps, with about 5–10% of the total sequence remaining undetermined. The missing genetic information was mostly in repetitive
heterochromatic Heterochromia is a variation in coloration. The term is most often used to describe color differences of the iris, but can also be applied to color variation of hair or skin. Heterochromia is determined by the production, delivery, and concent ...
regions and near the
centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...
s and
telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mo ...
s, but also some gene-encoding
euchromatic Euchromatin (also called "open chromatin") is a lightly packed form of chromatin ( DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active transcription. Euchromatin stands in contrast to heterochromatin, which ...
regions. There remained 160 euchromatic gaps in 2015 when the sequences spanning another 50 formerly-unsequenced regions were determined. Only in 2020 was the first truly complete telomere-to-telomere sequence of a human chromosome determined, namely of the
X chromosome The X chromosome is one of the two sex-determining chromosomes ( allosomes) in many organisms, including mammals (the other is the Y chromosome), and is found in both males and females. It is a part of the XY sex-determination system and XO se ...
. The first complete telomere-to-telomere sequence of a human autosomal chromosome,
chromosome 8 Chromosome 8 is one of the 23 pairs of chromosomes in humans. People normally have two copies of this chromosome. Chromosome 8 spans about 145 million base pairs (the building material of DNA) and represents between 4.5 and 5.0% of the total DNA ...
, followed a year later. The complete human genome (without Y chromosome) was published in 2021, while with Y chromosome in January 2022.


Molecular organization and gene content

The total length of the human
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...
, that does not represent the sequence of any specific individual. The genome is organized into 22 paired chromosomes, termed autosomes, plus the 23rd pair of
sex chromosome A sex chromosome (also referred to as an allosome, heterotypical chromosome, gonosome, heterochromosome, or idiochromosome) is a chromosome that differs from an ordinary autosome in form, size, and behavior. The human sex chromosomes, a typical ...
s (XX) in the female and (XY) in the male. The haploid genome is 3 054 815 472 base pairs, when the
X chromosome The X chromosome is one of the two sex-determining chromosomes ( allosomes) in many organisms, including mammals (the other is the Y chromosome), and is found in both males and females. It is a part of the XY sex-determination system and XO se ...
is included, and 2 963 015 935 base pairs when the
Y chromosome The Y chromosome is one of two sex chromosomes ( allosomes) in therian mammals, including humans, and many other animals. The other is the X chromosome. Y is normally the sex-determining chromosome in many species, since it is the presence or a ...
is substituted for the X chromosome. These chromosomes are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in multiple copies in each
mitochondrion A mitochondrion (; ) is an organelle found in the cells of most Eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is use ...
.


Information content

The haploid human genome (23
chromosomes A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...
) is about 3 billion base pairs long and contains around 30,000 genes. Since every base pair can be coded by 2 bits, this is about 750
megabyte The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes o ...
s of data. An individual somatic (
diploid Ploidy () is the number of complete sets of chromosomes in a cell, and hence the number of possible alleles for autosomal and pseudoautosomal genes. Sets of chromosomes refer to the number of maternal and paternal chromosome copies, respecti ...
) cell contains twice this amount, that is, about 6 billion base pairs. Males have fewer than females because the Y chromosome is about 57 million base pairs whereas the X is about 156 million. Since individual genomes vary in sequence by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes. The
entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, th ...
of the genome differs significantly between coding and non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million base pairs), but less for the non-coding parts. It ranges between 1.5 and 1.9 bits per base pair for the individual chromosome, except for the Y chromosome, which has an entropy rate below 0.9 bits per base pair., fig. 6, using the Lempel-Ziv estimators of entropy rate.


Coding vs. noncoding DNA

The content of the human genome is commonly divided into coding and noncoding DNA sequences.
Coding DNA The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to no ...
is defined as those sequences that can be transcribed into
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...
and
translated Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%).
Noncoding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
is made up of all of those sequences (approx. 98% of the genome) that are not used to encode proteins. Some noncoding DNA contains genes for RNA molecules with important biological functions (
noncoding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non- ...
, for example
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
and
transfer RNA Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino a ...
). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
(Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity. It is however disputed whether molecular activity (transcription of DNA into RNA) alone implies that the RNA produced has a meaningful biological function, since experiments have shown that random nonfunctional DNA will also reproducibly recruit transcription factors resulting in transcription into nonfunctional RNA. There is no consensus on what constitutes a "functional" element in the genome since geneticists, evolutionary biologists, and molecular biologists employ different definitions and methods. In evolutionary definitions, "functional" DNA, whether it is coding or non-coding, contributes to the fitness of the organism, and therefore is maintained by negative
evolutionary pressure Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...
whereas "non-functional" DNA has no benefit to the organism and therefore is under neutral selective pressure. This type of DNA has been described as
junk DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regu ...
In genetic definitions, "functional" DNA is related to how DNA segments manifest by phenotype and "nonfunctional" is related to loss-of-function effects on the organism. In biochemical definitions, "functional" DNA relates to DNA sequences that specify molecular products (e.g. noncoding RNAs) and biochemical activities with mechanistic roles in gene or genome regulation (i.e. DNA sequences that impact cellular level activity such as cell type, condition, and molecular processes). There is no consensus in the literature on the amount of functional DNA since, depending on how "function" is understood, ranges have been estimated from up to 90% of the human genome is likely nonfunctional DNA (junk DNA) to up to 80% of the genome is likely functional.. It is also possible that junk DNA may acquire a function in the future and therefore may play a role in evolution, but this is likely to occur only very rarely. Finally DNA that is deliterious to the organism and is under negative selective pressure is called garbage DNA. Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced genome has become a more focused analytical concept than the classical concept of the DNA-coding gene.


Coding sequences (protein-coding genes)

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
s, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the
exome The exome is composed of all of the exons within the genome, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing. This includes untranslated regions of messenger RNA (mRNA), and coding re ...
, and consists of DNA sequences encoded by
exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
s that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project. Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as
Uniprot UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
. Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early 1970s that the estimated
mutational load Genetic load is the difference between the fitness of an average genotype in a population and the fitness of some reference genotype, which may be either the best present in a population, or may be the theoretically optimal genotype. The average ...
from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes). The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the
roundworm The nematodes ( or grc-gre, Νηματώδη; la, Nematoda) or roundworms constitute the phylum Nematoda (also called Nemathelminthes), with plant-parasitic nematodes also known as eelworms. They are a diverse animal phylum inhabiting a broa ...
and the
fruit fly Fruit fly may refer to: Organisms * Drosophilidae, a family of small flies, including: ** ''Drosophila'', the genus of small fruit flies and vinegar flies ** ''Drosophila melanogaster'' or common fruit fly ** '' Drosophila suzukii'' or Asian fruit ...
. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons. Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high
gene density In genetics, the gene density of an organism's genome is the ratio of the number of genes per number of base pairs, usually written in terms of a million base pairs, or ''megabase'' (Mb). The human genome has a gene density of 11-15 genes/Mb, while ...
within chromosomes 1, 11, and 19. Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and
GC-content In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of ...
. The significance of these nonrandom patterns of gene density is not well understood. Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability. For example, the gene for
histone H1 Histone H1 is one of the five main histone protein families which are components of chromatin in eukaryotic cells. Though highly conserved, it is nevertheless the most variable histone in sequence across species. Structure Metazoan H1 prote ...
a (HIST1HIA) is relatively small and simple, lacking introns and encoding an 781 nucleotide-long mRNA that produces a 215 amino acid protein from its 648 nucleotide
open reading frame In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a Prokaryote, prokaryotic DNA sequence, where only one of the #Six-fra ...
.
Dystrophin Dystrophin is a rod-shaped cytoplasmic protein, and a vital part of a protein complex that connects the cytoskeleton of a muscle fiber to the surrounding extracellular matrix through the cell membrane. This complex is variously known as the cos ...
(DMD) was the largest protein-coding gene in the 2001 human reference genome, spanning a total of 2.2 million nucleotides, while more recent systematic meta-analysis of updated human genome data identified an even larger protein-coding gene, ''RBFOX1'' (RNA binding protein, fox-1 homolog 1), spanning a total of 2.47 million nucleotides.
Titin Titin (contraction for Titan protein) (also called connectin) is a protein that in humans is encoded by the ''TTN'' gene. Titin is a giant protein, greater than 1 µm in length, that functions as a molecular spring that is responsible for t ...
(TTN) has the longest coding sequence (114,414 nucleotides), the largest number of
exons An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
(363), and the longest single exon (17,106 nucleotides). As estimated based on a curated set of protein-coding genes over the whole genome, the median size is 26,288 nucleotides (mean = 66,577), the median exon size, 133 nucleotides (mean = 309), the median number of exons, 8 (mean = 11), and the median encoded protein is 425 amino acids (mean = 553) in length.


Noncoding DNA (ncDNA)

Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA. Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements. Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA). Protein-coding sequences (specifically, coding
exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
s) constitute less than 1.5% of the human genome. In addition, about 26% of the human genome is
introns An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
. Aside from genes (exons and introns) and known regulatory sequences (8–20%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity. It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result of transcriptional and biochemical noise, which must be actively filtered out by the organism. Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in
gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. ...
have important biological functions.
Comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong
evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
ary pressure and
purifying selection In natural selection, negative selection or purifying selection is the selective removal of alleles that are deleterious. This can result in stabilising selection through the purging of deleterious genetic polymorphisms that arise through random ...
. Many of these sequences regulate the structure of chromosomes by limiting the regions of
heterochromatin Heterochromatin is a tightly packed form of DNA or '' condensed DNA'', which comes in multiple varieties. These varieties lie on a continue between the two extremes of constitutive heterochromatin and facultative heterochromatin. Both play a rol ...
formation and regulating structural features of the chromosomes, such as the
telomeres A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...
and
centromeres The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...
. Other noncoding regions serve as origins of DNA replication. Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example), mRNA translation and stability (see
miRNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miR ...
), chromatin structure (including
histone In biology, histones are highly basic proteins abundant in lysine and arginine residues that are found in eukaryotic cell nuclei. They act as spools around which DNA winds to create structural units called nucleosomes. Nucleosomes in turn ar ...
modifications, for example), DNA methylation (for example), DNA recombination (for example), and cross-regulate other noncoding RNAs (for example). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific
RNA Polymerase In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that synthesizes RNA from a DNA template. Using the enzyme helicase, RNAP locally opens the ...
activity.


Pseudogenes

Pseudogenes are inactive copies of protein-coding genes, often generated by
gene duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene ...
, that have become nonfunctional through the accumulation of inactivating mutations. The number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during
molecular evolution Molecular evolution is the process of change in the sequence composition of cell (biology), cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and popula ...
. For example, the
olfactory receptor Olfactory receptors (ORs), also known as odorant receptors, are chemoreceptors expressed in the cell membranes of olfactory receptor neurons and are responsible for the detection of odorants (for example, compounds that have an odor) which give ri ...
gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.


Genes for noncoding RNA (ncRNA)

Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of
protein synthesis Protein biosynthesis (or protein synthesis) is a core biological process, occurring inside cells, balancing the loss of cellular proteins (via degradation or export) through the production of new proteins. Proteins perform a number of critical ...
and
RNA processing Transcriptional modification or co-transcriptional modification is a set of biological processes common to most eukaryotic cells by which an RNA primary transcript is chemically altered following transcription from a gene to produc