HIKESHI is a
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
important in lung and multicellular organismal development
that, in humans, is encoded by the ''HIKESHI''
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
.
HIKESHI is found on chromosome 11 in humans and chromosome 7 in mice. Similar sequences (
orthologs
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...
) are found in most animal and fungal species. The mouse homolog, lethal gene on chromosome 7 Rinchik 6 protein is encoded by the ''l7Rn6'' gene.
Gene
HIKESHI is a protein-coding gene in Homo sapiens. Alternate names for the gene are FLJ43020, HSPC138, HSPC179, and L7RN6. Located on long arm of chromosome 11 at area q14.2, the entire gene including introns and
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
s is 42,698 base pairs on the plus strand. The
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.
mRNA is created during the ...
of HIKESHI Variant 1 includes exons 1, 3, 4, 5, and 7 amounting to 1,183 base pairs, with base pairs 239 to 832 representing the coding regions.
Alternative Splicing
Variant 1 is the longest and most common protein coding variant. The three other main variants use an alternate exon sequence that throws off the reading frame, causing early termination of the mRNA sequence and undergoes protein decay. The table below shows the different variants and exon usage.
The four variants shown in the table above are the most common
isoforms
A protein isoform, or "protein variant", is a member of a set of highly similar proteins that originate from a single gene or gene family and are the result of genetic differences. While many perform the same or similar biological roles, some iso ...
found in human cells. There are a total of 13 alternatively spliced sequences and three unspliced forms that utilize two alternative promoters. The mRNA variants differ on the combination of 8 different exons, alternate, overlapping exons, and the retention of
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
s. Besides alternative
splicing, the mRNAs differ by truncation on the 3’ end. Variant 1 is one of ten mRNAs that has been shown to code for a protein, while the rest seem bound for nonsense mediated mRNA decay.
AceView representation of C11orf73 isoforms
Promoter
The Promoter region, GXP 47146, was found using the ElDorado tool from Genomatix. The 840 bp sequence is located before the HIKESHI gene at DNA points 86012753 to 86013592. The promoter is conserved in 12 of 12 orthologs and codes for 6 relevant transcripts.
Conserved transcription factor binding sites from Genomatix ElDorado tool:
Termination
Termination of the mRNA product is encoded for within the cDNA of the gene. The end termination of an mRNA product generally has three main features: the poly A signal, the
poly A tail
Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In euk ...
, and an area of sequence that can form a
stem loop
Stem-loop intramolecular base pairing is a pattern that can occur in single-stranded RNA. The structure is also known as a hairpin or hairpin loop. It occurs when two regions of the same strand, usually complementary in nucleotide sequence whe ...
structure. The poly A signal is a highly conserved site, six nucleotide long sequence. In eukaryotes the sequence is AATAAA and is located about 10-30 nucleotides from the poly A site. The AATAAA sequence is a highly conserved, eukaryotic polyA signal that signals for polyadenylation of the mRNA product 10-30 base pairs after the signal sequence. The polyA site for C11orf73 is GTA.
Gene expression
HIKESHI was determined to be expressed ubiquitously at a high level of 2.3 times above the average. C11orf73 is expressed in a large number of human tissues.
Between the Expression Profiles and the EST Profile on UniGene, only 11 tissues were shown not to express C11orf73, most likely due to small sample sizes in the tissue.
Protein
The human HIKESHI gene encodes for a protein called uncharacterized protein C11orf73.
The homologous mouse L7rn6 gene encodes a protein called lethal gene on chromosome 7 Rinchik 6.
1 mfgclvagrl vqtaaqqvae dkfvfdlpdy esinhvvvfm lgtipfpegm ggsvyfsypd
61 sngmpvwqll gfvtngkpsa ifkisglksg egsqhpfgam nivrtpsvaq igisvellds
121 maqqtpvgna avssvdsftq ftqkmldnfy nfassfavsq aqmtpspsem fipanvvlkw
181 yenfqrrlaq nplfwkt
The encoded human protein is 197 amino acids long and weighs 21,628 Daltons. Through analogy to the mouse protein, the hypothetical function of the human HIKESHI protein is the organization and function of the secretory apparatus in lung cells.
The
protein domain
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist o ...
known as DUF775 (Domain of Unknown Function 775) is located within both the human HIKESHI and mouse L7rn6 proteins. The DUF775 domain is 197 amino acids long, the same length as the protein. Other proteins that make up the DUF 775 super family by definition include all the orthologs of C11orf73.
Hydropathy analysis shows that there are no extensive hydrophobic regions in the protein and, hence, it is concluded that HIKESHI is a cytoplasmic protein. The isoelectric point for C11orf73 is 5.108 suggesting it functions optimally in a more acidic environment.
SNP
The only SNP, or
single-nucleotide polymorphism
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...
, for the C11orf73 sequence results in an amino acid change within the protein. The lack of other SNPs are most likely due to the high level of conservation of HIKESHI and the lethal effect a mutation in the protein bestows upon the organism. The phenotype for the SNP is unknown.
Gene Neighborhood
The surrounding genes of HIKESHI are CCDC81,
ME3, and
EED. The genetic neighborhood is looked at in order to get a better understanding of the possible function of the gene by looking at the function of the surrounding genes.
The CCDC81 gene codes for an uncharacterized protein product and is oriented on the plus strand. CCDC81stands for coiled-coil domain containing 81 isoform 1.
The ME3 gene stands for mitochondrial malic enzyme 3 precursor. Malic enzyme catalyzes the oxidative decarboxylation of
malate
Malic acid is an organic compound with the molecular formula . It is a dicarboxylic acid that is made by all living organisms, contributes to the sour taste of fruits, and is used as a food additive. Malic acid has two stereoisomeric forms ...
to
pyruvate
Pyruvic acid (CH3COCOOH) is the simplest of the alpha-keto acids, with a carboxylic acid and a ketone functional group. Pyruvate, the conjugate base, CH3COCOO−, is an intermediate in several metabolic pathways throughout the cell.
Pyruvic aci ...
using either
NAD+
Nicotinamide adenine dinucleotide (NAD) is a coenzyme central to metabolism. Found in all living cells, NAD is called a dinucleotide because it consists of two nucleotides joined through their phosphate groups. One nucleotide contains an ade ...
or NADP+ as a cofactor. Mammalian tissues contain 3 distinct isoforms of malic enzyme: a cytosolic NADP(+)-dependent isoform, a mitochondrial NADP(+)-dependent isoform, and a mitochondrial NAD(+)-dependent isoform. This gene encodes a mitochondrial NADP(+)-dependent isoform. Multiple alternatively spliced transcript variants have been found for this gene, but the biological validity of some variants has not been determined.
The EED gene stands for embryonic
ectoderm
The ectoderm is one of the three primary germ layers formed in early embryonic development. It is the outermost layer, and is superficial to the mesoderm (the middle layer) and endoderm (the innermost layer). It emerges and originates from the o ...
development isoform b and is a member of the
Polycomb-group
Polycomb-group proteins (PcG proteins) are a family of protein complexes first discovered in fruit flies that can remodel chromatin such that epigenetic silencing of genes takes place. Polycomb-group proteins are well known for silencing Hox ge ...
(PcG) family. PcG family members form multimeric protein complexes, which are involved in maintaining the transcriptional repressive state of genes over successive cell generations. This protein interacts with enhancer of zeste 2, the cytoplasmic tail of
integrin
Integrins are transmembrane receptors that facilitate cell-cell and cell-extracellular matrix (ECM) adhesion. Upon ligand binding, integrins activate signal transduction pathways that mediate cellular signals such as regulation of the cell cycle, ...
beta7, immunodeficiency virus type 1 (
HIV
The human immunodeficiency viruses (HIV) are two species of '' Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immu ...
-1) MA protein, and histone deacetylase proteins. This protein mediates repression of gene activity through histone deacetylation, and may act as a specific regulator of integrin function. Two transcript variants encoding distinct isoforms have been identified for this gene.
Interactions
The programs STRING and Sigma-Aldrich's Favorite Gene suggested possible protein interactions with C11orf73. ARGUL1, CRHBP, and EED were derived from textmining and HNF4A came from Sigma-Aldrich.
ARGUL1 is an unknown protein with an unknown function.
CRHBP
Corticotropin-releasing factor-binding protein is a protein that in humans is encoded by the ''CRHBP'' gene. It belongs to corticotropin-releasing hormone binding protein family
Corticotropin-releasing hormone binding protein (CRH-BP) binds cort ...
is a corticotrophin releasing hormone binding protein which could possibly play a role in a signal cascade that involves or activates HIKESHI. EED, a neighboring protein of C11orf73, is an embryonic ectoderm development protein and is a member of the Polycomb-group (PcG) family. PcG family members form multimeric protein complexes, which are involved in maintaining the transcriptional repressive state of genes over successive cell generations. HNF4A is a transcription regulator and it is unknown if HNF4A regulates C11orf73's expression or simply interacts with it.
[12
/sup>
Evolutionary History
The evolutionary history of organisms can be determined using the sequences of orthologs as time references to create a phylogenetic tree. The CLUSTALW[CLUSTALW Program Julie D. Thompson, Desmond G. Higgins and Toby J. Gibson http://workbench.sdsc.edu/ ] compares multiple sequences, the program can also be used to create such a phylogenetic tree based on the orthologs of C11orf73. The tree to the right shows the generated phylogenetic tree with a time line based on time of divergence. The tree made from the HIKESHI orthologs is identical to the literature phylogenetic tree, even grouping together similar organisms such as fish, birds, and fungi.
Orthologs
Homologous sequences are orthologous if they were separated by a speciation event: when a species diverges into two separate species, the divergent copies of a single gene in the resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that are similar to each other because they originated from a common ancestor. Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two organisms that are very closely related are likely to display very similar DNA sequences between two orthologs. Conversely, an organism that is further removed evolutionarily from another organism is likely to display a greater divergence in the sequence of the orthologs being studied.
Table of Chromosome 11 open reading frame 73 Orthologs
The table shows the 13 sequences (12 orthologs, 1 original sequence) along with protein name, accession numbers, nucleotide identity, protein identity, and E-values. The accession numbers are the identification numbers from the NCBI Protein database. The nucleotide sequence can be accessed from the protein's sequence page from DBSOURCE, which gives the accession number and is a link to the nucleotide's sequence page. The length of both the nucleotide and protein sequence for each ortholog and its respective organism are listed in the table as well. Next to the sequence lengths are the identities of the ortholog to the original HIKESHI gene. The identities and E-values were acquired using the global alignment program, ALIGN, from the SDSC Biology Workbench and BLAST from NCBI.
The graph shows the percent identity of the ortholog against the divergence time of the organism to produce a mostly linear curve. The two main joints within the curve suggest times of gene duplication, around 450 million years and 1150 million years ago respectively. The paralogs from the gene duplications are probably so dissimilar from the highly conserved orthologs of HIKESHI that it was not found using the Blink or BLAST tools.
The value m (total number of amino acid changes that have occurred in a 100 amino acid segment), which is the corrected value of n (number of amino acid differences from the template sequence), is also used to calculate λ (the average amino acid changes per year, usually represented in values of λE9).
m/100 = –ln(1-n/100)
λ = (m/100)/(2*T)
References
External links
* {{UCSC gene info">C11orf73
Human proteins