Shapiro–Senapathy Algorithm
   HOME

TheInfoList



OR:

The ShapiroSenapathy algorithm (S&S) is an algorithm for predicting splice junctions in
gene In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s of animals and plants. This algorithm has been used to discover disease-causing splice site mutations and cryptic splice sites.


The algorithm

A splice site is the border between an
exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
and
intron An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...
in a gene. These sites contain a particular
sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...
, which is necessary for recognition and processing by the RNA splicing machinery. The S&S algorithm uses
sliding window A sliding window protocol is a feature of packet-based data transmission Protocol (computing), protocols. Sliding window protocols are used where reliable in-order delivery of packets is required, such as in the data link layer (OSI model#Laye ...
s of eight nucleotides, corresponding to the length of the splice site sequence motif, to identify these conserved sequences and thus potential splice sites. Using a weighted table of
nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
frequencies, the S&S algorithm outputs a consensus-based percentage for the possibility of the window containing a splice site. The S&S algorithm serves as the basis of other software tools, such as Human Splicing Finder, Splice-site Analyzer Tool, dbass (Ensembl), Alamut, and SROOGLE.


Cancer gene discovery using S&S

By using the S&S algorithm, mutations and genes that cause many different forms of cancer have been discovered. For example, genes causing commonly occurring cancers including
breast cancer Breast cancer is a cancer that develops from breast tissue. Signs of breast cancer may include a Breast lump, lump in the breast, a change in breast shape, dimpling of the skin, Milk-rejection sign, milk rejection, fluid coming from the nipp ...
,
ovarian cancer Ovarian cancer is a cancerous tumor of an ovary. It may originate from the ovary itself or more commonly from communicating nearby structures such as fallopian tubes or the inner lining of the abdomen. The ovary is made up of three different ...
,
colorectal cancer Colorectal cancer (CRC), also known as bowel cancer, colon cancer, or rectal cancer, is the development of cancer from the Colon (anatomy), colon or rectum (parts of the large intestine). Signs and symptoms may include Lower gastrointestinal ...
,
leukemia Leukemia ( also spelled leukaemia; pronounced ) is a group of blood cancers that usually begin in the bone marrow and produce high numbers of abnormal blood cells. These blood cells are not fully developed and are called ''blasts'' or '' ...
,
head and neck cancers Head and neck cancer is a general term encompassing multiple cancers that can develop in the head and neck region. These include cancers of the mouth, tongue, gums and lips (oral cancer), voice box ( laryngeal), throat ( nasopharyngeal, orophary ...
,
prostate cancer Prostate cancer is the neoplasm, uncontrolled growth of cells in the prostate, a gland in the male reproductive system below the bladder. Abnormal growth of the prostate tissue is usually detected through Screening (medicine), screening tests, ...
,
retinoblastoma Retinoblastoma (Rb) is a rare form of cancer that rapidly develops from the immature cells of a retina, the light-detecting tissue of the eye. It is the most common primary malignant intraocular cancer in children, and 80% of retinoblastoma cas ...
,
squamous cell carcinoma Squamous-cell carcinoma (SCC), also known as epidermoid carcinoma, comprises a number of different types of cancer that begin in squamous cells. These cells form on the surface of the skin, on the lining of hollow organs in the body, and on the ...
,
gastrointestinal cancer Gastrointestinal cancer refers to malignant conditions of the Human gastrointestinal tract, gastrointestinal tract (GI tract) and accessory organs of digestion, including the esophagus, stomach, biliary system, pancreas, small intestine, large in ...
,
melanoma Melanoma is the most dangerous type of skin cancer; it develops from the melanin-producing cells known as melanocytes. It typically occurs in the skin, but may rarely occur in the mouth, intestines, or eye (uveal melanoma). In very rare case ...
,
liver cancer Liver cancer, also known as hepatic cancer, primary hepatic cancer, or primary hepatic malignancy, is cancer that starts in the liver. Liver cancer can be primary in which the cancer starts in the liver, or it can be liver metastasis, or secondar ...
,
Lynch syndrome Hereditary nonpolyposis colorectal cancer (HNPCC) is a hereditary predisposition to colon cancer. HNPCC includes (and was once synonymous with) Lynch syndrome, an autosomal dominant genetic condition that is associated with a high risk of colon ...
,
skin cancer Skin cancers are cancers that arise from the Human skin, skin. They are due to the development of abnormal cells (biology), cells that have the ability to invade or metastasis, spread to other parts of the body. It occurs when skin cells grow ...
, and
neurofibromatosis Neurofibromatosis (NF) refers to a group of three distinct genetic conditions in which tumors grow in the nervous system. The tumors are non-cancerous (benign) and often involve the skin or surrounding bone. Although symptoms are often mild, e ...
have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome,
Loeys–Dietz syndrome Loeys–Dietz syndrome (LDS) is an autosomal dominant genetic connective tissue disorder. It has features similar to Marfan syndrome and Ehlers–Danlos syndrome. The disorder is marked by aneurysms in the aorta, often in children, and the aorta m ...
, Osteochondromas (bone tumor),
Nevoid basal cell carcinoma syndrome Nevoid basal-cell carcinoma syndrome (NBCCS) is a rare inherited medical condition involving defects within multiple body systems such as the skin, nervous system, eyes, endocrine system, and bones. People with NBCCS are prone to developing vario ...
, and Pheochromocytomas have been identified. Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and
Fanconi anemia Fanconi anemia (FA) is a rare, autosomal recessive genetic disease characterized by aplastic anemia, congenital defects, endocrinological abnormalities, and an increased incidence of developing cancer. The study of Fanconi anemia has improve ...
(e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.


Discovery of genes causing inherited disorders using S&S

Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example,
Type 1 diabetes Type 1 diabetes (T1D), formerly known as juvenile diabetes, is an autoimmune disease that occurs when the body's immune system destroys pancreatic cells (beta cells). In healthy persons, beta cells produce insulin. Insulin is a hormone require ...
(e.g., PTPN22, TCF1 (HCF-1A)),
hypertension Hypertension, also known as high blood pressure, is a Chronic condition, long-term Disease, medical condition in which the blood pressure in the artery, arteries is persistently elevated. High blood pressure usually does not cause symptoms i ...
(e.g., LDL, LDLR, LPL),
Marfan syndrome Marfan syndrome (MFS) is a multi-systemic genetic disorder that affects the connective tissue. Those with the condition tend to be tall and thin, with dolichostenomelia, long arms, legs, Arachnodactyly, fingers, and toes. They also typically ha ...
(e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.


Genes causing immune system disorders

More than 100 immune system disorders affect humans, including
inflammatory bowel diseases Inflammatory bowel disease (IBD) is a group of inflammatory conditions of the colon and small intestine, with Crohn's disease and ulcerative colitis (UC) being the principal types. Crohn's disease affects the small intestine and large intestine ...
,
multiple sclerosis Multiple sclerosis (MS) is an autoimmune disease resulting in damage to myelinthe insulating covers of nerve cellsin the brain and spinal cord. As a demyelinating disease, MS disrupts the nervous system's ability to Action potential, transmit ...
,
systemic lupus erythematosus Lupus, formally called systemic lupus erythematosus (SLE), is an autoimmune disease in which the body's immune system mistakenly attacks healthy tissue in many parts of the body. Symptoms vary among people and may be mild to severe. Common ...
,
bloom syndrome Bloom syndrome (often abbreviated as BS in literature) is a rare autosomal recessive genetic disorder characterized by short stature, predisposition to the development of cancer, and genomic instability. BS is caused by mutations in the '' BLM'' g ...
, familial cold autoinflammatory syndrome, and
dyskeratosis congenita Dyskeratosis congenita (DKC), also known as Zinsser-Engman-Cole syndrome, is a rare progressive congenital disorder with a highly variable phenotype. The entity was classically defined by the triad of abnormal skin pigmentation, nail dystrophy, an ...
. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including
Ataxia telangiectasia Ataxia (from Greek α- negative prefix+ -τάξις rder= "lack of order") is a neurological sign consisting of lack of voluntary coordination of muscle movements that can include gait abnormality, speech changes, and abnormalities in e ...
, B-cell defects,
epidermolysis bullosa Epidermolysis bullosa (EB) is a group of rare medical conditions that result in easy blistering of the skin and mucous membranes. Blisters occur with minor trauma or friction and are painful. Its severity can range from mild to fatal. Inherite ...
, and X-linked agammaglobulinemia.
Xeroderma pigmentosum Xeroderma pigmentosum (XP) is a genetic disorder in which there is a decreased ability to repair DNA damage such as that caused by ultraviolet (UV) light. Symptoms may include a severe sunburn after only a few minutes in the sun, freckling in su ...
, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair. Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping. Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript. Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.


Application of S&S in hospitals for clinical practice and research

Applying the S&S technology platform in modern clinical
genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
research hasadvance diagnosis and treatment of human diseases. In the modern era of
Next Generation Sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
(NGS) technology, S&S is applied in clinical practice extensively. Clinicians and molecular diagnostic laboratories apply S&S using various computational tools including HSF, SSF, and Alamut. It is aiding in the discovery of genes and mutations in patients whose disease are stratified or when the disease in a patient is unknown based on clinical investigations. In this context, S&S has been applied on cohorts of patients in different ethnic groups with various cancers and inherited disorders. A few examples are given below.


Cancers


Inherited disorders


S&S - the first algorithm for identifying splice sites, exons and split genes

Dr. Senapathy's original objective in developing a method for identifying splice sites was to find complete genes in raw uncharacterized genomic sequence that could be used in the human genome project. In the landmark paper with this objective, he described the basic method for identifying the splice sites within a given sequence based on the Position Weight Matrix (PWM) of the splicing sequences in different eukaryotic organism groups for the first time. He also created the first exon detection method by defining the basic characteristics of an exon as the sequence bounded by an acceptor and a donor splice sites that had S&S scores above a threshold, and by an ORF that was mandatory for an exon. An algorithm for finding complete genes based on the identified exons was also described by Dr. Senapathy for the first time. Dr. Senapathy demonstrated that only deleterious mutations in the donor or acceptor splice sites that would drastically make the protein defective would reduce the splice site score (later known as the Shapiro–Senapathy score), and other non-deleterious variations would not reduce the score. The S&S method was adapted for researching the cryptic splice sites caused by mutations leading to diseases. This method for detecting deleterious splicing mutations in eukaryotic genes has been used extensively in disease research in the humans, animals and plants over the past three decades, as described above. The basic method for splice site identification, and for defining exons and genes was subsequently used by researchers in finding splice sites, exons and eukaryotic genes in a variety of organisms. These methods also formed the basis of all subsequent tools development for discovering genes in uncharacterized genomic sequences. It also was used in a different computational approaches including machine learning and neural network, and in alternative splicing research.


Discovering the mechanisms of aberrant splicing in diseases

The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become “weak” compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA (exon-skipping). On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable (intron inclusion). A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.


Examples of splicing aberrations

An example of splicing aberration (exon skipping) caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.


S&S in cryptic splice sites research and medical applications

The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely. Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript. Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.


S&S in animal and plant genomics research

S&S has also been used in RNA splicing research in many animals and plants. The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis. The splicing and exon–intron junction prediction coincided with the GT/AG rule (S&S) in the Molecular characterization and evolution of carnivorous sundew (Drosera rotundifolia L.) class V b-1,3-glucanase. Unspliced (LSDH) and spliced (SSDH) transcripts of NAD+ dependent sorbitol dehydroge nase (NADSDH) of strawberry (Fragaria ananassa Duch., cv. Nyoho) were investigated for phytohormonal treatments. Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish. Diminution of ''rbm24a'' or ''rbm24b'' gene products by
morpholino A Morpholino, also known as a Morpholino oligomer and as a phosphorodiamidate Morpholino oligomer (PMO), is a type of oligomer molecule (colloquially, an oligo) used in molecular biology to modify gene expression. Its molecular structure contains ...
knockdown resulted in significant disruption of somite formation in mouse and zebrafish. Dr.Senapathy algorithm used extensively to study intron-exon organization of
fut8 Alpha-(1,6)-fucosyltransferase is an enzyme that in humans is encoded by the ''FUT8'' gene. This enzyme belongs to the family of fucosyltransferases. The product of this gene catalyzes the transfer of fucose from GDP-fucose to N-linked type comp ...
genes. The intron-exon boundaries of ''Sf''9 ''fut8'' were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.


References

{{DEFAULTSORT:Shapiro-Senapathy algorithm Algorithms Bioinformatics algorithms RNA splicing Gene expression Spliceosome