SOAP (Short Oligonucleotide Analysis Package) is a suite of
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
software tools from the
BGI Bioinformatics department enabling the assembly, alignment, and analysis of
next generation DNA sequencing data. It is particularly suited to
short read sequencing data.
All programs in the SOAP package may be used free of charge and are distributed under the
GPL
The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first c ...
open source software
Open-source software (OSS) is Software, computer software that is released under a Open-source license, license in which the copyright holder grants users the rights to use, study, change, and Software distribution, distribute the software an ...
license.
Functionality
The SOAP suite of tools can be used to perform the following genome assembly tasks:
Sequence Alignment
''SOAPaligner'' (SOAP2) is specifically designed for fast alignment of short reads and performs favorably with respect to similar alignment tools such as
Bowtie
The bow tie or dicky bow is a type of neckwear, distinguishable from a necktie because it does not drape down the shirt placket, but is tied just underneath a winged collar. A modern bow tie is tied using a common shoelace knot, which is also ...
and MAQ.
Genome Assembly
''SOAPdenovo'' is a short read ''de novo'' assembler utilizing
De Bruijn graph
In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
construction. It is optimized for short reads such as that generated by
Illumina and is capable of assembling large genomes such as the human genome.
''SOAPdenovo'' was used to assemble the genome of the
giant panda
The giant panda (''Ailuropoda melanoleuca''), also known as the panda bear or simply panda, is a bear species endemic to China. It is characterised by its white animal coat, coat with black patches around the eyes, ears, legs and shoulders. ...
.
This was upgraded to ''SOAPdenovo2,'' which was optimized for large genomes and included the widely used GapCloser module.
Transcriptome Assembly
''SOAPdenovo-Trans'' is a ''de novo''
transcriptome
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...
assembler designed specifically for
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
that was created for the
1000 Plant Genomes project.
Indel Discovery
''SOAPindel'' is a tool to find
insertions and deletions from next generation paired-end sequencing data, providing a list of candidate
indel
Indel (insertion-deletion) is a molecular biology term for an insertion or deletion of bases in the genome of an organism. Indels ≥ 50 bases in length are classified as structural variants.
In coding regions of the genome, unless the lengt ...
s with quality scores.
SNP Discovery
''SOAPsnp'' is a consensus sequence builder. This tool uses the output from ''SOAPaligner'' to generate a consensus sequence which enables
SNPs
In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in ...
to be called on a newly sequenced individual.
Structural Variation Discovery
''SOAPsv'' is a tool to find structural variations using whole genome assembly.
Quality control and preprocessing
''SOAPnuke'' is a tool for integrated quality control and preprocessing of datasets from genomic,
small RNA
Small RNA (sRNA) are polymeric RNA molecules that are less than 200 nucleotides in length, and are usually non-coding RNA, non-coding. RNA silencing is often a function of these molecules, with the most common and well-studied example being RNA int ...
,
Digital Gene Expression, and
metagenomic
Metagenomics is the study of all genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the microbial co ...
experiments.
History
SOAP v1
The first release of SOAP consisted only of the
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
tool ''SOAPaligner''.
SOAP v2
SOAP v2
extended and improved on SOAP v1 by significantly improving the performance of the ''SOAPaligner'' tool. Alignment time was reduced by a factor of 20-30, while memory usage was reduced by a factor of 3. Support was added for compressed file formats.
The SOAP suite was expanded then to include the new tools: SOAPdenovo 1&2, SOAPindel, SOAPsnp, and SOAPsv.
SOAP v3
SOAP v3 extended the alignment tool by being the first short-read alignment tool to utilize GPU processors.
As a result of these improvements, SOAPalign significantly outperformed competing aligners
Bowtie
The bow tie or dicky bow is a type of neckwear, distinguishable from a necktie because it does not drape down the shirt placket, but is tied just underneath a winged collar. A modern bow tie is tied using a common shoelace knot, which is also ...
and
BWA in terms of speed.
See also
*
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
*
genome sequencing
Whole genome sequencing (WGS), also known as full genome sequencing or just genome sequencing, is the process of determining the entirety of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's ...
*
genome assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
*
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
External links
* http://soap.genomics.org.cn
* http://soap.genomics.org.cn/soap1
* http://bioinformatics.genomics.org.cn
* http://seqanswers.com/forums/showthread.php?t=43
References
{{Bioinformatics
Bioinformatics algorithms
Bioinformatics software
DNA sequencing
Free software projects