SOAP (Short Oligonucleotide Analysis Package) is a suite of
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
software tools from the
BGI Bioinformatics department enabling the assembly, alignment, and analysis of
next generation DNA sequencing data. It is particularly suited to
short read sequencing data.
All programs in the SOAP package may be used free of charge and are distributed under the
GPL
The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general us ...
open source software
Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Op ...
license.
Functionality
The SOAP suite of tools can be used to perform the following genome assembly tasks:
Sequence Alignment
''SOAPaligner'' (SOAP2) is specifically designed for fast alignment of short reads and performs favorably with respect to similar alignment tools such as
Bowtie
The bow tie is a type of necktie. A modern bow tie is tied using a common shoelace knot, which is also called the bow knot for that reason. It consists of a ribbon of fabric tied around the collar of a shirt in a symmetrical manner so that ...
and MAQ.
Genome Assembly
''SOAPdenovo'' is a short read ''de novo'' assembler utilizing
De Bruijn graph
In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
construction. It is optimized for short reads such as that generated by
Illumina and is capable of assembling large genomes such as the human genome.
''SOAPdenovo'' was used to assemble the genome of the
giant panda
The giant panda (''Ailuropoda melanoleuca''), also known as the panda bear (or simply the panda), is a bear species endemic to China. It is characterised by its bold black-and-white coat and rotund body. The name "giant panda" is sometimes u ...
.
This was upgraded to ''SOAPdenovo2,'' which was optimized for large genomes and included the widely used GapCloser module.
Transcriptome Assembly
''SOAPdenovo-Trans'' is a ''de novo''
transcriptome
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...
assembler designed specifically for
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing ...
that was created for the
1000 Plant Genomes project.
Indel Discovery
''SOAPindel'' is a tool to find
insertions and deletions from next generation paired-end sequencing data, providing a list of candidate
indel
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
s with quality scores.
SNP Discovery
''SOAPsnp'' is a consensus sequence builder. This tool uses the output from ''SOAPaligner'' to generate a consensus sequence which enables
SNPs
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
to be called on a newly sequenced individual.
Structural Variation Discovery
''SOAPsv'' is a tool to find structural variations using whole genome assembly.
Quality control and preprocessing
''SOAPnuke'' is a tool for integrated quality control and preprocessing of datasets from genomic,
small RNA
Small RNA (sRNA) are polymeric RNA molecules that are less than 200 nucleotides in length, and are usually non-coding. RNA silencing is often a function of these molecules, with the most common and well-studied example being RNA interference (RNA ...
,
Digital Gene Expression, and
metagenomic
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or micro ...
experiments.
History
SOAP v1
The first release of SOAP consisted only of the
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
tool ''SOAPaligner''.
SOAP v2
SOAP v2
extended and improved on SOAP v1 by significantly improving the performance of the ''SOAPaligner'' tool. Alignment time was reduced by a factor of 20-30, while memory usage was reduced by a factor of 3. Support was added for compressed file formats.
The SOAP suite was expanded then to include the new tools: SOAPdenovo 1&2, SOAPindel, SOAPsnp, and SOAPsv.
SOAP v3
SOAP v3 extended the alignment tool by being the first short-read alignment tool to utilize GPU processors.
As a result of these improvements, SOAPalign significantly outperformed competing aligners
Bowtie
The bow tie is a type of necktie. A modern bow tie is tied using a common shoelace knot, which is also called the bow knot for that reason. It consists of a ribbon of fabric tied around the collar of a shirt in a symmetrical manner so that ...
and
BWA in terms of speed.
See also
*
genomics
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
*
genome sequencing
Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a ...
*
genome assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
*
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
External links
* http://soap.genomics.org.cn
* http://soap.genomics.org.cn/soap1
* http://bioinformatics.genomics.org.cn
* http://seqanswers.com/forums/showthread.php?t=43
References
{{Reflist
Bioinformatics algorithms
Bioinformatics software
DNA sequencing
Free software projects
Short Oligonucleotide Analysis Package