Types
Assemblies
Genome
The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids toEST
Expressed sequence tag or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribedDe-novo vs mapping assembly
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read (an operation that has a naive time complexity of O(n2)). Current de-novo genome assemblers may use different types of graph-based algorithms, such as the: * ''Overlap/Layout/Consensus'' (OLC) approach, which was typical of the Sanger-data assemblers and relies on an overlap graph; * ''de Bruijn Graph'' (DBG) approach, which is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, which performs well with vast quantities of short reads; * ''Greedy graph-based'' approach, which may also use one of the OLC or DBG approaches. With greedy graph-based algorithms, the contigs, series of reads aligned together, grow by greedy extension, always taking on the read that is found by following the highest-scoring overlap. Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as a template (perhaps with the names of the main characters and a few locations changed), de-novo assemblies present a more daunting challenge in that one would not know beforehand whether this would become a science book, a novel, a catalogue, or even several books. Also, every shred would be compared with every other shred. Handling repeats in de-novo assembly requires the construction of a graph representing neighboring repeats. Such information can be derived from reading a long fragment covering the repeats in full or only its two ends. On the other hand, in a mapping assembly, parts with multiple or no matches are usually left for another assembling technique to look into.Technological advances
The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats. In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand. In 1975, the ''dideoxy termination'' method (AKA ''Sanger sequencing'') was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects where the reads * are about 800–900 bases long * contain sequencing artifacts like sequencing and cloning vectors * have error rates between 0.5 and 10% With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing. By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400–500 bases. Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the Newbler assembler from 454 was available. Released in mid-2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined ''hybrid assembly''. From 2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3–400bp clone. Announced at the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others. Later, new technologies like SOLiD from Applied Biosystems, Ion Torrent and SMRT were released and new technologies (e.g. Nanopore sequencing) continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem. It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy (≈85%).Quality control
Most sequence assemblers have some algorithms built in for quality control, such as Phred. However, such measures do not assess assembly completeness in terms of gene content. Some tools evaluate the quality of an assembly after the fact. For instance, BUSCO (Benchmarking Universal Single-Copy Orthologs) is a measure of gene completeness in a genome, gene set, or transcriptome, using the fact that many genes are present only as single-copy genes in most genomes. The initial BUSCO sets represented 3023 genes forAssembly algorithms
Different organisms have a distinct region of higher complexity within their genome. Hence, the need of different computational approaches is needed. Some of the commonly used algorithms are: ; Graph Assembly: is based on Graph theory in computer science. The de Bruijn Graph is an example of this approach and utilizes k-mers to assemble a contiguous from reads. ; Greedy Graph Assembly: this approach score each added read to the assembly and selects the highest possible score from the overlapping region. :Given a set of sequence fragments, the object is to find a longer sequence that contains all the fragments (see figure under ''Types of Sequence Assembly''): :# Сalculate pairwise alignments of all fragments. :# Choose two fragments with the largest overlap. :# Merge chosen fragments. :# Repeat step 2 and 3 until only one fragment is left. :The result might not be an optimal solution to the problem.Bioinformatics pipeline
In general, there are three steps in assembling sequencing reads into a scaffold: # Pre-assembly: This step is essential to ensure the integrity of downstream analysis such as variant calling or final scaffold sequence. This step consists of two chronological workflows: ## Quality check: Depending on the type of sequencing technology, different errors might arise that would lead to a false base call. For example, sequencing "NAAAAAAAAAAAAN" and "NAAAAAAAAAAAN" which include 12 adenine might be wrongfully called with 11 adenine instead. Sequencing a highly repetitive segment of the target DNA/RNA might result in a call that is one base shorter or one base longer. Read quality is typically measured by Phred which is an encoded score of each nucleotide quality within a read's sequence. Some sequencing technologies such as PacBio do not have a scoring method for their sequenced reads. A common tool used in this step is FastQC. ## Filtering of reads: Reads that failed to pass the quality check should be removed from the FASTQ file to get the best assembly contigs. # Assembly: During this step, reads alignment will be utilized with different criteria to map each read to the possible location. The predicted position of a read is based on either how much of its sequence aligns with other reads or a reference. Different alignment algorithms are used for reads from different sequencing technologies. Some of the commonly used approaches in the assembly are de Bruijn graph and overlapping. Read length, coverage, quality, and the sequencing technique used plays a major role in choosing the best alignment algorithm in the case of Next Generation Sequencing. On the other hand, algorithms aligning 3rd generation sequencing reads requires advance approaches to account for the high error rate associated with them. # Post-assembly: This step is focusing on extracting valuable information from the assembled sequence. Comparative genomics and population analysis are examples of post-assembly analysis.Programs
For a lists of ''de-novo'' assemblers, see De novo sequence assemblers. For a list of mapping aligners, see List of sequence alignment software § Short-read sequence alignment. Some of the common tools used in different assembly steps are listed in the following table:See also
* De novo sequence assemblers * Sequence alignment * De novo transcriptome assembly * Set cover problem * List of sequenced animal genomes * Plant genome assemblyReferences
{{DEFAULTSORT:Sequence Assembly Bioinformatics DNA sequencing methods