White paper on de novo assembly in CLC Assembly Cell 3.0
White Paper Jan 31, 2012
CLC Assembly Cell is a high-performance computing solution for read mapping and de novo assembling of Next Generation Sequencing data. The command-line interface of CLC Assembly Cell enables the functionalities to be easily included in scripts and other Next Generation Sequencing work-flows.
CLC Assembly Cell is utilizing SIMD instructions to parallelize and accelerate the assembly algorithms, making the program the fastest Next Generation Sequencing assembler at present.
This is a white paper on the de novo assembler in CLC Assembly Cell 3.0. Note that the same algorithm is used by CLC Genomics Workbench and CLC Genomics Server, so except for the performance benchmarks (speed and memory), this white paper applies to these products as well after next release.
The assembler is designed to combine a mix of data from Illumina, 454, SOLiD and Sanger sequencing, both as single and as paired-end reads. For paired-end data, different insert sizes can be combined in the same assembly. Note that in the current version, paired-end SOLiD data can be used in a post-processing step to link contigs together.
This white paper consists of three parts: The first part explains how the assembler works, and the next focuses on large genome assembly of a human data set, where we have compared our assembler with the ABySS assembler which is also capable of assembling human genome-size data sets [Simpson et al., 2009]. The third part reports the results of a smaller bacterial data set where focus is on quality and where we have compared the quality and performance of our own assembly with one of the popular open source assembly algorithms, Velvet [Zerbino and Birney, 2008].