Assemblathon I Offers Lessons in Complex Genome Assembly
Complete the form below to unlock access to ALL audio articles.
Assemblathon I Offers Lessons in Complex Genome Assembly
By Kevin Davies - from BIO IT World
December 13, 2011 | From familiar but repeat-laden plant species to obscure vertebrates, more and more genomes are being sequenced that require de novo assembly without alignment to a reference sequence. “Every genome has its own story in terms of repeats,” says Ian Korf, associate director of bioinformatics at the University of California Davis Genome Center.
Korf is one of the principal organizers of a genome assembly challenge known as the Assemblathon -- a competition to identify best practices in the de novo assembly of complex plant and animal genomes. Results of the first phase of the Assemblathon were recently published in Genome Research.
Korf discussed some of the Assemblathon results and, more broadly, the inherent challenges in genome assembly in a recent webinar hosted by the community forum NGS Leaders. “Sequence analysis starts after genome assembly – you can’t do much beforehand... Every genome is a complex genome -- even the simpler ones are pretty complex. There’s no easy genome,” says Korf.
The Assemblathon grew out of the G10K project, which is an effort to sequence 10,000 vertebrate genomes. Clearly, in order to sequence and assemble 10,000 genomes, it is crucial to know what is the best sequencing and assembly technology for the money. “It definitely becomes a cost-benefit ratio looking at 10,000 genomes,” says Korf.
Joseph DeRisi (Berkeley), David Haussler (UCSC), and Illumina helped launch the Assemblathon idea. The original goal was to make two targets: one was a real genome (snake), the other a synthetic genome, to enable participants to determine how well they performed. In the event, the snake data weren’t ready, so Assemblathon I, which took place in early 2011, utilized just the synthetic genome.
To create the synthetic genome, the organizers took a copy of human chromosome 13, and artificially evolved the sequence using Evolver software, which introduced mutations in different regions (coding/non-coding) and at different rates. “The sequences were human-ish, but after 200 million years of evolution, didn’t look that human,” says Korf.
The Assemblathon participants – 17 groups in all -- were then challenged to put the synthetic reads together. “Because we knew the answer, we could evaluate each one of the assemblers,” says Korf.
Results Are In
Commenting on the results, Korf said: “A lot did a pretty good job, but it’s more difficult to assemble regions with more mutations, so the coding regions were assembled better than non-coding regions.” (The contest did not test the growing number of commercial assembly packages, from the likes of CLC bio, DNAStar, Gene Codes and others.)
The assemblies were ranked by various criteria, including contig and scaffold paths, structural and copy number errors, and so on. In the final rankings, the top five were:
Broad Institute (ALLPATHS-LG)
BGI (SOAPdenovo)
Wellcome Trust Sanger Institute (SGA)
DOE Joint Genome Institute (Meraculous)
Cold Spring Harbor Lab (Quake, Celera, Bambus2)
Several useful tools emerged, says Korf, but experience in using the tools makes a big difference. “We found that sometimes two groups will use the same assembler, but the group that knows a bit more about the assembler might do a slightly better job. It’s something of an art at this point,” said Korf.
Korf says that wisely choosing the many different parameters involved in de novo genome assembly is difficult and “probably shouldn’t be attempted by amateurs.” He advises inexperienced users to “contact one of the major sequencing centers and get them to help you. Doing it on your own is pretty much guaranteed to give you a sub-optimal assembly… Don’t jump into genome assembly thinking it’s just like any other bioinformatics problem you can hack with some Perl scripts.”
It starts as far upstream as DNA library preparation. “You don’t want to choose the assembler as the last thing you do,” says Korf. “It must be in conjunction with the sequencing technology, how are the libraries made, the full equation. You can’t do it stepwise.”
Library preparation is a non-trivial step. “It’s really garbage in, garbage out,” says Korf. “So much is dependent on having high quality sequence and making your libraries correctly.” Indeed, some Assemblathon participants believe there should be a library construction competition, because that’s more important in some ways.