Dataset Information


Comparing de novo assemblers for 454 transcriptome data.

ABSTRACT: Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.


PROVIDER: S-EPMC3091720 | BioStudies | 2010-01-01

REPOSITORIES: biostudies

Similar Datasets

2012-01-01 | S-EPMC3288049 | BioStudies
2012-01-01 | S-EPMC3599625 | BioStudies
2008-01-01 | S-EPMC2639302 | BioStudies
2016-01-01 | S-EPMC5121142 | BioStudies
2011-01-01 | S-EPMC3233632 | BioStudies
2012-01-01 | S-EPMC3296665 | BioStudies
2012-01-01 | S-EPMC3517413 | BioStudies
2011-01-01 | S-EPMC3128070 | BioStudies
2014-01-01 | S-EPMC3901335 | BioStudies
2014-01-01 | S-EPMC3896809 | BioStudies