ABSTRACT: Most reconstruction methods for genomes of ancient origin that are used today require a closely related reference. In order to identify genomic rearrangements or the deletion of whole genes, de novo assembly has to be used. However, because of inherent problems with ancient DNA, its de novo assembly is highly complicated. In order to tackle the diversity in the length of the input reads, we propose a two-layer approach, where multiple assemblies are generated in the first layer, which are then combined in the second layer. We used this two-layer assembly to generate assemblies for two different ancient samples and compared the results to current de novo assembly approaches. We are able to improve the assembly with respect to the length of the contigs and can resolve more repetitive regions.
Project description:The development of next-generation sequencing has made it possible to sequence whole genomes at a relatively low cost. However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing errors. As more and more genomes are sequenced, reference-guided assembly approaches can be used to assist the assembly process. However, previous methods mostly focused on the assembly of other genotypes within the same species. We adapted and extended a reference-guided de novo assembly approach, which enables the usage of a related reference sequence to guide the genome assembly. In order to compare and evaluate de novo and our reference-guided de novo assembly approaches, we used a simulated data set of a repetitive and heterozygotic plant genome.The extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference of a different species is used. Similar improvements can be observed in high and low coverage situations. In addition, we show that a single evaluation metric, like the widely used N50 length, is not enough to properly rate assemblies as it not always points to the best assembly evaluated with other criteria. Therefore, we used the summed z-scores of 36 different statistics to evaluate the assemblies.The combination of reference mapping and de novo assembly provides a powerful tool to improve genome reconstruction by integrating information of a related genome. Our extension of the reference-guided de novo assembly approach enables the application of this strategy not only within but also between related species. Finally, the evaluation of genome assemblies is often not straight forward, as the truth is not known. Thus one should always use a combination of evaluation metrics, which not only try to assess the continuity but also the accuracy of an assembly.
Project description:TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only the sequenced reads and the assembly as input, we show that multiple common artifacts of de novo transcriptome assembly can be readily detected. These include chimeras, structural errors, incomplete assembly, and base errors. TransRate evaluates these errors to produce a diagnostic quality score for each contig, and these contig scores are integrated to evaluate whole assemblies. Thus, TransRate can be used for de novo assembly filtering and optimization as well as comparison of assemblies generated using different methods from the same input reads. Applying the method to a data set of 155 published de novo transcriptome assemblies, we deconstruct the contribution that assembly method, read length, read quantity, and read quality make to the accuracy of de novo transcriptome assemblies and reveal that variance in the quality of the input data explains 43% of the variance in the quality of published de novo transcriptome assemblies. Because TransRate is reference-free, it is suitable for assessment of assemblies of all types of RNA, including assemblies of long noncoding RNA, rRNA, mRNA, and mixed RNA samples.
Project description:BACKGROUND: The sequencing, de novo assembly and annotation of transcriptome datasets generated with next generation sequencing (NGS) has enabled biologists to answer genomic questions in non-model species with unprecedented ease. Reliable and accurate de novo assembly and annotation of transcriptomes, however, is a critically important step for transcriptome assemblies generated from short read sequences. Typical benchmarks for assembly and annotation reliability have been performed with model species. To address the reliability and accuracy of de novo transcriptome assembly in non-model species, we generated an RNAseq dataset for an intertidal gastropod mollusc species, Nerita melanotragus, and compared the assembly produced by four different de novo transcriptome assemblers; Velvet, Oases, Geneious and Trinity, for a number of quality metrics and redundancy. RESULTS: Transcriptome sequencing on the Ion Torrent PGM™ produced 1,883,624 raw reads with a mean length of 133 base pairs (bp). Both the Trinity and Oases de novo assemblers produced the best assemblies based on all quality metrics including fewer contigs, increased N50 and average contig length and contigs of greater length. Overall the BLAST and annotation success of our assemblies was not high with only 15-19% of contigs assigned a putative function. CONCLUSIONS: We believe that any improvement in annotation success of gastropod species will require more gastropod genome sequences, but in particular an increase in mollusc protein sequences in public databases. Overall, this paper demonstrates that reliable and accurate de novo transcriptome assemblies can be generated from short read sequencers with the right assembly algorithms.
Project description:Whereas de novo assemblies of RNA-Seq data are being published for a growing number of species across the tree of life, there are currently no broadly accepted methods for evaluating such assemblies. Here we present a detailed comparison of 99 transcriptome assemblies, generated with 6 de novo assemblers including CLC, Trinity, SOAP, Oases, ABySS and NextGENe. Controlled analyses of de novo assemblies for Arabidopsis thaliana and Oryza sativa transcriptomes provide new insights into the strengths and limitations of transcriptome assembly strategies. We find that the leading assemblers generate reassuringly accurate assemblies for the majority of transcripts. At the same time, we find a propensity for assemblers to fail to fully assemble highly expressed genes. Surprisingly, the instance of true chimeric assemblies is very low for all assemblers. Normalized libraries are reduced in highly abundant transcripts, but they also lack 1000s of low abundance transcripts. We conclude that the quality of de novo transcriptome assemblies is best assessed through consideration of a combination of metrics: 1) proportion of reads mapping to an assembly 2) recovery of conserved, widely expressed genes, 3) N50 length statistics, and 4) the total number of unigenes. We provide benchmark Illumina transcriptome data and introduce SCERNA, a broadly applicable modular protocol for de novo assembly improvement. Finally, our de novo assembly of the Arabidopsis leaf transcriptome revealed ~20 putative Arabidopsis genes lacking in the current annotation.
Project description:BACKGROUND: De novo genome assembly can be challenging due to inherent properties of the reads, even when using current state-of-the-art assembly tools based on de Bruijn graphs. Often users are not bio-informaticians and, in a black box approach, utilise assembly parameters such as contig length and N50 to generate whole genome sequences, potentially resulting in mis-assemblies. FINDINGS: Utilising several assembly tools based on de Bruijn graphs like Velvet, SPAdes and IDBA, we demonstrate that at the optimal N50, mis-assemblies do occur, even when using the multi-k-mer approaches of SPAdes and IDBA. We demonstrate that whole genome mapping can be used to identify these mis-assemblies and can guide the selection of the best k-mer size which yields the highest N50 without mis-assemblies. CONCLUSIONS: We demonstrate the utility of whole genome mapping (WGM) as a tool to identify mis-assemblies and to guide k-mer selection and higher quality de novo genome assembly of bacterial genomes.
Project description:BACKGROUND:Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. RESULTS:To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. CONCLUSIONS:The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.
Project description:De novo heterozygous assembly is an ongoing challenge requiring improved assembly approaches. In this study, three strategies were used to develop de novo Vitis vinifera 'Sultanina' genome assemblies for comparison with the inbred V. vinifera (PN40024 12X.v2) reference genome and a published Sultanina ALLPATHS-LG assembly (AP). The strategies were: 1) a default PLATANUS assembly (PLAT_d) for direct comparison with AP assembly, 2) an iterative merging strategy using METASSEMBLER to combine PLAT_d and AP assemblies (MERGE) and 3) PLATANUS parameter modifications plus GapCloser (PLAT*_GC).The three new assemblies were greater in size than the AP assembly. PLAT*_GC had the greatest number of scaffolds aligning with a minimum of 95% identity and ?1000 bp alignment length to V. vinifera (PN40024 12X.v2) reference genome. SNP analysis also identified additional high quality SNPs. A greater number of sequence reads mapped back with zero-mismatch to the PLAT_d, MERGE, and PLAT*_GC (>94%) than was found in the AP assembly (87%) indicating a greater fidelity to the original sequence data in the new assemblies than in AP assembly. A de novo gene prediction conducted using seedless RNA-seq data predicted >?30,000 coding sequences for the three new de novo assemblies, with the greatest number (30,544) in PLAT*_GC and only 26,515 for the AP assembly. Transcription factor analysis indicated good family coverage, but some genes found in the VCOST.v3 annotation were not identified in any of the de novo assemblies, particularly some from the MYB and ERF families.The PLAT_d and PLAT*_GC had a greater number of synteny blocks with the V. vinifera (PN40024 12X.v2) reference genome than AP or MERGE. PLAT*_GC provided the most contiguous assembly with only 1.2% scaffold N, in contrast to AP (10.7% N), PLAT_d (6.6% N) and Merge (6.4% N). A PLAT*_GC pseudo-chromosome assembly with chromosome alignment to the reference genome V. vinifera, (PN40024 12X.v2) provides new information for use in seedless grape genetic mapping studies. An annotated de novo gene prediction for the PLAT*_GC assembly, aligned with VitisNet pathways provides new seedless grapevine specific transcriptomic resource that has excellent fidelity with the seedless short read sequence data.
Project description:<h4>Background</h4>Nicotiana benthamiana is an allo-tetraploid plant, which can be challenging for de novo transcriptome assemblies due to homeologous and duplicated gene copies. Transcripts generated from such genes can be distinct yet highly similar in sequence, with markedly differing expression levels. This can lead to unassembled, partially assembled or mis-assembled contigs. Due to the different properties of de novo assemblers, no one assembler with any one given parameter space can re-assemble all possible transcripts from a transcriptome.<h4>Results</h4>In an effort to maximise the diversity and completeness of de novo assembled transcripts, we utilised four de novo transcriptome assemblers, TransAbyss, Trinity, SOAPdenovo-Trans, and Oases, using a range of k-mer sizes and different input RNA-seq read counts. We complemented the parameter space biologically by using RNA from 10 plant tissues. We then combined the output of all assemblies into a large super-set of sequences. Using a method from the EvidentialGene pipeline, the combined assembly was reduced from 9.9 million de novo assembled transcripts to about 235,000 of which about 50,000 were classified as primary. Metrics such as average bit-scores, feature response curves and the ability to distinguish paralogous or homeologous transcripts, indicated that the EvidentialGene processed assembly was of high quality. Of 35 RNA silencing gene transcripts, 34 were identified as assembled to full length, whereas in a previous assembly using only one assembler, 9 of these were partially assembled.<h4>Conclusions</h4>To achieve a high quality transcriptome, it is advantageous to implement and combine the output from as many different de novo assemblers as possible. We have in essence taking the 'best' output from each assembler while minimising sequence redundancy. We have also shown that simultaneous assessment of a variety of metrics, not just focused on contig length, is necessary to gauge the quality of assemblies.
Project description:BACKGROUND:Study of meta-transcriptomic datasets involving non-model organisms represents bioinformatic challenges. The production of chimeric sequences and our inability to distinguish the taxonomic origins of the sequences produced are inherent and recurrent difficulties in de novo assembly analyses. As the study of holobiont meta-transcriptomes is affected by challenges invoked above, we propose an innovative bioinformatic approach to tackle such difficulties and tested it on marine models as a proof of concept. RESULTS:We considered three holobiont models, of which two transcriptomes were previously published and a yet unpublished transcriptome, to analyze and sort their raw reads using Short Read Connector, a k-mer based similarity method. Before assembly, we thus defined four distinct categories for each holobiont meta-transcriptome: host reads, symbiont reads, shared reads, and unassigned reads. Afterwards, we observed that independent de novo assemblies for each category led to a diminution of the number of chimeras compared to classical assembly methods. Moreover, the separation of each partner's transcriptome offered the independent and comparative exploration of their functional diversity in the holobiont. Finally, our strategy allowed to propose new functional annotations for two well-studied holobionts (a Cnidaria-Dinophyta, a Porifera-Bacteria) and a first meta-transcriptome from a planktonic Radiolaria-Dinophyta system forming widespread symbiotic association for which our knowledge is considerably limited. CONCLUSIONS:In contrast to classical assembly approaches, our bioinformatic strategy generates less de novo assembled chimera and allows biologists to study separately host and symbiont data from a holobiont mixture. The pre-assembly separation of reads using an efficient tool as Short Read Connector is an effective way to tackle meta-transcriptomic challenges and offers bright perpectives to study holobiont systems composed of either well-studied or poorly characterized symbiotic lineages and ultimately expand our knowledge about these associations.
Project description:RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity.Transcriptome sequencing for three embryonic stages of Madagascar ground gecko (Paroedura picta) was performed with the Illumina platform. The output reads were assembled de novo for reconstructing transcript sequences. In order to evaluate the completeness of transcriptome assemblies, we prepared a reference gene set consisting of vertebrate one-to-one orthologs.To take advantage of increased read length of >150 nt, we demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, we introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities.Our results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as well as whole genome analyses.