Project description:De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.
Project description:BackgroundThe plastid acquisition by secondary endosymbiosis is a driving force for the algal evolution, and the comparative genomics was required to examine the genomic change of symbiont. Therefore, we established a pipeline of a de novo assembly of middle-sized genomes at a low cost and with high quality using long and short reads.ResultsWe sequenced symbiotic algae Chlorella variabilis using Oxfofrd Nanopore MinION as the long-read sequencer and Illumina HiSeq 4000 as the short-read sequencer and then assembled the genomes under various conditions. Subsequently, we evaluated these assemblies by the gene model quality and RNA-seq mapping rate. We found that long-read only assembly could not be suitable for the comparative genomics studies, but with short reads, we could obtain the acceptable assembly. On the basis of this result, we established the pipeline of de novo assembly for middle-sized algal genome using MinION.ConclusionsThe genomic change during the early stages of plastid acquisition can now be revealed by sequencing and comparing many algal genomes. Moreover, this pipeline offers a solution for the assembly of various middle-sized eukaryotic genomes with high-quality and ease.
Project description:Since chloroplasts and mitochondria are maternally inherited and have unique features in evolution, DNA sequences of those organelle genomes have been broadly used in phylogenetic studies. Thanks to recent progress in next-generation sequencer (NGS) technology, whole-genome sequencing can be easily performed. Here, using NGS data generated by Roche GS Titanium and Illumina Hiseq 2000, we performed a hybrid assembly of organelle genome sequences of Vigna angularis (azuki bean). Both the mitochondrial genome (mtDNA) and the chloroplast genome (cpDNA) of V. angularis have very similar size and gene content to those of V. radiata (mungbean). However, in structure, mtDNA sequences have undergone many recombination events after divergence from the common ancestor of V. angularis and V. radiata, whereas cpDNAs are almost identical between the two. The stability of cpDNAs and the variability of mtDNAs was further confirmed by comparative analysis of Vigna organelles with model plants Lotus japonicus and Arabidopsis thaliana.
Project description:BackgroundAedes aegypti is the principal vector of yellow fever and dengue viruses throughout the tropical world. To provide a set of manually curated and annotated sequences from the Ae. aegypti genome, 14 mapped bacterial artificial chromosome (BAC) clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using a combination of computational gene-finding, expressed sequence tag (EST) matches and comparative protein homology. PCR and sequencing were used to experimentally confirm expression and sequence of a subset of these transcripts.ResultsOf the 51 manual annotations, 50 and 43 demonstrated a high level of similarity to Anopheles gambiae and Drosophila melanogaster genes, respectively. Ten of the 12 BAC sequences with more than one annotated gene exhibited synteny with the A. gambiae genome. Putative transcripts from eight BAC clones were found in multiple copies (two copies in most cases) in the Aedes genome assembly, which point to the probable presence of haplotype polymorphisms and/or misassemblies.ConclusionThis study not only provides a benchmark set of manually annotated transcripts for this genome that can be used to assess the quality of the auto-annotation pipeline and the assembly, but it also looks at the effect of a high repeat content on the genome assembly and annotation pipeline.
Project description:BackgroundGenerating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads.ResultsLongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM.ConclusionsDue to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch .