Two low coverage bird genomes and a comparison of reference-guided versus de novo genome assemblies.
ABSTRACT: As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (?3.5-5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.
Project description:Daphnia species are well-suited for studying local adaptation and evolutionary responses to stress(ors) including those caused by algal blooms. Algal blooms, characterized by an overgrowth (bloom) of cyanobacteria, are detrimental to the health of aquatic and terrestrial members of freshwater ecosystems. Some strains of Daphnia pulicaria have demonstrated resistance to toxic algae and the ability to mitigate toxic algal blooms. Understanding the genetic mechanism associated with this toxin resistance requires adequate genomic resources. Using whole-genome sequence data mapped to the Daphnia pulex reference genome (PA42), we present reference-guided draft assemblies from one tolerant and one sensitive strain of D. pulicaria, Wintergreen-6 (WI-6), and Bassett-411 (BA-411), respectively. Assessment of the draft assemblies reveal low contamination levels, and high levels (95%) of genic content. Reference scaffolds had coverage breadths of 98.9-99.4%, and average depths of 33X and 29X for BA-411 and WI-6, respectively. Within, we discuss caveats and suggestions for improving these draft assemblies. These genomic resources are presented with a goal of contributing to the resources necessary to understand the genetic mechanisms and associations of toxic prey resistance observed in this species.
Project description:Genome assembly of short reads from large plant genomes remains a challenge in computational biology despite major developments in next generation sequencing. Of late several draft assemblies have been reported in sequenced plant genomes. The reported draft genome assemblies of <i>Cajanus cajan</i> have different levels of genome completeness, a large number of repeats, gaps, and segmental duplications. Draft assemblies with portions of genome missing are shorter than the referenced original genome. These assemblies come with low map accuracy affecting further functional annotation and the prediction of gene components as desired by crop researchers. Genome coverage, i.e., the number of sequenced raw reads mapped onto a certain location of the genome is an important quality indicator of completeness and assembly quality in draft assemblies. The present work aimed to improve the coverage in reported <i>de novo</i> sequenced draft genomes (GCA_000340665.1 and GCA_000230855.2) of pigeonpea, a legume widely cultivated in India. The two recently sequenced assemblies, A1 and A2 comprised 72% and 75% of the estimated coverage of the genome, respectively. We employed an assembly reconciliation approach to compare the draft assemblies and merge them, filling the gaps by employing an algorithm size sorting mate-pair library to generate a high quality and near complete assembly with enhanced contiguity. The majority of gaps present within scaffolds were filled with right-sized mate-pair reads. The improved assembly reduced the number of gaps than those reported in draft assemblies resulting in an improved genome coverage of 82.4%. Map accuracy of the improved assembly was evaluated using various quality metrics and for the presence of specific trait-related functional genes. Employed pair-end and mate-pair local libraries helped us to reduce gaps, repeats, and other sequence errors resulting in lengthier scaffolds compared to the two draft assemblies. We reported the prediction of putative host resistance genes against <i>Fusarium</i> wilt disease by their performance and evaluated them both in wet laboratory and field phenotypic conditions.
Project description:Assembling complete or near complete genomes from complex microbial communities remains a significant challenge in metagenomic studies. Recent developments in single cell amplified genomes (SAGs) have enabled the sequencing of individual draft genomes representative of uncultivated microbial populations. SAGs suffer from incomplete and uneven coverage due to artifacts that arise from multiple displacement amplification techniques. Conversely, metagenomic sequence data does not suffer from the same biases as SAGs, and significant improvements have been realized in the recovery of draft genomes from metagenomes. Nevertheless, the inherent genomic complexity of many microbial communities often obfuscates facile generation of population genome assemblies from metagenomic data. Here we describe a new method for metagenomic-guided SAG assembly that leverages the advantages of both methods and significantly improves the completeness of initial SAGs assemblies. We demonstrate that SAG assemblies of two cosmopolitan marine lineages-Marine Group 1 Thaumarchaeota and SAR324 clade bacterioplankton-were substantially improved using this approach. Moreover, the improved assemblies strengthened biological inferences. For example, the improved SAR324 clade genome assembly revealed the presence of many genes in phenylalanine catabolism and flagellar assembly that were absent in the original SAG.
Project description:BACKGROUND:Nicotiana benthamiana is an important model organism of the Solanaceae (Nightshade) family. Several draft assemblies of the N. benthamiana genome have been generated, but many of the gene-models in these draft assemblies appear incorrect. RESULTS:Here we present an improved proteome based on the Niben1.0.1 draft genome assembly guided by gene models from other Nicotiana species. Due to the fragmented nature of the Niben1.0.1 draft genome, many protein-encoding genes are missing or partial. We complement these missing proteins by similarly annotating other draft genome assemblies. This approach overcomes problems caused by mis-annotated exon-intron boundaries and mis-assigned short read transcripts to homeologs in polyploid genomes. With an estimated 98.1% completeness; only 53,411 protein-encoding genes; and improved protein lengths and functional annotations, this new predicted proteome is better in assigning spectra than the preceding proteome annotations. This dataset is more sensitive and accurate in proteomics applications, clarifying the detection by activity-based proteomics of proteins that were previously predicted to be inactive. Phylogenetic analysis of the subtilase family of hydrolases reveal inactivation of likely homeologs, associated with a contraction of the functional genome in this alloploid plant species. Finally, we use this new proteome annotation to characterize the extracellular proteome as compared to a total leaf proteome, which highlights the enrichment of hydrolases in the apoplast. CONCLUSIONS:This proteome annotation provides the community working with Nicotiana benthamiana with an important new resource for functional proteomics.
Project description:<h4>Summary</h4>The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13?m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory.<h4>Availability and implementation</h4>ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin.<h4>Supplementary information</h4>Supplementary data are available at Bioinformatics online.
Project description:Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to "phase 3 finished" status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides "lift-over" co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
Project description:BACKGROUND:Whole-genome sequencing data from wild-caught individuals of closely related North American killifish species (Fundulus xenicus, Fundulus catenatus, Fundulus nottii, and Fundulus olivaceus) were obtained using long-read Oxford Nanopore Technology (ONT) PromethION and short-read Illumina platforms. FINDINGS:Draft de novo reference genome assemblies were generated using a combination of long and short sequencing reads. For each species, the PromethION platform was used to generate 30-45× sequence coverage, and the Illumina platform was used to generate 50-160× sequence coverage. Illumina-only assemblies were fragmented with high numbers of contigs, while ONT-only assemblies were error prone with low BUSCO scores. The highest N50 values, ranging from 0.4 to 2.7 Mb, were from assemblies generated using a combination of short- and long-read data. BUSCO scores were consistently >90% complete using the Eukaryota database. CONCLUSIONS:High-quality genomes can be obtained from a combination of using short-read Illumina data to polish assemblies generated with long-read ONT data. Draft assemblies and raw sequencing data are available for public use. We encourage use and reuse of these data for assembly benchmarking and other analyses.
Project description:Genome assemblies that are accurate, complete and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a 'missing manual' that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.
Project description:Background:Contiguous genome assemblies are a highly valued biological resource because of the higher number of completely annotated genes and genomic elements that are usable compared to fragmented draft genomes. Nonetheless, contiguity is difficult to obtain if only low coverage data and/or only distantly related reference genome assemblies are available. Findings:In order to improve genome contiguity, we have developed Cross-Species Scaffolding-a new pipeline that imports long-range distance information directly into the de novo assembly process by constructing mate-pair libraries in silico. Conclusions:We show how genome assembly metrics and gene prediction dramatically improve with our pipeline by assembling two primate genomes solely based on ?30x coverage of shotgun sequencing data.
Project description:The metagenome skimming approach, i.e. low coverage shotgun sequencing of multi-species assemblages and subsequent reconstruction of individual genomes, is increasingly used for in-depth genomic characterization of ecological communities. This approach is a promising tool for reconstructing genomes of facultative symbionts, such as lichen-forming fungi, from metagenomic reads. However, no study has so far tested accuracy and completeness of assemblies based on metagenomic sequences compared to assemblies based on pure culture strains of lichenized fungi. Here we assembled the genomes of Evernia prunastri and Pseudevernia furfuracea based on metagenomic sequences derived from whole lichen thalli. We extracted fungal contigs using two different taxonomic binning methods, and performed gene prediction on the fungal contig subsets. We then assessed quality and completeness of the metagenome-based assemblies using genome assemblies as reference which are based on pure culture strains of the two fungal species. Our comparison showed that we were able to reconstruct fungal genomes from uncultured lichen thalli, and also cover most of the gene space (86-90%). Metagenome skimming will facilitate genome mining, comparative (phylo)genomics, and population genetics of lichen-forming fungi by circumventing the time-consuming, sometimes unfeasible, step of aposymbiotic cultivation.