HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
ABSTRACT: Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Project description:Genome assemblers are computational tools for <i>de novo</i> genome assembly, based on a plenitude of primary sequencing data. The quality of genome assemblies is estimated by their contiguity and the occurrences of misassemblies (duplications, deletions, translocations or inversions). The rapid development of sequencing technologies has enabled the rise of novel <i>de novo</i> genome assembly strategies. The ultimate goal of such strategies is to utilise the features of each sequencing platform in order to address the existing weaknesses of each sequencing type and compose a complete and correct genome map. In the present study, the hybrid strategy, which is based on Illumina short paired?end reads and Nanopore long reads, was benchmarked using MaSuRCA and Wengan assemblers. Moreover, the long?read assembly strategy, which is based on Nanopore reads, was benchmarked using Canu or PacBio HiFi reads were benchmarked using Hifiasm and HiCanu. The assemblies were performed on a computational cluster with limited computational resources. Their outputs were evaluated in terms of accuracy and computational performance. PacBio HiFi assembly strategy outperforms the other ones, while Hi?C scaffolding, which is based on chromatin 3D structure, is required in order to increase continuity, accuracy and completeness when large and complex genomes, such as the human one, are assembled. The use of Hi?C data is also necessary while using the hybrid assembly strategy. The results revealed that HiFi sequencing enabled the rise of novel algorithms which require less genome coverage than that of the other strategies making the assembly a less computationally demanding task. Taken together, these developments may lead to the democratisation of genome assembly projects which are now approachable by smaller labs with limited technical and financial resources.
Project description:The ability to characterize repetitive regions of the human genome is limited by the read lengths of short-read sequencing technologies. Although long-read sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies can potentially overcome this limitation, long segmental duplications with high sequence identity pose challenges for long-read mapping. We describe a probabilistic method, DuploMap, designed to improve the accuracy of long-read mapping in segmental duplications. It analyzes reads mapped to segmental duplications using existing long-read aligners and leverages paralogous sequence variants (PSVs)-sequence differences between paralogous sequences-to distinguish between multiple alignment locations. On simulated datasets, DuploMap increased the percentage of correctly mapped reads with high confidence for multiple long-read aligners including Minimap2 (74.3-90.6%) and BLASR (82.9-90.7%) while maintaining high precision. Across multiple whole-genome long-read datasets, DuploMap aligned an additional 8-21% of the reads in segmental duplications with high confidence relative to Minimap2. Using DuploMap-aligned PacBio circular consensus sequencing reads, an additional 8.9 Mb of DNA sequence was mappable, variant calling achieved a higher F1 score and 14 713 additional variants supported by linked-read data were identified. Finally, we demonstrate that a significant fraction of PSVs in segmental duplications overlaps with variants and adversely impacts short-read variant calling.
Project description:We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
Project description:Third generation sequencing technologies, with sequencing reads in the tens- of kilo-bases, facilitate genome assembly by spanning ambiguous regions and improving continuity. This has been critical for plant genomes, which are difficult to assemble due to high repeat content, gene family expansions, segmental and tandem duplications, and polyploidy. Recently, high-throughput mapping and scaffolding strategies have further improved continuity. Together, these long-range technologies enable quality draft assemblies of complex genomes in a cost-effective and timely manner.Here, we present high quality genome assemblies of the model legume plant, Medicago truncatula (R108) using PacBio, Dovetail Chicago (hereafter, Dovetail) and BioNano technologies. To test these technologies for plant genome assembly, we generated five assemblies using all possible combinations and ordering of these three technologies in the R108 assembly. While the BioNano and Dovetail joins overlapped, they also showed complementary gains in continuity and join numbers. Both technologies spanned repetitive regions that PacBio alone was unable to bridge. Combining technologies, particularly Dovetail followed by BioNano, resulted in notable improvements compared to Dovetail or BioNano alone. A combination of PacBio, Dovetail, and BioNano was used to generate a high quality draft assembly of R108, a M. truncatula accession widely used in studies of functional genomics. As a test for the usefulness of the resulting genome sequence, the new R108 assembly was used to pinpoint breakpoints and characterize flanking sequence of a previously identified translocation between chromosomes 4 and 8, identifying more than 22.7 Mb of novel sequence not present in the earlier A17 reference assembly.Adding Dovetail followed by BioNano data yielded complementary improvements in continuity over the original PacBio assembly. This strategy proved efficient and cost-effective for developing a quality draft assembly compared to traditional reference assemblies.
Project description:Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.
Project description:It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Here, we devise a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the "Gel Bead in Emulsion" (GEM) microfluidic method (Zheng et al., 2016). We show that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions and other subtelomeric gap regions. Moreover, REXTAL is computationally more efficient for the directed assembly of such regions from multiple genomes (e.g., for the comparison of structural variation) than genome-wide assembly approaches.
Project description:Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300?bp) do not usually enable complete genome assembly. Long-read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods affect hybrid assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the family Enterobacteriaceae, as these frequently have highly plastic, repetitive genetic structures, and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies, as well as comparing to long-read-only assembly with Flye followed by short-read polishing with Pilon. Hybrid assembly with either PacBio or ONT reads facilitated high-quality genome reconstruction, and was superior to the long-read assembly and polishing approach evaluated with respect to accuracy and completeness. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.
Project description:The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.
Project description:Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.
Project description:PacBio long reads sequencing presents several potential advantages for DNA assembly, including being able to provide more complete gene profiling of metagenomic samples. However, lower single-pass accuracy can make gene discovery and assembly for low-abundance organisms difficult. To evaluate the application and performance of PacBio long reads and Illumina HiSeq short reads in metagenomic analyses, we directly compared various assemblies involving PacBio and Illumina sequencing reads based on two anaerobic digestion microbiome samples from a biogas fermenter. Using a PacBio platform, 1.58 million long reads (19.6 Gb) were produced with an average length of 7,604 bp. Using an Illumina HiSeq platform, 151.2 million read pairs (45.4 Gb) were produced. Hybrid assemblies using PacBio long reads and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length, contig N50 size, and number of large contigs. Interestingly, depth-based hybrid assemblies generated a higher percentage of complete genes (98.86%) compared to those based on HiSeq contigs only (40.29%), because the PacBio reads were long enough to cover many repeating short elements and capture multiple genes in a single read. Additionally, the incorporation of PacBio long reads led to considerable advantages regarding reducing contig numbers and increasing the completeness of the genome reconstruction, which was poorly assembled and binned when using HiSeq data alone. From this comparison of PacBio long reads with Illumina HiSeq short reads related to complex microbiome samples, we conclude that PacBio long reads can produce longer contigs, more complete genes, and better genome binning, thereby offering more information about metagenomic samples.