Project description:Gene synthesis attempts to assemble user-defined DNA sequences with base-level precision. Verifying the sequences of construction intermediates and the final product of a gene synthesis project is a critical part of the workflow, yet one that has received the least attention. Sequence validation is equally important for other kinds of curated clone collections. Ensuring that the physical sequence of a clone matches its published sequence is a common quality control step performed at least once over the course of a research project. GenoREAD is a web-based application that breaks the sequence verification process into two steps: the assembly of sequencing reads and the alignment of the resulting contig with a reference sequence. GenoREAD can determine if a clone matches its reference sequence. Its sophisticated reporting features help identify and troubleshoot problems that arise during the sequence verification process. GenoREAD has been experimentally validated on thousands of gene-sized constructs from an ORFeome project, and on longer sequences including whole plasmids and synthetic chromosomes. Comparing GenoREAD results with those from manual analysis of the sequencing data demonstrates that GenoREAD tends to be conservative in its diagnostic. GenoREAD is available at www.genoread.org.
Project description:BackgroundBioinformatics is fundamental to biomedical sciences, but its mastery presents a steep learning curve for bench biologists and clinicians. Learning to code while analyzing data is difficult. The curve may be flattened by separating these two aspects and providing intermediate steps for budding bioinformaticians. Single-cell analysis is in great demand from biologists and biomedical scientists, as evidenced by the proliferation of training events, materials, and collaborative global efforts like the Human Cell Atlas. However, iterative analyses lacking reinstantiation, coupled with unstandardized pipelines, have made effective single-cell training a moving target.FindingsTo address these challenges, we present a Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for single-cell RNA sequencing (scRNA-seq) analysis, which offers parallel analytical methods using a graphical interface (buttons) or code. With clear, interoperable materials, MIGHTS facilitates smooth transitions between environments. Bridging the biologist-programmer gap, MIGHTS emphasizes interdisciplinary communication for effective learning at all levels. Real-world data analysis in MIGHTS promotes critical thinking and best practices, while FAIR data principles ensure validation of results. MIGHTS is freely available, hosted on the Galaxy Training Network, and leverages Galaxy interfaces for analyses in both settings. Given the ongoing popularity of Python-based (Scanpy) and R-based (Seurat & Monocle) scRNA-seq analyses, MIGHTS enables analyses using both.ConclusionsMIGHTS consists of 11 tutorials, including recordings, slide decks, and interactive visualizations, and a demonstrated track record of sustainability via regular updates and community collaborations. Parallel pathways in MIGHTS enable concurrent training of scientists at any programming level, addressing the heterogeneous needs of novice bioinformaticians.
Project description:Building de novo genome assemblies for complex genomes is possible thanks to long-read DNA sequencing technologies. However, maximizing the quality of assemblies based on long reads is a challenging task that requires the development of specialized data analysis techniques. We present new algorithms for assembling long DNA sequencing reads from haploid and diploid organisms. The assembly algorithm builds an undirected graph with two vertices for each read based on minimizers selected by a hash function derived from the k-mer distribution. Statistics collected during the graph construction are used as features to build layout paths by selecting edges, ranked by a likelihood function. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. We ran the implemented algorithms on PacBio HiFi and Nanopore sequencing data taken from haploid and diploid samples of different species. Our algorithms showed competitive accuracy and computational efficiency, compared with other currently used software. We expect that this new development will be useful for researchers building genome assemblies for different species.
Project description:Nanopore sequencing of DNA is a single-molecule technique that may achieve long reads, low cost and high speed with minimal sample preparation and instrumentation. Here, we build on recent progress with respect to nanopore resolution and DNA control to interpret the procession of ion current levels observed during the translocation of DNA through the pore MspA. As approximately four nucleotides affect the ion current of each level, we measured the ion current corresponding to all 256 four-nucleotide combinations (quadromers). This quadromer map is highly predictive of ion current levels of previously unmeasured sequences derived from the bacteriophage phi X 174 genome. Furthermore, we show nanopore sequencing reads of phi X 174 up to 4,500 bases in length, which can be unambiguously aligned to the phi X 174 reference genome, and demonstrate proof-of-concept utility with respect to hybrid genome assembly and polymorphism detection. This work provides a foundation for nanopore sequencing of long, natural DNA strands.
Project description:Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report's reliability.
Project description:Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.
Project description:Plasmids are a foundational tool for basic and applied research across all subfields of biology. Increasingly, researchers in synthetic biology are relying on and developing massive libraries of plasmids as vectors for directed evolution, combinatorial gene circuit tests, and for CRISPR multiplexing. Verification of plasmid sequences following synthesis is a crucial quality control step that creates a bottleneck in plasmid fabrication workflows. Crucially, researchers often elect to forego the cumbersome verification step, potentially leading to reproducibility and-depending on the application-security issues. In order to facilitate plasmid verification to improve the quality and reproducibility of life science research, we developed a fast, simple, and open source pipeline for assembly and verification of plasmid sequences from Illumina reads. We demonstrate that our pipeline, which relies on de novo assembly, can also be used to detect contaminating sequences in plasmid samples. In addition to presenting our pipeline, we discuss the role for verification and quality control in the increasingly complex life science workflows ushered in by synthetic biology.
Project description:We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.
Project description:Here, we describe single-tube long fragment read (stLFR), a technology that enables sequencing of data from long DNA molecules using economical second-generation sequencing technology. It is based on adding the same barcode sequence to subfragments of the original long DNA molecule (DNA cobarcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process, up to 3.6 billion unique barcode sequences were generated on beads, enabling practically nonredundant cobarcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique cobarcoding of more than 8 million 20- to 300-kb genomic DNA fragments. Analysis of the human genome NA12878 with stLFR demonstrated high-quality variant calling and phase block lengths up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries, and their construction did not significantly add to the time or cost of whole-genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.