Lighter: fast and memory-efficient sequencing error correction without counting.
ABSTRACT: Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Project description:BACKGROUND:High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS:In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less memory. CONCLUSION:We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY:https://www.github.com/GuillaumeHolley/BloomFilterTrie.
Project description:Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html.
Project description:BACKGROUND:Gene homology type classification is required for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. Consequently, a large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic data sets, these tools require high memory and CPU usage, typically available only in computational clusters. FINDINGS:Here we present a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. SwiftOrtho uses long k-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long k-mers. In addition, it uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups. In our tests, SwiftOrtho was the only tool that completed orthology analysis of proteins from 1,760 bacterial genomes on a computer with only 4 GB RAM. Using various standard orthology data sets, we also show that SwiftOrtho has a high accuracy. CONCLUSIONS:SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low-memory computers. SwiftOrtho is available at https://github.com/Rinoahu/SwiftOrtho.
Project description:Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool.Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
Project description:Understanding variability in smoking patterns may inform smoking cessation interventions. Retrospective reports of cigarettes smoked per day may be biased and typically do not provide temporal precision regarding when cigarettes are smoked. However, real-time, user-initiated tracking, such as logging each time a cigarette is smoked, can be burdensome over long time frames. In this study, adult, non-treatment seeking daily smokers (N?=?22) used an electronic, smart lighter to light and timestamp cigarettes for 14?days. Participants reported number of cigarettes smoked per day (CPD) via a mobile device (daily diary) and retrospectively reported CPD at the end of the study using the Timeline Followback (TLFB). Self-reported lighter satisfaction and adherence varied with 68% of participants reporting that they liked using the lighter and participants reporting using the lighter for 92% of cigarettes smoked, on average. Lighter-estimated CPD did not differ from daily diary-estimated CPD, but was significantly lower than TLFB estimates. The lighter resulted in greater day-to-day variability relative to other methods and fewer rounded cigarette counts (digit bias) relative to the TLFB. The lighter appears to be feasible for capturing data on smoking patterns in daily smokers. Though false positive cigarettes are likely low, additional technologies that augment data captured from the lighter may be necessary to reduce false negatives (missed cigarettes) and alternative lighter designs may appeal more to certain smokers.
Project description:Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
Project description:The transformation-induced stress relaxation and stress recovery of TiNi shape memory alloy (SMA) in stress-controlled subloop loading were investigated based on the local variation in temperature and transformation band on the surface of the tape in the tension test. The results obtained are summarized as follows. (1) In the loading process, temperature increases due to the exothermic martensitic transformation (MT) until the holding strain and thereafter temperature decreases while holding the strain constant, resulting in stress relaxation due to the MT; (2) In the unloading process, temperature decreases due to the endothermic reverse transformation until the holding strain and thereafter temperature increases while holding the strain constant, resulting in stress recovery due to the reverse transformation; (3) Stress varies markedly in the initial stage followed by gradual change while holding the strain constant; (4) If the stress rate is high until the holding strain in the loading and unloading processes, both stress relaxation and stress recovery are large; (5) It is important to take into account this behavior in the design of SMA elements, since the force of SMA elements varies even if the atmospheric temperature is kept constant.
Project description:With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.
Project description:Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing.We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read.Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Project description:MOTIVATION:Nanopore sequencing is one of the leading third-generation sequencing technologies. A number of computational tools have been developed to facilitate the processing and analysis of the Nanopore data. Previously, we have developed DeepSimulator1.0 (DS1.0), which is the first simulator for Nanopore sequencing to produce both the raw electrical signals and the reads. However, although DS1.0 can produce high-quality reads, for some sequences, the divergence between the simulated raw signals and the real signals can be large. Furthermore, the Nanopore sequencing technology has evolved greatly since DS1.0 was released. It is thus necessary to update DS1.0 to accommodate those changes. RESULTS:We propose DeepSimulator1.5 (DS1.5), all three modules of which have been updated substantially from DS1.0. As for the sequence generator, we updated the sample read length distribution to reflect the newest real reads' features. In terms of the signal generator, which is the core of DeepSimulator, we added one more pore model, the context-independent pore model, which is much faster than the previous context-dependent one. Furthermore, to make the generated signals more similar to the real ones, we added a low-pass filter to post-process the pore model signals. Regarding the basecaller, we added the support for the newest official basecaller, Guppy, which can support both GPU and CPU. In addition, multiple optimizations, related to multiprocessing control, memory and storage management, have been implemented to make DS1.5 a much more amenable and lighter simulator than DS1.0. AVAILABILITY AND IMPLEMENTATION:The main program and the data are available at https://github.com/lykaust15/DeepSimulator. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.