Parallel Generalized Suffix Tree Construction for Genomic Data
ABSTRACT: After a decade of digitization and technological advancements, we have an abundance of usable genomic data, which provide unique insights into our well-being. However, such datasets are large in volume, and retrieving meaningful information from them is often challenging. Hence, different indexing techniques and data structures have been proposed to handle such a massive scale of data. We utilize one such technique: Generalized Suffix Tree (GST). In this paper, we introduce an efficient parallel generalized suffix tree construction algorithm that is scalable for arbitrary genomic datasets. Our construction mechanism employs shared and distributed memory architecture collectively while not posing any fixed, prior memory requirement as it uses external memory (disks). Our experimental results show that our proposed architecture offers around 4-times speedup with respect to the sequential algorithm with only 16 parallel processors. The experiments on different datasets and parameters also exhibit the scalability of the execution time. In addition, we utilize different string queries and demonstrate their execution time on such tree structure, illustrating the efficacy and usability of GST for genomic data.
Project description:The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.
Project description:Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow.Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset.Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.
Project description:BACKGROUND:Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. RESULTS:We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs' terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. CONCLUSIONS:DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.
Project description:Background:The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result:In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions:gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.
Project description:The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-efficient and space-economical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.
Project description:The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.
Project description:Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/ .
Project description:BACKGROUND:Next-generation sequencing (NGS) technologies have produced large volumes of genomic data. One common operation on heterogeneous genomic data is genomic interval intersection. Most of the existing tools impose restrictions such as not allowing nested intervals or requiring intervals to be sorted when finding overlaps in two or more interval sets. RESULTS:We proposed segment tree (ST) and indexed segment tree forest (ISTF) based solutions for intersection of multiple genomic interval sets in parallel. We developed these methods as a tool, Joint Overlap Analysis (JOA), which takes n interval sets and finds overlapping intervals with no constraints on the given intervals. The proposed indexed segment tree forest is a novel composite data structure, which leverages on indexing and natural binning of a segment tree. We also presented construction and search algorithms for this novel data structure. We compared JOA ST and JOA ISTF with each other, and with other interval intersection tools for verification of its correctness and for showing that it attains comparable execution times. CONCLUSIONS:We implemented JOA in Java using the fork/join framework which speeds up parallel processing by taking advantage of all available processor cores. We compared JOA ST with JOA ISTF and showed that segment tree and indexed segment tree forest methods are comparable with each other in terms of execution time and memory usage. We also carried out execution time comparison analysis for JOA and other tools and demonstrated that JOA has comparable execution time and is able to further reduce its running time by using more processors per node. JOA can be run using its GUI or as a command line tool. JOA is available with source code at https://github.com/burcakotlu/JOA/ . A user manual is provided at https://joa.readthedocs.org.
Project description:BACKGROUND:Current popular variant calling pipelines rely on the mapping coordinates of each input read to a reference genome in order to detect variants. Since reads deriving from variant loci that diverge in sequence substantially from the reference are often assigned incorrect mapping coordinates, variant calling pipelines that rely on mapping coordinates can exhibit reduced sensitivity. RESULTS:In this work we present GeDi, a suffix array-based somatic single nucleotide variant (SNV) calling algorithm that does not rely on read mapping coordinates to detect SNVs and is therefore capable of reference-free and mapping-free SNV detection. GeDi executes with practical runtime and memory resource requirements, is capable of SNV detection at very low allele frequency (<1%), and detects SNVs with high sensitivity at complex variant loci, dramatically outperforming MuTect, a well-established pipeline. CONCLUSION:By designing novel suffix-array based SNV calling methods, we have developed a practical SNV calling software, GeDi, that can characterise SNVs at complex variant loci and at low allele frequency thus increasing the repertoire of detectable SNVs in tumour genomes. We expect GeDi to find use cases in targeted-deep sequencing analysis, and to serve as a replacement and improvement over previous suffix-array based SNV calling methods.
Project description:MOTIVATION: Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception--the TnT POS tagger--previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger. RESULTS: We have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type. AVAILABILITY AND IMPLEMENTATION: Java source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip.