A multi-pattern hash-binary hybrid algorithm for URL matching in the HTTP protocol.
ABSTRACT: In this paper, based on our previous multi-pattern uniform resource locator (URL) binary-matching algorithm called HEM, we propose an improved multi-pattern matching algorithm called MH that is based on hash tables and binary tables. The MH algorithm can be applied to the fields of network security, data analysis, load balancing, cloud robotic communications, and so on-all of which require string matching from a fixed starting position. Our approach effectively solves the performance problems of the classical multi-pattern matching algorithms. This paper explores ways to improve string matching performance under the HTTP protocol by using a hash method combined with a binary method that transforms the symbol-space matching problem into a digital-space numerical-size comparison and hashing problem. The MH approach has a fast matching speed, requires little memory, performs better than both the classical algorithms and HEM for matching fields in an HTTP stream, and it has great promise for use in real-world applications.
Project description:BACKGROUND:Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. RESULTS:The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. CONCLUSIONS:The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
Project description:Motivated by the problem of classifying individuals with a disease versus controls using a functional genomic attribute as input, we present relatively efficient general purpose inner product-based kernel classifiers to classify the test as a normal or disease sample. We encode each training sample as a string of 1 s (presence) and 0 s (absence) representing the attribute's existence across ordered physical blocks of the subdivided genome. Having binary-valued features allows for highly efficient data encoding in the computational basis for classifiers relying on binary operations. Given that a natural distance between binary strings is Hamming distance, which shares properties with bit-string inner products, our two classifiers apply different inner product measures for classification. The active inner product (AIP) is a direct dot product-based classifier whereas the symmetric inner product (SIP) classifies upon scoring correspondingly matching genomic attributes. SIP is a strongly Hamming distance-based classifier generally applicable to binary attribute-matching problems whereas AIP has general applications as a simple dot product-based classifier. The classifiers implement an inner product between N =?2 n dimension test and train vectors using n Fredkin gates while the training sets are respectively entangled with the class-label qubit, without use of an ancilla. Moreover, each training class can be composed of an arbitrary number m of samples that can be classically summed into one input string to effectively execute all test-train inner products simultaneously. Thus, our circuits require the same number of qubits for any number of training samples and are O(logN) in gate complexity after the states are prepared. Our classifiers were implemented on ibmqx2 (IBM-Q-team 2019b) and ibmq_16_melbourne (IBM-Q-team 2019a). The latter allowed encoding of 64 training features across the genome.
Project description:BACKGROUND:A number of alignment tools have been developed to align sequencing reads to the human reference genome. The scale of information from next-generation sequencing (NGS) experiments, however, is increasing rapidly. Recent studies based on NGS technology have routinely produced exome or whole-genome sequences from several hundreds or thousands of samples. To accommodate the increasing need of analyzing very large NGS data sets, it is necessary to develop faster, more sensitive and accurate mapping tools. RESULTS:HIA uses two indices, a hash table index and a suffix array index. The hash table performs direct lookup of a q-gram, and the suffix array performs very fast lookup of variable-length strings by exploiting binary search. We observed that combining hash table and suffix array (hybrid index) is much faster than the suffix array method for finding a substring in the reference sequence. Here, we defined the matching region (MR) is a longest common substring between a reference and a read. And, we also defined the candidate alignment regions (CARs) as a list of MRs that is close to each other. The hybrid index is used to find candidate alignment regions (CARs) between a reference and a read. We found that aligning only the unmatched regions in the CAR is much faster than aligning the whole CAR. In benchmark analysis, HIA outperformed in mapping speed compared with the other aligners, without significant loss of mapping accuracy. CONCLUSIONS:Our experiments show that the hybrid of hash table and suffix array is useful in terms of speed for mapping NGS sequencing reads to the human reference genome sequence. In conclusion, our tool is appropriate for aligning massive data sets generated by NGS sequencing.
Project description:BACKGROUND: Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes. RESULTS: We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10% to 30% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes. CONCLUSIONS: We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code in CUDA and OpenCL is freely available from http://www.cs.njit.edu/usman/MaxSSmap.
Project description:Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Project description:There is an urgent need for the development of global analytic frameworks that can perform analyses in a privacy-preserving federated environment across multiple institutions without privacy leakage. A few studies on the topic of federated medical analysis have been conducted recently with the focus on several algorithms. However, none of them have solved similar patient matching, which is useful for applications such as cohort construction for cross-institution observational studies, disease surveillance, and clinical trials recruitment.The aim of this study was to present a privacy-preserving platform in a federated setting for patient similarity learning across institutions. Without sharing patient-level information, our model can find similar patients from one hospital to another.We proposed a federated patient hashing framework and developed a novel algorithm to learn context-specific hash codes to represent patients across institutions. The similarities between patients can be efficiently computed using the resulting hash codes of corresponding patients. To avoid security attack from reverse engineering on the model, we applied homomorphic encryption to patient similarity search in a federated setting.We used sequential medical events extracted from the Multiparameter Intelligent Monitoring in Intensive Care-III database to evaluate the proposed algorithm in predicting the incidence of five diseases independently. Our algorithm achieved averaged area under the curves of 0.9154 and 0.8012 with balanced and imbalanced data, respectively, in κ-nearest neighbor with κ=3. We also confirmed privacy preservation in similarity search by using homomorphic encryption.The proposed algorithm can help search similar patients across institutions effectively to support federated data analysis in a privacy-preserving manner.
Project description:Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralized database of sequentially numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to those of mapping-based approaches in Clostridium difficile, using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals. Hash-cgMLST provided the same results as standard cgMLST, with minimal performance penalty. Comparing 272 replicate sequence pairs using reference-based mapping, there were 0, 1, or 2 single-nucleotide polymorphisms (SNPs) between 262 (96%), 5 (2%), and 1 (<1%) of the pairs, respectively. Using hash-cgMLST, 218 (80%) of replicate pairs assembled with SPAdes had zero gene differences, and 31 (11%), 5 (2%), and 18 (7%) pairs had 1, 2, and >2 differences, respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but were reduced using the SKESA assembler. Considering 412 pairs of infections with ?2 SNPS, i.e., consistent with recent transmission, 376 (91%) had ?2 gene differences and 16 (4%) had ?4. Comparing a genome to 100,000 others took <1 min using hash-cgMLST. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generate more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships.
Project description:Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.
Project description:Massively parallel sequencing readouts of epigenomic assays are enabling integrative genome-wide analyses of genomic and epigenomic variation. Pash 3.0 performs sequence comparison and read mapping and can be employed as a module within diverse configurable analysis pipelines, including ChIP-Seq and methylome mapping by whole-genome bisulfite sequencing.Pash 3.0 generally matches the accuracy and speed of niche programs for fast mapping of short reads, and exceeds their performance on longer reads generated by a new generation of massively parallel sequencing technologies. By exploiting longer read lengths, Pash 3.0 maps reads onto the large fraction of genomic DNA that contains repetitive elements and polymorphic sites, including indel polymorphisms.We demonstrate the versatility of Pash 3.0 by analyzing the interaction between CpG methylation, CpG SNPs, and imprinting based on publicly available whole-genome shotgun bisulfite sequencing data. Pash 3.0 makes use of gapped k-mer alignment, a non-seed based comparison method, which is implemented using multi-positional hash tables. This allows Pash 3.0 to run on diverse hardware platforms, including individual computers with standard RAM capacity, multi-core hardware architectures and large clusters.