Project description:Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.
Project description:Currently, one of the most common causes of death worldwide is cancer. The development of innovative methods to support the early and accurate detection of cancers is required to increase the recovery rate of patients. Several studies have shown that medical Hyperspectral Imaging (HSI) combined with artificial intelligence algorithms is a powerful tool for cancer detection. Various preprocessing methods are commonly applied to hyperspectral data to improve the performance of the algorithms. However, there is currently no standard for these methods, and no studies have compared them so far in the medical field. In this work, we evaluated different combinations of preprocessing steps, including spatial and spectral smoothing, Min-Max scaling, Standard Normal Variate normalization, and a median spatial smoothing technique, with the goal of improving tumor detection in three different HSI databases concerning colorectal, esophagogastric, and brain cancers. Two machine learning and deep learning models were used to perform the pixel-wise classification. The results showed that the choice of preprocessing method affects the performance of tumor identification. The method that showed slightly better results with respect to identifing colorectal tumors was Median Filter preprocessing (0.94 of area under the curve). On the other hand, esophagogastric and brain tumors were more accurately identified using Min-Max scaling preprocessing (0.93 and 0.92 of area under the curve, respectively). However, it is observed that the Median Filter method smooths sharp spectral features, resulting in high variability in the classification performance. Therefore, based on these results, obtained with different databases acquired by different HSI instrumentation, the most relevant preprocessing technique identified in this work is Min-Max scaling.
Project description:BackgroundMany RNAs have evolutionarily conserved secondary structures instead of primary sequences. Recently, there are an increasing number of methods being developed with focus on the structural alignments for finding conserved secondary structures as well as common structural motifs in pair-wise or multiple sequences. A challenging task is to search similar structures quickly for structured RNA sequences in large genomic databases since existing methods are too slow to be used in large databases.ResultsAn implementation of a fast structural alignment algorithm, RScan, is proposed to fulfill the task. RScan is developed by levering the advantages of both hashing algorithms and local alignment algorithms. In our experiment, on the average, the times for searching a tRNA and an rRNA in the randomized A. pernix genome are only 256 seconds and 832 seconds respectively by using RScan, but need 3,178 seconds and 8,951 seconds respectively by using an existing method RSEARCH. Remarkably, RScan can handle large database queries, taking less than 4 minutes for searching similar structures for a microRNA precursor in human chromosome 21.ConclusionThese results indicate that RScan is a preferable choice for real-life application of searching structural similarities for structured RNAs in large databases. RScan software is freely available at http://bioinfo.au.tsinghua.edu.cn/member/cxue/rscan/RScan.htm.
Project description:A general searching method for comparing multiple sequence alignments was developed to detect sequence relationships between conserved protein regions. Multiple alignments are treated as sequences of amino acid distributions and aligned by comparing pairs of such distributions. Four different comparison measures were tested and the Pearson correlation coefficient chosen. The method is sensitive, detecting weak sequence relationships between protein families. Relationships are detected beyond the range of conventional sequence database searches, illustrating the potential usefulness of the method. The previously undetected relation between flavoprotein subunits of two oxidoreductase families points to the potential active site in one of the families. The similarity between the bacterial RecA, DnaA and Rad51 protein families reveals a region in DnaA and Rad51 proteins likely to bind and unstack single-stranded DNA. Helix--turn--helix DNA binding domains from diverse proteins are readily detected and shown to be similar to each other. Glycosylasparaginase and gamma-glutamyltransferase enzymes are found to be similar in their proteolytic cleavage sites. The method has been fully implemented on the World Wide Web at URL: http://blocks.fhcrc.org/blocks-bin/LAMAvsearch.
Project description:Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.
Project description:The presentation of non-target stimuli during trial-based instruction is known as instructive feedback. Previous research on instructive feedback has shown that learners with developmental disabilities may acquire these additional (i.e., secondary) targets without further training. Embedding secondary targets during the review of previously mastered targets may be used to bolster instructional gains. The current study evaluated the efficacy of embedding secondary targets during maintenance sessions for a child with autism spectrum disorder. The participant's responding met the mastery criterion for two target sets presented during the consequence portion of maintenance trials. For the remaining five target sets, a more intrusive intervention was required. Additional research is needed to evaluate the conditions under which secondary targets may be acquired.Supplementary informationThe online version contains supplementary material available at 10.1007/s40616-022-00174-9.
Project description:Over the last five years, virtual screening of ultralarge synthesis on-demand libraries has emerged as a powerful tool for hit identification in drug discovery programs. As these libraries have grown to tens of billions of molecules, we have reached a point where it is no longer cost-effective to screen every molecule virtually. To address these challenges, several groups have developed heuristic search methods to rapidly identify the best molecules on a virtual screen. This article describes the application of Thompson sampling (TS), an active learning approach that streamlines the virtual screening of large combinatorial libraries by performing a probabilistic search in the reagent space, thereby never requiring the full enumeration of the library. TS is a general technique that can be applied to various virtual screening modalities, including 2D and 3D similarity search, docking, and application of machine-learning models. In an illustrative example, we show that TS can identify more than half of the top 100 molecules from a docking-based virtual screen of 335 million molecules by evaluating 1% of the data set.
Project description:Single-cell RNA-seq (scRNA-seq) is being used widely to resolve cellular heterogeneity. With the rapid accumulation of public scRNA-seq data, an effective and efficient cell-querying method is critical for the utilization of the existing annotations to curate newly sequenced cells. Such a querying method should be based on an accurate cell-to-cell similarity measure, and capable of handling batch effects properly. Herein, we present Cell BLAST, an accurate and robust cell-querying method built on a neural network-based generative model and a customized cell-to-cell similarity metric. Through extensive benchmarks and case studies, we demonstrate the effectiveness of Cell BLAST in annotating discrete cell types and continuous cell differentiation potential, as well as identifying novel cell types. Powered by a well-curated reference database and a user-friendly Web server, Cell BLAST provides the one-stop solution for real-world scRNA-seq cell querying and annotation.
Project description:BackgroundHigh-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and "understanding" becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective.ResultsThe BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank--like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different "view" of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding.ConclusionBpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL.
Project description:This project aims to assess the utility of bibliographic databases beyond the three major ones (MEDLINE, EMBASE and Cochrane CENTRAL) for finding controlled trials of complementary and alternative medicine (CAM). Fifteen databases were searched to identify controlled clinical trials (CCTs) of CAM not also indexed in MEDLINE. Searches were conducted in May 2006 using the revised Cochrane highly sensitive search strategy (HSSS) and the PubMed CAM Subset. Yield of CAM trials per 100 records was determined, and databases were compared over a standardized period (2005). The Acudoc2 RCT, Acubriefs, Index to Chiropractic Literature (ICL) and Hom-Inform databases had the highest concentrations of non-MEDLINE records, with more than 100 non-MEDLINE records per 500. Other productive databases had ratios between 500 and 1500 records to 100 non-MEDLINE records-these were AMED, MANTIS, PsycINFO, CINAHL, Global Health and Alt HealthWatch. Five databases were found to be unproductive: AGRICOLA, CAIRSS, Datadiwan, Herb Research Foundation and IBIDS. Acudoc2 RCT yielded 100 CAM trials in the most recent 100 records screened. Acubriefs, AMED, Hom-Inform, MANTIS, PsycINFO and CINAHL had more than 25 CAM trials per 100 records screened. Global Health, ICL and Alt HealthWatch were below 25 in yield. There were 255 non-MEDLINE trials from eight databases in 2005, with only 10% indexed in more than one database. Yield varied greatly between databases; the most productive databases from both sampling methods were Acubriefs, Acudoc2 RCT, AMED and CINAHL. Low overlap between databases indicates comprehensive CAM literature searches will require multiple databases.