Project description:PurposeThe Electronic Medical Record Search Engine (EMERSE) is a software tool built to aid research spanning cohort discovery, population health, and data abstraction for clinical trials. EMERSE is now live at three academic medical centers, with additional sites currently working on implementation. In this report, we describe how EMERSE has been used to support cancer research based on a variety of metrics.MethodsWe identified peer-reviewed publications that used EMERSE through online searches as well as through direct e-mails to users based on audit logs. These logs were also used to summarize use at each of the three sites. Search terms for two of the sites were characterized using the natural language processing tool MetaMap to determine to which semantic types the terms could be mapped.ResultsWe identified a total of 326 peer-reviewed publications that used EMERSE through August 2019, although this is likely an underestimation of the true total based on the use log analysis. Oncology-related research comprised nearly one third (n = 105; 32.2%) of all research output. The use logs showed that EMERSE had been used by multiple people at each site (nearly 3,500 across all three) who had collectively logged into the system > 100,000 times. Many user-entered search queries could not be mapped to a semantic type, but the most common semantic type for terms that did match was "disease or syndrome," followed by "pharmacologic substance."ConclusionEMERSE has been shown to be a valuable tool for supporting cancer research. It has been successfully deployed at other sites, despite some implementation challenges unique to each deployment environment.
Project description:This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. https://github.com/noname2020/Bioc.
Project description:UnlabelledThe flanking sequences provided by dbSNP of NCBI are usually short and fixed length without further extension, thus making the design of appropriate PCR primers difficult. Here, we introduce a tool named "SNP-Flankplus" to provide a web environment for retrieval of SNP flanking sequences from both the dbSNP and the nucleotide databases of NCBI. Two SNP ID types, rs# and ss#, are acceptable for querying SNP flanking sequences with adjustable lengths for at least sixteen organisms.AvailabilityThis software is freely available at http://bio.kuas.edu.tw/snp-flankplus/
Project description:Identification and retrieval of genes of interest from genomic data are an essential step for many bioinformatic applications. We present orthofisher, a command-line tool for automated identification and retrieval of genes with high sequence similarity to a query profile Hidden Markov Model sequence alignment across a set of proteomes. Performance assessment of orthofisher revealed high accuracy and precision during single-copy orthologous gene identification. orthofisher may be useful for assessing gene annotation quality, identifying single-copy orthologous genes for phylogenomic analyses, estimating gene copy number, and other evolutionary analyses that rely on identification and retrieval of homologous genes from genomic data. orthofisher comes complete with comprehensive documentation (https://jlsteenwyk.com/orthofisher/), is freely available under the MIT license, and is available for download from GitHub (https://github.com/JLSteenwyk/orthofisher), PyPi (https://pypi.org/project/orthofisher/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/orthofisher).
Project description:Microbiome analyses can be challenging because microbial strains are numerous, and often, confounding factors in the data set are also numerous. Many tools reduce, summarize, and visualize these high-dimensional data to provide insight at the community level. However, they lose the detailed information about each taxon and can be misleading (for example, the well-known horseshoe effect in ordination plots). Thus, multiple methods at different levels of resolution are required to capture the full range of microbial patterns. Here we present Calour, a user-friendly data exploration tool for microbiome analyses. Calour provides a study-centric data model to store and manipulate sample-by-feature tables (with features typically being operational taxonomic units) and their associated metadata. It generates an interactive heatmap, allowing visualization of microbial patterns and exploration using microbial knowledge databases. We demonstrate the use of Calour by exploring publicly available data sets, including the gut and skin microbiota of habitat-switched fire salamander larvae, gut microbiota of Trichuris muris-infected mice, skin microbiota of different human body sites, gut microbiota of various ant species, and a metabolome study of mice exposed to intermittent hypoxia and hypercapnia. In these cases, Calour reveals novel patterns and potential contaminants of subgroups of microbes that are otherwise hard to find. Calour is open source under the Berkeley Software Distribution (BSD) license and available from https://github.com/biocore/calour. IMPORTANCE Calour allows us to identify interesting microbial patterns and generate novel biological hypotheses by interactively inspecting microbiome studies and incorporating annotation databases and convenient statistical tools. Calour can be used as a first-step tool for microbiome data exploration.
Project description:The exclusion of monolingual natives from cyberspace is a global socioeconomic and cultural problem. Efforts at addressing this problem have been socioeconomic, culminating in training, empowerment, and digital access with the indelible hurt of language inequities. This paper is aimed at the cyber-inclusion of monolingual natives. Since cyber participation is basically through human interaction with cyber-applications in a human language, encapsulating these applications for interaction in any human language will help evade the hurt of language inequities. Information retrieval system (IRS) remains a fundamental cyber-application. Consequently, adopting the design science research methodology, we introduced a lingual agnostic IRS architecture designed on the principle of transparency on user language detection, information translations, and caching. The detailed design of the architecture was done using the unified modeling language. The designed IRS architecture has been implemented using the agile and component-based software engineering approaches. The resultant lingual agnostic IRS (LAIRS) was evaluated using heuristics and system evaluation methods for parity of language of interaction against the default language and was excellently stable across queries and languages, guaranteeing 86% parity with the default language in the use of other languages for information access and retrieval. Furthermore, it has been shown that LAIRS is the most appropriate IRS to address the problem of language barriers to cyber-inclusion compared with existing IRSs.
Project description:BackgroundBioinformatics and medical informatics are two research fields that serve the needs of different but related communities. Both domains share the common goal of providing new algorithms, methods and technological solutions to biomedical research, and contributing to the treatment and cure of diseases. Although different microarray techniques have been successfully used to investigate useful information for cancer diagnosis at the gene expression level, the true integration of existing methods into day-to-day clinical practice is still a long way off. Within this context, case-based reasoning emerges as a suitable paradigm specially intended for the development of biomedical informatics applications and decision support systems, given the support and collaboration involved in such a translational development. With the goals of removing barriers against multi-disciplinary collaboration and facilitating the dissemination and transfer of knowledge to real practice, case-based reasoning systems have the potential to be applied to translational research mainly because their computational reasoning paradigm is similar to the way clinicians gather, analyze and process information in their own practice of clinical medicine.ResultsIn addressing the issue of bridging the existing gap between biomedical researchers and clinicians who work in the domain of cancer diagnosis, prognosis and treatment, we have developed and made accessible a common interactive framework. Our geneCBR system implements a freely available software tool that allows the use of combined techniques that can be applied to gene selection, clustering, knowledge extraction and prediction for aiding diagnosis in cancer research. For biomedical researches, geneCBR expert mode offers a core workbench for designing and testing new techniques and experiments. For pathologists or oncologists, geneCBR diagnostic mode implements an effective and reliable system that can diagnose cancer subtypes based on the analysis of microarray data using a CBR architecture. For programmers, geneCBR programming mode includes an advanced edition module for run-time modification of previous coded techniques.ConclusiongeneCBR is a new translational tool that can effectively support the integrative work of programmers, biomedical researches and clinicians working together in a common framework. The code is freely available under the GPL license and can be obtained at http://www.genecbr.org.
Project description:The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
Project description:The DAVID Gene Functional Classification Tool http://david.abcc.ncifcrf.gov uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.
Project description:Visualization algorithms have been widely used for intuitive interrogation of genomic data and popularly used tools include MDS, t-SNE, and UMAP. However, these algorithms are not tuned for the visualization of binary data and none of them consider the hubness of observations for the visualization. In order to address these limitations, here we propose hubViz, a novel tool for hub-centric visualization of binary data. We evaluated the performance of hubViz with its application to the gene expression data measured in multiple brain regions of rats exposed to cocaine, the single-cell RNA-seq data of peripheral blood mononuclear cells treated with interferon beta, and the literature mining data to investigate relationships among diseases. We further evaluated the performance of hubViz using simulation studies. We showed that hubViz provides effective visual inspection by locating the hub in the center and the contrasting elements in the opposite sides around the center. We believe that hubViz and its software can be powerful tools that can improve visualizations of various genomic data. The hubViz is implemented as an R package hubviz, which is publicly available at https://dongjunchung.github.io/hubviz/.