Identification of pharmacodynamic biomarker hypotheses through literature analysis with IBM Watson.
ABSTRACT: BACKGROUND:Pharmacodynamic biomarkers are becoming increasingly valuable for assessing drug activity and target modulation in clinical trials. However, identifying quality biomarkers is challenging due to the increasing volume and heterogeneity of relevant data describing the biological networks that underlie disease mechanisms. A biological pathway network typically includes entities (e.g. genes, proteins and chemicals/drugs) as well as the relationships between these and is typically curated or mined from structured databases and textual co-occurrence data. We propose a hybrid Natural Language Processing and directed relationships-based network analysis approach using IBM Watson for Drug Discovery to rank all human genes and identify potential candidate biomarkers, requiring only an initial determination of a specific target-disease relationship. METHODS:Through natural language processing of scientific literature, Watson for Drug Discovery creates a network of semantic relationships between biological concepts such as genes, drugs, and diseases. Using Bruton's tyrosine kinase as a case study, Watson for Drug Discovery's automatically extracted relationship network was compared with a prominent manually curated physical interaction network. Additionally, potential biomarkers for Bruton's tyrosine kinase inhibition were predicted using a matrix factorization approach and subsequently compared with expert-generated biomarkers. RESULTS:Watson's natural language processing generated a relationship network matching 55 (86%) genes upstream of BTK and 98 (95%) genes downstream of Bruton's tyrosine kinase in a prominent manually curated physical interaction network. Matrix factorization analysis predicted 11 of 13 genes identified by Merck subject matter experts in the top 20% of Watson for Drug Discovery's 13,595 ranked genes, with 7 in the top 5%. CONCLUSION:Taken together, these results suggest that Watson for Drug Discovery's automatic relationship network identifies the majority of upstream and downstream genes in biological pathway networks and can be used to help with the identification and prioritization of pharmacodynamic biomarker evaluation, accelerating the early phases of disease hypothesis generation.
Project description:The goal of this study was to discover a minimally invasive pathway-specific biomarker that is immune to normal cell mRNA contamination for diagnosing head and neck squamous cell carcinoma (HNSCC). Using Elsevier's MedScan natural language processing component of the Pathway Studio software and the TRANSFAC database, we produced a curated set of genes regulated by the signaling networks driving the development of HNSCC. The network and its gene targets provided prior probabilities for gene expression, which guided our CoGAPS matrix factorization algorithm to isolate patterns related to HNSCC signaling activity from a microarray-based study. Using patterns that distinguished normal from tumor samples, we identified a reduced set of genes to analyze with Top Scoring Pair in order to produce a potential biomarker for HNSCC. Our proposed biomarker comprises targets of the transcription factor (TF) HIF1A and the FOXO family of TFs coupled with genes that show remarkable stability across all normal tissues. Based on validation with novel data from The Cancer Genome Atlas (TCGA), measured by RNAseq, and bootstrap sampling, the biomarker for normal vs. tumor has an accuracy of 0.77, a Matthews correlation coefficient of 0.54, and an area under the curve (AUC) of 0.82.
Project description:In the present work, we apply a geometric network approach to study common biological features of anticancer drug response. We use for this purpose the panel of 60 human cell lines (NCI-60) provided by the National Cancer Institute. Our study suggests that mathematical tools for network-based analysis can provide novel insights into drug response and cancer biology. We adopted a discrete notion of Ricci curvature to measure, via a link between Ricci curvature and network robustness established by the theory of optimal mass transport, the robustness of biological networks constructed with a pre-treatment gene expression dataset and coupled the results with the GI50 response of the cell lines to the drugs. Based on the resulting drug response ranking, we assessed the impact of genes that are likely associated with individual drug response. For genes identified as important, we performed a gene ontology enrichment analysis using a curated bioinformatics database which resulted in biological processes associated with drug response across cell lines and tissue types which are plausible from the point of view of the biological literature. These results demonstrate the potential of using the mathematical network analysis in assessing drug response and in identifying relevant genomic biomarkers and biological processes for precision medicine.
Project description:BACKGROUND:Because drug-drug interactions (DDIs) may cause adverse drug reactions or contribute to complex-disease treatments, it is important to identify DDIs before multiple-drug medications are prescribed. As the alternative of high-cost experimental identifications, computational approaches provide a much cheaper screening for potential DDIs on a large scale manner. Nevertheless, most of them only predict whether or not one drug interacts with another, but neglect their enhancive (positive) and depressive (negative) changes of pharmacological effects. Moreover, these comprehensive DDIs do not occur at random, but exhibit a weakly balanced relationship (a structural property when considering the DDI network), which would help understand how high-order DDIs work. RESULTS:This work exploits the intrinsically structural relationship to solve two tasks, including drug community detection as well as comprehensive DDI prediction in the cold-start scenario. Accordingly, we first design a balance regularized semi-nonnegative matrix factorization (BRSNMF) to partition the drugs into communities. Then, to predict enhancive and degressive DDIs in the cold-start scenario, we develop a BRSNMF-based predictive approach, which technically leverages drug-binding proteins (DBP) as features to associate new drugs (having no known DDI) with other drugs (having known DDIs). Our experiments demonstrate that BRSNMF can generate the drug communities, which exhibit more reasonable sizes, the property of weak balance as well as pharmacological significances. Moreover, they demonstrate the superiority of DBP features and the inspiring ability of the BRSNMF-based predictive approach on comprehensive DDI prediction with 94% accuracy among top-50 predicted enhancive and 86% accuracy among bottom-50 predicted degressive DDIs. CONCLUSIONS:Owing to the regularization of the weak balance property of the comprehensive DDI network into semi-nonnegative matrix factorization, our proposed BRSNMF is able to not only generate better drug communities but also provide an inspiring comprehensive DDI prediction in the cold-start scenario.
Project description:A central focus of clinical proteomics is to search for biomarkers in plasma for diagnostic and therapeutic use. We studied a set of plasma proteins accessed from the Healthy Human Individual's Integrated Plasma Proteome (HIP(2)) database, a larger set of curated human proteins, and a subset of inflammatory proteins, for overlap with sets of known protein biomarkers, drug targets, and secreted proteins. Most inflammatory proteins were found to occur in plasma, and over three times the level of biomarkers were found in inflammatory plasma proteins and their interacting protein neighbors compared to the sets of plasma and curated human proteins. Percentage overlaps with Gene Ontology terms were similar between the curated human set and plasma protein set, yet the set of inflammatory plasma proteins had a distinct ontology-based profile. Most of the major hub proteins within protein-protein interaction networks of tissue-specific sets of inflammatory proteins were found to occur in disease pathways. The present study presents a systematic approach for profiling a plasma subproteome's relationship to both its potential range of clinical application and its overlap with complex disease.
Project description:The biomedical community's collective understanding of how chemicals, genes and phenotypes interact is distributed across the text of over 24 million research articles. These interactions offer insights into the mechanisms behind higher order biochemical phenomena, such as drug-drug interactions and variations in drug response across individuals. To assist their curation at scale, we must understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes. We used NCBI's PubTator annotations to identify instances of chemical, gene and disease names in Medline abstracts and applied the Stanford dependency parser to find connecting dependency paths between pairs of entities in single sentences. We combined a published ensemble biclustering algorithm (EBC) with hierarchical clustering to group the dependency paths into semantically-related categories, which we annotated with labels, or 'themes' ('inhibition' and 'activation', for example). We evaluated our theme assignments against six human-curated databases: DrugBank, Reactome, SIDER, the Therapeutic Target Database, OMIM and PharmGKB.Clustering revealed 10 broad themes for chemical-gene relationships, 7 for chemical-disease, 10 for gene-disease and 9 for gene-gene. In most cases, enriched themes corresponded directly to known database relationships. Our final dataset, represented as a network, contained 37?491 thematically-labeled chemical-gene edges, 2?021?192 chemical-disease edges, 136?206 gene-disease edges and 41?418 gene-gene edges, each representing a single-sentence description of an interaction from somewhere in the literature.The complete network is available on Zenodo (https://zenodo.org/record/1035500). We have also provided the full set of dependency paths connecting biomedical entities in Medline abstracts, with associated sentences, for future use by the biomedical research community.Supplementary data are available at Bioinformatics online.
Project description:One challenge facing biologists is to tease out useful information from massive data sets for further analysis. A pathway-based analysis may shed light by projecting candidate genes onto protein functional relationship networks. We are building such a pathway-based analysis system.We have constructed a protein functional interaction network by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene coexpression, protein domain interaction, Gene Ontology (GO) annotations and text-mined protein interactions, which cover close to 50% of the human proteome. By applying this network to two glioblastoma multiforme (GBM) data sets and projecting cancer candidate genes onto the network, we found that the majority of GBM candidate genes form a cluster and are closer than expected by chance, and the majority of GBM samples have sequence-altered genes in two network modules, one mainly comprising genes whose products are localized in the cytoplasm and plasma membrane, and another comprising gene products in the nucleus. Both modules are highly enriched in known oncogenes, tumor suppressors and genes involved in signal transduction. Similar network patterns were also found in breast, colorectal and pancreatic cancers.We have built a highly reliable functional interaction network upon expert-curated pathways and applied this network to the analysis of two genome-wide GBM and several other cancer data sets. The network patterns revealed from our results suggest common mechanisms in the cancer biology. Our system should provide a foundation for a network or pathway-based analysis platform for cancer and other diseases.
Project description:The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.
Project description:BACKGROUND: Systems biological approach of molecular connectivity map has reached to a great interest to understand the gene functional similarities between the diseases. In this study, we developed a computational framework to build molecular connectivity maps by integrating mutated and differentially expressed genes of neurological and psychiatric diseases to determine its relationship with aging. RESULTS: The systematic large-scale analyses of 124 human diseases create three classes of molecular connectivity maps. First, molecular interaction of disease protein network generates 3632 proteins with 6172 interactions, which determines the common genes/proteins between diseases. Second, Disease-disease network includes 4845 positively scored disease-disease relationships. The comparison of these disease-disease pairs with Medical Subject Headings (MeSH) classification tree suggests 25% of the disease-disease pairs were in same disease area. The remaining can be a novel disease-disease relationship based on gene/protein similarity. Inclusion of aging genes set showed 79 neurological and 20 psychiatric diseases have the strong association with aging. Third and lastly, a curated disease biomarker network was created by relating the proteins/genes in specific disease contexts, such analysis showed 73 markers for 24 diseases. Further, the overall quality of the results was achieved by a series of statistical methods, to avoid insignificant data in biological networks. CONCLUSIONS: This study improves the understanding of the complex interactions that occur between neurological and psychiatric diseases with aging, which lead to determine the diagnostic markers. Also, the disease-disease association results could be helpful to determine the symptom relationships between neurological and psychiatric diseases. Together, our study presents many research opportunities in post-genomic biomarkers development.
Project description:Given the complex relationship between gene expression and phenotypic outcomes, computationally efficient approaches are needed to sift through large high-dimensional datasets in order to identify biologically relevant biomarkers. In this report, we describe a method of identifying the most salient biomarker genes in a dataset, which we call "candidate genes", by evaluating the ability of gene combinations to classify samples from a dataset, which we call "classification potential". Our algorithm, Gene Oracle, uses a neural network to test user defined gene sets for polygenic classification potential and then uses a combinatorial approach to further decompose selected gene sets into candidate and non-candidate biomarker genes. We tested this algorithm on curated gene sets from the Molecular Signatures Database (MSigDB) quantified in RNAseq gene expression matrices obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data repositories. First, we identified which MSigDB Hallmark subsets have significant classification potential for both the TCGA and GTEx datasets. Then, we identified the most discriminatory candidate biomarker genes in each Hallmark gene set and provide evidence that the improved biomarker potential of these genes may be due to reduced functional complexity.
Project description:Interferon-gamma (IFN-gamma) regulates various immune responses that are often critical for vaccine-induced protection. In order to annotate the IFN-gamma-related gene interaction network from a large amount of IFN-gamma research reported in the literature, a literature-based discovery approach was applied with a combination of natural language processing (NLP) and network centrality analysis. The interaction network of human IFN-gamma (Gene symbol: IFNG) and its vaccine-specific subnetwork were automatically extracted using abstracts from all articles in PubMed. Four network centrality metrics were further calculated to rank the genes in the constructed networks. The resulting generic IFNG network contains 1060 genes and 26313 interactions among these genes. The vaccine-specific subnetwork contains 102 genes and 154 interactions. Fifty six genes such as TNF, NFKB1, IL2, IL6, and MAPK8 were ranked among the top 25 by at least one of the centrality methods in one or both networks. Gene enrichment analysis indicated that these genes were classified in various immune mechanisms such as response to extracellular stimulus, lymphocyte activation, and regulation of apoptosis. Literature evidence was manually curated for the IFN-gamma relatedness of 56 genes and vaccine development relatedness for 52 genes. This study also generated many new hypotheses worth further experimental studies.