Evolving BioAssay Ontology (BAO): modularization, integration and applications.
ABSTRACT: The lack of established standards to describe and annotate biological assays and screening outcomes in the domain of drug and chemical probe discovery is a severe limitation to utilize public and proprietary drug screening data to their maximum potential. We have created the BioAssay Ontology (BAO) project (http://bioassayontology.org) to develop common reference metadata terms and definitions required for describing relevant information of low-and high-throughput drug and probe screening assays and results. The main objectives of BAO are to enable effective integration, aggregation, retrieval, and analyses of drug screening data. Since we first released BAO on the BioPortal in 2010 we have considerably expanded and enhanced BAO and we have applied the ontology in several internal and external collaborative projects, for example the BioAssay Research Database (BARD). We describe the evolution of BAO with a design that enables modeling complex assays including profile and panel assays such as those in the Library of Integrated Network-based Cellular Signatures (LINCS). One of the critical questions in evolving BAO is the following: how can we provide a way to efficiently reuse and share among various research projects specific parts of our ontologies without violating the integrity of the ontology and without creating redundancies. This paper provides a comprehensive answer to this question with a description of a methodology for ontology modularization using a layered architecture. Our modularization approach defines several distinct BAO components and separates internal from external modules and domain-level from structural components. This approach facilitates the generation/extraction of derived ontologies (or perspectives) that can suit particular use cases or software applications. We describe the evolution of BAO related to its formal structures, engineering approaches, and content to enable modeling of complex assays and integration with other ontologies and datasets.
Project description:High-throughput screening data repositories, such as PubChem, represent valuable resources for the development of small-molecule chemical probes and can serve as entry points for drug discovery programs. Although the loose data format offered by PubChem allows for great flexibility, important annotations, such as the assay format and technologies employed, are not explicitly indexed. The authors have previously developed a BioAssay Ontology (BAO) and curated more than 350 assays with standardized BAO terms. Here they describe the use of BAO annotations to analyze a large set of assays that employ luciferase- and ?-lactamase-based technologies. They identified promiscuous chemotypes pertaining to different subcategories of assays and specific mechanisms by which these chemotypes interfere in reporter gene assays. Results show that the data in PubChem can be used to identify promiscuous compounds that interfere nonspecifically with particular technologies. Furthermore, they show that BAO is a valuable toolset for the identification of related assays and for the systematic generation of insights that are beyond the scope of individual assays or screening campaigns.
Project description:With the public availability of biochemical assays and screening data constantly increasing, new applications for data mining and method analysis are evolving in parallel. One example is BioAssay Ontology (BAO) for systematic classification of assays based on screening setup and metadata annotations. In this article we report a high-throughput screening (HTS) against phospho-N-acetylmuramoyl-pentapeptide translocase (MraY), an attractive antibacterial drug target involved in peptidoglycan synthesis. The screen resulted in novel chemistry identification using a fluorescence resonance energy transfer assay. To address a subset of the false positive hits, a frequent hitter analysis was performed using an approach in which MraY hits were compared with hits from similar assays, previously used for HTS. The MraY assay was annotated according to BAO and three internal reference assays, using a similar assay design and detection technology, were identified. Analyzing the assays retrospectively, it was clear that both MraY and the three reference assays all showed a high false positive rate in the primary HTS assays. In the case of MraY, false positives were efficiently identified by applying a method to correct for compound interference at the hit-confirmation stage. Frequent hitter analysis based on the three reference assays with similar assay method identified additional false actives in the primary MraY assay as frequent hitters. This article demonstrates how assays annotated using BAO terms can be used to identify closely related reference assays, and that analysis based on these assays clearly can provide useful data to influence assay design, technology, and screening strategy.
Project description:BACKGROUND:The Cell Ontology (CL) is an OBO Foundry candidate ontology covering the domain of canonical, natural biological cell types. Since its inception in 2005, the CL has undergone multiple rounds of revision and expansion, most notably in its representation of hematopoietic cells. For in vivo cells, the CL focuses on vertebrates but provides general classes that can be used for other metazoans, which can be subtyped in species-specific ontologies. CONSTRUCTION AND CONTENT:Recent work on the CL has focused on extending the representation of various cell types, and developing new modules in the CL itself, and in related ontologies in coordination with the CL. For example, the Kidney and Urinary Pathway Ontology was used as a template to populate the CL with additional cell types. In addition, subtypes of the class 'cell in vitro' have received improved definitions and labels to provide for modularity with the representation of cells in the Cell Line Ontology and Reagent Ontology. Recent changes in the ontology development methodology for CL include a switch from OBO to OWL for the primary encoding of the ontology, and an increasing reliance on logical definitions for improved reasoning. UTILITY AND DISCUSSION:The CL is now mandated as a metadata standard for large functional genomics and transcriptomics projects, and is used extensively for annotation, querying, and analyses of cell type specific data in sequencing consortia such as FANTOM5 and ENCODE, as well as for the NIAID ImmPort database and the Cell Image Library. The CL is also a vital component used in the modular construction of other biomedical ontologies-for example, the Gene Ontology and the cross-species anatomy ontology, Uberon, use CL to support the consistent representation of cell types across different levels of anatomical granularity, such as tissues and organs. CONCLUSIONS:The ongoing improvements to the CL make it a valuable resource to both the OBO Foundry community and the wider scientific community, and we continue to experience increased interest in the CL both among developers and within the user community.
Project description:Bioinformatics and computer aided drug design rely on the curation of a large number of protocols for biological assays that measure the ability of potential drugs to achieve a therapeutic effect. These assay protocols are generally published by scientists in the form of plain text, which needs to be more precisely annotated in order to be useful to software methods. We have developed a pragmatic approach to describing assays according to the semantic definitions of the BioAssay Ontology (BAO) project, using a hybrid of machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort. We have carried out this work based on the premise that pure machine learning is insufficiently accurate, and that expecting scientists to find the time to annotate their protocols manually is unrealistic. By combining these approaches, we have created an effective prototype for which annotation of bioassay text within the domain of the training set can be accomplished very quickly. Well-trained annotations require single-click user approval, while annotations from outside the training set domain can be identified using the search feature of a well-designed user interface, and subsequently used to improve the underlying models. By drastically reducing the time required for scientists to annotate their assays, we can realistically advocate for semantic annotation to become a standard part of the publication process. Once even a small proportion of the public body of bioassay data is marked up, bioinformatics researchers can begin to construct sophisticated and useful searching and analysis algorithms that will provide a diverse and powerful set of tools for drug discovery researchers.
Project description:Novel tools need to be developed to help scientists analyze large amounts of available screening data with the goal to identify entry points for the development of novel chemical probes and drugs. As the largest class of drug targets, G protein-coupled receptors (GPCRs) remain of particular interest and are pursued by numerous academic and industrial research projects.We report the first GPCR ontology to facilitate integration and aggregation of GPCR-targeting drugs and demonstrate its application to classify and analyze a large subset of the PubChem database. The GPCR ontology, based on previously reported BioAssay Ontology, depicts available pharmacological, biochemical and physiological profiles of GPCRs and their ligands. The novelty of the GPCR ontology lies in the use of diverse experimental datasets linked by a model to formally define these concepts. Using a reasoning system, GPCR ontology offers potential for knowledge-based classification of individuals (such as small molecules) as a function of the data.The GPCR ontology is available at http://www.bioassayontology.org/bao_gpcr and the National Center for Biomedical Ontologies Web site.
Project description:Understanding naturally evolved cellular networks requires the consecutive identification and revision of the interactions between relevant molecular species. In this process, initially often simplified and incomplete networks are extended by integrating new reactions or whole subnetworks to increase consistency between model predictions and new measurement data. However, increased consistency with experimental data alone is not sufficient to show the existence of biomolecular interactions, because the interplay of different potential extensions might lead to overall similar dynamics. Here, we present a graph-based modularization approach to facilitate the design of experiments targeted at independently validating the existence of several potential network extensions. Our method is based on selecting the outputs to measure during an experiment, such that each potential network extension becomes virtually insulated from all others during data analysis. Each output defines a module that only depends on one hypothetical network extension, and all other outputs act as virtual inputs to achieve insulation. Given appropriate experimental time-series measurements of the outputs, our modules can be analyzed, simulated, and compared to the experimental data separately. Our approach exemplifies the close relationship between structural systems identification and modularization, an interplay that promises development of related approaches in the future.
Project description:BACKGROUND: Genome-scale metabolic network models have contributed to elucidating biological phenomena, and predicting gene targets to engineer for biotechnological applications. With their increasing importance, their precise network characterization has also been crucial for better understanding of the cellular physiology. RESULTS: We herein introduce a framework for network modularization and Bayesian network analysis (FMB) to investigate organism's metabolism under perturbation. FMB reveals direction of influences among metabolic modules, in which reactions with similar or positively correlated flux variation patterns are clustered, in response to specific perturbation using metabolic flux data. With metabolic flux data calculated by constraints-based flux analysis under both control and perturbation conditions, FMB, in essence, reveals the effects of specific perturbations on the biological system through network modularization and Bayesian network analysis at metabolic modular level. As a demonstration, this framework was applied to the genetically perturbed Escherichia coli metabolism, which is a lpdA gene knockout mutant, using its genome-scale metabolic network model. CONCLUSIONS: After all, it provides alternative scenarios of metabolic flux distributions in response to the perturbation, which are complementary to the data obtained from conventionally available genome-wide high-throughput techniques or metabolic flux analysis.
Project description:Cell lines are frequently used as highly standardized and reproducible in vitro models for biomedical analyses and assays. Cell lines are distributed by cell banks that operate databases describing their products. However, the description of the cell lines' properties are not standardized across different cell banks. Existing cell line-related ontologies mostly focus on the description of the cell lines' names, but do not cover aspects like the origin or optimal growth conditions. The objective of this work is to develop an ontology that allows for a more comprehensive description of cell lines and their metadata, which should cover the data elements provided by cell banks. This will provide the basis for the standardized annotation of cell lines and corresponding assays in biomedical research. In addition, the ontology will be the foundation for automated evaluation of such assays and their respective protocols in the future. To accomplish this, a broad range of cell bank databases as well as existing ontologies were analyzed in a comprehensive manner. We identified existing ontologies capable of covering different aspects of the cell line domain. However, not all data fields derived from the cell banks' databases could be mapped to existing ontologies. As a result, we created a new ontology called cell culture ontology (CCONT) integrating existing ontologies where possible. CCONT provides classes from the areas of cell line identification, origin, cell line properties, propagation and tests performed.
Project description:The wealth of bioactivity information now available on low-molecular weight compounds has enabled a paradigm shift in chemical biology and early phase drug discovery efforts. Traditionally chemical libraries have been most commonly employed in screening approaches where a bioassay is used to characterize a chemical library in a random search for active samples. However, robust curating of bioassay data, establishment of ontologies enabling mining of large chemical biology datasets, and a wealth of public chemical biology information has made possible the establishment of highly annotated compound collections. Such annotated chemical libraries can now be used to build a pathway/target hypothesis and have led to a new view where chemical libraries are used to characterize a bioassay. In this article we discuss the types of compounds in these annotated libraries composed of tools, probes, and drugs. As well, we provide rationale and a few examples for how such libraries can enable phenotypic/forward chemical genomic approaches. As with any approach, there are several pitfalls that need to be considered and we also outline some strategies to avoid these.
Project description:<h4>Summary</h4>We define a disease module as a partition of a molecular network whose components are jointly associated with one or several diseases or risk factors thereof. Identification of such modules, across different types of networks, has great potential for elucidating disease mechanisms and establishing new powerful biomarkers. To this end, we launched the 'Disease Module Identification (DMI) DREAM Challenge', a community effort to build and evaluate unsupervised molecular network modularization algorithms. Here, we present MONET, a toolbox providing easy and unified access to the three top-performing methods from the DMI DREAM Challenge for the bioinformatics community.<h4>Availability and implementation</h4>MONET is a command line tool for Linux, based on Docker and Singularity containers; the core algorithms were written in R, Python, Ada and C++. It is freely available for download at https://github.com/BergmannLab/MONET.git.<h4>Supplementary information</h4>Supplementary data are available at Bioinformatics online.