QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality.
ABSTRACT: We propose that quantitative structure-activity relationship (QSAR) predictions should be explicitly represented as predictive (probability) distributions. If both predictions and experimental measurements are treated as probability distributions, the quality of a set of predictive distributions output by a model can be assessed with Kullback-Leibler (KL) divergence: a widely used information theoretic measure of the distance between two probability distributions. We have assessed a range of different machine learning algorithms and error estimation methods for producing predictive distributions with an analysis against three of AstraZeneca's global DMPK datasets. Using the KL-divergence framework, we have identified a few combinations of algorithms that produce accurate and valid compound-specific predictive distributions. These methods use reliability indices to assign predictive distributions to the predictions output by QSAR models so that reliable predictions have tight distributions and vice versa. Finally we show how valid predictive distributions can be used to estimate the probability that a test compound has properties that hit single- or multi- objective target profiles.
Project description:In cognitive radio communication, spectrum sensing plays a vital role in sensing the existence of the primary user (PU). The sensing performance is badly affected by fading and shadowing in case of single secondary user(SU). To overcome this issue, cooperative spectrum sensing (CSS) is proposed. Although the reliability of the system is improved with cooperation but existence of malicious user (MU) in the CSS deteriorates the performance. In this work, we consider the Kullback-Leibler (KL) divergence method for minimizing spectrum sensing data falsification (SSDF) attack. In the proposed CSS scheme, each SU reports the fusion center(FC) about the availability of PU and also keeps the same evidence in its local database. Based on the KL divergence value, if the FC acknowledges the user as normal, then the user will send unified energy information to the FC based on its current and previous sensed results. This method keeps the probability of detection high and energy optimum, thus providing an improvement in performance of the system. Simulation results show that the proposed KL divergence method has performed better than the existing equal gain combination (EGC), maximum gain combination (MGC) and simple KL divergence schemes in the presence of MUs.
Project description:Humans are entertained and emotionally captivated by a good story. Artworks, such as operas, theatre plays, movies, TV series, cartoons, etc., contain implicit stories, which are conveyed visually (e.g., through scenes) and audially (e.g., via music and speech). Story theorists have explored the structure of various artworks and identified forms and paradigms that are common to most well-written stories. Further, typical story structures have been formalized in different ways and used by professional screenwriters as guidelines. Currently, computers cannot yet identify such a latent narrative structure of a movie story. Therefore, in this work, we raise the novel challenge of understanding and formulating the movie story structure and introduce the first ever story-based labeled dataset-the Flintstones Scene Dataset (FSD). The dataset consists of 1, 569 scenes taken from a manual annotation of 60 episodes of a famous cartoon series, The Flintstones, by 105 distinct annotators. The various labels assigned to each scene by different annotators are summarized by a probability vector over 10 possible story elements representing the function of each scene in the advancement of the story, such as the Climax of Act One or the Midpoint. These elements are learned from guidelines for professional script-writing. The annotated dataset is used to investigate the effectiveness of various story-related features and multi-label classification algorithms for the task of predicting the probability distribution of scene labels. We use cosine similarity and KL divergence to measure the quality of predicted distributions. The best approaches demonstrated 0.81 average similarity and 0.67 KL divergence between the predicted label vectors and the ground truth vectors based on the manual annotations. These results demonstrate the ability of machine learning approaches to detect the narrative structure in movies, which could lead to the development of story-related video analytics tools, such as automatic video summarization and recommendation systems.
Project description:PURPOSE:Oral bioavailability (%F) is a key factor that determines the fate of a new drug in clinical trials. Traditionally, %F is measured using costly and time-consuming experimental tests. Developing computational models to evaluate the %F of new drugs before they are synthesized would be beneficial in the drug discovery process. METHODS:We employed Combinatorial Quantitative Structure-Activity Relationship approach to develop several computational %F models. We compiled a %F dataset of 995 drugs from public sources. After generating chemical descriptors for each compound, we used random forest, support vector machine, k nearest neighbor, and CASE Ultra to develop the relevant QSAR models. The resulting models were validated using five-fold cross-validation. RESULTS:The external predictivity of %F values was poor (R(2)?=?0.28, n?=?995, MAE?=?24), but was improved (R(2)?=?0.40, n?=?362, MAE?=?21) by filtering unreliable predictions that had a high probability of interacting with MDR1 and MRP2 transporters. Furthermore, classifying the compounds according to the %F values (%F?<?50% as "low", %F???50% as 'high") and developing category QSAR models resulted in an external accuracy of 76%. CONCLUSIONS:In this study, we developed predictive %F QSAR models that could be used to evaluate new drug compounds, and integrating drug-transporter interactions data greatly benefits the resulting models.
Project description:The International Conference on Harmonization (ICH) M7 guideline allows the use of in silico approaches for predicting Ames mutagenicity for the initial assessment of impurities in pharmaceuticals. This is the first international guideline that addresses the use of quantitative structure-activity relationship (QSAR) models in lieu of actual toxicological studies for human health assessment. Therefore, QSAR models for Ames mutagenicity now require higher predictive power for identifying mutagenic chemicals. To increase the predictive power of QSAR models, larger experimental datasets from reliable sources are required. The Division of Genetics and Mutagenesis, National Institute of Health Sciences (DGM/NIHS) of Japan recently established a unique proprietary Ames mutagenicity database containing 12140 new chemicals that have not been previously used for developing QSAR models. The DGM/NIHS provided this Ames database to QSAR vendors to validate and improve their QSAR tools. The Ames/QSAR International Challenge Project was initiated in 2014 with 12 QSAR vendors testing 17 QSAR tools against these compounds in three phases. We now present the final results. All tools were considerably improved by participation in this project. Most tools achieved >50% sensitivity (positive prediction among all Ames positives) and predictive power (accuracy) was as high as 80%, almost equivalent to the inter-laboratory reproducibility of Ames tests. To further increase the predictive power of QSAR tools, accumulation of additional Ames test data is required as well as re-evaluation of some previous Ames test results. Indeed, some Ames-positive or Ames-negative chemicals may have previously been incorrectly classified because of methodological weakness, resulting in false-positive or false-negative predictions by QSAR tools. These incorrect data hamper prediction and are a source of noise in the development of QSAR models. It is thus essential to establish a large benchmark database consisting only of well-validated Ames test results to build more accurate QSAR models.
Project description:Quantitative structure-activity relationship (QSAR) models have long been used for making predictions and data gap filling in diverse fields including medicinal chemistry, predictive toxicology, environmental fate modeling, materials science, agricultural science, nanoscience, food science, and so forth. Usually a QSAR model is developed based on chemical information of a properly designed training set and corresponding experimental response data while the model is validated using one or more test set(s) for which the experimental response data are available. However, it is interesting to estimate the reliability of predictions when the model is applied to a completely new data set (true external set) even when the new data points are within applicability domain (AD) of the developed model. In the present study, we have categorized the quality of predictions for the test set or true external set into three groups (good, moderate, and bad) based on absolute prediction errors. Then, we have used three criteria [(a) mean absolute error of leave-one-out predictions for 10 most close training compounds for each query molecule; (b) AD in terms of similarity based on the standardization approach; and (c) proximity of the predicted value of the query compound to the mean training response] in different weighting schemes for making a composite score of predictions. It was found that using the most frequently appearing weighting scheme 0.5-0-0.5, the composite score-based categorization showed concordance with absolute prediction error-based categorization for more than 80% test data points while working with 5 different datasets with 15 models for each set derived in three different splitting techniques. These observations were also confirmed with true external sets for another four endpoints suggesting applicability of the scheme to judge the reliability of predictions for new datasets. The scheme has been implemented in a tool "Prediction Reliability Indicator" available at http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/, and the tool is presently valid for multiple linear regression models only.
Project description:Cerium dioxide (CeO2) is a surrogate material for traditional nuclear fuels and an essential material for a wide variety of industrial applications both in its bulk and nanometer length scale. Despite this fact, the underlying physics of thermal conductivity (kL), a crucial design parameter in industrial applications, has not received enough attention. In this article, a systematic investigation of the phonon transport properties was performed using ab initio calculations unified with the Boltzmann transport equation. An extensive examination of the phonon mode contribution, available three-phonon scattering phase space, mode Grüneisen parameter and mean free path (MFP) distributions were also conducted. To further augment theoretical predictions of the kL, measurements were made on specimens prepared by spark plasma sintering using the laser flash technique. Since the sample porosity plays a vital role in the value of measured kL, the effect of porosity on kL by molecular dynamics (MD) simulations were investigated. Finally, we also determined the nanostructuring effect on the thermal properties of CeO2. Since CeO2 films find application in various industries, the dependence of thickness on the in-plane and cross-plane kL for an infinite CeO2 thin film was also reported.
Project description:When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback-Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper-Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback-Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple "add one counter", and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable.
Project description:Despite decades of intensive research and a number of demonstrable successes, quantitative structure-activity relationship (QSAR) models still fail to yield predictions with reasonable accuracy in some circumstances, especially when the QSAR paradox occurs. In this study, to avoid the QSAR paradox, we proposed a novel integrated approach to improve the model performance through using both structural and biological information from compounds. As a proof-of-concept, the integrated models were built on a toxicological dataset to predict non-genotoxic carcinogenicity of compounds, using not only the conventional molecular descriptors but also expression profiles of significant genes selected from microarray data. For test set data, our results demonstrated that the prediction accuracy of QSAR model was dramatically increased from 0.57 to 0.67 with incorporation of expression data of just one selected signature gene. Our successful integration of biological information into classic QSAR model provided a new insight and methodology for building predictive models especially when QSAR paradox occurred.
Project description:A computational approach to functional specialization suggests that brain systems can be characterized in terms of the types of computations they perform, rather than their sensory or behavioral domains. We contrasted the neural systems associated with two computationally distinct forms of predictive model: a reinforcement-learning model of the environment obtained through experience with discrete events, and continuous dynamic forward modeling. By manipulating the precision with which each type of prediction could be used, we caused participants to shift computational strategies within a single spatial prediction task. Hence (using fMRI) we showed that activity in two brain systems (typically associated with reward learning and motor control) could be dissociated in terms of the forms of computations that were performed there, even when both systems were used to make parallel predictions of the same event. A region in parietal cortex, which was sensitive to the divergence between the predictions of the models and anatomically connected to both computational networks, is proposed to mediate integration of the two predictive modes to produce a single behavioral output.
Project description:Cytochromes P450 3A4, 2D6, and 2C9 metabolize a large fraction of drugs. Knowing where these enzymes will preferentially oxidize a molecule, the regioselectivity, allows medicinal chemists to plan how best to block its metabolism. We present QSAR-based regioselectivity models for these enzymes calibrated against compiled literature data of drugs and drug-like compounds. These models are purely empirical and use only the structures of the substrates, in contrast to those models that simulate a specific mechanism like hydrogen radical abstraction, and/or use explicit models of active sites. Our most predictive models use three substructure descriptors and two physical property descriptors. Descriptor importances from the random forest QSAR method show that other factors than the immediate chemical environment and the accessibility of the hydrogen affect regioselectivity in all three isoforms. The cross-validated predictions of the models are compared to predictions from our earlier mechanistic model (Singh et al. J. Med. Chem. 2003, 46, 1330-1336) and predictions from MetaSite (Cruciani et al. J. Med. Chem. 2005, 48, 6970-6979).