Support Vector Machine model for hERG inhibitory activities based on the integrated hERG database using descriptor selection by NSGA-II.
ABSTRACT: Assessing the hERG liability in the early stages of drug discovery programs is important. The recent increase of hERG-related information in public databases enabled various successful applications of machine learning techniques to predict hERG inhibition. However, most of these researches constructed the datasets from only one database, limiting the predictability and scope of the models. In this study, a hERG classification model was constructed using the largest dataset for hERG inhibition built by integrating multiple databases. The integrated dataset consisted of more than 291,000 structurally diverse compounds derived from ChEMBL, GOSTAR, PubChem, and hERGCentral. The prediction model was built by support vector machine (SVM) with descriptor selection based on Non-dominated Sorting Genetic Algorithm-II (NSGA-II) to optimize the descriptor set for maximum prediction performance with the minimal number of descriptors. The SVM classification model using 72 selected descriptors and ECFP_4 structural fingerprints recorded kappa statistics of 0.733 and accuracy of 0.984 for the test set, substantially outperforming the prediction performance of the current commercial applications for hERG prediction. Finally, the applicability domain of the prediction model was assessed based on the molecular similarity between the training set and test set compounds.
Project description:The human ether-a-go-go related gene (hERG) plays an important role in cardiac action potential. It encodes an ion channel protein named Kv11.1, which is related to long QT syndrome and may cause avoidable sudden cardiac death. Therefore, it is important to assess the hERG channel blockage of lead compounds in an early drug discovery process. In this study, we collected a large data set containing 1163 diverse compounds with IC50 values determined by the patch clamp method on mammalian cell lines. The whole data set was divided into 80% as the training set and 20% as the test set. Then, five machine learning methods were applied to build a series of binary classification models based on 13 molecular descriptors, five fingerprints and molecular descriptors combining fingerprints at four IC50 thresholds to discriminate hERG blockers from nonblockers, respectively. Models built by molecular descriptors combining fingerprints were validated by using an external validation set containing 407 compounds collected from the hERGCentral database. The performance indicated that the model built by molecular descriptors combining fingerprints yielded the best results and each threshold had its best suitable method, which means that hERG blockage assessment might depend on threshold values. Meanwhile, kNN and SVM methods were better than the others for model building. Furthermore, six privileged substructures were identified using information gain and frequency analysis methods, which could be regarded as structural alerts of cardiac toxicity mediated by hERG channel blockage.
Project description:The inhibition of the hERG potassium channel is closely related to the prolonged QT interval, and thus assessing this risk could greatly facilitate the development of therapeutic compounds and the withdrawal of hazardous marketed drugs. The recent increase in SAR information about hERG inhibitors in public databases has led to many successful applications of machine learning techniques to predict hERG inhibition. However, most of these reports constructed their prediction models based on only one SAR database because the differences in the data format and ontology hindered the integration of the databases. In this study, we curated the hERG-related data in ChEMBL, PubChem, GOSTAR, and hERGCentral, and integrated them into the largest database about hERG inhibition by small molecules. Assessment of structural diversity using Murcko frameworks revealed that the integrated database contains more than twice as many chemical scaffolds for hERG inhibitors than any of the individual databases, and covers 18.2% of the Murcko framework-based chemical space occupied by the compounds in ChEMBL. The database provides the most comprehensive information about hERG inhibitors and will be useful to design safer compounds for drug discovery. The database is freely available at http://drugdesign.riken.jp/hERGdb/.
Project description:Growing evidence suggests that drugs interact with diverse molecular targets mediating both therapeutic and toxic effects. Prediction of these complex interactions from chemical structures alone remains challenging, as compounds with different structures may possess similar toxicity profiles. In contrast, predictions based on systems-level measurements of drug effect may reveal pharmacologic similarities not evident from structure or known therapeutic indications. Here we utilized drug-induced transcriptional responses in the Connectivity Map (CMap) to discover such similarities among diverse antagonists of the human ether-à-go-go related (hERG) potassium channel, a common target of promiscuous inhibition by small molecules. Analysis of transcriptional profiles generated in three independent cell lines revealed clusters enriched for hERG inhibitors annotated using a database of experimental measurements (hERGcentral) and clinical indications. As a validation, we experimentally identified novel hERG inhibitors among the unannotated drugs in these enriched clusters, suggesting transcriptional responses may serve as predictive surrogates of cardiotoxicity complementing existing functional assays.
Project description:The hERG (human ether-a-go-go-related gene) encoded potassium ion (K+) channel plays a major role in cardiac repolarization. Drug-induced blockade of hERG has been a major cause of potentially lethal ventricular tachycardia termed Torsades de Pointes (TdPs). Therefore, we presented a pharmacoinformatics strategy using combined ligand and structure based models for the prediction of hERG inhibition potential (IC50) of new chemical entities (NCEs) during early stages of drug design and development. Integrated GRid-INdependent Descriptor (GRIND) models, and lipophilic efficiency (LipE), ligand efficiency (LE) guided template selection for the structure based pharmacophore models have been used for virtual screening and subsequent hERG activity (pIC50) prediction of identified hits. Finally selected two hits were experimentally evaluated for hERG inhibition potential (pIC50) using whole cell patch clamp assay. Overall, our results demonstrate a difference of less than ±1.6 log unit between experimentally determined and predicted hERG inhibition potential (IC50) of the selected hits. This revealed predictive ability and robustness of our models and could help in correctly rank the potency order (lower ?M to higher nM range) against hERG.
Project description:Capsule networks (CapsNets), a new class of deep neural network architectures proposed recently by Hinton et al., have shown a great performance in many fields, particularly in image recognition and natural language processing. However, CapsNets have not yet been applied to drug discovery-related studies. As the first attempt, we in this investigation adopted CapsNets to develop classification models of hERG blockers/nonblockers; drugs with hERG blockade activity are thought to have a potential risk of cardiotoxicity. Two capsule network architectures were established: convolution-capsule network (Conv-CapsNet) and restricted Boltzmann machine-capsule networks (RBM-CapsNet), in which convolution and a restricted Boltzmann machine (RBM) were used as feature extractors, respectively. Two prediction models of hERG blockers/nonblockers were then developed by Conv-CapsNet and RBM-CapsNet with the Doddareddy's training set composed of 2,389 compounds. The established models showed excellent performance in an independent test set comprising 255 compounds, with prediction accuracies of 91.8 and 92.2% for Conv-CapsNet and RBM-CapsNet models, respectively. Various comparisons were also made between our models and those developed by other machine learning methods including deep belief network (DBN), convolutional neural network (CNN), multilayer perceptron (MLP), support vector machine (SVM), k-nearest neighbors (kNN), logistic regression (LR), and LightGBM, and with different training sets. All the results showed that the models by Conv-CapsNet and RBM-CapsNet are among the best classification models. Overall, the excellent performance of capsule networks achieved in this investigation highlights their potential in drug discovery-related studies.
Project description:Shape Signatures is a new computational tool that is being evaluated for applications in computational toxicology and drug discovery. The method employs a customized ray-tracing algorithm to explore the volume enclosed by the surface of a molecule and then uses the output to construct compact histograms (i.e., signatures) that encode for molecular shape and polarity. In the present study, we extend the application of the Shape Signatures methodology to the domain of computational models for cardiotoxicity. The Shape Signatures method is used to generate molecular descriptors that are then utilized with widely used classification techniques such as k nearest neighbors ( k-NN), support vector machines (SVM), and Kohonen self-organizing maps (SOM). The performances of these approaches were assessed by applying them to a data set of compounds with varying affinity toward the 5-HT(2B) receptor as well as a set of human ether-a-go-go-related gene (hERG) potassium channel inhibitors. Our classification models for 5-HT(2B) represented the first attempt at global computational models for this receptor and exhibited average accuracies in the range of 73-83%. This level of performance is comparable to using commercially available molecular descriptors. The overall accuracy of the hERG Shape Signatures-SVM models was 69-73%, in line with other computational models published to date. Our data indicate that Shape Signatures descriptors can be used with SVM and Kohonen SOM and perform better in classification problems related to the analysis of highly clustered and heterogeneous property spaces. Such models may have utility for predicting the potential for cardiotoxicity in drug discovery mediated by the 5-HT(2B) receptor and hERG.
Project description:In this paper, we explore the impact of combining different in silico prediction approaches and data sources on the predictive performance of the resulting system. We use inhibition of the hERG ion channel target as the endpoint for this study as it constitutes a key safety concern in drug development and a potential cause of attrition. We will show that combining data sources can improve the relevance of the training set in regard of the target chemical space, leading to improved performance. Similarly we will demonstrate that combining multiple statistical models together, and with expert systems, can lead to positive synergistic effects when taking into account the confidence in the predictions of the merged systems. The best combinations analyzed display a good hERG predictivity. Finally, this work demonstrates the suitability of the SOHN methodology for building models in the context of receptor based endpoints like hERG inhibition when using the appropriate pharmacophoric descriptors.
Project description:Several non-cardiovascular drugs have been withdrawn from the market due to their inhibition of hERG K+ channels that can potentially lead to severe heart arrhythmia and death. As hERG safety testing is a mandatory FDArequired procedure, there is a considerable interest for developing predictive computational tools to identify and filter out potential hERG blockers early in the drug discovery process. In this study, we aimed to generate predictive and well-characterized quantitative structure-activity relationship (QSAR) models for hERG blockage using the largest publicly available dataset of 11,958 compounds from the ChEMBL database. The models have been developed and validated according to OECD guidelines using four types of descriptors and four different machine-learning techniques. The classification accuracies discriminating blockers from non-blockers were as high as 0.83-0.93 on external set. Model interpretation revealed several SAR rules, which can guide structural optimization of some hERG blockers into non-blockers. We have also applied the generated models for screening the World Drug Index (WDI) database and identify putative hERG blockers and non-blockers among currently marketed drugs. The developed models can reliably identify blockers and non-blockers, which could be useful for the scientific community. A freely accessible web server has been developed allowing users to identify putative hERG blockers and non-blockers in chemical libraries of their interest (http://labmol.farmacia.ufg.br/predherg).
Project description:BACKGROUND: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. RESULTS: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. CONCLUSIONS: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways.The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.
Project description:Human ether-a-go-go-related gene (hERG) potassium channel blockage by small molecules may cause severe cardiac side effects. Thus, it is crucial to screen compounds for activity on the hERG channels early in the drug discovery process. In this study, we collected 5299 hERG inhibitors with diverse chemical structures from a number of sources. Based on this dataset, we evaluated different machine learning (ML) and deep learning (DL) algorithms using various integer and binary type fingerprints. A training set of 3991 compounds was used to develop quantitative structure-activity relationship (QSAR) models. The performance of the developed models was evaluated using a test set of 998 compounds. Models were further validated using external set 1 (263 compounds) and external set 2 (47 compounds). Overall, models with integer type fingerprints showed better performance than models with no fingerprints, converted binary type fingerprints or original binary type fingerprints. Comparison of ML and DL algorithms revealed that integer type fingerprints are suitable for ML, whereas binary type fingerprints are suitable for DL. The outcomes of this study indicate that the rational selection of fingerprints is important for hERG blocker prediction.