QSAR and Classification Study on Prediction of Acute Oral Toxicity of N-Nitroso Compounds.
ABSTRACT: To better understand the mechanism of in vivo toxicity of N-nitroso compounds (NNCs), the toxicity data of 80 NNCs related to their rat acute oral toxicity data (50% lethal dose concentration, LD50) were used to establish quantitative structure-activity relationship (QSAR) and classification models. Quantum chemistry methods calculated descriptors and Dragon descriptors were combined to describe the molecular information of all compounds. Genetic algorithm (GA) and multiple linear regression (MLR) analyses were combined to develop QSAR models. Fingerprints and machine learning methods were used to establish classification models. The quality and predictive performance of all established models were evaluated by internal and external validation techniques. The best GA-MLR-based QSAR model containing eight molecular descriptors was obtained with Q²loo = 0.7533, R² = 0.8071, Q²ext = 0.7041 and R²ext = 0.7195. The results derived from QSAR studies showed that the acute oral toxicity of NNCs mainly depends on three factors, namely, the polarizability, the ionization potential (IP) and the presence/absence and frequency of C?O bond. For classification studies, the best model was obtained using the MACCS keys fingerprint combined with artificial neural network (ANN) algorithm. The classification models suggested that several representative substructures, including nitrile, hetero N nonbasic, alkylchloride and amine-containing fragments are main contributors for the high toxicity of NNCs. Overall, the developed QSAR and classification models of the rat acute oral toxicity of NNCs showed satisfying predictive abilities. The results provide an insight into the understanding of the toxicity mechanism of NNCs in vivo, which might be used for a preliminary assessment of NNCs toxicity to mammals.
Project description:O?-methylguanine-DNA methyltransferase (MGMT), a unique DNA repair enzyme, can confer resistance to DNA anticancer alkylating agents that modify the O?-position of guanine. Thus, inhibition of MGMT activity in tumors has a great interest for cancer researchers because it can significantly improve the anticancer efficacy of such alkylating agents. In this study, we performed a quantitative structure activity relationship (QSAR) and classification study based on a total of 134 base analogs related to their ED<sub>50</sub> values (50% inhibitory concentration) against MGMT. Molecular information of all compounds were described by quantum chemical descriptors and Dragon descriptors. Genetic algorithm (GA) and multiple linear regression (MLR) analysis were combined to develop QSAR models. Classification models were generated by seven machine-learning methods based on six types of molecular fingerprints. Performances of all developed models were assessed by internal and external validation techniques. The best QSAR model was obtained with Q²<sub>Loo</sub> = 0.83, R² = 0.87, Q²<sub>ext</sub> = 0.67, and R²<sub>ext</sub> = 0.69 based on 84 compounds. The results from QSAR studies indicated topological charge indices, polarizability, ionization potential (IP), and number of primary aromatic amines are main contributors for MGMT inhibition of base analogs. For classification studies, the accuracies of 10-fold cross-validation ranged from 0.750 to 0.885 for top ten models. The range of accuracy for the external test set ranged from 0.800 to 0.880 except for PubChem-Tree model, suggesting a satisfactory predictive ability. Three models (Ext-SVM, Ext-Tree and Graph-RF) showed high and reliable predictive accuracy for both training and external test sets. In addition, several representative substructures for characterizing MGMT inhibitors were identified by information gain and substructure frequency analysis method. Our studies might be useful for further study to design and rapidly identify potential MGMT inhibitors.
Project description:AIMS AND OBJECTIVES:QSPR models establish relationships between different types of structural information to their observed properties. In the present study the relationship between the molecular descriptors and quantum properties of cycloalkanes is represented. MATERIALS AND METHODS:Genetic Algorithm (GA) and Multiple Linear Regressions (MLR) were successfully developed to predict quantum properties of cycloalkanes. A large number of molecular descriptors were calculated with Dragon software and a subset of calculated descriptors was selected with a genetic algorithm as a feature selection technique. The quantum properties consist of the heat capacity (Cv)/ Jmol-1K-1 entropy(S)/ Jmol-1K-1 and thermal energy(Eth)/ kJmol-1 were obtained from quantum-chemistry technique at the Hartree-Fock (HF) level using the ab initio 6-31G* basis sets. RESULTS:The Genetic Algorithm (GA) method was used to select important molecular descriptors and then they were used as inputs for SPSS software package. The predictive powers of the MLR models were discussed using Leave-One-Out (LOO) cross-validation, leave-group (5-fold)-out (LGO) and external prediction series. The statistical parameters of the training and test sets for GA-MLR models were calculated. CONCLUSION:The resulting quantitative GA-MLR models of Cv, S, and Eth were obtained:[r2=0.950, Q2=0.989, r2 ext=0.969, MAE(overall,5-flod)=0.6825 Jmol-1K-1], [r2=0.980, Q2=0.947, r2 ext=0.943, MAE(overall,5-flod)=0.5891Jmol-1K-1], and [r2=0.980, Q2=0.809, r2 ext=0.985, MAE(overall,5-flod)=2.0284 kJmol-1]. The results showed that the predictive ability of the models was satisfactory, and the constitutional, topological indices and ring descriptor could be used to predict the mentioned properties of 103 cycloalkanes.
Project description:The prediction of biological activity of a chemical compound from its structural features plays an important role in drug design. In this paper, we discuss the quantitative structure activity relationship (QSAR) prediction models developed on a dataset of 170 HIV protease enzyme inhibitors. Various chemical descriptors that encode hydrophobic, topological, geometrical and electronic properties are calculated to represent the structures of the molecules in the dataset. We use the hybrid-GA (genetic algorithm) optimization technique for descriptor space reduction. The linear multiple regression analysis (MLR), correlation-based feature selection (CFS), non-linear decision tree (DT), and artificial neural network (ANN) approaches are used as fitness functions. The selected descriptors represent the overall descriptor space and account well for the binding nature of the considered dataset. These selected features are also human interpretable and can be used to explain the interactions between a drug molecule and its receptor protein (HIV protease). The selected descriptors are then used for developing the QSAR prediction models by using the MLR, DT and ANN approaches. These models are discussed, analyzed and compared to validate and test their performance for this dataset. All three approaches yield the QSAR models with good prediction performance. The models developed by DT and ANN are comparable and have better prediction than the MLR model. For ANN model, weight analysis is carried out to analyze the role of various descriptors in activity prediction. All the prediction models point towards the involvement of hydrophobic interactions. These models can be useful for predicting the biological activity of new untested HIV protease inhibitors and virtual screening for identifying new lead compounds.
Project description:Limited information on the potential toxicity of ionic liquids (ILs) becomes the bottleneck that creates a barrier in their large-scale application. In this work, two quantitative structure-activity relationships (QSAR) models were used to evaluate the toxicity of ILs toward the acetylcholinesterase enzyme using multiple linear regression (MLR) and extreme learning machine (ELM) algorithms. The structures of 57 cations and 21 anions were optimized using quantum chemistry calculations. The electrostatic potential surface area (<i>S</i><sub>EP</sub>) and the screening charge density distribution area (<i>S</i><sub>?</sub>) descriptors were calculated and used for prediction of IL toxicity. Performance and predictive aptitude between MLR and ELM models were analyzed. Highest squared correlation coefficient (<i>R</i><sup>2</sup>), and also lowest average absolute relative deviation (AARD%) and root-mean-square error (RMSE) were observed for training set, test set, and total set for the ELM model. These findings validated the superior performance of ELM over the MLR toxicity prediction model.
Project description:Quantitative Structure Activity Relationship (QSAR) models can inform on the correlation between activities and structure-based molecular descriptors. This information is important for the understanding of the factors that govern molecular properties and for designing new compounds with favorable properties. Due to the large number of calculate-able descriptors and consequently, the much larger number of descriptors combinations, the derivation of QSAR models could be treated as an optimization problem. For continuous responses, metrics which are typically being optimized in this process are related to model performances on the training set, for example, R2 and QCV2. Similar metrics, calculated on an external set of data (e.g., QF1/F2/F32), are used to evaluate the performances of the final models. A common theme of these metrics is that they are context -" ignorant". In this work we propose that QSAR models should be evaluated based on their intended usage. More specifically, we argue that QSAR models developed for Virtual Screening (VS) should be derived and evaluated using a virtual screening-aware metric, e.g., an enrichment-based metric. To demonstrate this point, we have developed 21 Multiple Linear Regression (MLR) models for seven targets (three models per target), evaluated them first on validation sets and subsequently tested their performances on two additional test sets constructed to mimic small-scale virtual screening campaigns. As expected, we found no correlation between model performances evaluated by "classical" metrics, e.g., R2 and QF1/F2/F32 and the number of active compounds picked by the models from within a pool of random compounds. In particular, in some cases models with favorable R2 and/or QF1/F2/F32 values were unable to pick a single active compound from within the pool whereas in other cases, models with poor R2 and/or QF1/F2/F32 values performed well in the context of virtual screening. We also found no significant correlation between the number of active compounds correctly identified by the models in the training, validation and test sets. Next, we have developed a new algorithm for the derivation of MLR models by optimizing an enrichment-based metric and tested its performances on the same datasets. We found that the best models derived in this manner showed, in most cases, much more consistent results across the training, validation and test sets and outperformed the corresponding MLR models in most virtual screening tests. Finally, we demonstrated that when tested as binary classifiers, models derived for the same targets by the new algorithm outperformed Random Forest (RF) and Support Vector Machine (SVM)-based models across training/validation/test sets, in most cases. We attribute the better performances of the Enrichment Optimizer Algorithm (EOA) models in VS to better handling of inactive random compounds. Optimizing an enrichment-based metric is therefore a promising strategy for the derivation of QSAR models for classification and virtual screening.
Project description:The data is obtained from exploring the modulatory activities of bioflavonoids on P-glycoprotein function by ligand-based approaches. Multivariate Linear-QSAR models for predicting the induced/inhibitory activities of the flavonoids were created. Molecular descriptors were initially used as independent variables and a dependent variable was expressed as pFAR. The variables were then used in MLR analysis by stepwise regression calculation to build the linear QSAR data. The entire dataset consisted of 23 bioflavonoids was used as a training set. Regarding the obtained MLR QSAR model, R of 0.963, R (2)=0.927, [Formula: see text], SEE=0.197, F=33.849 and q (2)=0.927 were achieved. The true predictabilities of QSAR model were justified by evaluation with the external dataset (Table 4). The pFARs of representative flavonoids were predicted by MLR QSAR modelling. The data showed that internal and external validations may generate the same conclusion.
Project description:MMP-2 enzyme is a kind of matrix metalloproteinases that digests the denatured collagens and gelatins. It is highly involved in the process of tumor invasion and has been considered as a promising target for cancer therapy. The structural requirements of an MMP-2 inhibitor are: (1) a functional group that binds the zinc ion, and (2) a functional group which interacts with the enzyme backbone and the side chains which undergo effective interactions with the enzyme subsites.In the present study, a QSAR model was generated to screen new inhibitors of MMP-2 based on L-hydroxy tyrosine scaffold. Descriptors generation were done by Hyperchem 8, DRAGON and Gaussian98W programs. SPSS and MATLAB programs have been used for multiple linear regression (MLR) and genetic algorithm partial least squares (GA-PLS) analyses and for theoretical validation. Applicability domain of the model was performed to screen new compounds. The binding site potential of all inhibitors was verified by structure-based docking according to their binding energy and then the best inhibitors were selected.The best QSAR models in MLR and GA-PLS were reported, with the square correlation coefficient for leave-one-out cross-validation (Q(2) LOO) larger than 0.921 and 0.900 respectively. The created MLR and GA-PLS models indicated the importance of molecular size, degree of branching, flexibility, shape, three-dimensional coordination of different atoms in a molecule in inhibitory activities against MMP-2. The docking study indicated that lipophilic and hydrogen bonding interactions among the inhibitors and the receptor are involved in a ligand-receptor interaction. The oxygen of carbonyl and sulfonyl groups is important for hydrogen bonds of ligand with Leu82 and Ala83. R2 and R3 substituents play a main role in hydrogen bonding interactions. R1 is sited in the hydrophobic pocket. Methylene group can help a ligand to be fitted in the lipophilic pocket, so two methylene groups are better than one. The Phenyl group can create a ?-? interaction with Phe86.The QSAR and docking analyses demonstrated to be helpful tools in the prediction of anti-cancer activities and a guide to the synthesis of new metalloproteinase inhibitors based on L-tyrosine scaffold.
Project description:Over expression of Protein kinase (CK2) suppresses apoptosis induced by a variety of agents, whereas down-regulation of CK2 sensitizes cells to induction of apoptosis. In this study, we have built quantitative structure activity relationship (QSAR) models, which were trained and tested on experimentally verified 38 enzyme׳s inhibitors having inhibitory value IC50 in µM. These inhibitors were docked at the active site of CK2 (PDB id: 2ZJW) using AutoDock software, which resulted in energy-based descriptors such as binding energy, intermol energy, torsional energy, internal energy and docking energy. For QSAR modeling, Multiple Linear Regression (MLR) model was engendered using energy-based descriptors yielding correlation coefficient r(2) of 0.4645. To assess the predictive performance of QSAR models, different cross-validation procedures were adopted. Our results suggests that ligand-receptor binding interactions for CK2 employing QSAR modeling seems to be a promising approach for prediction of IC50 value of a new ligand molecule against CK2.Further, twenty analogues of ellagic acid were docked with CK2 structure. After docking, two compounds CID 46229200 and CID 10003463 had lower docking energy even lower than standard control Ellagic acid with CK2 was selected as potent candidate drugs for Oral cancer. The biological activity of two compounds in terms of IC50 was predicted based on QSAR model, which could be used as a guideline for anticancerous activity of compounds before their synthesis.
Project description:A series of 436 Munro database chemicals were studied with respect to their corresponding experimental LD50 values to investigate the possibility of establishing a global QSAR model for acute toxicity. Dragon molecular descriptors were used for the QSAR model development and genetic algorithms were used to select descriptors better correlated with toxicity data. Toxic values were discretized in a qualitative class on the basis of the Globally Harmonized Scheme: the 436 chemicals were divided into 3 classes based on their experimental LD50 values: highly toxic, intermediate toxic and low to non-toxic. The k-nearest neighbor (k-NN) classification method was calibrated on 25 molecular descriptors and gave a non-error rate (NER) equal to 0.66 and 0.57 for internal and external prediction sets, respectively. Even if the classification performances are not optimal, the subsequent analysis of the selected descriptors and their relationship with toxicity levels constitute a step towards the development of a global QSAR model for acute toxicity.
Project description:The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84-0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.