Project description:Nanostructures like fullerene derivatives (FDs) belong to a new family of nano-sized organic compounds. Fullerenes have found a widespread application in material science, pharmaceutical, biomedical, and medical fields. This fact caused the importance of the study of pharmacological as well as toxicological properties of this relatively new family of chemicals. In this work, a large set of 169 FDs and their binding activity to 1117 proteins was investigated. The structure-based descriptors widely used in drug design (so-called drug-like descriptors) were applied to understand cheminformatics characteristics related to the binding activity of fullerene nanostructures. Investigation of applied descriptors demonstrated that polarizability, topological diameter, and rotatable bonds play the most significant role in the binding activity of FDs. Various cheminformatics methods, including the counter propagation artificial neural network (CPANN) and Kohonen network as visualization tool, were applied. The results of this study can be applied to compose the priority list for testing in risk assessment related to the toxicological properties of FDs. The pharmacologist can filter the data from the heat map to view all possible side effects for selected FDs.
Project description:Cancer cells have upregulated DNA repair mechanisms, enabling them survive DNA damage induced during repeated rapid cell divisions and targeted chemotherapeutic treatments. Cancer cell proliferation and survival targeting via inhibition of DNA repair pathways is currently a very promiscuous anti-tumor approach. The deubiquitinating enzyme, USP1 is known to promote DNA repair via complexing with UAF1. The USP1/UAF1 complex is responsible for regulating DNA break repair pathways such as trans-lesion synthesis pathway, Fanconi anemia pathway and homologous recombination. Thus, USP1/UAF1 inhibition poses as an efficient anti-cancer strategy. The recently made available high throughput screen data for anti USP1/UAF1 activity prompted us to compute bioactivity predictive models that could help in screening for potential USP1/UAF1 inhibitors having anti-cancer properties. The current study utilizes publicly available high throughput screen data set of chemical compounds evaluated for their potential USP1/UAF1 inhibitory effect. A machine learning approach was devised for generation of computational models that could predict for potential anti USP1/UAF1 biological activity of novel anticancer compounds. Additional efficacy of active compounds was screened by applying SMARTS filter to eliminate molecules with non-drug like features. The structural fragment analysis was further performed to explore structural properties of the molecules. We demonstrated that modern machine learning approaches could be efficiently employed in building predictive computational models and their predictive performance is statistically accurate. The structure fragment analysis revealed the structures that could play an important role in identification of USP1/UAF1 inhibitors.
Project description:Agranulocytosis, induced by non-chemotherapy drugs, is a serious medical condition that presents a formidable challenge in predictive toxicology due to its idiosyncratic nature and complex mechanisms. In this study, we assembled a dataset of 759 compounds and applied a rigorous feature selection process prior to employing ensemble machine learning classifiers to forecast non-chemotherapy drug-induced agranulocytosis (NCDIA) toxicity. The balanced bagging classifier combined with a gradient boosting decision tree (BBC + GBDT), utilizing the combined descriptor set of DS and RDKit comprising 237 features, emerged as the top-performing model, with an external validation AUC of 0.9164, ACC of 83.55%, and MCC of 0.6095. The model's predictive reliability was further substantiated by an applicability domain analysis. Feature importance, assessed through permutation importance within the BBC + GBDT model, highlighted key molecular properties that significantly influence NCDIA toxicity. Additionally, 16 structural alerts identified by SARpy software further revealed potential molecular signatures associated with toxicity, enriching our understanding of the underlying mechanisms. We also applied the constructed models to assess the NCDIA toxicity of novel drugs approved by FDA. This study advances predictive toxicology by providing a framework to assess and mitigate agranulocytosis risks, ensuring the safety of pharmaceutical development and facilitating post-market surveillance of new drugs.
Project description:Modern nanotechnology provides efficient and cost-effective nanomaterials (NMs). The increasing usage of NMs arises great concerns regarding nanotoxicity in humans. Traditional animal testing of nanotoxicity is expensive and time-consuming. Modeling studies using machine learning (ML) approaches are promising alternatives to direct evaluation of nanotoxicity based on nanostructure features. However, NMs, including two-dimensional nanomaterials (2DNMs) such as graphenes, have complex structures making them difficult to annotate and quantify the nanostructures for modeling purposes. To address this issue, we constructed a virtual graphenes library using nanostructure annotation techniques. The irregular graphene structures were generated by modifying virtual nanosheets. The nanostructures were digitalized from the annotated graphenes. Based on the annotated nanostructures, geometrical nanodescriptors were computed using Delaunay tessellation approach for ML modeling. The partial least square regression (PLSR) models for the graphenes were built and validated using a leave-one-out cross-validation (LOOCV) procedure. The resulted models showed good predictivity in four toxicity-related endpoints with the coefficient of determination (R2) ranging from 0.558 to 0.822. This study provides a novel nanostructure annotation strategy that can be applied to generate high-quality nanodescriptors for ML model developments, which can be widely applied to nanoinformatics studies of graphenes and other NMs.
Project description:Aberrant methylation patterns in human DNA have great potential for the discovery of novel diagnostic and disease progression biomarkers. In this paper we used machine learning algorithms to identify promising methylation sites for diagnosing cancerous tissue and to classify patients based on methylation values at these sites. We used genome-wide DNA methylation patterns from both cancerous and normal tissue samples, obtained from the Genomic Data Commons consortium and trialled our methods on three types of urological cancer. A decision tree was used to identify the methylation sites most useful for diagnosis. The identified locations were then used to train a neural network to classify samples as either cancerous or non-cancerous. Using this two-step approach we found strong indicative biomarker panels for each of the three cancer types. These methods could likely be translated to other cancers and improved by using non-invasive liquid methods such as blood instead of biopsy tissue.
Project description:Aquatic toxicity is a crucial endpoint for evaluating chemically adverse effects on ecosystems. Therefore, we developed in silico methods for the prediction of chemical aquatic toxicity in marine environment. At first, a diverse data set including different crustacean species was constructed. We then built local binary models using Mysidae data and global binary models using Mysidae, Palaemonidae, and Penaeidae data. Molecular fingerprints and descriptors were employed to represent chemical structures separately. All the models were built by six machine learning methods. The AUC (area under the receiver operating characteristic curve) values of the better local and global models were around 0.8 and 0.9 for the test sets, respectively. We also identified several chemicals with selective toxicity on different species. The analysis of selective toxicity would promote to design greener chemicals in a specific environment. Finally, to understand and interpret the models, we explored the relationships between chemical aquatic toxicity and the molecular descriptors. Our study would be helpful in gaining further insights into marine organisms, prediction of chemical aquatic toxicity and prioritization of environmental hazard assessment.
Project description:BackgroundAs one of the serious public health issues, vaccination refusal has been attracting more and more attention, especially for newly approved human papillomavirus (HPV) vaccines. Understanding public opinion towards HPV vaccines, especially concerns on social media, is of significant importance for HPV vaccination promotion.MethodsIn this study, we leveraged a hierarchical machine learning based sentiment analysis system to extract public opinions towards HPV vaccines from Twitter. English tweets containing HPV vaccines-related keywords were collected from November 2, 2015 to March 28, 2016. Manual annotation was done to evaluate the performance of the system on the unannotated tweets corpus. Followed time series analysis was applied to this corpus to track the trends of machine-deduced sentiments and their associations with different days of the week.ResultsThe evaluation of the unannotated tweets corpus showed that the micro-averaging F scores have reached 0.786. The learning system deduced the sentiment labels for 184,214 tweets in the collected unannotated tweets corpus. Time series analysis identified a coincidence between mainstream outcome and Twitter contents. A weak trend was found for "Negative" tweets that decreased firstly and began to increase later; an opposite trend was identified for "Positive" tweets. Tweets that contain the worries on efficacy for HPV vaccines showed a relative significant decreasing trend. Strong associations were found between some sentiments ("Positive", "Negative", "Negative-Safety" and "Negative-Others") with different days of the week.ConclusionsOur efforts on sentiment analysis for newly approved HPV vaccines provide us an automatic and instant way to extract public opinion and understand the concerns on Twitter. Our approaches can provide a feedback to public health professionals to monitor online public response, examine the effectiveness of their HPV vaccination promotion strategies and adjust their promotion plans.
Project description:Cytochrome P450 17A1 (CYP17A1) is one of the key enzymes in steroidogenesis that produces dehydroepiandrosterone (DHEA) from cholesterol. Abnormal DHEA production may lead to the progression of severe diseases, such as prostatic and breast cancers. Thus, CYP17A1 is a druggable target for anti-cancer molecule development. In this study, cheminformatic analyses and quantitative structure-activity relationship (QSAR) modeling were applied on a set of 962 CYP17A1 inhibitors (i.e., consisting of 279 steroidal and 683 nonsteroidal inhibitors) compiled from the ChEMBL database. For steroidal inhibitors, a QSAR classification model built using the PubChem fingerprint along with the extra trees algorithm achieved the best performance, reflected by the accuracy values of 0.933, 0.818, and 0.833 for the training, cross-validation, and test sets, respectively. For nonsteroidal inhibitors, a systematic cheminformatic analysis was applied for exploring the chemical space, Murcko scaffolds, and structure-activity relationships (SARs) for visualizing distributions, patterns, and representative scaffolds for drug discoveries. Furthermore, seven total QSAR classification models were established based on the nonsteroidal scaffolds, and two activity cliff (AC) generators were identified. The best performing model out of these seven was model VIII, which is built upon the PubChem fingerprint along with the random forest algorithm. It achieved a robust accuracy across the training set, the cross-validation set, and the test set, i.e., 0.96, 0.92, and 0.913, respectively. It is anticipated that the results presented herein would be instrumental for further CYP17A1 inhibitor drug discovery efforts.
Project description:Cancer metastasis accounts for approximately 90% of cancer deaths, and elucidating markers in metastasis is the first step in its prevention. To characterize metastasis marker genes (MGs) of breast cancer, XGBoost models that classify metastasis status were trained with gene expression profiles from TCGA. Then, a metastasis score (MS) was assigned to each gene by calculating the inner product between the feature importance and the AUC performance of the models. As a result, 54, 202, and 357 genes with the highest MS were characterized as MGs by empirical p-value cutoffs of 0.001, 0.005, and 0.01, respectively. The three sets of MGs were compared with those from existing metastasis marker databases, which provided significant results in most comparisons (p-value < 0.05). They were also significantly enriched in biological processes associated with breast cancer metastasis. The three MGs, SPPL2C, KRT23, and RGS7, showed highly significant results (p-value < 0.01) in the survival analysis. The MGs that could not be identified by statistical analysis (e.g., GOLM1, ELAVL1, UBP1, and AZGP1), as well as the MGs with the highest MS (e.g., ZNF676, FAM163B, LDOC2, IRF1, and STK40), were verified via the literature. Additionally, we checked how close the MGs were to each other in the protein-protein interaction networks. We expect that the characterized markers will help understand and prevent breast cancer metastasis.
Project description:The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84-0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.