Project description:A novel automated high-throughput screening approach, ClusterFinder, is reported for finding candidate structures for atomic pair distribution function (PDF) structural refinements. Finding starting models for PDF refinements is notoriously difficult when the PDF originates from nanoclusters or small nanoparticles. The reported ClusterFinder algorithm can screen 104 to 105 candidate structures from structural databases such as the Inorganic Crystal Structure Database (ICSD) in minutes, using the crystal structures as templates in which it looks for atomic clusters that result in a PDF similar to the target measured PDF. The algorithm returns a rank-ordered list of clusters for further assessment by the user. The algorithm has performed well for simulated and measured PDFs of metal-oxido clusters such as Keggin clusters. This is therefore a powerful approach to finding structural cluster candidates in a modelling campaign for PDFs of nanoparticles and nanoclusters.
Project description:Structural modelling of octahedral tilts in perovskites is typically carried out using the symmetry constraints of the resulting space group. In most cases, this introduces more degrees of freedom than those strictly necessary to describe only the octahedral tilts. It can therefore be a challenge to disentangle the octahedral tilts from other structural distortions such as cation displacements and octahedral distortions. This paper reports the development of constraints for modelling pure octahedral tilts and implementation of the constraints in diffpy-CMI, a powerful package to analyse pair distribution function (PDF) data. The model in the program allows features in the PDF that come from rigid tilts to be separated from non-rigid relaxations, providing an intuitive picture of the tilting. The model has many fewer refinable variables than the unconstrained space group fits and provides robust and stable refinements of the tilt components. It further demonstrates the use of the model on the canonical tilted perovskite CaTiO3 which has the known Glazer tilt system α+β-β-. The Glazer model fits comparably to the corresponding space-group model Pnma below r = 14 Å and becomes progressively worse than the space-group model at higher r due to non-rigid distortions in the real material.
Project description:The structures of metal ions in solution constitute essential information for obtaining chemical insight spanning from catalytic reaction mechanisms to formation of functional nanomaterials. Here, we explore Zr4+ solution structures using X-ray pair distribution function (PDF) analysis across pH (0-14), concentrations (0.1-1.5 M), solvents (water, methanol, ethanol, acetonitrile) and metal sources (ZrCl4, ZrOCl2·8H2O, ZrO(NO3)2·xH2O). In water, [Zr4(OH)8(OH2)16]8+-tetramers are predominant, while non-aqueous solvents contain monomeric complexes. The PDF analysis also reveals second sphere coordination of chloride counter ions to the aqueous tetramers. The results are reproducible across data measured at three different beamlines at the PETRA-III and MAX IV synchrotron light sources.
Project description:Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
Project description:Identifying patient prognostic phenotypes facilitates precision medicine. This study aimed to explore phenotypes of patients with heart failure (HF) corresponding to prognostic condition (risk of mortality) and identify the phenotype of new patients by machine learning (ML). A unsupervised ML was applied to explore phenotypes of patients in a derivation dataset (n = 562) based on their medical records. Thereafter, supervised ML models were trained on the derivation dataset to classify these identified phenotypes. Then, the trained classifiers were further validated on an independent validation dataset (n = 168). Finally, Shapley additive explanations were used to interpret decision making of phenotype classification. Three patient phenotypes corresponding to stratified mortality risk (high, low, and intermediate) were identified. Kaplan-Meier survival curves among the three phenotypes had significant difference (pairwise comparison p < 0.05). Hazard ratio of all-cause mortality between patients in phenotype 1 (n = 91; high risk) and phenotype 3 (n = 329; intermediate risk) was 2.08 (95%CI 1.29-3.37, p = 0.003), and 0.26 (95%CI 0.11-0.61, p = 0.002) between phenotype 2 (n = 142; low risk) and phenotype 3. For phenotypes classification by random forest, AUCs of phenotypes 1, 2, and 3 were 0.736 ± 0.038, 0.815 ± 0.035, and 0.721 ± 0.03, respectively, slightly better than the decision tree. Then, the classifier effectively identified the phenotypes for new patients in the validation dataset with significant difference on survival curves and hazard ratios. Finally, age and creatinine clearance rate were identified as the top two most important predictors. ML could effectively identify patient prognostic phenotypes, facilitating reasonable management and treatment considering prognostic condition.
Project description:We focused on building models that incorporated transcription factor (TF)-DNA interaction data for 12 members of the Auxin Response Factor (ARF) family from soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq).
Project description:Having accurate maps depicting the locations of residential buildings across a region benefits a range of sectors. This is particularly true for public health programs focused on delivering services at the household level, such as indoor residual spraying with insecticide to help prevent malaria. While open source data from OpenStreetMap (OSM) depicting the locations and shapes of buildings is rapidly improving in terms of quality and completeness globally, even in settings where all buildings have been mapped, information on whether these buildings are residential, commercial or another type is often only available for a small subset. Using OSM building data from Botswana and Swaziland, we identified buildings for which 'type' was indicated, generated via on the ground observations, and classified these into two classes, "sprayable" and "not-sprayable". Ensemble machine learning, using building characteristics such as size, shape and proximity to neighbouring features, was then used to form a model to predict which of these 2 classes every building in these two countries fell into. Results show that an ensemble machine learning approach performed marginally, but statistically, better than the best individual model and that using this ensemble model we were able to correctly classify >86% (using independent test data) of structures correctly as sprayable and not-sprayable across both countries.
Project description:Lysosomotropism is a phenomenon of diverse pharmaceutical interests because it is a property of compounds with diverse chemical structures and primary targets. While it is primarily reported to be caused by compounds having suitable lipophilicity and basicity values, not all compounds that fulfill such criteria are in fact lysosomotropic. Here, we use morphological profiling by means of the cell painting assay (CPA) as a reliable surrogate to identify lysosomotropism. We noticed that only 35% of the compound subset with matching physicochemical properties show the lysosomotropic phenotype. Based on a matched molecular pair analysis (MMPA), no key substructures driving lysosomotropism could be identified. However, using explainable machine learning (XML), we were able to highlight that higher lipophilicity, basicity, molecular weight, and lower topological polar surface area are among the important properties that induce lysosomotropism in the compounds of this subset.
Project description:Quantifying the extent to which points are clustered in single-molecule localization microscopy data is vital to understanding the spatial relationships between molecules in the underlying sample. Many existing computational approaches are limited in their ability to process large-scale data sets, to deal effectively with sample heterogeneity, or require subjective user-defined analysis parameters. Here, we develop a supervised machine-learning approach to cluster analysis which is fast and accurate. Trained on a variety of simulated clustered data, the neural network can classify millions of points from a typical single-molecule localization microscopy data set, with the potential to include additional classifiers to describe different subtypes of clusters. The output can be further refined for the measurement of cluster area, shape, and point-density. We demonstrate this approach on simulated data and experimental data of the kinase Csk and the adaptor PAG in primary human T cell immunological synapses.
Project description:When drilling wells for energy explorations, it is important to regulate the formation pressures appropriately to prevent kicks, which can lead to unimaginable loss of lives and properties. This is usually done by controlling the equivalent circulating density (ECD), which responds to the dynamic conditions that occur during drilling. The conventional approach to determine ECD is via mathematical modeling or downhole measurements. However, the downhole measurement tools can be very expensive, and the mathematical models do not provide a high degree of accuracy. Some previous authors have proposed using machine learning (ML) techniques to improve the degree of accuracy of the ECD predictions. In this work, we employed an extreme gradient-boosting (XGBoost) methodology to predict ECD values. The model's accuracy was determined using correlation coefficients (R2) and root mean square errors (RMSE) as their performance metrics. The results showed a strong prediction capability with an R2 and RMSE of 1.00 and 0.0005 for the training data and an R2 and RMSE of 0.989 and 0.023 for the testing/blind data set, respectively. The developed model outperformed those obtained using other popular machine learning techniques. Lastly, an interpretation of the model results showed that mud weight, weight on hook, and standpipe pressure contributed the most to the ECD prediction values.