Project description:Predicting pathogenicity of missense variants in molecular diagnostics remains a challenge despite the available wealth of data, such as evolutionary information, and the wealth of tools to integrate that data. We describe DeepRank-Mut, a configurable framework designed to extract and learn from physicochemically relevant features of amino acids surrounding missense variants in 3D space. For each variant, various atomic and residue-level features are extracted from its structural environment, including sequence conservation scores of the surrounding amino acids, and stored in multi-channel 3D voxel grids which are then used to train a 3D convolutional neural network (3D-CNN). The resultant model gives a probabilistic estimate of whether a given input variant is disease-causing or benign. We find that the performance of our 3D-CNN model, on independent test datasets, is comparable to other widely used resources which also combine sequence and structural features. Based on the 10-fold cross-validation experiments, we achieve an average accuracy of 0.77 on the independent test datasets. We discuss the contribution of the variant neighborhood in the model's predictive power, in addition to the impact of individual features on the model's performance. Two key features: evolutionary information of residues in the variant neighborhood and their solvent accessibilities were observed to influence the predictions. We also highlight how predictions are impacted by the underlying disease mechanisms of missense mutations and offer insights into understanding these to improve pathogenicity predictions. Our study presents aspects to take into consideration when adopting deep learning approaches for protein structure-guided pathogenicity predictions.
Project description:Accurately predicting activation energies is crucial for understanding chemical reactions and modeling complex reaction systems. However, the high computational cost of quantum chemistry methods often limits the feasibility of large-scale studies, leading to a scarcity of high-quality activation energy data. In this work, we explore and compare three innovative approaches (transfer learning, delta learning, and feature engineering) to enhance the accuracy of activation energy predictions using graph neural networks, specifically focusing on methods that incorporate low-cost, low-level computational data. Using the Chemprop model, we systematically evaluated how these methods leverage data from semiempirical quantum mechanics (SQM) calculations to improve predictions. Delta learning, which adjusts low-level SQM activation energies to align with high-level CCSD(T)-F12a targets, emerged as the most effective method, achieving high accuracy with substantially reduced data requirements. Notably, delta learning trained with just 20-30% of high-level data matched or exceeded the performance of other methods trained with full data sets, making it advantageous in data-scarce scenarios. However, its reliance on transition state searches imposes significant computational demands during model application. Transfer learning, which pretrains models on large data sets of low-level data, provided mixed results, particularly when there was a mismatch in the reaction distributions between the training and target data sets. Feature engineering, which involves adding computed molecular properties as input features, showed modest gains, particularly in thermodynamic properties. Our study highlights the trade-offs between accuracy and computational demand in selecting the best approach for enhancing activation energy predictions. These insights provide valuable guidelines for researchers aiming to apply machine learning in chemical reaction engineering, helping to balance accuracy with resource constraints.
Project description:Predicting the stability of crystals is one of the central problems in materials science. Today, density functional theory (DFT) calculations remain comparatively expensive and scale poorly with system size. Here we show that deep neural networks utilizing just two descriptors-the Pauling electronegativity and ionic radii-can predict the DFT formation energies of C3A2D3O12 garnets and ABO3 perovskites with low mean absolute errors (MAEs) of 7-10 meV atom-1 and 20-34 meV atom-1, respectively, well within the limits of DFT accuracy. Further extension to mixed garnets and perovskites with little loss in accuracy can be achieved using a binary encoding scheme, addressing a critical gap in the extension of machine-learning models from fixed stoichiometry crystals to infinite universe of mixed-species crystals. Finally, we demonstrate the potential of these models to rapidly transverse vast chemical spaces to accurately identify stable compositions, accelerating the discovery of novel materials with potentially superior properties.
Project description:BackgroundRNA regulation is significantly dependent on its binding protein partner, known as the RNA-binding proteins (RBPs). Unfortunately, the binding preferences for most RBPs are still not well characterized. Interdependencies between sequence and secondary structure specificities is challenging for both predicting RBP binding sites and accurate sequence and structure motifs detection.ResultsIn this study, we propose a deep learning-based method, iDeepS, to simultaneously identify the binding sequence and structure motifs from RNA sequences using convolutional neural networks (CNNs) and a bidirectional long short term memory network (BLSTM). We first perform one-hot encoding for both the sequence and predicted secondary structure, to enable subsequent convolution operations. To reveal the hidden binding knowledge from the observed sequences, the CNNs are applied to learn the abstract features. Considering the close relationship between sequence and predicted structures, we use the BLSTM to capture possible long range dependencies between binding sequence and structure motifs identified by the CNNs. Finally, the learned weighted representations are fed into a classification layer to predict the RBP binding sites. We evaluated iDeepS on verified RBP binding sites derived from large-scale representative CLIP-seq datasets. The results demonstrate that iDeepS can reliably predict the RBP binding sites on RNAs, and outperforms the state-of-the-art methods. An important advantage compared to other methods is that iDeepS can automatically extract both binding sequence and structure motifs, which will improve our understanding of the mechanisms of binding specificities of RBPs.ConclusionOur study shows that the iDeepS method identifies the sequence and structure motifs to accurately predict RBP binding sites. iDeepS is available at https://github.com/xypan1232/iDeepS .
Project description:Accurate ligand-protein binding affinity prediction, for a set of similar binders, is a major challenge in the lead optimization stage in drug development. In general, docking and scoring functions perform unsatisfactorily in this application. Docking calculations, followed by molecular dynamics simulations and free energy calculations can be applied to improve the predictions. However, for targets with large, flexible binding sites, with no experimentally determined binding modes for a set of ligands, insufficient sampling can decrease the accuracy of the free energy calculations. Cytochrome P450s, a protein family of major importance for drug metabolism, is an example of a challenging target for binding affinity predictions. As a result, the choice of starting structure from the docking solutions becomes crucial. In this study, an iterative scheme is introduced that includes multiple independent molecular dynamics simulations to obtain weighted ensemble averages to be used in the linear interaction energy method. The proposed scheme makes the initial pose selection less crucial for further simulation, as it automatically calculates the relative weights of the various poses. It also properly takes into account the possibility that multiple binding modes contribute similarly to the overall affinity, or of similar compounds occupying very different poses. The method was applied to a set of 12 compounds binding to cytochrome P450 2C9 and it displayed a root mean-square error of 2.9 kJ/mol.
Project description:Conventional analysis of fluorescence recovery after photobleaching (FRAP) data for diffusion coefficient estimation typically involves fitting an analytical or numerical FRAP model to the recovery curve data using non-linear least squares. Depending on the model, this can be time consuming, especially for batch analysis of large numbers of data sets and if multiple initial guesses for the parameter vector are used to ensure convergence. In this work, we develop a completely new approach, DeepFRAP, utilizing machine learning for parameter estimation in FRAP. From a numerical FRAP model developed in previous work, we generate a very large set of simulated recovery curve data with realistic noise levels. The data are used for training different deep neural network regression models for prediction of several parameters, most importantly the diffusion coefficient. The neural networks are extremely fast and can estimate the parameters orders of magnitude faster than least squares. The performance of the neural network estimation framework is compared to conventional least squares estimation on simulated data, and found to be strikingly similar. Also, a simple experimental validation is performed, demonstrating excellent agreement between the two methods. We make the data and code used publicly available to facilitate further development of machine learning-based estimation in FRAP. LAY DESCRIPTION: Fluorescence recovery after photobleaching (FRAP) is one of the most frequently used methods for microscopy-based diffusion measurements and broadly used in materials science, pharmaceutics, food science and cell biology. In a FRAP experiment, a laser is used to photobleach fluorescent particles in a region. By analysing the recovery of the fluorescence intensity due to the diffusion of still fluorescent particles, the diffusion coefficient and other parameters can be estimated. Typically, a confocal laser scanning microscope (CLSM) is used to image the time evolution of the recovery, and a model is fit using least squares to obtain parameter estimates. In this work, we introduce a new, fast and accurate method for analysis of data from FRAP. The new method is based on using artificial neural networks to predict parameter values, such as the diffusion coefficient, effectively circumventing classical least squares fitting. This leads to a dramatic speed-up, especially noticeable when analysing large numbers of FRAP data sets, while still producing results in excellent agreement with least squares. Further, the neural network estimates can be used as very good initial guesses for least squares estimation in order to make the least squares optimization convergence much faster than it otherwise would. This provides for obtaining, for example, diffusion coefficients as soon as possible, spending minimal time on data analysis. In this fashion, the proposed method facilitates efficient use of the experimentalist's time which is the main motivation to our approach. The concept is demonstrated on pure diffusion. However, the concept can easily be extended to the diffusion and binding case. The concept is likely to be useful in all application areas of FRAP, including diffusion in cells, gels and solutions.
Project description:Accurate somatic variant calling from next-generation sequencing data is one most important tasks in personalised cancer therapy. The sophistication of the available technologies is ever-increasing, yet, manual candidate refinement is still a necessary step in state-of-the-art processing pipelines. This limits reproducibility and introduces a bottleneck with respect to scalability. We demonstrate that the validation of genetic variants can be improved using a machine learning approach resting on a Convolutional Neural Network, trained using existing human annotation. In contrast to existing approaches, we introduce a way in which contextual data from sequencing tracks can be included into the automated assessment. A rigorous evaluation shows that the resulting model is robust and performs on par with trained researchers following published standard operating procedure.
Project description:BackgroundRNA binding proteins (RBPs) play a vital role in post-transcriptional processes in all eukaryotes, such as splicing regulation, mRNA transport, and modulation of mRNA translation and decay. The identification of RBP binding sites is a crucial step in understanding the biological mechanism of post-transcriptional gene regulation. However, the determination of RBP binding sites on a large scale is a challenging task due to high cost of biochemical assays. Quite a number of studies have exploited machine learning methods to predict binding sites. Especially, deep learning is increasingly used in the bioinformatics field by virtue of its ability to learn generalized representations from DNA and protein sequences.ResultsIn this paper, we implemented a novel deep neural network model, DeepRKE, which combines primary RNA sequence and secondary structure information to effectively predict RBP binding sites. Specifically, we used word embedding algorithm to extract features of RNA sequences and secondary structures, i.e., distributed representation of k-mers sequence rather than traditional one-hot encoding. The distributed representations are taken as input of convolutional neural networks (CNN) and bidirectional long-term short-term memory networks (BiLSTM) to identify RBP binding sites. Our results show that deepRKE outperforms existing counterpart methods on two large-scale benchmark datasets.ConclusionsOur extensive experimental results show that DeepRKE is an efficacious tool for predicting RBP binding sites. The distributed representations of RNA sequences and secondary structures can effectively detect the latent relationship and similarity between k-mers, and thus improve the predictive performance. The source code of DeepRKE is available at https://github.com/youzhiliu/DeepRKE/ .
Project description:The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly half a million annotated images of macromolecular crystallization experiments from various sources and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different parts of this data set. We find that more than 94% of the test images can be correctly labeled, irrespective of their experimental origin. Because crystal recognition is key to high-density screening and the systematic analysis of crystallization experiments, this approach opens the door to both industrial and fundamental research applications.
Project description:Alterations in joint contact forces (JCFs) are thought to be important mechanisms for the onset and progression of many musculoskeletal and orthopaedic pain disorders. Computational approaches to JCFs assessment represent the only non-invasive means of estimating in-vivo forces; but this cannot be undertaken in free-living environments. Here, we used deep neural networks to train models to predict JCFs, using only joint angles as predictors. Our neural network models were generally able to predict JCFs with errors within published minimal detectable change values. The errors ranged from the lowest value of 0.03 bodyweight (BW) (ankle medial-lateral JCF in walking) to a maximum of 0.65BW (knee VT JCF in running). Interestingly, we also found that over parametrised neural networks by training on longer epochs (>100) resulted in better and smoother waveform predictions. Our methods for predicting JCFs using only joint kinematics hold a lot of promise in allowing clinicians and coaches to continuously monitor tissue loading in free-living environments.