Blind tests of RNA-protein binding affinity prediction.
ABSTRACT: Interactions between RNA and proteins are pervasive in biology, driving fundamental processes such as protein translation and participating in the regulation of gene expression. Modeling the energies of RNA-protein interactions is therefore critical for understanding and repurposing living systems but has been hindered by complexities unique to RNA-protein binding. Here, we bring together several advances to complete a calculation framework for RNA-protein binding affinities, including a unified free energy function for bound complexes, automated Rosetta modeling of mutations, and use of secondary structure-based energetic calculations to model unbound RNA states. The resulting Rosetta-Vienna RNP-??G method achieves root-mean-squared errors (RMSEs) of 1.3 kcal/mol on high-throughput MS2 coat protein-RNA measurements and 1.5 kcal/mol on an independent test set involving the signal recognition particle, human U1A, PUM1, and FOX-1. As a stringent test, the method achieves RMSE accuracy of 1.4 kcal/mol in blind predictions of hundreds of human PUM2-RNA relative binding affinities. Overall, these RMSE accuracies are significantly better than those attained by prior structure-based approaches applied to the same systems. Importantly, Rosetta-Vienna RNP-??G establishes a framework for further improvements in modeling RNA-protein binding that can be tested by prospective high-throughput measurements on new systems.
Project description:The predictive modeling and design of biologically active RNA molecules requires understanding the energetic balance among their basic components. Rapid developments in computer simulation promise increasingly accurate recovery of RNA's nearest-neighbor (NN) free-energy parameters, but these methods have not been tested in predictive trials or on nonstandard nucleotides. Here, we present, to our knowledge, the first such tests through a RECCES-Rosetta (reweighting of energy-function collection with conformational ensemble sampling in Rosetta) framework that rigorously models conformational entropy, predicts previously unmeasured NN parameters, and estimates these values' systematic uncertainties. RECCES-Rosetta recovers the 10 NN parameters for Watson-Crick stacked base pairs and 32 single-nucleotide dangling-end parameters with unprecedented accuracies: rmsd of 0.28 kcal/mol and 0.41 kcal/mol, respectively. For set-aside test sets, RECCES-Rosetta gives rmsd values of 0.32 kcal/mol on eight stacked pairs involving G-U wobble pairs and 0.99 kcal/mol on seven stacked pairs involving nonstandard isocytidine-isoguanosine pairs. To more rigorously assess RECCES-Rosetta, we carried out four blind predictions for stacked pairs involving 2,6-diaminopurine-U pairs, which achieved 0.64 kcal/mol rmsd accuracy when tested by subsequent experiments. Overall, these results establish that computational methods can now blindly predict energetics of basic RNA motifs, including chemically modified variants, with consistently better than 1 kcal/mol accuracy. Systematic tests indicate that resolving the remaining discrepancies will require energy function improvements beyond simply reweighting component terms, and we propose further blind trials to test such efforts.
Project description:RNA-protein complexes underlie numerous cellular processes including translation, splicing, and posttranscriptional regulation of gene expression. The structures of these complexes are crucial to their functions but often elude high-resolution structure determination. Computational methods are needed that can integrate low-resolution data for RNA-protein complexes while modeling de novo the large conformational changes of RNA components upon complex formation. To address this challenge, we describe RNP-denovo, a Rosetta method to simultaneously fold-and-dock RNA to a protein surface. On a benchmark set of diverse RNA-protein complexes not solvable with prior strategies, RNP-denovo consistently sampled native-like structures with better than nucleotide resolution. We revisited three past blind modeling challenges involving the spliceosome, telomerase, and a methyltransferase-ribosomal RNA complex in which previous methods gave poor results. When coupled with the same sparse FRET, crosslinking, and functional data used previously, RNP-denovo gave models with significantly improved accuracy. These results open a route to modeling global folds of RNA-protein complexes from low-resolution data.
Project description:The SM8 quantum mechanical aqueous continuum solvation model is applied to a 17-molecule test set proposed by Nicholls et al. (J. Med. Chem. 2008, 51, 769) to predict free energies of solvation. With the M06-2X density functional, the 6-31G(d) basis set, and CM4M charge model, the root-mean-square error (RMSE) of SM8 is 1.08 kcal mol(-1) for aqueous geometries and 1.14 kcal mol(-1) for gas-phase geometries. These errors compare favorably with optimal explicit and continuum models reported by Nicholls et al., having RMSEs of 1.33 and 1.87 kcal mol(-1), respectively. Other models examined by these workers had RMSEs of 1.5-2.6 kcal mol(-1). We also explore the use of other density functionals and charge models with SM8 and the RMSE increases to 1.21 kcal mol(-1) for mPW1/CM4 with gas-phase geometries, to 1.50 kcal mol(-1) for M06-2X/CM4 with gas-phase geometries, and to 1.27-1.64 kcal mol(-1) with three different models at B3LYP gas-phase geometries.
Project description:Results are reported for octanol-water partition coefficients (log P) of the neutral states of drug-like molecules provided during the SAMPL6 (Statistical Assessment of Modeling of Proteins and Ligands) blind prediction challenge from applying the "embedded cluster reference interaction site model" (EC-RISM) as a solvation model for quantum-chemical calculations. Following the strategy outlined during earlier SAMPL challenges we first train 1- and 2-parameter water-free ("dry") and water-saturated ("wet") models for n-octanol solvation Gibbs energies with respect to experimental values from the "Minnesota Solvation Database" (MNSOL), yielding a root mean square error (RMSE) of 1.5 kcal mol-1 for the best-performing 2-parameter wet model, while the optimal water model developed for the pKa part of the SAMPL6 challenge is kept unchanged (RMSE 1.6 kcal mol-1 for neutral compounds from a model trained on both neutral and ionic species). Applying these models to the blind prediction set yields a log P RMSE of less than 0.5 for our best model (2-parameters, wet). Further analysis of our results reveals that a single compound is responsible for most of the error, SM15, without which the RMSE drops to 0.2. Since this is the only compound in the challenge dataset with a hydroxyl group we investigate other alcohols for which Gibbs energy of solvation data for both water and n-octanol are available in the MNSOL database to demonstrate a systematic cause of error and to discuss strategies for improvement.
Project description:We used protein-compound docking simulations to develop a structure-based quantitative structure-activity relationship (QSAR) model. The prediction model used docking scores as descriptors. The binding free energy was approximated by a weighted average of docking scores for multiple proteins. This approximation was based on a pharmacophore model of receptor pockets and compounds. The weights of the docking scores were restricted to small values to avoid unrealistic weights by a regularization term. Additional outlier elimination improved the results. We applied this method to two groups of targets. The first target was the kinase family. The cross-validation results of 107 kinase proteins showed that the RMSE of predicted binding free energies was 1.1?kcal/mol. The second target was the matrix metalloproteinase (MMP) family, which has been difficult for docking programs. MMPs require metal-binding groups in their inhibitor structures in many cases. A quantum effect contributes to the metal-ligand interaction. Despite this difficulty, the present method worked well for the MMPs. This method showed that the RMSE of predicted binding free energies was 1.1?kcal/mol. In comparison, with the original docking method the RMSE was 1.7?kcal/mol. The results suggest that the present QSAR model should be applied to general target proteins.
Project description:Consistently predicting biopolymer structure at atomic resolution from sequence alone remains a difficult problem, even for small sub-segments of large proteins. Such loop prediction challenges, which arise frequently in comparative modeling and protein design, can become intractable as loop lengths exceed 10 residues and if surrounding side-chain conformations are erased. Current approaches, such as the protein local optimization protocol or kinematic inversion closure (KIC) Monte Carlo, involve stages that coarse-grain proteins, simplifying modeling but precluding a systematic search of all-atom configurations. This article introduces an alternative modeling strategy based on a 'stepwise ansatz', recently developed for RNA modeling, which posits that any realistic all-atom molecular conformation can be built up by residue-by-residue stepwise enumeration. When harnessed to a dynamic-programming-like recursion in the Rosetta framework, the resulting stepwise assembly (SWA) protocol enables enumerative sampling of a 12 residue loop at a significant but achievable cost of thousands of CPU-hours. In a previously established benchmark, SWA recovers crystallographic conformations with sub-Angstrom accuracy for 19 of 20 loops, compared to 14 of 20 by KIC modeling with a comparable expenditure of computational power. Furthermore, SWA gives high accuracy results on an additional set of 15 loops highlighted in the biological literature for their irregularity or unusual length. Successes include cis-Pro touch turns, loops that pass through tunnels of other side-chains, and loops of lengths up to 24 residues. Remaining problem cases are traced to inaccuracies in the Rosetta all-atom energy function. In five additional blind tests, SWA achieves sub-Angstrom accuracy models, including the first such success in a protein/RNA binding interface, the YbxF/kink-turn interaction in the fourth 'RNA-puzzle' competition. These results establish all-atom enumeration as an unusually systematic approach to ab initio protein structure modeling that can leverage high performance computing and physically realistic energy functions to more consistently achieve atomic accuracy.
Project description:A complete macromolecule modeling package must be able to solve the simplest structure prediction problems. Despite recent successes in high resolution structure modeling and design, the Rosetta software suite fares poorly on small protein and RNA puzzles, some as small as four residues. To illustrate these problems, this manuscript presents Rosetta results for four well-defined test cases: the 20-residue mini-protein Trp cage, an even smaller disulfide-stabilized conotoxin, the reactive loop of a serine protease inhibitor, and a UUCG RNA tetraloop. In contrast to previous Rosetta studies, several lines of evidence indicate that conformational sampling is not the major bottleneck in modeling these small systems. Instead, approximations and omissions in the Rosetta all-atom energy function currently preclude discriminating experimentally observed conformations from de novo models at atomic resolution. These molecular "puzzles" should serve as useful model systems for developers wishing to make foundational improvements to this powerful modeling suite.
Project description:The effect of charge hydration asymmetry (CHA)-non-invariance of solvation free energy upon solute charge inversion-is missing from the standard linear response continuum electrostatics. The proposed charge hydration asymmetric-generalized Born (CHA-GB) approximation introduces this effect into the popular generalized Born (GB) model. The CHA is added to the GB equation via an analytical correction that quantifies the specific propensity of CHA of a given water model; the latter is determined by the charge distribution within the water model. Significant variations in CHA seen in explicit water (TIP3P, TIP4P-Ew, and TIP5P-E) free energy calculations on charge-inverted "molecular bracelets" are closely reproduced by CHA-GB, with the accuracy similar to models such as SEA and 3D-RISM that go beyond the linear response. Compared against reference explicit (TIP3P) electrostatic solvation free energies, CHA-GB shows about a 40% improvement in accuracy over the canonical GB, tested on a diverse set of 248 rigid small neutral molecules (root mean square error, rmse = 0.88 kcal/mol for CHA-GB vs 1.24 kcal/mol for GB) and 48 conformations of amino acid analogs (rmse = 0.81 kcal/mol vs 1.26 kcal/mol). CHA-GB employs a novel definition of the dielectric boundary that does not subsume the CHA effects into the intrinsic atomic radii. The strategy leads to finding a new set of intrinsic atomic radii optimized for CHA-GB; these radii show physically meaningful variation with the atom type, in contrast to the radii set optimized for GB. Compared to several popular radii sets used with the original GB model, the new radii set shows better transferability between different classes of molecules.
Project description:The design of proteins with novel ligand-binding functions holds great potential for application in biomedicine and biotechnology. However, our ability to engineer ligand-binding proteins is still limited, and current approaches rely primarily on experimentation. Computation could reduce the cost of the development process and would allow rigorous testing of our understanding of the principles governing molecular recognition. While computational methods have proven successful in the early stages of the discovery process, optimization approaches that can quantitatively predict ligand affinity changes upon protein mutation are still lacking. Here, we assess the ability of free energy calculations based on first-principles statistical mechanics, as well as the latest Rosetta protocols, to quantitatively predict such affinity changes on a challenging set of 134 mutations. After evaluating different protocols with computational efficiency in mind, we investigate the performance of different force fields. We show that both the free energy calculations and Rosetta are able to quantitatively predict changes in ligand binding affinity upon protein mutations, yet the best predictions are the result of combining the estimates of both methods. These closely match the experimentally determined ??<i>G</i> values, with a root-mean-square error of 1.2 kcal/mol for the full benchmark set and of 0.8 kcal/mol for a subset of protein systems providing the most reproducible results. The currently achievable accuracy offers the prospect of being able to employ computation for the optimization of ligand-binding proteins as well as the prediction of drug resistance.
Project description:In the previous publications of this series, we presented a set of Thole induced dipole interaction models using four types of screening functions. In this work, we document our effort to refine the van der Waals parameters for the Thole polarizable models. Following the philosophy of AMBER force field development, the van der Waals (vdW) parameters were tuned for the Thole model with linear screening function to reproduce both the ab initio interaction energies and the experimental densities of pure liquids. An in-house genetic algorithm was applied to maximize the fitness of "chromosomes" which is a function of the root-mean-square errors (RMSE) of interaction energy and liquid density. To efficiently explore the vdW parameter space, a novel approach was developed to estimate the liquid densities for a given vdW parameter set using the mean residue-residue interaction energies through interpolation/extrapolation. This approach allowed the costly molecular dynamics simulations be performed at the end of each optimization cycle only and eliminated the simulations during the cycle. Test results show notable improvements over the original AMBER FF99 vdW parameter set, as indicated by the reduction in errors of the calculated pure liquid densities (d), heats of vaporization (H(vap)), and hydration energies. The average percent error (APE) of the densities of 59 pure liquids was reduced from 5.33 to 2.97%; the RMSE of H(vap) was reduced from 1.98 to 1.38 kcal/mol; the RMSE of solvation free energies of 15 compounds was reduced from 1.56 to 1.38 kcal/mol. For the interaction energies of 1639 dimers, the overall performance of the optimized vdW set is slightly better than the original FF99 vdW set (RMSE of 1.56 versus 1.63 kcal/mol). The optimized vdW parameter set was also evaluated for the exponential screening function used in the Amoeba force field to assess its applicability for different types of screening functions. Encouragingly, comparable performance was observed when the optimized vdW set was combined with the Thole Amoeba-like polarizable model, particularly for the interaction energy and liquid density calculations. Thus, the optimized vdW set is applicable to both types of Thole models with either linear or Amoeba-like screening functions.