A New Hidden Markov Model for Protein Quality Assessment Using Compatibility Between Protein Sequence and Structure.
ABSTRACT: Protein structure Quality Assessment (QA) is an essential component in protein structure prediction and analysis. The relationship between protein sequence and structure often serves as a basis for protein structure QA. In this work, we developed a new Hidden Markov Model (HMM) to assess the compatibility of protein sequence and structure for capturing their complex relationship. More specifically, the emission of the HMM consists of protein local structures in angular space, secondary structures, and sequence profiles. This model has two capabilities: (1) encoding local structure of each position by jointly considering sequence and structure information, and (2) assigning a global score to estimate the overall quality of a predicted structure, as well as local scores to assess the quality of specific regions of a structure, which provides useful guidance for targeted structure refinement. We compared the HMM model to state-of-art single structure quality assessment methods OPUSCA, DFIRE, GOAP, and RW in protein structure selection. Computational results showed our new score HMM.Z can achieve better overall selection performance on the benchmark datasets.
Project description:In this study, we address the problem of local quality assessment in homology models. As a prerequisite for the evaluation of methods for predicting local model quality, we first examine the problem of measuring local structural similarities between a model and the corresponding native structure. Several local geometric similarity measures are evaluated. Two methods based on structural superposition are found to best reproduce local model quality assessments by human experts. We then examine the performance of state-of-the-art statistical potentials in predicting local model quality on three qualitatively distinct data sets. The best statistical potential, DFIRE, is shown to perform on par with the best current structure-based method in the literature, ProQres. A combination of different statistical potentials and structural features using support vector machines is shown to provide somewhat improved performance over published methods.
Project description:This paper presents a new fast and accurate web service for protein model quality analysis, called PSICA (Protein Structural Information Conformity Analysis). It is designed to evaluate how much a tertiary model of a given protein primary sequence conforms to the known protein structures of similar protein sequences, and to evaluate the quality of predicted protein models. PSICA implements the MUfoldQA_S method, an efficient state-of-the-art protein model quality assessment (QA) method. In CASP12, MUfoldQA_S ranked No. 1 in the protein model QA select-20 category in terms of the difference between the predicted and true GDT-TS value of each model. For a given predicted 3D model, PSICA generates (i) predicted global GDT-TS value; (ii) interactive comparison between the model and other known protein structures; (iii) visualization of the predicted local quality of the model; and (iv) JSmol rendering of the model. Additionally, PSICA implements MUfoldQA_C, a new consensus method based on MUfoldQA_S. In CASP12, MUfoldQA_C ranked No. 1 in top 1 model GDT-TS loss on the select-20 QA category and No. 2 in the average difference between the predicted and true GDT-TS value of each model for both select-20 and best-150 QA categories. The PSICA server is freely available at http://qas.wangwb.com/?wwr34/mufoldqa/index.html.
Project description:Hidden Markov Models (HMMs) have proven very useful in computational biology for such applications as sequence pattern matching, gene-finding, and structure prediction. Thus far, however, they have been confined to representing 1D sequence (or the aspects of structure that could be represented by character strings).We develop an HMM formalism that explicitly uses 3D coordinates in its match states. The match states are modeled by 3D Gaussian distributions centered on the mean coordinate position of each alpha carbon in a large structural alignment. The transition probabilities depend on the spread of the neighboring match states and on the number of gaps found in the structural alignment. We also develop methods for aligning query structures against 3D HMMs and scoring the result probabilistically. For 1D HMMs these tasks are accomplished by the Viterbi and forward algorithms. However, these will not work in unmodified form for the 3D problem, due to non-local quality of structural alignment, so we develop extensions of these algorithms for the 3D case. Several applications of 3D HMMs for protein structure classification are reported. A good separation of scores for different fold families suggests that the described construct is quite useful for protein structure analysis.We have created a rigorous 3D HMM representation for protein structures and implemented a complete set of routines for building 3D HMMs in C and Perl. The code is freely available from http://www.molmovdb.org/geometry/3dHMM, and at this site we also have a simple prototype server to demonstrate the features of the described approach.
Project description:Protein tertiary structures are essential for studying functions of proteins at molecular level. An indispensable approach for protein structure solution is computational prediction. Most protein structure prediction methods generate candidate models first and select the best candidates by model quality assessment (QA). In many cases, good models can be produced, but the QA tools fail to select the best ones from the candidate model pool. Because of incomplete understanding of protein folding, each QA method only reflects partial facets of a structure model and thus has limited discerning power with no one consistently outperforming others. In this article, we developed a set of new QA methods, including two QA methods for evaluating target/template alignments, a molecular dynamics (MD)-based QA method, and three consensus QA methods with selected references to reveal new facets of protein structures complementary to the existing methods. Moreover, the underlying relationship among different QA methods were analyzed and then integrated into a multilayer evaluation approach to guide the model generation and model selection in prediction. All methods are integrated and implemented into an innovative and improved prediction system hereafter referred to as MUFOLD. In CASP8 and CASP9, MUFOLD has demonstrated the proof of the principles in terms of both QA discerning power and structure prediction accuracy.
Project description:An accurate statistical energy function that is suitable for the prediction of protein structures of all classes should be independent of the structural database used for energy extraction. Here, two high-resolution, low-sequence-identity structural databases of 333 alpha-proteins and 271 beta-proteins were built for examining the database dependence of three all-atom statistical energy functions. They are RAPDF (residue-specific all-atom conditional probability discriminatory function), atomic KBP (atomic knowledge-based potential), and DFIRE (statistical potential based on distance-scaled finite ideal-gas reference state). These energy functions differ in the reference states used for energy derivation. The energy functions extracted from the different structural databases are used to select native structures from multiple decoys of 64 alpha-proteins and 28 beta-proteins. The performance in native structure selections indicates that the DFIRE-based energy function is mostly independent of the structural database whereas RAPDF and KBP have a significant dependence. The construction of two additional structural databases of alpha/beta and alpha + beta-proteins further confirmed the weak dependence of DFIRE on the structural databases of various structural classes. The possible source for the difference between the three all-atom statistical energy functions is that the physical reference state of ideal gas used in the DFIRE-based energy function is least dependent on the structural database.
Project description:Structure prediction on a genomic scale requires a simplified energy function that can efficiently sample the conformational space of polypeptide chains. A good energy function at minimum should discriminate native structures against decoys. Here, we show that a recently developed, residue-specific, all-atom knowledge-based potential (167 atomic types) based on distance-scaled, finite ideal-gas reference state (DFIRE-all-atom) can be substantially simplified to 20 residue types located at side-chain center of mass (DFIRE-SCM) without a significant change in its capability of structure discrimination. Using 96 standard multiple decoy sets, we show that there is only a small reduction (from 80% to 78%) in success rate of ranking native structures as the top 1. The success rate is higher than two previously developed, all-atom distance-dependent statistical pair potentials. Applied to structure selections of 21 docking decoys without modification, the DFIRE-SCM potential is 29% more successful in recognizing native complex structures than an all-atom statistical potential trained by a database of dimeric interfaces. The potential also achieves 92% accuracy in distinguishing true dimeric interfaces from artificial crystal interfaces. In addition, the DFIRE potential with the C(alpha) positions as the interaction centers recognizes 123 native structures out of a comprehensive 125-protein TOUCHSTONE decoy set in which each protein has 24,000 decoys with only C(alpha) positions. Furthermore, the performance by DFIRE-SCM on newly established 25 monomeric and 31 docking Rosetta-decoy sets is comparable to (or better than in the case of monomeric decoy sets) that of a recently developed, all-atom Rosetta energy function enhanced with an orientation-dependent hydrogen bonding potential.
Project description:Protein quality assessment (QA) has played an important role in protein structure prediction. We developed a novel single-model quality assessment method-Qprob. Qprob calculates the absolute error for each protein feature value against the true quality scores (i.e. GDT-TS scores) of protein structural models, and uses them to estimate its probability density distribution for quality assessment. Qprob has been blindly tested on the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) as MULTICOM-NOVEL server. The official CASP result shows that Qprob ranks as one of the top single-model QA methods. In addition, Qprob makes contributions to our protein tertiary structure predictor MULTICOM, which is officially ranked 3rd out of 143 predictors. The good performance shows that Qprob is good at assessing the quality of models of hard targets. These results demonstrate that this new probability density distribution based method is effective for protein single-model quality assessment and is useful for protein structure prediction. The webserver of Qprob is available at: http://calla.rnet.missouri.edu/qprob/. The software is now freely available in the web server of Qprob.
Project description:Protein model quality assessment (QA) plays a very important role in protein structure prediction. It can be divided into two groups of methods: single model and consensus QA method. The consensus QA methods may fail when there is a large portion of low quality models in the model pool.In this paper, we develop a novel single-model quality assessment method QAcon utilizing structural features, physicochemical properties, and residue contact predictions. We apply residue-residue contact information predicted by two protein contact prediction methods PSICOV and DNcon to generate a new score as feature for quality assessment. This novel feature and other 11 features are used as input to train a two-layer neural network on CASP9 datasets to predict the quality of a single protein model. We blindly benchmarked our method QAcon on CASP11 dataset as the MULTICOM-CLUSTER server. Based on the evaluation, our method is ranked as one of the top single model QA methods. The good performance of the features based on contact prediction illustrates the value of using contact information in protein quality assessment.The web server and the source code of QAcon are freely available at: http://cactus.rnet.missouri.edu/QAcon.email@example.com.Supplementary data are available at Bioinformatics online.
Project description:The arrangement of amino acids in a protein sequence encodes its native folding. However, the same arrangement in aggregation-prone regions may cause misfolding as a result of local environmental stress. Under normal physiological conditions, such regions congregate in the protein's interior to avoid aggregation and attain the native fold. We have used solvent accessibility of aggregation patches (SAAPp) to determine the packing of aggregation-prone residues. Our results showed that SAAPp has low values for native crystal structures, consistent with protein folding as a mechanism to minimize the solvent accessibility of aggregation-prone residues. SAAPp also shows an average correlation of 0.76 with the global distance test (GDT) score on CASP12 template-based protein models. Using SAAPp scores and five structural features, a random forest machine learning quality assessment tool, SAAP-QA, showed 2.32 average GDT loss between best model predicted and actual best based on GDT score on independent CASP test data, with the ability to discriminate native-like folds having an AUC of 0.94. Overall, the Pearson correlation coefficient (PCC) between true and predicted GDT scores on independent CASP data was 0.86 while on the external CAMEO dataset, comprising high quality protein structures, PCC and average GDT loss were 0.71 and 4.46 respectively. SAAP-QA can be used to detect the quality of models and iteratively improve them to native or near-native structures.
Project description:The thioredoxin family of oxidoreductases plays an important role in redox signaling and control of protein function. Not only are thioredoxins linked to a variety of disorders, but their stable structure has also seen application in protein engineering. Both sequence-based and structure-based tools exist for thioredoxin identification, but remote homolog detection remains a challenge. We developed a thioredoxin predictor using the approach of integrating sequence with structural information. We combined a sequence-based Hidden Markov Model (HMM) with a molecular dynamics enhanced structure-based recognition method (dynamic FEATURE, DF). This hybrid method (HMMDF) has high precision and recall (0.90 and 0.95, respectively) compared with HMM (0.92 and 0.87, respectively) and DF (0.82 and 0.97, respectively). Dynamic FEATURE is sensitive but struggles to resolve closely related protein families, while HMM identifies these evolutionary differences by compromising sensitivity. Our method applied to structural genomics targets makes a strong prediction of a novel thioredoxin.