LiveBench-1: continuous benchmarking of protein structure prediction servers.
ABSTRACT: We present a novel, continuous approach aimed at the large-scale assessment of the performance of available fold-recognition servers. Six popular servers were investigated: PDB-Blast, FFAS, T98-lib, GenTHREADER, 3D-PSSM, and INBGU. The assessment was conducted using as prediction targets a large number of selected protein structures released from October 1999 to April 2000. A target was selected if its sequence showed no significant similarity to any of the proteins previously available in the structural database. Overall, the servers were able to produce structurally similar models for one-half of the targets, but significantly accurate sequence-structure alignments were produced for only one-third of the targets. We further classified the targets into two sets: easy and hard. We found that all servers were able to find the correct answer for the vast majority of the easy targets if a structurally similar fold was present in the server's fold libraries. However, among the hard targets--where standard methods such as PSI-BLAST fail--the most sensitive fold-recognition servers were able to produce similar models for only 40% of the cases, half of which had a significantly accurate sequence-structure alignment. Among the hard targets, the presence of updated libraries appeared to be less critical for the ranking. An "ideally combined consensus" prediction, where the results of all servers are considered, would increase the percentage of correct assignments by 50%. Each server had a number of cases with a correct assignment, where the assignments of all the other servers were wrong. This emphasizes the benefits of considering more than one server in difficult prediction tasks. The LiveBench program (http://BioInfo.PL/LiveBench) is being continued, and all interested developers are cordially invited to join.
Project description:Protein structure prediction provides valuable insights into function, and comparative modeling is one of the most reliable methods to predict 3D structures directly from amino acid sequences. However, critical problems arise during the selection of the correct templates and the alignment of query sequences therewith. We have developed an automatic protein structure prediction server, (PS)2, which uses an effective consensus strategy both in template selection, which combines PSI-BLAST and IMPALA, and target-template alignment integrating PSI-BLAST, IMPALA and T-Coffee. (PS)2 was evaluated for 47 comparative modeling targets in CASP6 (Critical Assessment of Techniques for Protein Structure Prediction). For the benchmark dataset, the predictive performance of (PS)2, based on the mean GTD_TS score, was superior to 10 other automatic servers. Our method is based solely on the consensus sequence and thus is considerably faster than other methods that rely on the additional structural consensus of templates. Our results show that (PS)2, coupled with suitable consensus strategies and a new similarity score, can significantly improve structure prediction. Our approach should be useful in structure prediction and modeling. The (PS)2 is available through the website at http://ps2.life.nctu.edu.tw/.
Project description:BACKGROUND: Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them. RESULTS: In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top L/5 predicted contacts are evaluated where L is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively. CONCLUSION: Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.
Project description:A number of state-of-the-art protein structure prediction servers have been developed by researchers working in the Bioinformatics Unit at University College London. The popular PSIPRED server allows users to perform secondary structure prediction, transmembrane topology prediction and protein fold recognition. More recent servers include DISOPRED for the prediction of protein dynamic disorder and DomPred for domain boundary prediction. These servers are available from our software home page at http://bioinf.cs.ucl.ac.uk/software.html.
Project description:BACKGROUND: The accurate prediction of ligand binding residues from amino acid sequences is important for the automated functional annotation of novel proteins. In the previous two CASP experiments, the most successful methods in the function prediction category were those which used structural superpositions of 3D models and related templates with bound ligands in order to identify putative contacting residues. However, whilst most of this prediction process can be automated, visual inspection and manual adjustments of parameters, such as the distance thresholds used for each target, have often been required to prevent over prediction. Here we describe a novel method FunFOLD, which uses an automatic approach for cluster identification and residue selection. The software provided can easily be integrated into existing fold recognition servers, requiring only a 3D model and list of templates as inputs. A simple web interface is also provided allowing access to non-expert users. The method has been benchmarked against the top servers and manual prediction groups tested at both CASP8 and CASP9. RESULTS: The FunFOLD method shows a significant improvement over the best available servers and is shown to be competitive with the top manual prediction groups that were tested at CASP8. The FunFOLD method is also competitive with both the top server and manual methods tested at CASP9. When tested using common subsets of targets, the predictions from FunFOLD are shown to achieve a significantly higher mean Matthews Correlation Coefficient (MCC) scores and Binding-site Distance Test (BDT) scores than all server methods that were tested at CASP8. Testing on the CASP9 set showed no statistically significant separation in performance between FunFOLD and the other top server groups tested. CONCLUSIONS: The FunFOLD software is freely available as both a standalone package and a prediction server, providing competitive ligand binding site residue predictions for expert and non-expert users alike. The software provides a new fully automated approach for structure based function prediction using 3D models of proteins.
Project description:LISE is a web server for a novel method for predicting small molecule binding sites on proteins. It differs from a number of servers currently available for such predictions in two aspects. First, rather than relying on knowledge of similar protein structures, identification of surface cavities or estimation of binding energy, LISE computes a score by counting geometric motifs extracted from sub-structures of interaction networks connecting protein and ligand atoms. These network motifs take into account spatial and physicochemical properties of ligand-interacting protein surface atoms. Second, LISE has now been more thoroughly tested, as, in addition to the evaluation we previously reported using two commonly used small benchmark test sets and targets of two community-based experiments on ligand-binding site predictions, we now report an evaluation using a large non-redundant data set containing >2000 protein-ligand complexes. This unprecedented test, the largest ever reported to our knowledge, demonstrates LISE's overall accuracy and robustness. Furthermore, we have identified some hard to predict protein classes and provided an estimate of the performance that can be expected from a state-of-the-art binding site prediction server, such as LISE, on a proteome scale. The server is freely available at http://lise.ibms.sinica.edu.tw.
Project description:PISCES is a database server for producing lists of sequences from the Protein Data Bank (PDB) using a number of entry- and chain-specific criteria and mutual sequence identity. Our goal in culling the PDB is to provide the longest list possible of the highest resolution structures that fulfill the sequence identity and structural quality cut-offs. The new PISCES server uses a combination of PSI-BLAST and structure-based alignments to determine sequence identities. Structure alignment produces more complete alignments and therefore more accurate sequence identities than PSI-BLAST. PISCES now allows a user to cull the PDB by-entry in addition to the standard culling by individual chains. In this scenario, a list will contain only entries that do not have a chain that has a sequence identity to any chain in any other entry in the list over the sequence identity cut-off. PISCES also provides fully annotated sequences including gene name and species. The server allows a user to cull an input list of entries or chains, so that other criteria, such as function, can be used. Results from a search on the re-engineered RCSB's site for the PDB can be entered into the PISCES server by a single click, combining the powerful searching abilities of the PDB with PISCES's utilities for sequence culling. The server's data are updated weekly. The server is available at http://dunbrack.fccc.edu/pisces.
Project description:BACKGROUND:Subunit vaccines based on recombinant proteins have been effective in preventing infectious diseases and are expected to meet the demands of future vaccine development. Computational approach, especially reverse vaccinology (RV) method has enormous potential for identification of protein vaccine candidates (PVCs) from a proteome. The existing protective antigen prediction software and web servers have low prediction accuracy leading to limited applications for vaccine development. Besides machine learning techniques, those software and web servers have considered only protein's adhesin-likeliness as criterion for identification of PVCs. Several non-adhesin functional classes of proteins involved in host-pathogen interactions and pathogenesis are known to provide protection against bacterial infections. Therefore, knowledge of bacterial pathogenesis has potential to identify PVCs. RESULTS:A web server, Jenner-Predict, has been developed for prediction of PVCs from proteomes of bacterial pathogens. The web server targets host-pathogen interactions and pathogenesis by considering known functional domains from protein classes such as adhesin, virulence, invasin, porin, flagellin, colonization, toxin, choline-binding, penicillin-binding, transferring-binding, fibronectin-binding and solute-binding. It predicts non-cytosolic proteins containing above domains as PVCs. It also provides vaccine potential of PVCs in terms of their possible immunogenicity by comparing with experimentally known IEDB epitopes, absence of autoimmunity and conservation in different strains. Predicted PVCs are prioritized so that only few prospective PVCs could be validated experimentally. The performance of web server was evaluated against known protective antigens from diverse classes of bacteria reported in Protegen database and datasets used for VaxiJen server development. The web server efficiently predicted known vaccine candidates reported from Streptococcus pneumoniae and Escherichia coli proteomes. The Jenner-Predict server outperformed NERVE, Vaxign and VaxiJen methods. It has sensitivity of 0.774 and 0.711 for Protegen and VaxiJen dataset, respectively while specificity of 0.940 has been obtained for the latter dataset. CONCLUSIONS:Better prediction accuracy of Jenner-Predict web server signifies that domains involved in host-pathogen interactions and pathogenesis are better criteria for prediction of PVCs. The web server has successfully predicted maximum known PVCs belonging to different functional classes. Jenner-Predict server is freely accessible at http://126.96.36.199/vaccine/home.html.
Project description:Most proteins consist of multiple domains, independent structural and evolutionary units that are often reshuffled in genomic rearrangements to form new protein architectures. Template-based modeling methods can often detect homologous templates for individual domains, but templates that could be used to model the entire query protein are often not available.We have developed a fast docking algorithm ab initio domain assembly (AIDA) for assembling multi-domain protein structures, guided by the ab initio folding potential. This approach can be extended to discontinuous domains (i.e. domains with 'inserted' domains). When tested on experimentally solved structures of multi-domain proteins, the relative domain positions were accurately found among top 5000 models in 86% of cases. AIDA server can use domain assignments provided by the user or predict them from the provided sequence. The latter approach is particularly useful for automated protein structure prediction servers. The blind test consisting of 95 CASP10 targets shows that domain boundaries could be successfully determined for 97% of targets.The AIDA package as well as the benchmark sets used here are available for download at http://ffas.burnham.org/AIDAfirstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Project description:BACKGROUND: Accurate prediction of antigenic epitopes is important for immunologic research and medical applications, but it is still an open problem in bioinformatics. The case for discontinuous epitopes is even worse - currently there are only a few discontinuous epitope prediction servers available, though discontinuous peptides constitute the majority of all B-cell antigenic epitopes. The small number of structures for antigen-antibody complexes limits the development of reliable discontinuous epitope prediction methods and an unbiased benchmark to evaluate developed methods. RESULTS: In this work, we present two novel server applications for discontinuous epitope prediction: EPSVR and EPMeta, where EPMeta is a meta server. EPSVR, EPMeta, and datasets are available at http://sysbio.unl.edu/services. CONCLUSION: The server application for discontinuous epitope prediction, EPSVR, uses a Support Vector Regression (SVR) method to integrate six scoring terms. Furthermore, we combined EPSVR with five existing epitope prediction servers to construct EPMeta. All methods were benchmarked by our curated independent test set, in which all antigens had no complex structures with the antibody, and their epitopes were identified by various biochemical experiments. The area under the receiver operating characteristic curve (AUC) of EPSVR was 0.597, higher than that of any other existing single server, and EPMeta had a better performance than any single server - with an AUC of 0.638, significantly higher than PEPITO and Disctope (p-value < 0.05).
Project description:Proteins with high-sequence identity but very different folds present a special challenge to sequence-based protein structure prediction methods. In particular, a 56-residue three-helical bundle protein (GA(95)) and an alpha/beta-fold protein (GB(95)), which share 95% sequence identity, were targets in the CASP-8 structure prediction contest. With only 12 out of 300 submitted server-CASP8 models for GA(95) exhibiting the correct fold, this protein proved particularly challenging despite its small size. Here, we demonstrate that the information contained in NMR chemical shifts can readily be exploited by the CS-Rosetta structure prediction program and yields adequate convergence, even when input chemical shifts are limited to just amide (1)H(N) and (15)N or (1)H(N) and (1)H(alpha) values.