Prediction of membrane transport proteins and their substrate specificities using primary sequence information.
ABSTRACT: Membrane transport proteins (transporters) move hydrophilic substrates across hydrophobic membranes and play vital roles in most cellular functions. Transporters represent a diverse group of proteins that differ in topology, energy coupling mechanism, and substrate specificity as well as sequence similarity. Among the functional annotations of transporters, information about their transporting substrates is especially important. The experimental identification and characterization of transporters is currently costly and time-consuming. The development of robust bioinformatics-based methods for the prediction of membrane transport proteins and their substrate specificities is therefore an important and urgent task.Support vector machine (SVM)-based computational models, which comprehensively utilize integrative protein sequence features such as amino acid composition, dipeptide composition, physico-chemical composition, biochemical composition, and position-specific scoring matrices (PSSM), were developed to predict the substrate specificity of seven transporter classes: amino acid, anion, cation, electron, protein/mRNA, sugar, and other transporters. An additional model to differentiate transporters from non-transporters was also developed. Among the developed models, the biochemical composition and PSSM hybrid model outperformed other models and achieved an overall average prediction accuracy of 76.69% with a Mathews correlation coefficient (MCC) of 0.49 and a receiver operating characteristic area under the curve (AUC) of 0.833 on our main dataset. This model also achieved an overall average prediction accuracy of 78.88% and MCC of 0.41 on an independent dataset.Our analyses suggest that evolutionary information (i.e., the PSSM) and the AAIndex are key features for the substrate specificity prediction of transport proteins. In comparison, similarity-based methods such as BLAST, PSI-BLAST, and hidden Markov models do not provide accurate predictions for the substrate specificity of membrane transport proteins. TrSSP: The Transporter Substrate Specificity Prediction Server, a web server that implements the SVM models developed in this paper, is freely available at http://bioinfo.noble.org/TrSSP.
Project description:S-palmitoylation, the covalent attachment of 16-carbon palmitic acids to a cysteine residue via a thioester linkage, is an important reversible lipid modification that plays a regulatory role in a variety of physiological and biological processes. As the number of experimentally identified S-palmitoylated peptides increases, it is imperative to investigate substrate motifs to facilitate the study of protein S-palmitoylation. Based on 710 non-homologous S-palmitoylation sites obtained from published databases and the literature, we carried out a bioinformatics investigation of S-palmitoylation sites based on amino acid composition. Two Sample Logo indicates that positively charged and polar amino acids surrounding S-palmitoylated sites may be associated with the substrate site specificity of protein S-palmitoylation. Additionally, maximal dependence decomposition (MDD) was applied to explore the motif signatures of S-palmitoylation sites by categorizing a large-scale dataset into subgroups with statistically significant conservation of amino acids. Single features such as amino acid composition (AAC), amino acid pair composition (AAPC), position specific scoring matrix (PSSM), position weight matrix (PWM), amino acid substitution matrix (BLOSUM62), and accessible surface area (ASA) were considered, along with the effectiveness of incorporating MDD-identified substrate motifs into a two-layered prediction model. Evaluation by five-fold cross-validation showed that a hybrid of AAC and PSSM performs best at discriminating between S-palmitoylation and non-S-palmitoylation sites, according to the support vector machine (SVM). The two-layered SVM model integrating MDD-identified substrate motifs performed well, with a sensitivity of 0.79, specificity of 0.80, accuracy of 0.80, and Matthews Correlation Coefficient (MCC) value of 0.45. Using an independent testing dataset (613 S-palmitoylated and 5412 non-S-palmitoylated sites) obtained from the literature, we demonstrated that the two-layered SVM model could outperform other prediction tools, yielding a balanced sensitivity and specificity of 0.690 and 0.694, respectively. This two-layered SVM model has been implemented as a web-based system (MDD-Palm), which is now freely available at http://csb.cse.yzu.edu.tw/MDDPalm/.
Project description:BACKGROUND: Malaria parasite secretes various proteins in infected RBC for its growth and survival. Thus identification of these secretory proteins is important for developing vaccine/drug against malaria. The existing motif-based methods have got limited success due to lack of universal motif in all secretory proteins of malaria parasite. RESULTS: In this study a systematic attempt has been made to develop a general method for predicting secretory proteins of malaria parasite. All models were trained and tested on a non-redundant dataset of 252 secretory and 252 non-secretory proteins. We developed SVM models and achieved maximum MCC 0.72 with 85.65% accuracy and MCC 0.74 with 86.45% accuracy using amino acid and dipeptide composition respectively. SVM models were developed using split-amino acid and split-dipeptide composition and achieved maximum MCC 0.74 with 86.40% accuracy and MCC 0.77 with accuracy 88.22% respectively. In this study, for the first time PSSM profiles obtained from PSI-BLAST, have been used for predicting secretory proteins. We achieved maximum MCC 0.86 with 92.66% accuracy using PSSM based SVM model. All models developed in this study were evaluated using 5-fold cross-validation technique. CONCLUSION: This study demonstrates that secretory proteins have different residue composition than non-secretory proteins. Thus, it is possible to predict secretory proteins from its residue composition-using machine learning technique. The multiple sequence alignment provides more information than sequence itself. Thus performance of method based on PSSM profile is more accurate than method based on sequence composition. A web server PSEApred has been developed for predicting secretory proteins of malaria parasites,the URL can be found in the Availability and requirements section.
Project description:BACKGROUND: Guanosine triphosphate (GTP)-binding proteins play an important role in regulation of G-protein. Thus prediction of GTP interacting residues in a protein is one of the major challenges in the field of the computational biology. In this study, an attempt has been made to develop a computational method for predicting GTP interacting residues in a protein with high accuracy (Acc), precision (Prec) and recall (Rc). RESULT: All the models developed in this study have been trained and tested on a non-redundant (40% similarity) dataset using five-fold cross-validation. Firstly, we have developed neural network based models using single sequence and PSSM profile and achieved maximum Matthews Correlation Coefficient (MCC) 0.24 (Acc 61.30%) and 0.39 (Acc 68.88%) respectively. Secondly, we have developed a support vector machine (SVM) based models using single sequence and PSSM profile and achieved maximum MCC 0.37 (Prec 0.73, Rc 0.57, Acc 67.98%) and 0.55 (Prec 0.80, Rc 0.73, Acc 77.17%) respectively. In this work, we have introduced a new concept of predicting GTP interacting dipeptide (two consecutive GTP interacting residues) and tripeptide (three consecutive GTP interacting residues) for the first time. We have developed SVM based model for predicting GTP interacting dipeptides using PSSM profile and achieved MCC 0.64 with precision 0.87, recall 0.74 and accuracy 81.37%. Similarly, SVM based model have been developed for predicting GTP interacting tripeptides using PSSM profile and achieved MCC 0.70 with precision 0.93, recall 0.73 and accuracy 83.98%. CONCLUSION: These results show that PSSM based method performs better than single sequence based method. The prediction models based on dipeptides or tripeptides are more accurate than the traditional model based on single residue. A web server "GTPBinder" http://www.imtech.res.in/raghava/gtpbinder/ based on above models has been developed for predicting GTP interacting residues in a protein.
Project description:BACKGROUND:DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. RESULTS:We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. CONCLUSIONS:The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
Project description:BACKGROUND: Plasminogen (Pg), the precursor of the proteolytic and fibrinolytic enzyme of blood, is converted to the active enzyme plasmin (Pm) by different plasminogen activators (tissue plasminogen activators and urokinase), including the bacterial activators streptokinase and staphylokinase, which activate Pg to Pm and thus are used clinically for thrombolysis. The identification of Pg-activators is therefore an important step in understanding their functional mechanism and derives new therapies. METHODS: In this study, different computational methods for predicting plasminogen activator peptide sequences with high accuracy were investigated, including support vector machines (SVM) based on amino acid (AC), dipeptide composition (DC), PSSM profile and Hybrid methods used to predict different Pg-activators from both prokaryotic and eukaryotic origins. RESULTS: Overall maximum accuracy, evaluated using the five-fold cross validation technique, was 88.37%, 84.32%, 87.61%, 85.63% in 0.87, 0.83,0.86 and 0.85 MCC with amino (AC) or dipeptide composition (DC), PSSM profile and Hybrid methods respectively. Through this study, we have found that the different subfamilies of Pg-activators are quite closely correlated in terms of amino, dipeptide, PSSM and Hybrid compositions. Therefore, our prediction results show that plasminogen activators are predictable with a high accuracy from their primary sequence. Prediction performance was also cross-checked by confusion matrix and ROC (Receiver operating characteristics) analysis. A web server to facilitate the prediction of Pg-activators from primary sequence data was implemented. CONCLUSION: The results show that dipeptide, PSSM profile, and Hybrid based methods perform better than single amino acid composition (AC). Furthermore, we also have developed a web server, which predicts the Pg-activators and their classification (available online at http://mamsap.it.deakin.edu.au/plas_pred/home.html). Our experimental results show that our approaches are faster and achieve generally a good prediction performance.
Project description:Succinylation is a type of protein post-translational modification (PTM), which can play important roles in a variety of cellular processes. Due to an increasing number of site-specific succinylated peptides obtained from high-throughput mass spectrometry (MS), various tools have been developed for computationally identifying succinylated sites on proteins. However, most of these tools predict succinylation sites based on traditional machine learning methods. Hence, this work aimed to carry out the succinylation site prediction based on a deep learning model. The abundance of MS-verified succinylated peptides enabled the investigation of substrate site specificity of succinylation sites through sequence-based attributes, such as position-specific amino acid composition, the composition of k-spaced amino acid pairs (CKSAAP), and position-specific scoring matrix (PSSM). Additionally, the maximal dependence decomposition (MDD) was adopted to detect the substrate signatures of lysine succinylation sites by dividing all succinylated sequences into several groups with conserved substrate motifs. According to the results of ten-fold cross-validation, the deep learning model trained using PSSM and informative CKSAAP attributes can reach the best predictive performance and also perform better than traditional machine-learning methods. Moreover, an independent testing dataset that truly did not exist in the training dataset was used to compare the proposed method with six existing prediction tools. The testing dataset comprised of 218 positive and 2621 negative instances, and the proposed model could yield a promising performance with 84.40% sensitivity, 86.99% specificity, 86.79% accuracy, and an MCC value of 0.489. Finally, the proposed method has been implemented as a web-based prediction tool (CNN-SuccSite), which is now freely accessible at http://csb.cse.yzu.edu.tw/CNN-SuccSite/ .
Project description:Functional annotation of protein sequences with low similarity to well characterized protein sequences is a major challenge of computational biology in the post genomic era. The cyclin protein family is once such important family of proteins which consists of sequences with low sequence similarity making discovery of novel cyclins and establishing orthologous relationships amongst the cyclins, a difficult task. The currently identified cyclin motifs and cyclin associated domains do not represent all of the identified and characterized cyclin sequences. We describe a Support Vector Machine (SVM) based classifier, CyclinPred, which can predict cyclin sequences with high efficiency. The SVM classifier was trained with features of selected cyclin and non cyclin protein sequences. The training features of the protein sequences include amino acid composition, dipeptide composition, secondary structure composition and PSI-BLAST generated Position Specific Scoring Matrix (PSSM) profiles. Results obtained from Leave-One-Out cross validation or jackknife test, self consistency and holdout tests prove that the SVM classifier trained with features of PSSM profile was more accurate than the classifiers based on either of the other features alone or hybrids of these features. A cyclin prediction server--CyclinPred has been setup based on SVM model trained with PSSM profiles. CyclinPred prediction results prove that the method may be used as a cyclin prediction tool, complementing conventional cyclin prediction methods.
Project description:BACKGROUND: Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins. RESULTS: SVM models have been developed on DNAaset, which consists of 1153 DNA-binding and equal number of non DNA-binding proteins, and achieved the maximum accuracy of 72.42% and 71.59% using amino acid and dipeptide compositions, respectively. The performance of SVM model improved from 72.42% to 74.22%, when evolutionary information in form of PSSM profiles was used as input instead of amino acid composition. In addition, SVM models have been developed on DNAset, which consists of 146 DNA-binding and 250 non-binding chains/domains, and achieved the maximum accuracy of 79.80% and 86.62% using amino acid composition and PSSM profiles. The SVM models developed in this study perform better than existing methods on a blind dataset. CONCLUSION: A highly accurate method has been developed for predicting DNA-binding proteins using SVM and PSSM profiles. This is the first study in which evolutionary information in form of PSSM profiles has been used successfully for predicting DNA-binding proteins. A web-server DNAbinder has been developed for identifying DNA-binding proteins and domains from query amino acid sequences http://www.imtech.res.in/raghava/dnabinder/.
Project description:BACKGROUND: Mannose binding proteins (MBPs) play a vital role in several biological functions such as defense mechanisms. These proteins bind to mannose on the surface of a wide range of pathogens and help in eliminating these pathogens from our body. Thus, it is important to identify mannose interacting residues (MIRs) in order to understand mechanism of recognition of pathogens by MBPs. RESULTS: This paper describes modules developed for predicting MIRs in a protein. Support vector machine (SVM) based models have been developed on 120 mannose binding protein chains, where no two chains have more than 25% sequence similarity. SVM models were developed on two types of datasets: 1) main dataset consists of 1029 mannose interacting and 1029 non-interacting residues, 2) realistic dataset consists of 1029 mannose interacting and 10320 non-interacting residues. In this study, firstly, we developed standard modules using binary and PSSM profile of patterns and got maximum MCC around 0.32. Secondly, we developed SVM modules using composition profile of patterns and achieved maximum MCC around 0.74 with accuracy 86.64% on main dataset. Thirdly, we developed a model on a realistic dataset and achieved maximum MCC of 0.62 with accuracy 93.08%. Based on this study, a standalone program and web server have been developed for predicting mannose interacting residues in proteins (http://www.imtech.res.in/raghava/premier/). CONCLUSIONS: Compositional analysis of mannose interacting and non-interacting residues shows that certain types of residues are preferred in mannose interaction. It was also observed that residues around mannose interacting residues have a preference for certain types of residues. Composition of patterns/peptide/segment has been used for predicting MIRs and achieved reasonable high accuracy. It is possible that this novel strategy may be effective to predict other types of interacting residues. This study will be useful in annotating the function of protein as well as in understanding the role of mannose in the immune system.
Project description:Protein carbonylation, an irreversible and non-enzymatic post-translational modification (PTM), is often used as a marker of oxidative stress. When reactive oxygen species (ROS) oxidized the amino acid side chains, carbonyl (CO) groups are produced especially on Lysine (K), Arginine (R), Threonine (T), and Proline (P). Nevertheless, due to the lack of information about the carbonylated substrate specificity, we were encouraged to develop a systematic method for a comprehensive investigation of protein carbonylation sites.After the removal of redundant data from multipe carbonylation-related articles, totally 226 carbonylated proteins in human are regarded as training dataset, which consisted of 307, 126, 128, and 129 carbonylation sites for K, R, T and P residues, respectively. To identify the useful features in predicting carbonylation sites, the linear amino acid sequence was adopted not only to build up the predictive model from training dataset, but also to compare the effectiveness of prediction with other types of features including amino acid composition (AAC), amino acid pair composition (AAPC), position-specific scoring matrix (PSSM), positional weighted matrix (PWM), solvent-accessible surface area (ASA), and physicochemical properties. The investigation of position-specific amino acid composition revealed that the positively charged amino acids (K and R) are remarkably enriched surrounding the carbonylated sites, which may play a functional role in discriminating between carbonylation and non-carbonylation sites. A variety of predictive models were built using various features and three different machine learning methods. Based on the evaluation by five-fold cross-validation, the models trained with PWM feature could provide better sensitivity in the positive training dataset, while the models trained with AAindex feature achieved higher specificity in the negative training dataset. Additionally, the model trained using hybrid features, including PWM, AAC and AAindex, obtained best MCC values of 0.432, 0.472, 0.443 and 0.467 on K, R, T and P residues, respectively.When comparing to an existing prediction tool, the selected models trained with hybrid features provided a promising accuracy on an independent testing dataset. In short, this work not only characterized the carbonylated substrate preference, but also demonstrated that the proposed method could provide a feasible means for accelerating preliminary discovery of protein carbonylation.