Dataset Information

Identification of hot regions in protein-protein interactions by sequential pattern mining.

ABSTRACT:

Background

Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard regions. Thus, the proposed mining technology considers large irregular gaps when growing patterns, in order to find the residues that are simultaneously conserved but largely separated on the sequences. A derived pattern is called a cluster-like pattern since the discovered conserved residues are always grouped into several blocks, which each corresponds to a local conserved region on the protein sequence.

Results

The experiments conducted in this work demonstrate that the derived long patterns automatically discover the important residues that form one or several hot regions of protein-protein interactions. The methodology is evaluated by conducting experiments on the web server MAGIIC-PRO based on a well known benchmark containing 220 protein chains from 72 distinct complexes. Among the tested 218 proteins, there are 900 sequential blocks discovered, 4.25 blocks per protein chain on average. About 92% of the derived blocks are observed to be clustered in space with at least one of the other blocks, and about 66% of the blocks are found to be near the interface of protein-protein interactions. It is summarized that for about 83% of the tested proteins, at least two interacting blocks can be discovered by this approach.

Conclusion

This work aims to demonstrate that the important residues associated with the interface of protein-protein interactions may be automatically discovered by sequential pattern mining. The detected regions possess high conservation and thus are considered as the computational hot regions. This information would be useful to characterizing protein sequences, predicting protein function, finding potential partners, and facilitating protein docking for drug discovery.

SUBMITTER: Hsu CM

PROVIDER: S-EPMC1892096 | biostudies-literature | 2007 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identification of hot regions in protein-protein interactions by sequential pattern mining.

Hsu Chen-Ming CM Chen Chien-Yu CY Liu Baw-Jhiune BJ Huang Chih-Chang CC Laio Min-Hung MH Lin Chien-Chieh CC Wu Tzung-Lin TL

BMC bioinformatics 20070524

<h4>Background</h4>Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard ...[more]

PMID: 17570867

Similar Datasets

Project description:BackgroundNon-coding RNAs (ncRNAs) play crucial roles in many biological processes, such as post-transcription of gene regulation. ncRNAs mainly function through interaction with RNA binding proteins (RBPs). To understand the function of a ncRNA, a fundamental step is to identify which protein is involved into its interaction. Therefore it is promising to computationally predict RBPs, where the major challenge is that the interaction pattern or motif is difficult to be found.ResultsIn this study, we propose a computational method IPMiner (Interaction Pattern Miner) to predict ncRNA-protein interactions from sequences, which makes use of deep learning and further improves its performance using stacked ensembling. One of the IPMiner's typical merits is that it is able to mine the hidden sequential interaction patterns from sequence composition features of protein and RNA sequences using stacked autoencoder, and then the learned hidden features are fed into random forest models. Finally, stacked ensembling is used to integrate different predictors to further improve the prediction performance. The experimental results indicate that IPMiner achieves superior performance on the tested lncRNA-protein interaction dataset with an accuracy of 0.891, sensitivity of 0.939, specificity of 0.831, precision of 0.945 and Matthews correlation coefficient of 0.784, respectively. We further comprehensively investigate IPMiner on other RNA-protein interaction datasets, which yields better performance than the state-of-the-art methods, and the performance has an increase of over 20 % on some tested benchmarked datasets. In addition, we further apply IPMiner for large-scale prediction of ncRNA-protein network, that achieves promising prediction performance.ConclusionBy integrating deep neural network and stacked ensembling, from simple sequence composition features, IPMiner can automatically learn high-level abstraction features, which had strong discriminant ability for RNA-protein detection. IPMiner achieved high performance on our constructed lncRNA-protein benchmark dataset and other RNA-protein datasets. IPMiner tool is available at http://www.csbio.sjtu.edu.cn/bioinf/IPMiner .

Project description:Background and objectiveIncreases in outpatients seeking medical check-ups are expanding the number of health examination data records, which can be utilized for medical strategic planning and other purposes. However, because hospital visits by outpatients seeking medical check-ups are unpredictable, those patients often cannot receive optimal service due to limited facilities of hospitals. To resolve this problem, this study attempted to predict re-visit patterns of outpatients.MethodTwo-phase sequential pattern mining (SPM) and an association mining method were chosen to predict patient returns using sequential data. The data were grouped according to the outpatients' personal information and evaluated by a discriminant analysis to check the significance of the grouping. Furthermore, SPM was employed to generate frequency patterns from each group and extract a general association pattern of return.ResultsResults of sequence patterns and association mining in this study provided valuable insights in terms of outpatients' re-visit behaviors for regular medical check-ups. Cosine and Jaccard are two symmetric measures which were used in this study to indicate the degree of association between two variables. For instance, Jaccard values of variable abnormal blood pressure associated with an abnormal body-mass index (BMI) and/or abnormal blood sugar were respectively 47.5% and 100%, for the two-visit and three-visit behavior patterns. These results indicated that the corresponding pair of variables was more reliable when covering the three-visit behavior pattern than the two-visit behavior. Thus, appropriate preventive measures or suggestions for other medical treatments can be prepared for outpatients that have this pattern on their third visit. The higher degree of association implies that the corresponding behavior pattern might influence outpatients' intentions to regularly seek medical check-ups concerning the risk of stroke. Furthermore, a radiology diagnosis (i.e., magnetic resonance imaging or neck vascular ultrasound) plays an important role in the association with a re-visit behavior pattern with respective 50% and 70% Cosine and Jaccard values in general behavior {f11}∧{f01}. These findings can serve as valuable information to increase the quality of medical services and marketing, by suggesting appropriate treatment for the subsequent visit after learning the behavior patterns.ConclusionsThe proposed method can provide valuable information related to outpatients' re-visit behavior patterns based on hidden knowledge generated from sequential patterns and association mining results. For marketing purposes, medical practitioners can take behavior patterns studied in this paper into account to raise patients' awareness of several possible medical conditions that might arise on subsequent visits and encourage them to take preventive measures or suggest other medical treatments.

Project description:BackgroundDeciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches.ResultsDuring the benchmark evaluation of BioCreative 2006, all of our results ranked in the top three places. In the task of filtering articles irrelevant to physical protein interactions, our method contributes a precision of 75.07%, a recall of 81.07%, and an AUC (area under the receiver operating characteristic curve) of 0.847. In the task of identifying protein mentions and normalizing mentions to molecule identifiers, our method is competitive among runs submitted, with a precision of 34.83%, a recall of 24.10%, and an F1 score of 28.5%. In extracting protein interaction pairs, our profile-based method was competitive on the SwissProt-only subset (precision = 36.95%, recall = 32.68%, and F1 score = 30.40%) and on the entire dataset (30.96%, 29.35%, and 26.20%, respectively). From the biologist's point of view, however, these findings are far from satisfactory. The error analysis presented in this report provides insight into how performance could be improved: three-quarters of false negatives were due to protein normalization problems (532/698), and about one-quarter were due to problems with correctly extracting interactions for this system.ConclusionWe present a text-mining framework to extract physical protein-protein interactions from the literature. Three key issues are addressed, namely filtering irrelevant articles, identifying protein names and normalizing them to molecule identifiers, and extracting protein-protein interactions. Our system is among the top three performers in the benchmark evaluation of BioCreative 2006. The tool will be helpful for manual interaction curation and can greatly facilitate the process of extracting protein-protein interactions.

Project description:BACKGROUND: The local connectivity and global position of a protein in a protein interaction network are known to correlate with some of its functional properties, including its essentiality or dispensability. It is therefore of interest to extend this observation and examine whether network properties of two proteins considered simultaneously can determine their joint dispensability, i.e., their propensity for synthetic sick/lethal interaction. Accordingly, we examine the predictive power of protein interaction networks for synthetic genetic interaction in Saccharomyces cerevisiae, an organism in which high confidence protein interaction networks are available and synthetic sick/lethal gene pairs have been extensively identified. RESULTS: We design a support vector machine system that uses graph-theoretic properties of two proteins in a protein interaction network as input features for prediction of synthetic sick/lethal interactions. The system is trained on interacting and non-interacting gene pairs culled from large scale genetic screens as well as literature-curated data. We find that the method is capable of predicting synthetic genetic interactions with sensitivity and specificity both exceeding 85%. We further find that the prediction performance is reasonably robust with respect to errors in the protein interaction network and with respect to changes in the features of test datasets. Using the prediction system, we carried out novel predictions of synthetic sick/lethal gene pairs at a genome-wide scale. These pairs appear to have functional properties that are similar to those that characterize the known synthetic lethal gene pairs. CONCLUSION: Our analysis shows that protein interaction networks can be used to predict synthetic lethal interactions with accuracies on par with or exceeding that of other computational methods that use a variety of input features, including functional annotations. This indicates that protein interaction networks could plausibly be rich sources of information about epistatic effects among genes.

Dataset Information

Identification of hot regions in protein-protein interactions by sequential pattern mining.

Background

Results

Conclusion

Publications

Identification of hot regions in protein-protein interactions by sequential pattern mining.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets