Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.
ABSTRACT: A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.
Project description:MOTIVATION: Peripheral membrane-targeting domain (MTD) families, such as C1-, C2- and PH domains, play a key role in signal transduction and membrane trafficking by dynamically translocating their parent proteins to specific plasma membranes when changes in lipid composition occur. It is, however, difficult to determine the subset of domains within families displaying this property, as sequence motifs signifying the membrane binding properties are not well defined. For this reason, procedures based on sequence similarity alone are often insufficient in computational identification of MTDs within families (yielding less than 65% accuracy even with a sequence identity of 70%). RESULTS: We present a machine learning protocol for determining membrane-targeting properties achieving 85-90% accuracy in separating binding and non-binding domains within families. Our model is based on features from both sequence and structure, thereby incorporation statistics obtained from the entire domain family and domain-specific physical quantities such as surface electrostatics. In addition, by using the enriched rules in alternating decision tree classifiers, we are able to determine the meaning of the assigned function labels in terms of biological mechanisms. CONCLUSIONS: The high accuracy of the learned models and good agreement between the rules discovered using the ADtree classifier and mechanisms reported in the literature reflect the value of machine learning protocols in both prediction and biological knowledge discovery. Our protocol can thus potentially be used as a general function annotation and knowledge mining tool for other protein domains. AVAILABILITY: metador.bioengr.uic.edu CONTACT: firstname.lastname@example.org.
Project description:Membrane-binding peripheral proteins play important roles in many biological processes, including cell signaling and membrane trafficking. Unlike integral membrane proteins, these proteins bind the membrane mostly in a reversible manner. Since peripheral proteins do not have canonical transmembrane segments, it is difficult to identify them from their amino acid sequences. As a first step toward genome-scale identification of membrane-binding peripheral proteins, we built a kernel-based machine learning protocol. Key features of known membrane-binding proteins, including electrostatic properties and amino acid composition, were calculated from their amino acid sequences and tertiary structures, which were then incorporated into the support vector machine to perform the classification. A data set of 40 membrane-binding proteins and 230 non-membrane-binding proteins was used to construct and validate the protocol. Cross-validation and holdout evaluation of the protocol showed that the accuracy of the prediction reached up to 93.7% and 91.6%, respectively. The protocol was applied to the prediction of membrane-binding properties of four C2 domains from novel protein kinases C. Although these C2 domains have 50% sequence identity, only one of them was predicted to bind the membrane, which was verified experimentally with surface plasmon resonance analysis. These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains and may be further extended to genome-scale identification of membrane-binding peripheral proteins.
Project description:Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature.We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database.MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context.
Project description:Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein's function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.
Project description:BACKGROUND:We introduce TranScriptML, a semantic representation schema for prescription regimens allowing various properties of prescriptions (e.g. dose, frequency, route) to be specified separately and applied (manually or automatically) as annotations to patient instructions. In this paper, we describe the annotation schema, the curation of a corpus of prescription instructions through a manual annotation effort, and initial experiments in modeling and automated generation of TranScriptML representations. RESULTS:TranScriptML was developed in the process of curating a corpus of 2914 ambulatory prescriptions written within the Partners Healthcare network, and its schema is informed by the content of that corpus. We developed the representation schema as a novel set of semantic tags for prescription concept categories (e.g. frequency); each tag label is defined with an accompanying attribute framework in which the meaning of tagged concepts can be specified in a normalized fashion. We annotated a subset (1746) of this dataset using cross-validation and reconciliation between multiple annotators, and used Conditional Random Field machine learning and various other methods to train automated annotation models based on the manual annotations. The TranScriptML schema implementation, manual annotation, and machine learning were all performed using the MITRE Annotation Toolkit (MAT). We report that our annotation schema can be applied with varying levels of pairwise agreement, ranging from low agreement levels (0.125 F for the relatively rare REFILL tag) to high agreement levels approaching 0.9 F for some of the more frequent tags. We report similarly variable scores for modeling tag labels and spans, averaging 0.748 F-measure with balanced precision and recall. The best of our various attribute modeling methods captured most attributes with accuracy above 0.9. CONCLUSIONS:We have described an annotation schema for prescription regimens, and shown that it is possible to annotate prescription regimens at high accuracy for many tag types. We have further shown that many of these tags and attributes can be modeled at high accuracy with various techniques. By structuring the textual representation through annotation enriched with normalized values, the text can be compared against the pharmacist-entered structured data, offering an opportunity to detect and correct discrepancies.
Project description:BACKGROUND: There has been an explosion in the number of single nucleotide polymorphisms (SNPs) within public databases. In this study we focused on non-synonymous protein coding single nucleotide polymorphisms (nsSNPs), some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl. RESULTS: The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient (MCC) varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value. CONCLUSION: The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at http://www.brightstudy.ac.uk/das_help.html.
Project description:BACKGROUND:The postmortem microbiome can provide valuable information to a death investigation and to the human health of the once living. Microbiome sequencing produces, in general, large multi-dimensional datasets that can be difficult to analyze and interpret. Machine learning methods can be useful in overcoming this analytical challenge. However, different methods employ distinct strategies to handle complex datasets. It is unclear whether one method is more appropriate than others for modeling postmortem microbiomes and their ability to predict attributes of interest in death investigations, which require understanding of how the microbial communities change after death and may represent those of the once living host. METHODS AND FINDINGS:Postmortem microbiomes were collected by swabbing five anatomical areas during routine death investigation, sequenced and analyzed from 188 death cases. Three machine learning methods (boosted algorithms, random forests, and neural networks) were compared with respect to their abilities to predict case attributes: postmortem interval (PMI), location of death, and manner of death. Accuracy depended on the method used, the numbers of anatomical areas analyzed, and the predicted attribute of death. CONCLUSIONS:All algorithms performed well but with distinct features to their performance. Xgboost often produced the most accurate predictions but may also be more prone to overfitting. Random forest was the most stable across predictions that included more anatomic areas. Analysis of postmortem microbiota from more than three anatomic areas appears to yield limited returns on accuracy, with the eyes and rectum providing the most useful information correlating with circumstances of death in most cases for this dataset.
Project description:Nucleic acid-binding proteins are involved in a great number of cellular processes. Understanding the mechanisms underlying these proteins first requires the identification of specific residues involved in nucleic acid binding. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Here, we present a method for the identification of amino acid residues involved in DNA- and RNA-binding using sequence-based attributes. The method used in this work combines the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning. Our DNA-binding model achieved 79.1% accuracy, while the RNA-binding model reached an accuracy of 73.2%. The NAPS web server is freely available at http://proteomics.bioengr.uic.edu/NAPS.
Project description:We aimed to develop machine learning models to accurately predict bronchiolitis severity, and to compare their predictive performance with a conventional scoring (reference) model. In a 17-center prospective study of infants (aged?<?1 year) hospitalized for bronchiolitis, by using routinely-available pre-hospitalization data as predictors, we developed four machine learning models: Lasso regression, elastic net regression, random forest, and gradient boosted decision tree. We compared their predictive performance-e.g., area-under-the-curve (AUC), sensitivity, specificity, and net benefit (decision curves)-using a cross-validation method, with that of the reference model. The outcomes were positive pressure ventilation use and intensive treatment (admission to intensive care unit and/or positive pressure ventilation use). Of 1,016 infants, 5.4% underwent positive pressure ventilation and 16.0% had intensive treatment. For the positive pressure ventilation outcome, machine learning models outperformed reference model (e.g., AUC 0.88 [95% CI 0.84-0.93] in gradient boosted decision tree vs 0.62 [95% CI 0.53-0.70] in reference model), with higher sensitivity (0.89 [95% CI 0.80-0.96] vs. 0.62 [95% CI 0.49-0.75]) and specificity (0.77 [95% CI 0.75-0.80] vs. 0.57 [95% CI 0.54-0.60]). The machine learning models also achieved a greater net benefit over ranges of clinical thresholds. Machine learning models consistently demonstrated a superior ability to predict acute severity and achieved greater net benefit.
Project description:BACKGROUND: DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM). RESULTS: We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes. CONCLUSION: Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repertoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.