Unknown

Dataset Information

0

Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.


ABSTRACT: A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.

SUBMITTER: Langlois RE 

PROVIDER: S-EPMC2706547 | BioStudies | 2007-01-01

REPOSITORIES: biostudies

Similar Datasets

2012-01-01 | S-EPMC3436836 | BioStudies
2006-01-01 | S-EPMC2707359 | BioStudies
1000-01-01 | S-EPMC2905367 | BioStudies
2019-01-01 | S-EPMC6729729 | BioStudies
2019-01-01 | S-EPMC6544933 | BioStudies
2006-01-01 | S-EPMC1489951 | BioStudies
2019-01-01 | S-EPMC6464165 | BioStudies
1000-01-01 | S-EPMC2896077 | BioStudies
2020-01-01 | S-EPMC7335203 | BioStudies
2009-01-01 | S-EPMC2660303 | BioStudies