Dataset Information

Transferable deep generative modeling of intrinsically disordered protein conformations.

ABSTRACT: Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulation data have emerged as an efficient alternative for generating structural ensembles. However, such methods currently suffer from limited transferability when modeling sequences and conformations absent in the training data. Here, we develop a novel generative model that achieves high levels of transferability for intrinsically disordered protein ensembles. The approach, named idpSAM, is a latent diffusion model based on transformer neural networks. It combines an autoencoder to learn a representation of protein geometry and a diffusion model to sample novel conformations in the encoded space. IdpSAM was trained on a large dataset of simulations of disordered protein regions performed with the ABSINTH implicit solvent model. Thanks to the expressiveness of its neural networks and its training stability, idpSAM faithfully captures 3D structural ensembles of test sequences with no similarity in the training set. Our study also demonstrates the potential for generating full conformational ensembles from datasets with limited sampling and underscores the importance of training set size for generalization. We believe that idpSAM represents a significant progress in transferable protein ensemble modeling through machine learning.

Author summary

Proteins are essential molecules in living organisms and some of them have highly dynamical structures, which makes understanding their biological roles challenging. Disordered proteins can be studied through a combination of computer simulations and experiments. Computer simulations are often resource-intensive. Recently, machine learning has been used to make this process more efficient. The strategy is to learn from previous simulations to model the heterogenous conformations of proteins. However, such methods still suffer from poor transferability, meaning that they tend to make incorrect predictions on proteins not seen in training data. In this study, we present idpSAM, a method based on generative artificial intelligence for modeling the structures of disordered proteins. The model was trained using a vast dataset and, thanks to its architecture and training procedure, it performs well on not just proteins in the training set but achieves high levels transferability to proteins unseen in training. This advancement is a step forward in modeling biologically relevant disordered proteins. It shows how the combination of generative modeling and large training sets and can aid us understand how dynamical proteins behave.

SUBMITTER: Janson G

PROVIDER: S-EPMC10871340 | biostudies-literature | 2024 Feb

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Transferable deep generative modeling of intrinsically disordered protein conformations.

Janson Giacomo G Feig Michael M

bioRxiv : the preprint server for biology 20240208

Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulati ...[more]

PMID: 38370653

Dataset Information

Transferable deep generative modeling of intrinsically disordered protein conformations.

Author summary

Publications

Transferable deep generative modeling of intrinsically disordered protein conformations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Cooperative unfolding of compact conformations of the intrinsically disordered protein osteopontin.
| S-EPMC3737600 | biostudies-literature

Modeling intrinsically disordered proteins with bayesian statistics.
| S-EPMC2956375 | biostudies-other

Beyond monopole electrostatics in regulating conformations of intrinsically disordered proteins.
| S-EPMC11382291 | biostudies-literature

How multisite phosphorylation impacts the conformations of intrinsically disordered proteins.
| S-EPMC8148376 | biostudies-literature

The neuroendocrine protein 7B2 is intrinsically disordered.
| S-EPMC3457758 | biostudies-other

Deep generative modeling for single-cell transcriptomics.
| S-EPMC6289068 | biostudies-literature

Protein kinases phosphorylate long disordered regions in intrinsically disordered proteins.
| S-EPMC6954741 | biostudies-literature

Folding-upon-binding pathways of an intrinsically disordered protein from a deep Markov state model.
| S-EPMC10401938 | biostudies-literature

Diffusing protein binders to intrinsically disordered proteins.
| S-EPMC11275890 | biostudies-literature

Conformational recognition of an intrinsically disordered protein.
| S-EPMC4008819 | biostudies-literature