Unknown

Dataset Information

0

Large language models generate functional protein sequences across diverse families.


ABSTRACT: Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

SUBMITTER: Madani A 

PROVIDER: S-EPMC10400306 | biostudies-literature | 2023 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications


Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying pro  ...[more]

Similar Datasets

| S-EPMC10087057 | biostudies-literature
| S-EPMC11913289 | biostudies-literature
| S-EPMC11870394 | biostudies-literature
| S-EPMC11358375 | biostudies-literature
| S-EPMC3205580 | biostudies-literature
| S-EPMC11893428 | biostudies-literature
| S-EPMC10701588 | biostudies-literature
| S-EPMC10626176 | biostudies-literature
| S-EPMC3098182 | biostudies-literature
| S-EPMC11501434 | biostudies-literature