Dataset Information

Nucleotide augmentation for machine learning-guided protein engineering.

ABSTRACT:

Summary

Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance.

Availability and implementation

The code used in this study is publicly available at https://github.com/minotm/NTA.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

SUBMITTER: Minot M

PROVIDER: S-EPMC9843584 | biostudies-literature | 2023

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Nucleotide augmentation for machine learning-guided protein engineering.

Minot Mason M Reddy Sai T ST

Bioinformatics advances 20221209 1

<h4>Summary</h4>Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; ...[more]

PMID: 36698759

Dataset Information

Nucleotide augmentation for machine learning-guided protein engineering.

Summary

Availability and implementation

Supplementary information

Publications

Nucleotide augmentation for machine learning-guided protein engineering.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Machine Learning-Guided Protein Engineering.
| S-EPMC10629210 | biostudies-literature

Machine-guided cell-fate engineering
2025-06-02 | GSE214951 | GEO

Machine-guided cell-fate engineering
| PRJNA887788 | ENA

Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics.
| S-EPMC6858556 | biostudies-literature

Machine learning-guided engineering of genetically encoded fluorescent calcium indicators.
| S-EPMC11878291 | biostudies-literature

Machine Learning-Guided Three-Dimensional Printing of Tissue Engineering Scaffolds.
| S-EPMC7759288 | biostudies-literature

Accelerated enzyme engineering by machine-learning guided cell-free expression.
| S-EPMC11747319 | biostudies-literature

Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production.
| S-EPMC8492656 | biostudies-literature

Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering.
| S-EPMC11289365 | biostudies-literature

Domain-guided data augmentation for deep learning on medical imaging.
| S-EPMC10035842 | biostudies-literature