Dataset Information

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.

ABSTRACT: Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.

SUBMITTER: Han Y

PROVIDER: S-EPMC10915406 | biostudies-literature | 2024 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.

Han Yilei Y Zhang Haoye H Zeng Zheni Z Liu Zhiyuan Z Lu Diannan D Liu Zheng Z

Synthetic and systems biotechnology 20240228 2

Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performa ...[more]

PMID: 38450325

Dataset Information

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.

Publications

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Dataset's chemical diversity limits the generalizability of machine learning predictions.
| S-EPMC6852905 | biostudies-literature

Machine-Learning Predictions of Critical Temperatures from Chemical Compositions of Superconductors.
| S-EPMC11481088 | biostudies-literature

Machine Learning Identifies Chemical Characteristics That Promote Enzyme Catalysis.
| S-EPMC6407039 | biostudies-literature

Expert-augmented machine learning.
| S-EPMC7060733 | biostudies-literature

Chemical features and machine learning assisted predictions of protein-ligand short hydrogen bonds.
| S-EPMC10447522 | biostudies-literature

Chemical Features and Machine Learning Assisted Predictions of Protein-Ligand Short Hydrogen Bonds.
| S-EPMC10246099 | biostudies-literature

Fatigue Evaluation through Machine Learning and a Global Fatigue Descriptor.
| S-EPMC6969995 | biostudies-literature

Spectra-descriptor-based machine learning for predicting protein-ligand interactions.
| S-EPMC11905448 | biostudies-literature

Model-to-crop conserved NUE Regulons enhance machine learning predictions of nitrogen use efficiency
2025-05-09 | GSE280353 | GEO

Model-to-crop conserved NUE Regulons enhance machine learning predictions of nitrogen use efficiency
2025-05-09 | GSE280345 | GEO