Dataset Information

Fine-tuning protein embeddings for functional similarity evaluation.

ABSTRACT:

Motivation

Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks.

Results

We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering.

Availability and implementation

github.com/mofradlab/go_metric.

SUBMITTER: Dickson A

PROVIDER: S-EPMC11299545 | biostudies-literature | 2024 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Fine-tuning protein embeddings for functional similarity evaluation.

Dickson Andrew A Mofrad Mohammad R K MRK

Bioinformatics (Oxford, England) 20240801 8

<h4>Motivation</h4>Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their ...[more]

PMID: 38985218

Dataset Information

Fine-tuning protein embeddings for functional similarity evaluation.

Motivation

Results

Availability and implementation

Publications

Fine-tuning protein embeddings for functional similarity evaluation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Protein domain embeddings for fast and accurate similarity search.
| S-EPMC11529836 | biostudies-literature

Controlling protein function by fine-tuning conformational flexibility.
| S-EPMC7375816 | biostudies-literature

Evaluation of input data modality choices on functional gene embeddings.
| S-EPMC10629286 | biostudies-literature

Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning.
| S-EPMC10659351 | biostudies-literature

PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings.
| S-EPMC11783280 | biostudies-literature

Identification of Protein Subcellular Localization With Network and Functional Embeddings.
| S-EPMC7873866 | biostudies-literature

Similarity-driven multi-view embeddings from high-dimensional biomedical data.
| S-EPMC8009088 | biostudies-literature

Fine-tuning protein language models boosts predictions across diverse tasks.
| S-EPMC11358375 | biostudies-literature

Fine Tuning of Chlorophyll Spectra by Protein-Induced Ring Deformation.
| S-EPMC6690836 | biostudies-literature

Fine-Tuning Protein Self-Organization by Orthogonal Chemo-Optogenetic Tools.
| S-EPMC7986231 | biostudies-literature