Unknown

Dataset Information

0

Fine-tuning protein embeddings for functional similarity evaluation.


ABSTRACT:

Motivation

Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks.

Results

We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering.

Availability and implementation

github.com/mofradlab/go_metric.

SUBMITTER: Dickson A 

PROVIDER: S-EPMC11299545 | biostudies-literature | 2024 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Fine-tuning protein embeddings for functional similarity evaluation.

Dickson Andrew A   Mofrad Mohammad R K MRK  

Bioinformatics (Oxford, England) 20240801 8


<h4>Motivation</h4>Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their  ...[more]

Similar Datasets

| S-EPMC11529836 | biostudies-literature
| S-EPMC7375816 | biostudies-literature
| S-EPMC10629286 | biostudies-literature
| S-EPMC10659351 | biostudies-literature
| S-EPMC11783280 | biostudies-literature
| S-EPMC7873866 | biostudies-literature
| S-EPMC8009088 | biostudies-literature
| S-EPMC11358375 | biostudies-literature
| S-EPMC6690836 | biostudies-literature
| S-EPMC7986231 | biostudies-literature