Unknown

Dataset Information

0

LearnMSA: learning and aligning large protein families.


ABSTRACT:

Background

The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments.

Results

We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum-Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU.

Conclusions

Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

SUBMITTER: Becker F 

PROVIDER: S-EPMC9673500 | biostudies-literature | 2022 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

learnMSA: learning and aligning large protein families.

Becker Felix F   Stanke Mario M   Stanke Mario M  

GigaScience 20221101


<h4>Background</h4>The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments.<h4>Results</h4>We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on  ...[more]

Similar Datasets

| S-EPMC2712342 | biostudies-literature
| S-EPMC7570326 | biostudies-literature
| S-EPMC3605371 | biostudies-literature
| S-EPMC2711941 | biostudies-literature
| S-EPMC101833 | biostudies-literature
| S-EPMC10400306 | biostudies-literature
| S-EPMC3519460 | biostudies-literature
| S-EPMC3205580 | biostudies-literature
| S-EPMC7391736 | biostudies-literature
| S-EPMC2525701 | biostudies-literature