Dataset Information

Large multiple sequence alignments with a root-to-leaf regressive method.

ABSTRACT: Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.

SUBMITTER: Garriga E

PROVIDER: S-EPMC6894943 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Large multiple sequence alignments with a root-to-leaf regressive method.

Garriga Edgar E Di Tommaso Paolo P Magis Cedrik C Erb Ionas I Mansouri Leila L Baltzis Athanasios A Laayouni Hafid H Kondrashov Fyodor F Floden Evan E Notredame Cedric C

Nature biotechnology 20191202 12

Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is ...[more]

PMID: 31792410

Dataset Information

Large multiple sequence alignments with a root-to-leaf regressive method.

Publications

Large multiple sequence alignments with a root-to-leaf regressive method.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.
| S-EPMC7297217 | biostudies-literature

Parallelization of MAFFT for large-scale multiple sequence alignments.
| S-EPMC6041967 | biostudies-literature

Detecting species-site dependencies in large multiple sequence alignments.
| S-EPMC2764451 | biostudies-literature

Progressive multiple sequence alignments from triplets.
| S-EPMC1948021 | biostudies-literature

EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment.
| S-EPMC10704716 | biostudies-literature

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments.
| S-EPMC5939968 | biostudies-literature

A minimum reporting standard for multiple sequence alignments.
| S-EPMC7671350 | biostudies-literature

Refining multiple sequence alignments with conserved core regions.
| S-EPMC1463900 | biostudies-literature

OD-seq: outlier detection in multiple sequence alignments.
| S-EPMC4548304 | biostudies-literature

MSAViewer: interactive JavaScript visualization of multiple sequence alignments.
| S-EPMC5181560 | biostudies-literature