Unknown

Dataset Information

0

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins.


ABSTRACT:

Motivation

The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved.

Results

We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library.

Availability and implementation

https://zhanglab.ccmb.med.umich.edu/DeepMSA/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Zhang C 

PROVIDER: S-EPMC7141871 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC1779300 | biostudies-literature
| S-EPMC6818797 | biostudies-other
| S-EPMC3716705 | biostudies-literature
| S-EPMC4669437 | biostudies-literature
| S-EPMC3534397 | biostudies-literature
| S-EPMC153506 | biostudies-other
| S-EPMC3463120 | biostudies-literature