Unknown

Dataset Information

0

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.


ABSTRACT: Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, and OMIM) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.

SUBMITTER: Benegas G 

PROVIDER: S-EPMC10592768 | biostudies-literature | 2023 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.

Benegas Gonzalo G   Albors Carlos C   Aw Alan J AJ   Ye Chengzhong C   Song Yun S YS  

bioRxiv : the preprint server for biology 20240406


Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across severa  ...[more]

Similar Datasets

| S-EPMC10622914 | biostudies-literature
| S-EPMC10484790 | biostudies-literature
| S-EPMC7901104 | biostudies-literature
| S-EPMC7849406 | biostudies-literature
| S-EPMC5587781 | biostudies-literature
| S-EPMC4073816 | biostudies-literature
| S-EPMC4165772 | biostudies-literature
| S-EPMC7037363 | biostudies-literature
| S-EPMC5343954 | biostudies-literature
| S-EPMC4489222 | biostudies-literature