Unknown

Dataset Information

0

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.


ABSTRACT: Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.

SUBMITTER: Zhai J 

PROVIDER: S-EPMC11185591 | biostudies-literature | 2024 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.

Zhai Jingjing J   Gokaslan Aaron A   Schiff Yair Y   Berthel Ana A   Liu Zong-Yan ZY   Lai Wei-Yun WY   Miller Zachary R ZR   Scheben Armin A   Stitzer Michelle C MC   Romay M Cinta MC   Buckler Edward S ES   Kuleshov Volodymyr V  

bioRxiv : the preprint server for biology 20240822


Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labe  ...[more]

Similar Datasets

2025-07-19 | GSE271937 | GEO
| S-EPMC11924033 | biostudies-literature
| S-EPMC10357883 | biostudies-literature
| S-EPMC10457366 | biostudies-literature
| S-EPMC9710646 | biostudies-literature
| S-EPMC11226626 | biostudies-literature
| S-EPMC8756089 | biostudies-literature
| S-EPMC9044357 | biostudies-literature
| S-EPMC11444397 | biostudies-literature
| S-EPMC11227580 | biostudies-literature