Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines
Ontology highlight
ABSTRACT: Predicting gene expression from cis-regulatory DNA sequences is a central challenge in plant genomics. Here, we developed deep learning sequence-to-expression (S2E) models that leverage high-dimensional representations from auxiliary foundational models (genomic language model PlantCaduceus, chromatin accessibility model a2z) instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. We first evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model’s prediction accuracy across all species outperforms PhytoExpr, a state-of-the-art (SOTA) S2E model trained on the same dataset (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed the SOTA models in predicting between-gene expression differences (regression coefficient β=0.78 vs. β=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA models showed only weak associations (regression coefficient β=0.38 vs. β=0.08). Our results demonstrate the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future studies.
ORGANISM(S): Brachypodium distachyon
PROVIDER: GSE324261 | GEO | 2026/03/09
REPOSITORIES: GEO
ACCESS DATA