Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Genomic language models improve cross-species gene expression prediction and accurately capture regulatory variant effects in Brachypodium mutant lines

ABSTRACT: Predicting gene expression from cis-regulatory DNA sequences is a central challenge in plant genomics. Here, we developed deep learning sequence-to-expression (S2E) models that leverage high-dimensional representations from auxiliary foundational models (genomic language model PlantCaduceus, chromatin accessibility model a2z) instead of one-hot encoding of sequences, to predict gene expression across 17 plant species. We first evaluated our models to predict gene expression on unseen gene families via cross-validation, demonstrating our model’s prediction accuracy across all species outperforms PhytoExpr, a state-of-the-art (SOTA) S2E model trained on the same dataset (Pearson R=0.82 vs. R=0.74). We then validated variant effect predictions using an experimental dataset across 796 Brachypodium mutant lines, specifically designed to test predictions at single-base resolution. Our models outperformed the SOTA models in predicting between-gene expression differences (regression coefficient β=0.78 vs. β=0.57). Remarkably, they also accurately predicted the effects of single-nucleotide mutations on within-gene expression, while SOTA models showed only weak associations (regression coefficient β=0.38 vs. β=0.08). Our results demonstrate the value of context-aware DNA sequence embeddings for predicting regulatory variant effects in plants. They also reveal a persistent accuracy gap in S2E models when moving from between-gene to allelic variation, a challenge that needs to be addressed in future studies.

ORGANISM(S): Brachypodium distachyon

PROVIDER: GSE324261 | GEO | 2026/03/09

REPOSITORIES: GEO

ACCESS DATA

Json Xml

Dataset's files

Source:

			Action	DRS
		Other

Items per page:

1 - 1 of 1

Similar Datasets

Development and validation of the AI-based diagnosis system for pathological findings in invasive front of colorectal cancer

Project description:Primary outcome(s): Accuracy of models (i.e. dice coefficient, Jaccard Index)

| 2653863 | ecrin-mdr-crc

Identification of a prognostic signature for old-age mortality by integrating genome-wide transcriptomic data with the conventional predictors: the Vitality 90+ Study

Project description:BACKGROUND: Prediction models for old-age mortality have generally relied upon conventional markers such as plasma-based factors and biophysiological characteristics. However, it is unknown whether the existing markers are able to provide the most relevant information in terms of old-age survival or whether predictions could be improved through the integration of whole-genome expression profiles. METHODS: We assessed the predictive abilities of survival models containing only conventional markers, only gene expression data or both types of data together in a Vitality 90+ study cohort consisting of n = 151 nonagenarians. The all-cause death rate was 32.5% (49 of 151 individuals), and the median follow-up time was 2.55 years. RESULTS: Three different feature selection models, the penalized Lasso and Ridge regressions and the C-index boosting algorithm, were used to test the genomic data. The Ridge regression model incorporating both the conventional markers and transcripts outperformed the other models. The multivariate Cox regression model was used to adjust for the conventional mortality prediction markers, i.e., the body mass index, frailty index and cell-free DNA level, revealing that 331 transcripts were independently associated with survival. The final mortality-predicting transcriptomic signature derived from the Ridge regression model was mapped to a network that identified nuclear factor kappa beta (NF-?B) as a central node. CONCLUSIONS: Together with the loss of physiological reserves, the transcriptomic predictors centered around NF-?B underscored the role of immunoinflammatory signaling, the control of the DNA damage response and cell cycle, and mitochondrial functions as the key determinants of old-age mortality. The study consisted of 151 nonagenarians who were analyzed for genome-wide transcriptomic mortality predictors in a longitudinal setting.

2015-01-22 | E-GEOD-65218 | biostudies-arrayexpress

Identification of a prognostic signature for old-age mortality by integrating genome-wide transcriptomic data with the conventional predictors: the Vitality 90+ Study

Project description:BACKGROUND: Prediction models for old-age mortality have generally relied upon conventional markers such as plasma-based factors and biophysiological characteristics. However, it is unknown whether the existing markers are able to provide the most relevant information in terms of old-age survival or whether predictions could be improved through the integration of whole-genome expression profiles. METHODS: We assessed the predictive abilities of survival models containing only conventional markers, only gene expression data or both types of data together in a Vitality 90+ study cohort consisting of n = 151 nonagenarians. The all-cause death rate was 32.5% (49 of 151 individuals), and the median follow-up time was 2.55 years. RESULTS: Three different feature selection models, the penalized Lasso and Ridge regressions and the C-index boosting algorithm, were used to test the genomic data. The Ridge regression model incorporating both the conventional markers and transcripts outperformed the other models. The multivariate Cox regression model was used to adjust for the conventional mortality prediction markers, i.e., the body mass index, frailty index and cell-free DNA level, revealing that 331 transcripts were independently associated with survival. The final mortality-predicting transcriptomic signature derived from the Ridge regression model was mapped to a network that identified nuclear factor kappa beta (NF-κB) as a central node. CONCLUSIONS: Together with the loss of physiological reserves, the transcriptomic predictors centered around NF-κB underscored the role of immunoinflammatory signaling, the control of the DNA damage response and cell cycle, and mitochondrial functions as the key determinants of old-age mortality.

2015-01-22 | GSE65218 | GEO

DeepWheat: Predicting the Effects of Genomic Variants on Gene Expression and Regulatory Activities Across Tissues and Varieties in Wheat Using Deep Learning [ChIP-Seq]

Project description:Accurate prediction of genomic variant effects and gene expression is essential for identifying functional variations and enabling precise genome editing of cis-regulatory elements (CREs). Spatiotemporal gene expression patterns are fundamental to the formation of key traits, yet tissue-specific predictions remain inaccurate, particularly in large-genome crops like wheat. In this study, we developed DeepWheat, a suite of two models for predicting epigenomic features and gene expression in wheat. DeepEXP, a deep learning model, integrates epigenomic and transcriptomic data across various wheat tissues, achieving Pearson correlation coefficients (PCC) over 0.8 and outperforming sequence-only models, especially for tissue-specific genes. DeepEPI predicts epigenomic features from DNA sequences, helping identify regulatory sequences and facilitating model transfer across wheat varieties. Using chromatin accessibility and transcriptomic data from 9 additional wheat varieties, we validated the model’s accuracy and transfer efficiency. Our analysis further revealed that indels have a greater impact on gene expression than SNPs, and that, compared to promoter regions, the 5’UTR, 3’UTR, and introns exert even stronger regulatory effects on gene expression. These models also identified mutations that alter gene expression, supporting precise CRE editing. They provide valuable tools for tissue-specific predictions, regulatory sequence identification, and saturation mutagenesis to pinpoint high-effect sites.

2025-09-05 | GSE289179 | GEO

DeepWheat: Predicting the Effects of Genomic Variants on Gene Expression and Regulatory Activities Across Tissues and Varieties in Wheat Using Deep Learning [ATAC-seq]

2025-09-05 | GSE287695 | GEO

Improved Prediction of Smoking Status via Isoform-Aware RNA-seq Deep Learning Models

Project description:Predictive models based on gene expression are already a part of medical decision making for selected situations such as early breast cancer treatment. Most of these models are based on measures that do not capture critical aspects of gene splicing, butwith RNA sequencing it is possible to capture some of these aspects of alternative splicing and use them to improve clinical predictions. Building on previous models to predict cigarette smoking status, we show that measures of alternative splicing significantly improve the accuracy of these predictive models.

2020-12-31 | GSE158699 | GEO

Baseline gene expression profiling determines long-term benefit to programmed cell death protein 1 axis blockade

Project description:Treatment with immune checkpoint inhibitors has altered the course of malignant melanoma, with approximately half of the patients with advanced disease surviving for more than 5 years after diagnosis. Currently, there are no biomarker methods for predicting outcome from immunotherapy. Here, we obtained transcriptomic information from a total of 105 baseline tumor samples comprising two cohorts of patients with advanced melanoma treated with programed cell death protein 1 (PD-1)-based immunotherapies. Gene expression profiles were correlated with progression-free survival (PFS) within consecutive clinical benefit intervals (i.e., 6, 12, 18, and 24 months). Elastic net binomial regression models with cross validation were utilized to compare the predictive value of distinct genes across time. Lasso regression was used to generate a signature predicting long-term benefit (LTB), defined as patients who remain alive and free of disease progression at 24 months post treatment initiation. We show that baseline gene expression profiles were consistently able to predict long-term immunotherapy outcomes with high accuracy. The predictive value of different genes fluctuated across consecutive clinical benefit intervals, with a distinct set of genes defining benefit at 24 months compared to earlier outcomes. A 12-gene signature was able to predict LTB following anti-PD-1 therapy with an area under the curve (AUC) equal to 0.92 and 0.74 in the training and validation set, respectively. Evaluation of LTB, via a unique signature may complement objective response classification and characterize the logistics of sustained antitumor immune responses.

2022-10-21 | GSE215868 | GEO

Quantitative modeling of transcription factor binding specificities using DNA shape

Project description:Accurate predictions of the DNA binding specificities of transcription factors (TFs) are necessary for understanding gene regulatory mechanisms. Traditionally, predictive models are built based on nucleotide sequence features. Here, we employed three- dimensional DNA shape information obtained on a high-throughput basis to integrate intuitive DNA structural features into the modeling of TF binding specificities using support vector regression. We performed quantitative predictions of DNA binding specificities, using the DREAM5 dataset for 65 mouse TFs and genomic-context protein binding microarray data for three human basic helix-loop-helix TFs. DNA shape-augmented models compared favorably with sequence-based models for these predictions. Although both k-mer and DNA shape features encoded the interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space compared to k-mer use. Finally, analyzing the weights of DNA shape-augmented models uncovered TF family- specific structural readout mechanisms that were not obvious from the nucleotide sequence.

2014-11-04 | GSE59845 | GEO

Gene expression of bovine embryo

Project description:The objective of the present study was to use our own fabricated bovine embryo cDNA microarray to profile genome-wide expression patterns of genes during the implantation period. Duplicate readings were obtained by hybridization of a Cy3/Cy5 probe mixture for each sample. Data normalization was performed the local background intensity of each array spot was smoothed by the local weight regression (lowess) smoother and subtracted from spot intensity data. The subtracted intensity data were subjected to non-parametric regression and local variance normalization because non-parametric regression can reduce intensity-depended biases. Accuracy is improved, when compared to linear regression, if the points in the scatter plot of Cy3 vs. Cy5 are not distributed around a straight line. Keywords = Gene expression Keywords = embryo Keywords = fetus Keywords = fetal membrane Keywords = implantation period Keywords: time-course

2004-11-24 | GSE1414 | GEO

Quantitative modeling of transcription factor binding specificities using DNA shape

Project description:The SELEX-seq platform was used to generate DNA-binding affinity predictions for the human Max transcription factor. This experiment was performed as part of a cross-validation study comparing the accuracy of DNA shape-augmented TF binding specificity models across two different platforms (SELEX-seq and gcPBM)

2015-03-01 | GSE60200 | GEO

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data