Other

Dataset Information

0

DNA language models reveal the architecture of nucleotide dependencies in genomes


ABSTRACT: While the genome is composed of individual nucleotides, functional elements such as cis-regulatory elements and structural interactions are formed from sets of interdependent nucleotides. In principle, these dependencies are reflected in coevolutionary relationships. However, classical comparative genomics approaches struggle to detect these dependencies beyond alignable highly conserved sequences such as within coding regions. DNA language models (LMs), which are trained by predicting nucleotides given their sequence context, have recently been proposed as foundational models for sequence-based prediction problems. DNA LMs implicitly capture functional elements from genomic sequences alone. However, which dependencies DNA LMs learn and whether they reflect known or even novel biology remains an open question. Here we introduce nucleotide dependency maps to systematically study nucleotide dependencies captured by DNA LMs in a purely unsupervised setup. We compute these maps genome-wide and show that they reveal and clearly delineate known functional genomic features such as transcription factor binding motifs, functional interactions between splice sites, RNA tertiary structures, and coding sequences. This allowed to uncover novel experimentally validated RNA structures. We furthermore investigate dependency maps from in silico manipulated sequences, revealing the ability of DNA LMs to capture operations such as copying and reverse complementarity without memorization. Lastly, we compare dependency maps from openly available DNA LMs, showcasing the drawbacks and advantages of different models. We find stark differences in the ability of models to accurately learn conserved but infrequent features. Altogether, by leveraging the flexibility of DNA language models, nucleotide dependency mapping emerges as a general methodology to discover and study functional interactions in genomes.

ORGANISM(S): Escherichia coli

PROVIDER: GSE271937 | GEO | 2025/07/19

REPOSITORIES: GEO

Dataset's files

Source:
Action DRS
Other
Items per page:
1 - 1 of 1

Similar Datasets

| PRJNA1134158 | ENA
2025-01-14 | GSE286872 | GEO
2025-04-30 | GSE295371 | GEO
2024-11-05 | GSE270865 | GEO
2024-04-27 | GSE265942 | GEO
2020-08-22 | GSE110061 | GEO
2024-06-19 | MODEL2406180001 | BioModels
2017-02-09 | GSE68295 | GEO
2017-02-09 | GSE68304 | GEO
2022-12-31 | GSE159849 | GEO