Unknown

Dataset Information

0

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.


ABSTRACT: The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

SUBMITTER: Tang Z 

PROVIDER: S-EPMC10925287 | biostudies-literature | 2024 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

Evaluating the representational power of pre-trained DNA language models for regulatory genomics.

Tang Ziqi Z   Somia Nirali N   Yu Yiyang Y   Koo Peter K PK  

bioRxiv : the preprint server for biology 20240925


The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of <i>cis</i>-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these  ...[more]

Similar Datasets

| S-EPMC10357883 | biostudies-literature
| S-EPMC10457366 | biostudies-literature
| S-EPMC9710646 | biostudies-literature
| S-EPMC9044357 | biostudies-literature
| S-EPMC11227580 | biostudies-literature
| S-EPMC11529870 | biostudies-literature
| S-EPMC8637713 | biostudies-literature
| S-EPMC11764203 | biostudies-literature
| S-EPMC10782905 | biostudies-literature
| S-EPMC10038667 | biostudies-literature