Unknown

Dataset Information

0

EpiGePT: a Pretrained Transformer model for epigenomics.


ABSTRACT: The transformer-based models, such as GPT-31 and DALL-E2, have achieved unprecedented breakthroughs in the field of natural language processing and computer vision. The inherent similarities between natural language and biological sequences have prompted a new wave of inferring the grammatical rules underneath the biological sequences. In genomic study, it is worth noting that DNA sequences alone cannot explain all the gene activities due to epigenetic mechanism. To investigate this problem, we propose EpiGePT, a new transformer-based language pretrained model in epigenomics, for predicting genome-wide epigenomic signals by considering the mechanistic modeling of transcriptional regulation. Specifically, EpiGePT takes the context-specific activities of transcription factors (TFs) into consideration, which could offer deeper biological insights comparing to models trained on DNA sequence only. In a series of experiments, EpiGePT demonstrates state-of-the-art performance in a diverse epigenomic signals prediction tasks as well as new prediction tasks by fine-tuning. Furthermore, EpiGePT is capable of learning the cell-type-specific long-range interactions through the self-attention mechanism and interpreting the genetic variants that associated with human diseases. We expect that the advances of EpiGePT can shed light on understanding the complex regulatory mechanisms in gene regulation. We provide free online prediction service of EpiGePT through https://health.tsinghua.edu.cn/epigept/.

SUBMITTER: Gao Z 

PROVIDER: S-EPMC10370089 | biostudies-literature | 2023 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

EpiGePT: a Pretrained Transformer model for epigenomics.

Gao Zijing Z   Liu Qiao Q   Zeng Wanwen W   Jiang Rui R   Wong Wing Hung WH  

bioRxiv : the preprint server for biology 20240203


The inherent similarities between natural language and biological sequences have given rise to great interest in adapting the transformer-based large language models (LLMs) underlying recent breakthroughs in natural language processing (references), for applications in genomics. However, current LLMs for genomics suffer from several limitations such as the inability to include chromatin interactions in the training data, and the inability to make prediction in new cellular contexts not represent  ...[more]

Similar Datasets

| S-EPMC11657395 | biostudies-literature
| S-EPMC10469107 | biostudies-literature
| S-EPMC9280463 | biostudies-literature
| S-EPMC10805303 | biostudies-literature
| S-EPMC11369538 | biostudies-literature
| S-EPMC11874033 | biostudies-literature
| S-EPMC11591995 | biostudies-literature
| S-EPMC11806334 | biostudies-literature
| S-EPMC11220626 | biostudies-literature
| S-EPMC11349980 | biostudies-literature