Project description:Sequence-based deep learning models have become the state of the art for the analysis of the genomic regulatory code. Particularly for transcriptional enhancers, deep learning models excel at deciphering sequence features and grammar that underlie their spatiotemporal activity. To enable end-to-end enhancer modeling and design, we developed a software and modeling package, called CREsted. It combines preprocessing starting from single-cell ATAC-seq data; modeling with a choice of several architectures for training classification and regression models on either topics or pseudobulk peak heights; sequence design using multiple strategies; and downstream analysis through a collection of tools to locate transcription factor (TF) binding sites, infer the effect of a TF (activating or repressing) on enhancer accessibility, decipher enhancer grammar, and score gene loci. We demonstrate CREsted using a mouse cortex model that we validate using the BICCN collection of in vivo validated mouse brain enhancers. Classical enhancers in immune cells, including the IFN-β enhanceosome are revisited using a PBMC model, and we assess the accuracy of TF binding site predictions with ChIP-seq. Additionally, we use CREsted to compare mesenchymal-like cancer cell states between tumor types; and we investigate different fine-tuning strategies of Borzoi within CREsted, comparing their performance and explainability with CREsted models trained from scratch. Finally, we train a CREsted model on a scATAC-seq atlas of zebrafish development, and use this to design and in vivo validate cell type-specific synthetic enhancers in 3 tissues. For varying datasets we demonstrate that CREsted facilitates efficient training and analyses, enabling scrutinization of the enhancer logic and design of synthetic enhancers across tissues and species. CREsted is available at https://crested.readthedocs.io.
Project description:Mammalian development is orchestrated by the interplay of trans-acting factors and cis-regulatory elements. However, while genome sequences evolve rapidly, the regulatory grammar that governs their interpretation evolves far more slowly. We hypothesized that this pronounced mismatch in evolutionary tempos creates a powerful opportunity for “evolutionary transfer learning”, in that models trained to learn cell type-specific cis-regulatory grammars in one mammalian species should generalize to the orthologous cell types of other mammals. To test this, we generated a time-resolved atlas of chromatin accessibility across mouse development from embryonic day 10 (E10) to birth (P0). Using single-cell combinatorial indexing, we profiled 3.9 million nuclei from 36 precisely staged embryos, resolving dynamic accessibility landscapes across 36 cell classes and 140 cell types. From these data, we applied a multi-output deep learning model, CREsted, to predict cell type-specific chromatin accessibility from DNA sequence. However, while “evolution-naive” models performed well within peak-defined regions, genome-wide inference revealed systematic failure modes, including overprediction at tandem repeats and conflation of promoter and distal enhancer grammars. To address this, we introduced an “evolution-aware” framework that isolates distal enhancer grammars by requiring both syntenic persistence and functional coherence across mammals, defined as sequence-intrinsic regulatory behavior that is concordant across enhancer orthologs and robust to in silico tandem repeat disruption. This updated CREsted model produced refined genomewide regulatory maps whose predicted enhancer activity scaled with enhancer score and enhancer–promoter proximity to explain cell type-specific gene expression. Incorporating syntenic enhancer orthologs from up to 240 placental mammals directly into training expanded the effective regulatory corpus by more than two orders of magnitude. Finally, applying the fully evolution-augmented model to the human genome yielded distal enhancer maps for orthologous human cell types. Taken together, our results unify advances in single-cell molecular profiling, deep learning, and comparative genomics into a framework for model-driven reconstruction of human cis-regulatory landscapes, including for cell types that emerge during the embryonic, fetal, and pediatric stages of human development that are largely inaccessible to molecular profiling. More broadly, our work supports the view that model organisms and evolutionarily diverse non-human genomes are indispensable resources for accelerating the AI-enabled exploration of human biology.