Decoding and Rewiring Promoter Architecture Using Large Language Models and Diffusion Frameworks
Ontology highlight
ABSTRACT: High-performance promoters are essential tools for precisely regulating gene expres-sion, yet their rational design within the vast combinatorial sequence space remains a major challenge. Here, we present a hybrid framework that integrates a large lan-guage model (LLM) with a diffusion model to enable data-driven and interpretable promoter design. The fine-tuned LLM predicts promoter strength with high accuracy and, through pseudo-sequence mutations, identifies biologically essential core motifs. A diffusion model is then conditioned on these motifs to reconstruct non-core regions and generate complete promoter sequences. We experimentally validated this approach in E. coli by high-throughput barcoded promoter activity sequencing: over 90% of the generated promoters showed measurable activity, and the best variants achieved ap-proximately ∼20-fold higher expression than the benchmark promoter (BBa_J23119). By explicitly coupling interpretability with generative design, this strategy provides a generalizable path to accelerate synthetic biology efforts and advance large-scale regu-latory sequence engineering.
ORGANISM(S): Escherichia coli DH5[alpha]
PROVIDER: GSE316915 | GEO | 2026/01/23
REPOSITORIES: GEO
ACCESS DATA