Project description:While the genome is composed of individual nucleotides, functional elements such as cis-regulatory elements and structural interactions are formed from sets of interdependent nucleotides. In principle, these dependencies are reflected in coevolutionary relationships. However, classical comparative genomics approaches struggle to detect these dependencies beyond alignable highly conserved sequences such as within coding regions. DNA language models (LMs), which are trained by predicting nucleotides given their sequence context, have recently been proposed as foundational models for sequence-based prediction problems. DNA LMs implicitly capture functional elements from genomic sequences alone. However, which dependencies DNA LMs learn and whether they reflect known or even novel biology remains an open question. Here we introduce nucleotide dependency maps to systematically study nucleotide dependencies captured by DNA LMs in a purely unsupervised setup. We compute these maps genome-wide and show that they reveal and clearly delineate known functional genomic features such as transcription factor binding motifs, functional interactions between splice sites, RNA tertiary structures, and coding sequences. This allowed to uncover novel experimentally validated RNA structures. We furthermore investigate dependency maps from in silico manipulated sequences, revealing the ability of DNA LMs to capture operations such as copying and reverse complementarity without memorization. Lastly, we compare dependency maps from openly available DNA LMs, showcasing the drawbacks and advantages of different models. We find stark differences in the ability of models to accurately learn conserved but infrequent features. Altogether, by leveraging the flexibility of DNA language models, nucleotide dependency mapping emerges as a general methodology to discover and study functional interactions in genomes.
Project description:Current base editors use DNA deaminases, including cytidine deaminase in cytidine base editor (CBE) or adenine deaminase in adenine base editor (ABE), to facilitate transition nucleotide substitutions. Combining CBE or ABE with glycosylase enzymes can induce limited transversion mutations. Nonetheless, a critical demand remains for base editors capable of generating alternative mutation types, such as T>G corrections. In this study, we leveraged pre-trained protein language models to optimize a uracil-N-glycosylase (UNG) variant with altered specificity for thymines (eTDG). Notably, after two rounds of testing fewer than 50 top-ranking variants, more than 50% exhibited over 1.5-fold enhancement in enzymatic activities. When eTDG was fused with nCas9, it induced programmable T-to-S (G/C) substitutions and corrected db/db diabetic mutation in mice (up to 55%). Our findings not only establish orthogonal strategies for developing novel base editors, but also demonstrate the capacities of protein language models for optimizing enzymes without extensive task-specific training data.
Project description:Transcription initiation involves the recruitment of basal transcription factors to the core promoter. A variety of core promoter elements exists, however for most of these motifs the distribution across species is unknown. Here we report on the comparison of human and amphibian promoter sequences. We have used oligo-capping in combination with deep sequencing to determine transcription start sites in Xenopus tropicalis. To systematically predict regulatory elements we have developed a de novo motif finding pipeline using an ensemble of computational tools. A comprehensive comparison of human and amphibian promoter sequences revealed both similarities and differences in core promoter architecture. Some of the differences stem from a highly divergent nucleotide composition of Xenopus and human promoters. Whereas the distribution of some core promoter motifs is conserved independent of species-specific nucleotide bias, the frequency of another class of motifs correlates with the single nucleotide frequencies. This class includes the well-known TATA box and SP1 motifs, which are more abundant in Xenopus and human promoters, respectively. While these motifs are enriched above the local nucleotide background in both organisms, their frequency varies in step with this background. These differences are likely adaptive as these motifs can recruit TFIID to either CpG island or sharply initiating promoters. Our results highlight both conserved and diverged aspects of vertebrate transcription, most notably showing co-opted motif usage to recruit the transcriptional machinery to promoters with diverging nucleotide composition. This shows how sweeping changes in nucleotide composition are compatible with highly conserved mechanisms of transcription initiation. ChIP-seq profiles of TBP in Xenopus tropicalis stage 12 embryos and TSS-seq profiles of Xenopus oocytes and stage 12 embryos
Project description:Traditional protein engineering methods, such as directed evolution, while effective, are often slow and labor-intensive. Advances in machine learning and automated biofoundry present new opportunities for optimizing these processes. This study devises a protein language model-enabled automatic evolution platform, a closed-loop system for automated protein engineering within the Design-Build-Test-Learn cycle. The protein language model ESM-2 makes zero-shot prediction of 96 variants to initiate the cycle. The biofoundry constructs and evaluates these variants, and feeds the results back to a multi-layer perceptron to train a fitness predictor, which then makes prediction of second round of 96 variants with improved fitness. With the tRNA synthetase as a model enzyme, four-rounds of evolution carried out within 10 days lead to mutants with enzyme activity improved by up to 2.4-fold. Our system significantly enhances the speed and accuracy of protein evolution, driving faster advancements in protein engineering for industrial applications.
Project description:Transcription initiation involves the recruitment of basal transcription factors to the core promoter. A variety of core promoter elements exists, however for most of these motifs the distribution across species is unknown. Here we report on the comparison of human and amphibian promoter sequences. We have used oligo-capping in combination with deep sequencing to determine transcription start sites in Xenopus tropicalis. To systematically predict regulatory elements we have developed a de novo motif finding pipeline using an ensemble of computational tools. A comprehensive comparison of human and amphibian promoter sequences revealed both similarities and differences in core promoter architecture. Some of the differences stem from a highly divergent nucleotide composition of Xenopus and human promoters. Whereas the distribution of some core promoter motifs is conserved independent of species-specific nucleotide bias, the frequency of another class of motifs correlates with the single nucleotide frequencies. This class includes the well-known TATA box and SP1 motifs, which are more abundant in Xenopus and human promoters, respectively. While these motifs are enriched above the local nucleotide background in both organisms, their frequency varies in step with this background. These differences are likely adaptive as these motifs can recruit TFIID to either CpG island or sharply initiating promoters. Our results highlight both conserved and diverged aspects of vertebrate transcription, most notably showing co-opted motif usage to recruit the transcriptional machinery to promoters with diverging nucleotide composition. This shows how sweeping changes in nucleotide composition are compatible with highly conserved mechanisms of transcription initiation.
Project description:To investigate the architecture of the E. coli K-12 transcriptome, we used two RNA-Seq technologies to analyze strand-specific transcription at single-nucleotide resolution. We analyzed the data by using an organizational schema to annotate the promoters and terminators that define transcription units across the genome. Our results showed that most (ca. two-thirds) operons have a single promoter and terminator, whereas one-third of operons contain multiple transcription units. We found substantial evidence for differential gene expression within complex operons, which we categorized based on operon architecture. E. coli K-12 strain MG1655 substrain BW38028 and isogenic rpoS mutant were cultured in minimal glucose media and the total transcriptome of log and stationary phase samples was sequenced.
Project description:Language is a unique human capability with limited molecular understanding due to the lack of animal models and technical constraints. Transcriptomics analysis offers a comprehensive view of gene expression in specific tissues, aiding the understanding of their functions. However, such patterns have been underexplored in language-related regions of the human brain. This study conducts a comprehensive transcriptomic analysis of 125 samples from 13 language-related Brodmann areas (BAs) in both hemispheres of five human postmortem brains. The expression landscape of human language-related regions is mapped, revealing higher expression in the right hemisphere, notably BA45 (Broca’s area) and BA3/1/2 (ventral sensory-motor cortex). Integrative analysis of differentially expressed genes and language-relevant genetic discoveries provides insights into the rs62060948 locus. The findings suggest that the rs62060948-MYC-WNT3 axis plays a crucial role in language function in BA44 of the right hemisphere. Behavior tests in mice show that Wnt3 knockdown in the right auditory cortex leads to abnormal behaviors, confirming that imbalanced Wnt3 expression impacts language function. Pathway analysis indicates that Wnt3 maintains nervous system development through neuron ensheathment. This study enhances understanding of the molecular mechanisms underlying human language function and identifies potential targets for language disorder therapies.
Project description:To investigate the architecture of the E. coli K-12 transcriptome, we used two RNA-Seq technologies to analyze strand-specific transcription at single-nucleotide resolution. We analyzed the data by using an organizational schema to annotate the promoters and terminators that define transcription units across the genome. Our results showed that most (ca. two-thirds) operons have a single promoter and terminator, whereas one-third of operons contain multiple transcription units. We found substantial evidence for differential gene expression within complex operons, which we categorized based on operon architecture.