Project description:Long noncoding RNAs (lncRNAs) can exert their function by interacting with the DNA via triplex structure formation. Even though this has been validated with a handful of experiments, a genome-wide analysis of lncRNA-DNA binding is needed. In this paper, we develop and interpret deep learning models that predict the genome-wide binding sites deciphered by ChIRP-Seq experiments of 12 different lncRNAs. Among the several deep learning architectures tested, a simple architecture consisting of two convolutional neural network layers performed the best suggesting local sequence patterns as determinants of the interaction. Further interpretation of the kernels in the model revealed that these local sequence patterns form triplex structures with the corresponding lncRNAs. We uncovered several novel triplexes forming domains (TFDs) of these 12 lncRNAs and previously experimentally verified TFDs of lncRNAs HOTAIR and MEG3. We experimentally verified such two novel TFDs of lncRNAs HOTAIR and TUG1 predicted by our method (but previously unreported) using Electrophoretic mobility shift assays. In conclusion, we show that simple deep learning architecture can accurately predict genome-wide binding sites of lncRNAs and interpretation of the models suggest RNA:DNA:DNA triplex formation as a viable mechanism underlying lncRNA-DNA interactions at genome-wide level.
Project description:RNA silencing at the transcriptional and posttranscriptional levels regulates endogenous gene expression, controls invading transposable elements (TEs), and protects the cell against viruses. Key components of the mechanism are small RNAs (sRNAs) of 21-24 nt that guide the silencing machinery to their nucleic acid targets in a nucleotide sequence-specific manner. Transcriptional gene silencing is associated with 24-nt sRNAs and RNA-directed DNA methylation (RdDM) at cytosine residues in three DNA sequence contexts (CG, CHG, and CHH). We previously demonstrated that 24-nt sRNAs are mobile from shoot to root in Arabidopsis thaliana and confirmed that they mediate DNA methylation at three sites in recipient cells. In this study, we extend this finding by demonstrating that RdDM of thousands of loci in root tissues is dependent upon mobile sRNAs from the shoot and that mobile sRNA-dependent DNA methylation occurs predominantly in non-CG contexts. Mobile sRNA-dependent non-CG methylation is largely dependent on the DOMAINS REARRANGED METHYLTRANSFERASES 1/2 (DRM1/DRM2) RdDM pathway but is independent of the CHROMOMETHYLASE (CMT)2/3 DNA methyltransferases. Specific superfamilies of TEs, including those typically found in gene-rich euchromatic regions, lose DNA methylation in a mutant lacking 22- to 24-nt sRNAs (dicer-like 2, 3, 4 triple mutant). Transcriptome analyses identified a small number of genes whose expression in roots is associated with mobile sRNAs and connected to DNA methylation directly or indirectly. Finally, we demonstrate that sRNAs from shoots of one accession move across a graft union and target DNA methylation de novo at normally unmethylated sites in the genomes of root cells from a different accession.
Project description:Genome-wide binding assays can determine where individual transcription factors bind in the genome. However, these factors rarely bind chromatin alone, but instead frequently bind to cis-regulatory elements (CREs) together with other factors thus forming protein complexes. Currently there are no integrative analytical approaches that can predict which complexes are formed on chromatin. Here, we describe a computational methodology to systematically capture protein complexes and infer their impact on gene expression. We applied our method to three human cell types, identified thousands of CREs, inferred known and undescribed complexes recruited to these CREs, and determined the role of the complexes as activators or repressors. Importantly, we found that the predicted complexes have a higher number of physical interactions between their members than expected by chance. Our work provides a mechanism for developing hypotheses about gene regulation via binding partners, and deciphering the interplay between combinatorial binding and gene expression.
Project description:Genome-wide maps of transcription factor (TF) occupancy and regions of open chromatin implicitly contain DNA sequence signals for multiple factors. We present SeqGL, a novel de novo motif discovery algorithm to identify multiple TF sequence signals from ChIP-, DNase-, and ATAC-seq profiles. SeqGL trains a discriminative model using a k-mer feature representation together with group lasso regularization to extract a collection of sequence signals that distinguish peak sequences from flanking regions. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy. Furthermore, SeqGL can be naturally used with multitask learning to identify genomic and cell-type context determinants of TF binding. SeqGL successfully scales to the large multiplicity of sequence signals in DNase- or ATAC-seq maps. In particular, SeqGL was able to identify a number of ChIP-seq validated sequence signals that were not found by traditional motif discovery algorithms. Thus compared to widely used motif discovery algorithms, SeqGL demonstrates both greater discriminative accuracy and higher sensitivity for detecting the DNA sequence signals underlying regulatory element maps. SeqGL is available at http://cbio.mskcc.org/public/Leslie/SeqGL/.