Project description:Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods. Yet, while Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq datasets grow exponentially, suboptimal motif scanning is commonly used for TFBS prediction from ATAC-seq. Here, we present “maxATAC”, a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the largest collection of state-of-the-art TFBS models to date. maxATAC performance extends to primary cells and single-cell ATAC-seq, enabling state-of-the-art TFBS prediction in vivo. We demonstrate maxATAC’s capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci.
Project description:We developed an unbiased strategy for MOA prediction, called Perturbation-Specific Transcriptional Mapping (PerSpecTM), in which large-throughput expression profiling of wildtype or hypomorphic mutants, depleted for essential targets, enables a computational strategy to address this challenge. We applied PerSpecTM to perform reference-based MOA prediction based on the principle that similar perturbations, whether small molecule or genetic, will elicit similar transcriptional responses. Using this approach, we elucidated the MOAs of three new molecules with activity against Pseudomonas aeruginosa by mapping their expression profiles to those of a reference set of antimicrobial compounds with known MOAs. We also show that transcriptional responses to small molecule inhibition maps to those resulting from genetic depletion of essential targets by CRISPRi by PerSpecTM, demonstrating proof-of-concept that correlations between expression profiles of small molecule and genetic perturbations can facilitate MOA prediction when no chemical entities exist to serve as a reference. Empowered by PerSpecTM, this work lays the foundation for an unbiased, readily scalable, systematic reference-based strategy for MOA elucidation that could transform antibiotic discovery efforts.
Project description:In this work, a total of eight proteins have been identified (six specific to the human proteome and two specific to the soybean proteome) that are supported by literature to be involved in human health, specifically related to immunological and neurological pathways. Our approach involved the use of the Protein-protein Interaction Prediction Engine (PIPE4) algorithm which was specifically developed for complex inter- and cross-species prediction schemas to generate the comprehensive interactome between H. sapiens and G. max. A literature-curated list of human proteins known to be associated with the Human Allergy Response and a second literature-curated list of soybean proteins known to be associated with Soybean Allergens were used to identify candidate proteins whose interactions may be consequential to human health. This study, beyond generating the most comprehensive human-soybean interactome to date, elucidated a soybean seed interactome and identified several proteins putatively consequential to human health.
Project description:Abstract:
Crosslinking mass spectrometry (Crosslinking MS) has developed into a robust technique that is increasingly used to investigate the interactomes of organelles and cells. However, the incomplete and noisy information contained in spectra limits especially the identification of heteromeric protein-protein interactions (PPIs) from the many theoretically possible PPIs. We successfully leveraged here chromatographic retention time (RT) to complement the mass spectrometry-centric identification process. For this, we first made crosslinked peptides amenable to RT prediction, through a Siamese neural network, and then added RT information to the identification process. Our multi-task machine learning model xiRT achieved highly accurate predictions in a multi-dimensional separation experiment of crosslinked E. coli lysate conducted for this study. We combined strong cation exchange (SCX), hydrophilic strong anion exchange (hSAX) and reversed-phase (RP) chromatography and reached R^2 0.94 in RP and a margin of error of 1 fraction for hSAX in 94%, and SCX in 85% of the cases. Importantly, supplementing the search engine score with retention time features led to a 1.4-fold increase in PPIs, at 1% PPI false discovery rate (FDR). We also demonstrated the value of this approach for the more routine analysis of multiprotein complexes. In the Fanconi anaemia monoubiquitin ligase complex, an increase of 1.7-fold in heteromeric residue-pairs was achieved at 1% residue-pair FDR, solely using reversed-phase RT. Retention times therefore proved to be a powerful complement to mass spectrometric information to improve the identification of crosslinked peptides. We envision xiRT to supplement search engines in their scoring routines to increase the sensitivity of Crosslinking MS analyses especially for protein-protein interactions.
Conclusion:
Using a Siamese network architecture, we succeeded in bringing RT prediction into the Crosslinking MS field, independent of separation setup and search software. Our open source application xiRT introduces the concept of multi-task learning to achieve multi-dimensional chromatographic retention time prediction, and may use any peptide sequence-dependent measure including for example collision cross section or isoelectric point. The black-box character of the neural network was reduced by means of interpretable machine learning that revealed individual amino acid contributions towards the separation behavior. The RT predictions – even when using only the RP dimension – complement mass spectrometric information to enhance the identification of heteromeric crosslinks in multiprotein complex and proteome-wide studies. Overfitting does not account for this gain as known false target matches from an entrapment database did not increase. Leveraging additional information sources may help to address the mass-spectrometric identification challenge of heteromeric crosslinks.