Project description:How to design experiments that accelerate knowledge discovery on complex biological landscapes remains a tantalizing question. Here, we present OPEX, an optimal experimental design method to identify informative omics experiments for both experimental space exploration and model training. OPEX-guided exploration of Escherichia coli's cross-behavior potential, when exposed to novel biocide and antibiotic combinations, led to accelerated knowledge discovery with predictive models that are more accurate while needing 44% fewer data to train. Selecting experiments favoring broader exploration followed by fine-tuning emerged as the optimal strategy. This led to the discovery of 29 cross-protection and 4 cross-vulnerability conditions, with further validation revealing the central role of chaperones, stress response proteins and transport pumps in cross-stress exposure. This work demonstrates how active learning can be used to automate omics data collection for training accurate predictive models, evidence-driven decision making and accelerated knowledge discovery in life sciences.
Project description:For finding new conditions that show maximum entropy and highest prediction interval, we bound the condition space to explore by making a list of top 98 conditions for MG1655 without genetic perturbations that use 13 most populated stresses (acidic, Ax, Bm, butanol, Cfs, cold, ethanol, heat, Mcn, Nx, hypoxia, osmotic, oxidative) or no stress and 7 most-used carbon sources (Glu 0.4%, Gly 0.4%, Lac 0.4%, Galactose, Arabinose 0.4%, Glucose 0.2%, and Alanine) for M9 medium. Among them, 13 conditions were in Ecomics dataset. For the 85 unexplored conditions, we identify the top 15 conditions that show maximum entropy and highest prediction interval in an adaptive fashion. That is, for each iteration, we find a condition in the list that shows maximum entropy and highest prediction interval from the model that was built from the training data. Since the maximum entropy quantification and prediction interval value are not at the same scale, we bound the two measures between zero and one by min-max normalization for 85 conditions. Then we supplement the predicted expression levels for that candidate condition and repeat the next iteration of the procedure, until we identify all 15 conditions. The initial training data is 2610 profiles of 178 transcription factors. Transcriptome profiling of 45 samples (3 replicates for each condition) for E. coli selected from optimal experiment design for genome-scale model.
Project description:BACKGROUND: Microarray comparative genomic hybridization (CGH) is currently one of the most powerful techniques to measure DNA copy number in large genomes. In humans, microarray CGH is widely used to assess copy number variants in healthy individuals and copy number aberrations associated with various diseases, syndromes and disease susceptibility. In model organisms such as Caenorhabditis elegans (C. elegans) the technique has been applied to detect mutations, primarily deletions, in strains of interest. Although various constraints on oligonucleotide properties have been suggested to minimize non-specific hybridization and improve the data quality, there have been few experimental validations for CGH experiments. For genomic regions where strict design filters would limit the coverage it would also be useful to quantify the expected loss in data quality associated with relaxed design criteria. RESULTS: We have quantified the effects of filtering various oligonucleotide properties by measuring the resolving power for detecting deletions in the human and C. elegans genomes using NimbleGen microarrays. Approximately twice as many oligonucleotides are typically required to be affected by a deletion in human DNA samples in order to achieve the same statistical confidence as one would observe for a deletion in C. elegans. Surprisingly, the ability to detect deletions strongly depends on the oligonucleotide 15-mer count, which is defined as the sum of the genomic frequency of all the constituent 15-mers within the oligonucleotide. A similarity level above 80% to non-target sequences over the length of the probe produces significant cross-hybridization. We recommend the use of a fairly large melting temperature window of up to 10 C, the elimination of repeat sequences, the elimination of homopolymers longer than 5 nucleotides, and a threshold of -1 kcal/mol on the oligonucleotide self-folding energy. We observed very little difference in data quality when varying the oligonucleotide length between 50 and 70, and even when using an isothermal design strategy. CONCLUSIONS: We have determined experimentally the effects of varying several key oligonucleotide microarray design criteria for detection of deletions in C. elegans and humans with NimbleGen's CGH technology. Our oligonucleotide design recommendations should be applicable for CGH analysis in most species.
Project description:A definitive screening design was used to systematically optimize a DIA workflow for crustacean neuropeptide quantitation. We were able to assess several parameters for their effect on increasing quantifiable neuropeptides and predict the optimal value for these parameters.
Project description:Transcriptional enhancers act as docking stations for combinations of transcription factors and thereby regulate spatiotemporal activation of their target genes. A single enhancer, of a few hundred base pairs in length, can autonomously and independently of its location and orientation drive cell-type specific expression of a gene or transgene. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Recently, deep learning models have yielded unprecedented insight into the enhancer code, and well-trained models are reaching a level of understanding that may be close to complete. As a consequence, we hypothesized that deep learning models can be used to guide the directed design of synthetic, cell type specific enhancers, and that this process would allow for a detailed tracing of all enhancer features at nucleotide-level resolution. Here we implemented and compared three different design strategies, each built on a deep learning model: (1) directed sequence evolution; (2) directed iterative motif implanting; and (3) generative design. We evaluated the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We then exploited this concept further by creating “dual-code” enhancers that target two cell types, and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the trajectories followed during state space searches towards functional enhancers, we could accurately define the enhancer code as the optimal strength, combination, and relative distance of TF activator motifs, and the absence of TF repressor motifs. Finally, we applied the same three strategies to successfully design human enhancers, finding highly similar design principles as in Drosophila. In conclusion, enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.