Deep2Full: Evaluating strategies for selecting the minimal mutational experiments for optimal computational predictions of deep mutational scan outcomes.
ABSTRACT: Performing a complete deep mutational scan with all single point mutations may not be practical, and may not even be required, especially if predictive computational models can be developed. Computational models are however naive to cellular response in the myriads of assay-conditions. In a realistic paradigm of assay context-aware predictive hybrid models that combine minimal experimental data from deep mutational scans with structure, sequence information and computational models, we define and evaluate different strategies for choosing this minimal set. We evaluated the trivial strategy of a systematic reduction in the number of mutational studies from 85% to 15%, along with several others about the choice of the types of mutations such as random versus site-directed with the same 15% data completeness. Interestingly, the predictive capabilities by training on a random set of mutations and using a systematic substitution of all amino acids to alanine, asparagine and histidine (ANH) were comparable. Another strategy we explored, augmenting the training data with measurements of the same mutants at multiple assay conditions, did not improve the prediction quality. For the six proteins we analyzed, the bin-wise error in prediction is optimal when 50-100 mutations per bin are used in training the computational model, suggesting that good prediction quality may be achieved with a library of 500-1000 mutations.
Project description:The genome of a cancer cell carries somatic mutations that are the cumulative consequences of the DNA damage and repair processes operative during the cellular lineage between the fertilized egg and the cancer cell. Remarkably, these mutational processes are poorly characterized. Global sequencing initiatives are yielding catalogs of somatic mutations from thousands of cancers, thus providing the unique opportunity to decipher the signatures of mutational processes operative in human cancer. However, until now there have been no theoretical models describing the signatures of mutational processes operative in cancer genomes and no systematic computational approaches are available to decipher these mutational signatures. Here, by modeling mutational processes as a blind source separation problem, we introduce a computational framework that effectively addresses these questions. Our approach provides a basis for characterizing mutational signatures from cancer-derived somatic mutational catalogs, paving the way to insights into the pathogenetic mechanism underlying all cancers.
Project description:FMS-like tyrosine kinase 3 receptor (FLT3) internal tandem duplication (ITD) mutations result in constitutive activation of this receptor and have been shown to increase the risk of relapse in patients with acute myeloid leukemia (AML); however, substantial heterogeneity in clinical outcomes still exists within both the ITD mutated and unmutated AML subgroups, suggesting alternative mechanisms of disease relapse not accounted by FLT3 mutational status. Single cell network profiling (SCNP) is a multiparametric flow cytometry based assay that simultaneously measures, in a quantitative fashion and at the single cell level, both extracellular surface marker levels and changes in intracellular signaling proteins in response to extracellular modulators. We previously reported an initial characterization of FLT3 ITD-mediated signaling using SCNP. Herein SCNP was applied sequentially to two separate cohorts of samples collected from elderly AML patients at diagnosis. In the first (training) study, AML samples carrying unmutated, wild-type FLT3 (FLT3 WT) displayed a wide range of induced signaling, with a fraction having signaling profiles comparable to FLT3 ITD AML samples. Conversely, the FLT3 ITD AML samples displayed more homogeneous induced signaling, with the exception of patients with low (<40%) mutational load, which had profiles comparable to FLT3 WT AML samples. This observation was then confirmed in an independent (verification) cohort. Data from the second cohort were also used to assess the association between SCNP data and disease-free survival (DFS) in the context of FLT3 and nucleophosmin (NPM1) mutational status among patients who achieved complete remission (CR) to induction chemotherapy. The combination of SCNP read outs together with FLT3 and NPM1 molecular status improved the DFS prediction accuracy of the latter. Taken together, these results emphasize the value of comprehensive functional assessment of biologically relevant signaling pathways in AML as a basis for the development of highly predictive tests for guidance of post-remission therapy.
Project description:We participated in the Critical Assessment of Genome Interpretation eQTL challenge to further test computational models of regulatory variant impact and their association with human disease. Our prediction model is based on a discriminative gapped-kmer SVM (gkm-SVM) trained on genome-wide chromatin accessibility data in the cell type of interest. The comparisons with massively parallel reporter assays (MPRA) in lymphoblasts show that gkm-SVM is among the most accurate prediction models even though all other models used the MPRA data for model training, and gkm-SVM did not. In addition, we compare gkm-SVM with other MPRA datasets and show that gkm-SVM is a reliable predictor of expression and that deltaSVM is a reliable predictor of variant impact in K562 cells and mouse retina. We further show that DHS (DNase-I hypersensitive sites) and ATAC-seq (assay for transposase-accessible chromatin using sequencing) data are equally predictive substrates for training gkm-SVM, and that DHS regions flanked by H3K27Ac and H3K4me1 marks are more predictive than DHS regions alone.
Project description:Much remains unknown about the progression and heterogeneity of mutational processes in different cancers and their diagnostic and clinical potential. A growing body of evidence supports mutation rate dependence on the local DNA sequence context for various types of mutations. We propose several tools for the analysis of cancer context-dependent mutations, which are implemented in an online computational framework MutaGene. The framework explores DNA context-dependent mutational patterns and underlying somatic cancer mutagenesis, analyzes mutational profiles of cancer samples, identifies the combinations of underlying mutagenic processes including those related to infidelity of DNA replication and repair machinery, and various other endogenous and exogenous mutagenic factors. As a result, the combination of mutagenic processes can be identified in any query sample with subsequent comparison to mutational profiles derived from malignant and benign samples. In addition, mutagen or cancer-specific mutational background models are applied to calculate expected DNA and protein site mutability to decouple relative contributions of mutagenesis and selection in carcinogenesis, thus elucidating the site-specific driving events in cancer. MutaGene is freely available at https://www.ncbi.nlm.nih.gov/projects/mutagene/.
Project description:Competition for substrates is a ubiquitous selection pressure faced by microbes, yet intracellular trade-offs can prevent cells from metabolizing every type of available substrate. Adaptive evolution is constrained by these trade-offs, but their consequences for the repeatability and predictability of evolution are unclear. Here we develop an eco-evolutionary model with a metabolic trade-off to generate networks of mutational paths in microbial communities and show that these networks have descriptive and predictive information about the evolution of microbial communities. We find that long-term outcomes, including community collapse, diversity, and cycling, have characteristic evolutionary dynamics that determine the entropy, or repeatability, of mutational paths. Although reliable prediction of evolutionary outcomes from environmental conditions is difficult, graph-theoretic properties of the mutational networks enable accurate prediction even from incomplete observations. In conclusion, we present a novel methodology for analyzing adaptive evolution and report that the dynamics of adaptation are a key variable for predictive success.The structure and dynamics of microbial communities reflect trade-offs in the ability to use different resources. Here, Josephides and Swain incorporate metabolic trade-offs into an eco-evolutionary model to predict networks of mutational paths and the evolutionary outcomes for microbial communities.
Project description:Mutational robustness of gene regulatory networks refers to their ability to generate constant biological output upon mutations that change network structure. Such networks contain regulatory interactions (transcription factor-target gene interactions) but often also protein-protein interactions between transcription factors. Using computational modeling, we study factors that influence robustness and we infer several network properties governing it. These include the type of mutation, i.e. whether a regulatory interaction or a protein-protein interaction is mutated, and in the case of mutation of a regulatory interaction, the sign of the interaction (activating vs. repressive). In addition, we analyze the effect of combinations of mutations and we compare networks containing monomeric with those containing dimeric transcription factors. Our results are consistent with available data on biological networks, for example based on evolutionary conservation of network features. As a novel and remarkable property, we predict that networks are more robust against mutations in monomer than in dimer transcription factors, a prediction for which analysis of conservation of DNA binding residues in monomeric vs. dimeric transcription factors provides indirect evidence.
Project description:Identifying driver mutations in cancer is notoriously difficult. To date, recurrence of a mutation in patients remains one of the most reliable markers of mutation driver status. However, some mutations are more likely to occur than others due to differences in background mutation rates arising from various forms of infidelity of DNA replication and repair machinery, endogenous, and exogenous mutagens. We calculated nucleotide and codon mutability to study the contribution of background processes in shaping the observed mutational spectrum in cancer. We developed and tested probabilistic pan-cancer and cancer-specific models that adjust the number of mutation recurrences in patients by background mutability in order to find mutations which may be under selection in cancer. We showed that mutations with higher mutability values had higher observed recurrence frequency, especially in tumor suppressor genes. This trend was prominent for nonsense and silent mutations or mutations with neutral functional impact. In oncogenes, however, highly recurring mutations were characterized by relatively low mutability, resulting in an inversed U-shaped trend. Mutations not yet observed in any tumor had relatively low mutability values, indicating that background mutability might limit mutation occurrence. We compiled a dataset of missense mutations from 58 genes with experimentally validated functional and transforming impacts from various studies. We found that mutability of driver mutations was lower than that of passengers and consequently adjusting mutation recurrence frequency by mutability significantly improved ranking of mutations and driver mutation prediction. Even though no training on existing data was involved, our approach performed similarly or better to the state-of-the-art methods.
Project description:The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype-fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.
Project description:Proteins of all kinds can self-assemble into highly ordered ?-sheet aggregates known as amyloid fibrils, important both biologically and clinically. However, the specific molecular structure of a fibril can vary dramatically depending on sequence and environmental conditions, and mutations can drastically alter amyloid function and pathogenicity. Experimental structure determination has proven extremely difficult with only a handful of NMR-based models proposed, suggesting a need for computational methods.We present AmyloidMutants, a statistical mechanics approach for de novo prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein mutational landscapes, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability. Tested on non-mutant, full-length amyloid structures with known chemical shift data, AmyloidMutants offers roughly 2-fold improvement in prediction accuracy over existing tools. Moreover, AmyloidMutants is the only method to predict complete super-secondary structures, enabling accurate discrimination of topologically dissimilar amyloid conformations that correspond to the same sequence locations. Applied to mutant prediction, AmyloidMutants identifies a global conformational switch between A? and its highly-toxic 'Iowa' mutant in agreement with a recent experimental model based on partial chemical shift data. Predictions on mutant, yeast-toxic strains of HET-s suggest similar alternate folds. When applied to HET-s and a HET-s mutant with core asparagines replaced by glutamines (both highly amyloidogenic chemically similar residues abundant in many amyloids), AmyloidMutants surprisingly predicts a greatly reduced capacity of the glutamine mutant to form amyloid. We confirm this finding by conducting mutagenesis experiments.Our tool is publically available on the web at http://firstname.lastname@example.org; email@example.com.
Project description:Cancer arises through the accumulation of somatic mutations over time. Understanding the sequence of mutation occurrence during cancer progression can assist early and accurate diagnosis and improve clinical decision-making. Here we employ long short-term memory (LSTM) networks, a class of recurrent neural network, to learn the evolution of a tumor through an ordered sequence of mutations. We demonstrate the capacity of LSTMs to learn complex dynamics of the mutational time series governing tumor progression, allowing accurate prediction of the mutational burden and the occurrence of mutations in the sequence. Using the probabilities learned by the LSTM, we simulate mutational data and show that the simulation results are statistically indistinguishable from the empirical data. We identify passenger mutations that are significantly associated with established cancer drivers in the sequence and demonstrate that the genes carrying these mutations are substantially enriched in interactions with the corresponding driver genes. Breaking the network into modules consisting of driver genes and their interactors, we show that these interactions are associated with poor patient prognosis, thus likely conferring growth advantage for tumor progression. Thus, application of LSTM provides for prediction of numerous additional conditional drivers and reveals hitherto unknown aspects of cancer evolution.