Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning.
ABSTRACT: Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data sets (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.
Project description:The genetic dependencies of human cancers widely vary. Here, we catalog this heterogeneity and use it to identify functional gene interactions and genotype-dependent liabilities in cancer. By using genome-wide CRISPR-based screens, we generate a gene essentiality dataset across 14 human acute myeloid leukemia (AML) cell lines. Sets of genes with correlated patterns of essentiality across the lines reveal new gene relationships, the essential substrates of enzymes, and the molecular functions of uncharacterized proteins. Comparisons of differentially essential genes between Ras-dependent and -independent lines uncover synthetic lethal partners of oncogenic Ras. Screens in both human AML and engineered mouse pro-B cells converge on a surprisingly small number of genes in the Ras processing and MAPK pathways and pinpoint PREX1 as an AML-specific activator of MAPK signaling. Our findings suggest general strategies for defining mammalian gene networks and synthetic lethal interactions by exploiting the natural genetic and epigenetic diversity of human cancer cells.
Project description:BACKGROUND: New drug targets are urgently needed for parasites of socio-economic importance. Genes that are essential for parasite survival are highly desirable targets, but information on these genes is lacking, as gene knockouts or knockdowns are difficult to perform in many species of parasites. We examined the applicability of large-scale essentiality information from four model eukaryotes, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Saccharomyces cerevisiae, to discover essential genes in each of their genomes. Parasite genes that lack orthologues in their host are desirable as selective targets, so we also examined prediction of essential genes within this subset. RESULTS: Cross-species analyses showed that the evolutionary conservation of genes and the presence of essential orthologues are each strong predictors of essentiality in eukaryotes. Absence of paralogues was also found to be a general predictor of increased relative essentiality. By combining several orthology and essentiality criteria one can select gene sets with up to a five-fold enrichment in essential genes compared with a random selection. We show how quantitative application of such criteria can be used to predict a ranked list of potential drug targets from Ancylostoma caninum and Haemonchus contortus--two blood-feeding strongylid nematodes, for which there are presently limited sequence data but no functional genomic tools. CONCLUSIONS: The present study demonstrates the utility of using orthology information from multiple, diverse eukaryotes to predict essential genes. The data also emphasize the challenge of identifying essential genes among those in a parasite that are absent from its host.
Project description:Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.
Project description:In many prokaryotes but limited eukaryotic species, the combination of transposon mutagenesis and high-throughput sequencing has greatly accelerated the identification of essential genes. Here we successfully applied this technique to the methylotrophic yeast Pichia pastoris and classified its conditionally essential/non-essential gene sets. Firstly, we showed that two DNA transposons, TcBuster and Sleeping beauty, had high transposition activities in P. pastoris. By merging their insertion libraries and performing Tn-seq, we identified a total of 202,858 unique insertions under glucose supported growth condition. We then developed a machine learning method to classify the 5,040 annotated genes into putatively essential, putatively non-essential, ambig1 and ambig2 groups, and validated the accuracy of this classification model. Besides, Tn-seq was also performed under methanol supported growth condition and methanol specific essential genes were identified. The comparison of conditionally essential genes between glucose and methanol supported growth conditions helped to reveal potential novel targets involved in methanol metabolism and signaling. Our findings suggest that transposon mutagenesis and Tn-seq could be applied in the methylotrophic yeast Pichia pastoris to classify conditionally essential/non-essential gene sets. Our work also shows that determining gene essentiality under different culture conditions could help to screen for novel functional components specifically involved in methanol metabolism.
Project description:Genes are characterized as essential if their knockout is associated with a lethal phenotype, and these "essential genes" play a central role in biological function. In addition, some genes are only essential when deleted in pairs, a phenomenon known as synthetic lethality. Here we consider genes displaying synthetic lethality as "essential pairs" of genes, and analyze the properties of yeast essential genes and synthetic lethal pairs together. As gene duplication initially produces an identical pair or sets of genes, it is often invoked as an explanation for synthetic lethality. However, we find that duplication explains only a minority of cases of synthetic lethality. Similarly, disruption of metabolic pathways leads to relatively few examples of synthetic lethality. By contrast, the vast majority of synthetic lethal gene pairs code for proteins with related functions that share interaction partners. We also find that essential genes and synthetic lethal pairs cluster in the protein-protein interaction network. These results suggest that synthetic lethality is strongly dependent on the formation of protein-protein interactions. Compensation by duplicates does not usually occur mainly because the genes involved are recent duplicates, but is more commonly due to functional similarity that permits preservation of essential protein complexes. This unified view, combining genes that are individually essential with those that form essential pairs, suggests that essentiality is a feature of physical interactions between proteins protein-protein interactions, rather than being inherent in gene and protein products themselves.
Project description:Identification of essential genes is not only useful for our understanding of the minimal gene set required for cellular life but also aids the identification of novel drug targets in pathogens. In this work, we present a simple and effective gene essentiality prediction method using information-theoretic features that are derived exclusively from the gene sequences.We developed a Random Forest classifier and performed an extensive model performance evaluation among and within 15 selected bacteria. In intra-organism predictions, where training and testing sets are taken from the same organism, AUC (Area Under the Curve) scores ranging from 0.73 to 0.90, 0.84 on average, were obtained. Cross-organism predictions using 5-fold cross-validation, pairwise, leave-one-species-out, leave-one-taxon-out, and cross-taxon yielded average AUC scores of 0.88, 0.75, 0.80, 0.82, and 0.78, respectively. To further show the applicability of our method in other domains of life, we predicted the essential genes of the yeast Schizosaccharomyces pombe and obtained a similar accuracy (AUC 0.84).The proposed method enables a simple and reliable identification of essential genes without searching in databases for orthologs and demanding further experimental data such as network topology and gene-expression.
Project description:Knowing the full set of essential genes for a given organism provides important information about ways to promote, and to limit, its growth and survival. For many non-model organisms, the lack of a stable haploid state and low transformation efficiencies impede the use of conventional approaches to generate a genome-wide comprehensive set of mutant strains and the identification of the genes essential for growth. Here we report on the isolation and utilization of a highly stable haploid derivative of the human pathogenic fungus Candida albicans, together with a modified heterologous transposon and machine learning (ML) analysis method, to predict the degree to which all of the open reading frames are required for growth under standard laboratory conditions. We identified 1,610?C. albicans essential genes, including 1,195 with high "essentiality confidence" scores, thereby increasing the number of essential genes (currently 66 in the Candida Genome Database) by >20-fold and providing an unbiased approach to determine the degree of confidence in the determination of essentiality. Among the genes essential in C. albicans were 602 genes also essential in the model budding and fission yeasts analyzed by both deletion and transposon mutagenesis. We also identified essential genes conserved among the four major human pathogens C. albicans, Aspergillus fumigatus, Cryptococcus neoformans, and Histoplasma capsulatum and highlight those that lack homologs in humans and that thus could serve as potential targets for the design of antifungal therapies.IMPORTANCE Comprehensive understanding of an organism requires that we understand the contributions of most, if not all, of its genes. Classical genetic approaches to this issue have involved systematic deletion of each gene in the genome, with comprehensive sets of mutants available only for very-well-studied model organisms. We took a different approach, harnessing the power of in vivo transposition coupled with deep sequencing to identify >500,000 different mutations, one per cell, in the prevalent human fungal pathogen Candida albicans and to map their positions across the genome. The transposition approach is efficient and less labor-intensive than classic approaches. Here, we describe the production and analysis (aided by machine learning) of a large collection of mutants and the comprehensive identification of 1,610?C. albicans genes that are essential for growth under standard laboratory conditions. Among these C. albicans essential genes, we identify those that are also essential in two distantly related model yeasts as well as those that are conserved in all four major human fungal pathogens and that are not conserved in the human genome. This list of genes with functions important for the survival of the pathogen provides a good starting point for the development of new antifungal drugs, which are greatly needed because of the emergence of fungal pathogens with elevated resistance and/or tolerance of the currently limited set of available antifungal drugs.
Project description:Phenomic profiles are high-dimensional sets of readouts that can comprehensively capture the biological impact of chemical and genetic perturbations in cellular assay systems. Phenomic profiling of compound libraries can be used for compound target identification or mechanism of action (MoA) prediction and other applications in drug discovery. To devise an economical set of phenomic profiling assays, we assembled a library of 1,008 approved drugs and well-characterized tool compounds manually annotated to 218 unique MoAs, and we profiled each compound at four concentrations in live-cell, high-content imaging screens against a panel of 15 reporter cell lines, which expressed a diverse set of fluorescent organelle and pathway markers in three distinct cell lineages. For 41 of 83 testable MoAs, phenomic profiles accurately ranked the reference compounds (AUC-ROC???0.9). MoAs could be better resolved by screening compounds at multiple concentrations than by including replicates at a single concentration. Screening additional cell lineages and fluorescent markers increased the number of distinguishable MoAs but this effect quickly plateaued. There remains a substantial number of MoAs that were hard to distinguish from others under the current study's conditions. We discuss ways to close this gap, which will inform the design of future phenomic profiling efforts.
Project description:Identifying genes required by pathogens during infection is critical for antimicrobial development. Here, we use a Monte Carlo simulation-based method to analyse high-throughput transposon sequencing data to determine the role of infection site and co-infecting microorganisms on the in vivo 'essential' genome of Staphylococcus aureus. We discovered that co-infection of murine surgical wounds with Pseudomonas aeruginosa results in conversion of ?25% of the in vivo S. aureus mono-culture essential genes to non-essential. Furthermore, 182 S. aureus genes are uniquely essential during co-infection. These 'community dependent essential' (CoDE) genes illustrate the importance of studying pathogen gene essentiality in polymicrobial communities.
Project description:Gene essentiality changes are crucial for organismal evolution. However, it is unclear how essentiality of orthologs varies across species. We investigated the underlying mechanism of gene essentiality changes between yeast and mouse based on the framework of network evolution and comparative genomic analysis. We found that yeast nonessential genes become essential in mouse when their network connections rapidly increase through engagement in protein complexes. The increased interactions allowed the previously nonessential genes to become members of vital pathways. By accounting for changes in gene essentiality, we firmly reestablished the centrality-lethality rule, which proposed the relationship of essential genes and network hubs. Furthermore, we discovered that the number of connections associated with essential and non-essential genes depends on whether they were essential in ancestral species. Our study describes for the first time how network evolution occurs to change gene essentiality.