Project description:The noncoding genome plays an important role in de novo gene birth and the emergence of genetic novelty. Nevertheless, how the properties of noncoding sequences could promote the birth of novel genes and shape the structural diversity and evolution of proteins remains unclear. Here, we investigated the potential of the noncoding genome of yeast to produce novel protein bricks that can give rise to novel genes or be integrated in pre-existing proteins, thus participating in protein structure evolution and diversity. Combining different bioinformatics approaches, we showed that intergenic ORFs of yeast encompass the large structural diversity of canonical proteins with the majority encoding peptides predicted as foldable. Then, we investigated the early stages of de novo gene birth with Ribosome Profiling and systematic reconstruction of yeast de novo gene ancestral sequences. We highlighted sequence and structural factors determining de novo gene birth and protein evolution. Finally, we showed a strong correlation between the fold potential of de novo genes and their ancestral ORFs reflecting the relationship between the noncoding genome and the protein structure universe. Overall design: RiboSeq of two replicates of the same BY 4742 yeast strain, to identify for alternative CDS in unanotated regions, such CDS are then called IGORFs if RiboSeq signal is detected.
Project description:Novel protein-coding genes can arise either through re-organization of pre-existing genes or de novo. Processes involving re-organization of pre-existing genes, notably after gene duplication, have been extensively described. In contrast, de novo gene birth remains poorly understood, mainly because translation of sequences devoid of genes, or 'non-genic' sequences, is expected to produce insignificant polypeptides rather than proteins with specific biological functions. Here we formalize an evolutionary model according to which functional genes evolve de novo through transitory proto-genes generated by widespread translational activity in non-genic sequences. Testing this model at the genome scale in Saccharomyces cerevisiae, we detect translation of hundreds of short species-specific open reading frames (ORFs) located in non-genic sequences. These translation events seem to provide adaptive potential, as suggested by their differential regulation upon stress and by signatures of retention by natural selection. In line with our model, we establish that S. cerevisiae ORFs can be placed within an evolutionary continuum ranging from non-genic sequences to genes. We identify ~1,900 candidate proto-genes among S. cerevisiae ORFs and find that de novo gene birth from such a reservoir may be more prevalent than sporadic gene duplication. Our work illustrates that evolution exploits seemingly dispensable sequences to generate adaptive functional innovation.
Project description:Little is known about the rate of emergence of de novo genes, what their initial properties are, and how they spread in populations. We examined wild yeast populations (Saccharomyces paradoxus) to characterize the diversity and turnover of intergenic ORFs over short evolutionary timescales. We find that hundreds of intergenic ORFs show translation signatures similar to canonical genes, and we experimentally confirmed the translation of many of these ORFs in laboratory conditions using a reporter assay. Compared with canonical genes, intergenic ORFs have lower translation efficiency, which could imply a lack of optimization for translation or a mechanism to reduce their production cost. Translated intergenic ORFs also tend to have sequence properties that are generally close to those of random intergenic sequences. However, some of the very recent translated intergenic ORFs, which appeared <110 kya, already show gene-like characteristics, suggesting that the raw material for functional innovations could appear over short evolutionary timescales.
Project description:Deep sequencing of mammalian DNA methylomes has uncovered a previously unpredicted number of discrete hypomethylated regions in intergenic space (iHMRs). Here, we combined whole-genome bisulfite sequencing data with extensive gene expression and chromatin-state data to define functional classes of iHMRs, and to reconstruct the dynamics of their establishment in a developmental setting. Comparing HMR profiles in embryonic stem and primary blood cells, we show that iHMRs mark an exclusive subset of active DNase hypersensitive sites (DHS), and that both developmentally constitutive and cell-type-specific iHMRs display chromatin states typical of distinct regulatory elements. We also observe that iHMR changes are more predictive of nearby gene activity than the promoter HMR itself, and that expression of noncoding RNAs within the iHMR accompanies full activation and complete demethylation of mature B cell enhancers. Conserved sequence features corresponding to iHMR transcript start sites, including a discernible TATA motif, suggest a conserved, functional role for transcription in these regions. Similarly, we explored both primate-specific and human population variation at iHMRs, finding that while enhancer iHMRs are more variable in sequence and methylation status than any other functional class, conservation of the TATA box is highly predictive of iHMR maintenance, reflecting the impact of sequence plasticity and transcriptional signals on iHMR establishment. Overall, our analysis allowed us to construct a three-step timeline in which (1) intergenic DHS are pre-established in the stem cell, (2) partial demethylation of blood-specific intergenic DHSs occurs in blood progenitors, and (3) complete iHMR formation and transcription coincide with enhancer activation in lymphoid-specified cells.
Project description:The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to colocalize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. By using both simulated and real data sets, we demonstrate that CisModule is not only accurate in predicting modules but also more sensitive in detecting motif patterns and binding sites than standard motif discovery methods are.
Project description:Long intergenic non-coding RNAs (lincRNAs) are emerging as important regulatory molecules involved in diseases including heart failure. However, little is known about how the lincRNAs work together with protein-coding genes (PCGs) contributing to the pathogenesis of heart failure. In this study, we constructed a comprehensive transcriptome profile of lincRNAs, PCGs and miRNAs using RNA-seq and miRNA-seq data of 16 heart failure patients (HFs) and 8 non-failing individuals (NFs). Through integrating lincRNA and PCG expression profiles, we identified HF-associated lincRNA modules. We identified a heart-specific lincRNA module which was significantly enriched for differentially expressed lincRNAs and PCGs. This module was associated with heart failure rather than with other clinical traits such as sex, age, smoking and diabetes mellitus. Moreover, the module was significantly correlated with certain indicators of left ventricular function like ejection fraction and left ventricular end-diastolic diameter, implying the potential of its components as crucial biomarkers. Apart from enhancer-like function, lincRNAs in this module could act as competing endogenous RNAs (ceRNAs) to regulate genes which were associated with left-ventricular systolic function. Our work provided deep insights into the critical roles of lincRNAs in the pathology of heart failure and suggested that they could be valuable biomarkers and therapeutic targets.
Project description:The adaptive immune system of vertebrates has an extraordinary potential to sense and neutralize foreign antigens entering the body. De novo evolution of genes implies that the genome itself expresses novel antigens from intergenic sequences which could cause a problem with this immune system. Peptides from these novel proteins could be presented by the major histocompatibility complex (MHC) receptors to the cell surface and would be recognized as foreign. The respective cells would then be attacked and destroyed, or would cause inflammatory responses. Hence, de novo expressed peptides have to be introduced to the immune system as being self-peptides to avoid such autoimmune reactions. The regulation of the distinction between self and non-self starts during embryonic development, but continues late into adulthood. It is mostly mediated by specialized cells in the thymus, but can also be conveyed in peripheral tissues, such as the lymph nodes and the spleen. The self-antigens need to be exposed to the reactive T-cells, which requires the expression of the genes in the respective tissues. Since the initial activation of a promotor for new intergenic transcription of a de novo gene could occur in any tissue, we should expect that the evolutionary establishment of a de novo gene in animals with an adaptive immune system should also involve expression in at least one of the tissues that confer self-recognition.We have studied this question by analyzing the transcriptomes of multiple tissues from young mice in three closely related natural populations of the house mouse (M. m. domesticus). We find that new intergenic transcription occurs indeed mostly in only a single tissue. When a second tissue becomes involved, thymus and spleen are significantly overrepresented.We conclude that the inclusion of de novo transcripts in the processes for the induction of self-tolerance is indeed an important step in the evolution of functional de novo genes in vertebrates.
Project description:Preterm birth (PTB) affects ~12% of pregnancies in the US. Despite its high mortality and morbidity, the molecular etiology underlying PTB has been unclear. Numerous studies have been devoted to identifying genetic factors in maternal and fetal genomes, but so far few genomic loci have been associated with PTB. By analyzing whole-genome sequencing data from 816 trio families, for the first time, we observed the role of fetal de novo mutations in PTB. We observed a significant increase in de novo mutation burden in PTB fetal genomes. Our genomic analyses further revealed that affected genes by PTB de novo mutations were dosage sensitive, intolerant to genomic deletions, and their mouse orthologs were likely developmentally essential. These genes were significantly involved in early fetal brain development, which was further supported by our analysis of copy number variants identified from an independent PTB cohort. Our study indicates a new mechanism in PTB occurrence independently contributed from fetal genomes, and thus opens a new avenue for future PTB research.
Project description:De novo gene origination has been recently established as an important mechanism for the formation of new genes. In organisms with a large genome, intergenic and intronic regions provide plenty of raw material for new transcriptional events to occur, but little is know about how de novo transcripts originate in more densely-packed genomes. Here, we identify 213 de novo originated transcripts in Saccharomyces cerevisiae using deep transcriptomics and genomic synteny information from multiple yeast species grown in two different conditions. We find that about half of the de novo transcripts are expressed from regions which already harbor other genes in the opposite orientation; these transcripts show similar expression changes in response to stress as their overlapping counterparts, and some appear to translate small proteins. Thus, a large fraction of de novo genes in yeast are likely to co-evolve with already existing genes.
Project description:Regulatory networks control the spatiotemporal gene expression patterns that give rise to and define the individual cell types of multicellular organisms. In eumetazoa, distal regulatory elements called enhancers play a key role in determining the structure of such networks, particularly the wiring diagram of "who regulates whom." Mutations that affect enhancer activity can therefore rewire regulatory networks, potentially causing adaptive changes in gene expression. Here, we use whole-tissue and single-cell transcriptomic and chromatin accessibility data from mouse to show that enhancers play an additional role in the evolution of regulatory networks: They facilitate network growth by creating transcriptionally active regions of open chromatin that are conducive to de novo gene evolution. Specifically, our comparative transcriptomic analysis with three other mammalian species shows that young, mouse-specific intergenic open reading frames are preferentially located near enhancers, whereas older open reading frames are not. Mouse-specific intergenic open reading frames that are proximal to enhancers are more highly and stably transcribed than those that are not proximal to enhancers or promoters, and they are transcribed in a limited diversity of cellular contexts. Furthermore, we report several instances of mouse-specific intergenic open reading frames proximal to promoters showing evidence of being repurposed enhancers. We also show that open reading frames gradually acquire interactions with enhancers over macroevolutionary timescales, helping integrate genes-those that have arisen de novo or by other means-into existing regulatory networks. Taken together, our results highlight a dual role of enhancers in expanding and rewiring gene regulatory networks.