ABSTRACT: BACKGROUND: The amount of transcription factor binding sites (TFBS) in an organism's genome positively correlates with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and accumulate in genomes and the effects of regulatory network complexity on the organism's fitness are far from being known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly from an evolutionary perspective. RESULTS: We analyzed TFBS data from five model organisms - E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A. thaliana - and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organism's genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS patterns, and that selection on the regulatory network function does not alter this finding. CONCLUSIONS: The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as complexity on both the binding site and pathway level, which has significant implications on many directions in microbiology, genetics, and synthetic biology.
Project description:DNase I hypersensitive sites (DHSs) define the accessible chromatin landscape and have revolutionised the discovery of distinct cis-regulatory elements in diverse organisms. Here, we report the first comprehensive map of human transcription factor binding site (TFBS)-clustered regions using Gaussian kernel density estimation based on genome-wide mapping of the TFBSs in 133 human cell and tissue types. Approximately 1.6 million distinct TFBS-clustered regions, collectively spanning 27.7% of the human genome, were discovered. The TFBS complexity assigned to each TFBS-clustered region was highly correlated with genomic location, cell selectivity, evolutionary conservation, sequence features, and functional roles. An integrative analysis of these regions using ENCODE data revealed transcription factor occupancy, transcriptional activity, histone modification, DNA methylation, and chromatin structures that varied based on TFBS complexity. Furthermore, we found that we could recreate lineage-branching relationships by simple clustering of the TFBS-clustered regions from terminally differentiated cells. Based on these findings, a model of transcriptional regulation determined by TFBS complexity is proposed.
Project description:It is now clear that animal genomes are predominantly non-protein-coding, and that these sequences encode a wide array of RNA transcripts and other regulatory elements that are fundamental to the development of complex life. We have previously argued that the proportion of an animal genome that is non-protein-coding DNA (ncDNA) correlates well with its apparent biological complexity. Here we extend on that work and, using data from a total of 1,627 prokaryotic and 153 eukaryotic complete and annotated genomes, show that the proportion of ncDNA per haploid genome is significantly positively correlated with a previously published proxy of biological complexity, the number of distinct cell types. This is in contrast to the amount of the genome that encodes proteins, which we show is essentially unchanged across Metazoa. Furthermore, using a total of 179 RNA-seq data sets from nematode (47), fruit fly (72), zebrafish (20) and human (42), we show, consistent with other recent reports, that the vast majority of ncDNA in animals is transcribed. This includes more than 60 human loci previously considered "gene deserts," many of which are expressed tissue-specifically and associated with previously reported GWAS SNPs. These results suggest that ncDNA, and the ncRNAs encoded within it, may be intimately involved in the evolution, maintenance and development of complex life.
Project description:Previous research demonstrated the use of evolutionary computation for the discovery of transcription factor binding sites (TFBS) in promoter regions upstream of coexpressed genes. However, it remained unclear whether or not composite TFBS elements, commonly found in higher organisms where two or more TFBSs form functional complexes, could also be identified by using this approach. Here, we present an important refinement of our previous algorithm and test the identification of composite elements using NFAT/AP-1 as an example. We demonstrate that by using appropriate existing parameters such as window size, novel-scoring methods such as central bonusing and methods of self-adaptation to automatically adjust the variation operators during the evolutionary search, TFBSs of different sizes and complexity can be identified as top solutions. Some of these solutions have known experimental relationships with NFAT/AP-1. We also indicate that even after properly tuning the model parameters, the choice of the appropriate window size has a significant effect on algorithm performance. We believe that this improved algorithm will greatly augment TFBS discovery.
Project description:BACKGROUND:Retroelements (REs) are transposable elements occupying ~40% of the human genome that can regulate genes by providing transcription factor binding sites (TFBS). RE-linked TFBS profile can serve as a marker of gene transcriptional regulation evolution. This approach allows for interrogating the regulatory evolution of organisms with RE-rich genomes. We aimed to characterize the evolution of transcriptional regulation for human genes and molecular pathways using RE-linked TFBS accumulation as a metric. Methods: We characterized human genes and molecular pathways either enriched or deficient in RE-linked TFBS regulation. We used ENCODE database with mapped TFBS for 563 transcription factors in 13 human cell lines. For 24,389 genes and 3124 molecular pathways, we calculated the score of RE-linked TFBS regulation reflecting the regulatory evolution rate at the level of individual genes and molecular pathways. Results: The major groups enriched by RE regulation deal with gene regulation by microRNAs, olfaction, color vision, fertilization, cellular immune response, and amino acids and fatty acids metabolism and detoxication. The deficient groups were involved in translation, RNA transcription and processing, chromatin organization, and molecular signaling. Conclusion: We identified genes and molecular processes that have characteristics of especially high or low evolutionary rates at the level of RE-linked TFBS regulation in human lineage.
Project description:Maize is a major crop and a model plant for studying C4 photosynthesis and leaf development. However, a genomewide regulatory network of leaf development is not yet available. This knowledge is useful for developing C3 crops to perform C4 photosynthesis for enhanced yields. Here, using 22 transcriptomes of developing maize leaves from dry seeds to 192 h post imbibition, we studied gene up- and down-regulation and functional transition during leaf development and inferred sets of strongly coexpressed genes. More significantly, we developed a method to predict transcription factor binding sites (TFBSs) and their cognate transcription factors (TFs) using genomic sequence and transcriptomic data. The method requires not only evolutionary conservation of candidate TFBSs and sets of strongly coexpressed genes but also that the genes in a gene set share the same Gene Ontology term so that they are involved in the same biological function. In addition, we developed another method to predict maize TF-TFBS pairs using known TF-TFBS pairs in Arabidopsis or rice. From these efforts, we predicted 1,340 novel TFBSs and 253 new TF-TFBS pairs in the maize genome, far exceeding the 30 TF-TFBS pairs currently known in maize. In most cases studied by both methods, the two methods gave similar predictions. In vitro tests of 12 predicted TF-TFBS interactions showed that our methods perform well. Our study has significantly expanded our knowledge on the regulatory network involved in maize leaf development.
Project description:The sequencing of the human genome heralded the new age of 'genetic medicine' and raised the hope of precision medicine facilitating prolonged and healthy lives. Recent studies have dampened this expectation, as the relationships among mutations (termed 'risk factors'), biological processes, and diseases have emerged to be more complex than initially anticipated. In this review, we elaborate upon the nature of the relationship between genotype and phenotype, between chance-laden molecular complexity and the evolution of complex traits, and the relevance of this relationship to precision medicine. Molecular contingency, i.e., chance-driven molecular changes, in conjunction with the blind nature of evolutionary processes, creates genetic redundancy or multiple molecular pathways to the same phenotype; as time goes on, these pathways become more complex, interconnected, and hierarchically integrated. Based on the proposition that gene-gene interactions provide the major source of variation for evolutionary change, we present a theory of molecular complexity and posit that it consists of two parts, necessary and unnecessary complexity, both of which are inseparable and increase over time. We argue that, unlike necessary complexity, comprising all aspects of the organism's genetic program, unnecessary complexity is evolutionary baggage: the result of molecular constraints, historical circumstances, and the blind nature of evolutionary forces. In the short term, unnecessary complexity can give rise to similar risk factors with different genetic backgrounds; in the long term, genes become functionally interconnected and integrated, directly or indirectly, affecting multiple traits simultaneously. We reason that in addition to personal genomics and precision medicine, unnecessary complexity has consequences in evolutionary biology.
Project description:Gene regulatory networks exhibit complex, hierarchical features such as global regulation and network motifs. There is much debate about whether the evolutionary origins of such features are the results of adaptation, or the by-products of non-adaptive processes of DNA replication. The lack of availability of gene regulatory networks of ancestor species on evolutionary timescales makes this a particularly difficult problem to resolve. Digital organisms, however, can be used to provide a complete evolutionary record of lineages. We use a biologically realistic evolutionary model that includes gene expression, regulation, metabolism and biosynthesis, to investigate the evolution of complex function in gene regulatory networks. We discover that: (i) network architecture and complexity evolve in response to environmental complexity, (ii) global gene regulation is selected for in complex environments, (iii) complex, inter-connected, hierarchical structures evolve in stages, with energy regulation preceding stress responses, and stress responses preceding growth rate adaptations and (iv) robustness of evolved models to mutations depends on hierarchical level: energy regulation and stress responses tend not to be robust to mutations, whereas growth rate adaptations are more robust and non-lethal when mutated. These results highlight the adaptive and incremental evolution of complex biological networks, and the value and potential of studying realistic in silico evolutionary systems as a way of understanding living systems.
Project description:Organisms from all domains of life use gene regulation networks to control cell growth, identity, function, and responses to environmental challenges. Although accurate global regulatory models would provide critical evolutionary and functional insights, they remain incomplete, even for the best studied organisms. Efforts to build comprehensive networks are confounded by challenges including network scale, degree of connectivity, complexity of organism-environment interactions, and difficulty of estimating the activity of regulatory factors. Taking advantage of the large number of known regulatory interactions in Bacillus subtilis and two transcriptomics datasets (including one with 38 separate experiments collected specifically for this study), we use a new combination of network component analysis and model selection to simultaneously estimate transcription factor activities and learn a substantially expanded transcriptional regulatory network for this bacterium. In total, we predict 2,258 novel regulatory interactions and recall 74% of the previously known interactions. We obtained experimental support for 391 (out of 635 evaluated) novel regulatory edges (62% accuracy), thus significantly increasing our understanding of various cell processes, such as spore formation.
2015-01-01 | S-EPMC4670728 | BioStudies
Project description:Evolutionary forces shaping the nucleotide composition of organisms
Project description:Recent advances in genome sequencing suggest a remarkable conservation in gene content of mammalian organisms. The similarity in gene repertoire present in different organisms has increased interest in studying regulatory mechanisms of gene expression aimed at elucidating the differences in phenotypes. In particular, a proximal promoter region contains a large number of regulatory elements that control the expression of its downstream gene. Although many studies have focused on identification of these elements, a broader picture on the complexity of transcriptional regulation of different biological processes has not been addressed in mammals. The regulatory complexity may strongly correlate with gene function, as different evolutionary forces must act on the regulatory systems under different biological conditions. We investigate this hypothesis by comparing the conservation of promoters upstream of genes classified in different functional categories.By conducting a rank correlation analysis between functional annotation and upstream sequence alignment scores obtained by human-mouse and human-dog comparison, we found a significantly greater conservation of the upstream sequence of genes involved in development, cell communication, neural functions and signaling processes than those involved in more basic processes shared with unicellular organisms such as metabolism and ribosomal function. This observation persists after controlling for G+C content. Considering conservation as a functional signature, we hypothesize a higher density of cis-regulatory elements upstream of genes participating in complex and adaptive processes.We identified a class of functions that are associated with either high or low promoter conservation in mammals. We detected a significant tendency that points to complex and adaptive processes were associated with higher promoter conservation, despite the fact that they have emerged relatively recently during evolution. We described and contrasted several hypotheses that provide a deeper insight into how transcriptional complexity might have been emerged during evolution.