Project description:Next-generation sequencing technologies have revolutionized the field of virology by enabling the reading of complete viral genomes, extensive metagenomic studies, and the identification of novel viral pathogens. Although metagenomic sequencing has the advantage of not requiring specific probes or primers, it faces significant challenges in analyzing data and identifying novel viruses. Traditional bioinformatics tools for sequence identification mainly depend on homology-based strategies, which may not allow the detection of a virus significantly different from known variants due to the extensive genetic diversity and rapid evolution of viruses. In this work, we performed metagenomic analysis of bat feces from different Russian cities and identified a wide range of viral pathogens. We then selected sequences with minimal homology to a known picornavirus and used "Switching Mechanism at the 5' end of RNA Template" technology to obtain a longer genome fragment, allowing for more reliable identification. This study emphasizes the importance of integrating advanced computational methods with experimental strategies for identifying unknown viruses to better understand the viral universe.
Project description:Although lessons have been learned from previous severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) outbreaks, the rapid evolution of the viruses means that future outbreaks of a much larger scale are possible, as shown by the current coronavirus disease 2019 (COVID-19) outbreak. Therefore, it is necessary to better understand the evolution of coronaviruses as well as viruses in general. This study reports a comparative analysis of the amino acid usage within several key viral families and genera that are prone to triggering outbreaks, including coronavirus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2], SARS-CoV, MERS-CoV, human coronavirus-HKU1 [HCoV-HKU1], HCoV-OC43, HCoV-NL63, and HCoV-229E), influenza A (H1N1 and H3N2), flavivirus (dengue virus serotypes 1 to 4 and Zika) and ebolavirus (Zaire, Sudan, and Bundibugyo ebolavirus). Our analysis reveals that the distribution of amino acid usage in the viral genome is constrained to follow a linear order, and the distribution remains closely related to the viral species within the family or genus. This constraint can be adapted to predict viral mutations and future variants of concern. By studying previous SARS and MERS outbreaks, we have adapted this naturally occurring pattern to determine that although pangolin plays a role in the outbreak of COVID-19, it may not be the sole agent as an intermediate animal. In addition to this study, our findings contribute to the understanding of viral mutations for subsequent development of vaccines and toward developing a model to determine the source of the outbreak. IMPORTANCE This study reports a comparative analysis of amino acid usage within several key viral genera that are prone to triggering outbreaks. Interestingly, there is evidence that the amino acid usage within the viral genomes is not random but in a linear order.
Project description:Since its beginning in the early 1960s, the field of X-ray astronomy has exploded, experiencing a ten-billion-fold increase in sensitivity, which brought it on par with the most advanced facilities at all wavelengths. I will briefly describe the revolutionary first discoveries prior to the launch of the Chandra and XMM-Newton X-ray observatories, present some of the current achievements, and offer some thoughts about the future of this field.
Project description:Metagenomics sequencing projects have dramatically increased our knowledge of the protein universe and provided over one-half of currently known protein sequences; they have also introduced a much broader phylogenetic diversity into the protein databases. The full analysis of metagenomic datasets is only beginning, but it has already led to the discovery of thousands of new protein families, likely representing novel functions specific to given environments. At the same time, a deeper analysis of such novel families, including experimental structure determination of some representatives, suggests that most of them represent distant homologs of already characterized protein families, and thus most of the protein diversity present in the new environments are due to functional divergence of the known protein families rather than the emergence of new ones.
Project description:The protein universe is the set of all proteins of all organisms. Here, all currently known sequences are analyzed in terms of families that have single-domain or multidomain architectures and whether they have a known three-dimensional structure. Growth of new single-domain families is very slow: Almost all growth comes from new multidomain architectures that are combinations of domains characterized by approximately 15,000 sequence profiles. Single-domain families are mostly shared by the major groups of organisms, whereas multidomain architectures are specific and account for species diversity. There are known structures for a quarter of the single-domain families, and >70% of all sequences can be partially modeled thanks to their membership in these families.
Project description:p53 transcriptional networks are well-characterized in many organisms. However, a global understanding of requirements for in vivo p53 interactions with DNA and relationships with transcription across human biological systems in response to various p53 activating situations remains limited. Using a common analysis pipeline, we analyzed 41 data sets from genome-wide ChIP-seq studies of which 16 have associated gene expression data, including our recent primary data with normal human lymphocytes. The resulting extensive analysis, accessible at p53 BAER hub via the UCSC browser, provides a robust platform to characterize p53 binding throughout the human genome including direct influence on gene expression and underlying mechanisms. We establish the impact of spacers and mismatches from consensus on p53 binding in vivo and propose that once bound, neither significantly influences the likelihood of expression. Our rigorous approach revealed a large p53 genome-wide cistrome composed of >900 genes directly targeted by p53. Importantly, we identify a core cistrome signature composed of genes appearing in over half the data sets, and we identify signatures that are treatment- or cell-specific, demonstrating new functions for p53 in cell biology. Our analysis reveals a broad homeostatic role for human p53 that is relevant to both basic and translational studies.
Project description:To explore protein space from a global perspective, we consider 9,710 SCOP (Structural Classification of Proteins) domains with up to 70% sequence identity and present all similarities among them as networks: In the "domain network," nodes represent domains, and edges connect domains that share "motifs," i.e., significantly sized segments of similar sequence and structure. We explore the dependence of the network on the thresholds that define the evolutionary relatedness of the domains. At excessively strict thresholds the network falls apart completely; for very lax thresholds, there are network paths between virtually all domains. Interestingly, at intermediate thresholds the network constitutes two regions that can be described as "continuous" versus "discrete." The continuous region comprises a large connected component, dominated by domains with alternating alpha and beta elements, and the discrete region includes the rest of the domains in isolated islands, each generally corresponding to a fold. We also construct the "motif network," in which nodes represent recurring motifs, and edges connect motifs that appear in the same domain. This network also features a large and highly connected component of motifs that originate from domains with alternating alpha/beta elements (and some all-alpha domains), and smaller isolated islands. Indeed, the motif network suggests that nature reuses such motifs extensively. The networks suggest evolutionary paths between domains and give hints about protein evolution and the underlying biophysics. They provide natural means of organizing protein space, and could be useful for the development of strategies for protein search and design.
Project description:The Protein Data Bank (PDB) has grown from a small data resource for crystallographers to a worldwide resource serving structural biology. The history of the growth of the PDB and the role that the community has played in developing standards and policies are described. This article also illustrates how other biophysics communities are collaborating with the worldwide PDB to create a network of interoperating data resources. This network will expand the capabilities of structural biology and enable the determination and archiving of increasingly complex structures.
Project description:Historically, small proteins (sproteins) of less than 50 amino acids, in their final processed forms or genetically encoded as such, have been understudied. However, both serendipity and more recent focused efforts have led to the identification of a number of new sproteins in both Gram-negative and Gram-positive bacteria. Increasing evidence demonstrates that sproteins participate in a wide array of cellular processes and exhibit great diversity in their mechanisms of action, yet general principles of sprotein function are emerging. This review highlights examples of sproteins that participate in cell signaling, act as antibiotics and toxins, and serve as structural proteins. We also describe roles for sproteins in detecting and altering membrane features, acting as chaperones, and regulating the functions of larger proteins.
Project description:The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.