Project description:The inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names is a major challenge for widespread use of polymer related data resources and limits broad application of materials informatics for innovation in broad classes of polymer science and polymeric based materials. The current solution of using a variety of different chemical identifiers has proven insufficient to address the challenge and is not intuitive for researchers. This work proposes a multi-algorithm-based mapping methodology entitled ChemProps that is optimized to solve the polymer indexing issue with easy-to-update design both in depth and in width. RESTful API is enabled for lightweight data exchange and easy integration across data systems. A weight factor is assigned to each algorithm to generate scores for candidate chemical names and optimized to maximize the minimum value of the score difference between the ground truth chemical name and the other candidate chemical names. Ten-fold validation is utilized on the 160 training data points to prevent overfitting issues. The obtained set of weight factors achieves a 100% test accuracy on the 54 test data points. The weight factors will evolve as ChemProps grows. With ChemProps, other polymer databases can remove duplicate entries and enable a more accurate "search by SMILES" function by using ChemProps as a common name-to-SMILES translator through API calls. ChemProps is also an excellent tool for auto-populating polymer properties thanks to its easy-to-update design.
Project description:VIPERdb (http://viperdb.scripps.edu) is a relational database and a web portal for icosahedral virus capsid structures. Our aim is to provide a comprehensive resource specific to the needs of the virology community, with an emphasis on the description and comparison of derived data from structural and computational analyses of the virus capsids. In the current release, VIPERdb(2), we implemented a useful and novel method to represent capsid protein residues in the icosahedral asymmetric unit (IAU) using azimuthal polar orthographic projections, otherwise known as Phi-Psi (Phi-Psi) diagrams. In conjunction with a new Application Programming Interface (API), these diagrams can be used as a dynamic interface to the database to map residues (categorized as surface, interface and core residues) and identify family wide conserved residues including hotspots at the interfaces. Additionally, we enhanced the interactivity with the database by interfacing with web-based tools. In particular, the applications Jmol and STRAP were implemented to visualize and interact with the virus molecular structures and provide sequence-structure alignment capabilities. Together with extended curation practices that maintain data uniformity, a relational database implementation based on a schema for macromolecular structures and the APIs provided will greatly enhance the ability to do structural bioinformatics analysis of virus capsids.
Project description:The International Committee on Taxonomy of Viruses (ICTV) is charged with the task of developing, refining, and maintaining a universal virus taxonomy. This task encompasses the classification of virus species and higher-level taxa according to the genetic and biological properties of their members; naming virus taxa; maintaining a database detailing the currently approved taxonomy; and providing the database, supporting proposals, and other virus-related information from an open-access, public web site. The ICTV web site (http://ictv.global) provides access to the current taxonomy database in online and downloadable formats, and maintains a complete history of virus taxa back to the first release in 1971. The ICTV has also published the ICTV Report on Virus Taxonomy starting in 1971. This Report provides a comprehensive description of all virus taxa covering virus structure, genome structure, biology and phylogenetics. The ninth ICTV report, published in 2012, is available as an open-access online publication from the ICTV web site. The current, 10th report (http://ictv.global/report/), is being published online, and is replacing the previous hard-copy edition with a completely open access, continuously updated publication. No other database or resource exists that provides such a comprehensive, fully annotated compendium of information on virus taxa and taxonomy.
Project description:The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases. The taxonomy database is manually curated by a small group of scientists at the NCBI who use the current taxonomic literature to maintain a phylogenetic taxonomy for the source organisms represented in the sequence databases. The taxonomy database is a central organizing hub for many of the resources at the NCBI, and provides a means for clustering elements within other domains of NCBI web site, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web. Our primary purpose is to index the domain of sequences as conveniently as possible for our user community.
Project description:BackgroundDiscovery and incorporation of predictive and prognostic biomarkers enhance outcomes for patients with cancer. Clinico-genomic datasets, which retrospectively link real-world clinical data to tumor sequencing data, are important resources for biomarker research, which has historically relied on robust research infrastructures exclusive to large academic centers. The objective was to evaluate the feasibility of a pragmatic, technology-enabled platform at community-based research sites for development of a prospective clinico-genomic database supported by centralized electronic health record (EHR)-based patient ascertainment and data processing.MethodsAdults with stage IV or recurrent metastatic non-small cell lung cancer or extensive-stage small-cell lung cancer were enrolled at 23 US sites upon initiating a standard line of therapy. Enrollment rates were estimated from eligible populations at individual centers. Clinical data from routinely collected EHR documentation were centrally processed and normalized for quality control. Serial blood samples at pre-specified timepoints (baseline, during treatment and at disease progression/end of therapy) were used for circulating tumor DNA (ctDNA) genomic profiling.ResultsBetween December 2019 and May 2021, 944 patients enrolled, representing ≈25 % of eligible patients. Eight-hundred seventeen of 944 (87 %), 406 of 606 (67 %) and 398 of 852 (47 %) participants provided qualifying samples for ctDNA testing at baseline, during treatment and at disease progression/end of therapy, respectively. Samples were provided at all three timepoints by 35 % of participants.ConclusionA community-based oncology patient cohort was rapidly enrolled, creating a real-world clinico-genomic dataset. This pragmatic study platform has potential research applications where prospective real-world data may contribute to evidence generation.
Project description:BackgroundAdvances in sequencing and genotyping technologies are leading to the widespread availability of multi-species variation data, dense genotype data and large-scale resequencing projects. The 1000 Genomes Project and similar efforts in other species are challenging the methods previously used for storage and manipulation of such data necessitating the redesign of existing genome-wide bioinformatics resources.ResultsEnsembl has created a database and software library to support data storage, analysis and access to the existing and emerging variation data from large mammalian and vertebrate genomes. These tools scale to thousands of individual genome sequences and are integrated into the Ensembl infrastructure for genome annotation and visualisation. The database and software system is easily expanded to integrate both public and non-public data sources in the context of an Ensembl software installation and is already being used outside of the Ensembl project in a number of database and application environments.ConclusionsEnsembl's powerful, flexible and open source infrastructure for the management of variation, genotyping and resequencing data is freely available at http://www.ensembl.org.
Project description:Type material is the taxonomic device that ties formal names to the physical specimens that serve as exemplars for the species. For the prokaryotes these are strains submitted to the culture collections; for the eukaryotes they are specimens submitted to museums or herbaria. The NCBI Taxonomy Database (http://www.ncbi.nlm.nih.gov/taxonomy) now includes annotation of type material that we use to flag sequences from type in GenBank and in Genomes. This has important implications for many NCBI resources, some of which are outlined below.
Project description:Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.
Project description:BackgroundThe effectiveness of weight loss therapies is commonly measured using body mass index and other obesity-related variables. Although these data are often stored in electronic health records (EHRs) and potentially very accessible, few studies on obesity and weight loss have used data derived from EHRs. We developed processes for obtaining data from the EHR in order to construct a database on patients undergoing Roux-en-Y gastric bypass (RYGB) surgery.MethodsClinical data obtained as part of standard of care in a bariatric surgery program at an integrated health delivery system were extracted from the EHR and deposited into a data warehouse. Data files were extracted, cleaned, and stored in research datasets. To illustrate the utility of the data, Kaplan-Meier analysis was used to estimate length of post-operative follow-up.ResultsDemographic, laboratory, medication, co-morbidity, and survey data were obtained from 2028 patients who had undergone RYGB at the same institution since 2004. Pre-and post-operative diagnostic and prescribing information were available on all patients, while survey laboratory data were available on a majority of patients. The number of patients with post-operative laboratory test results varied by test. Based on Kaplan-Meier estimates, over 74% of patients had post-operative weight data available at 4 years.ConclusionA variety of EHR-derived data related to obesity can be efficiently obtained and used to study important outcomes following RYGB.