Project description:Results of scientific work in chemistry can usually be obtained in the form of materials and data. A big step towards transparency and reproducibility of the scientific work can be gained if scientists publish their data in research data repositories in a FAIR manner. Nevertheless, in order to make chemistry a sustainable discipline, obtaining FAIR data is insufficient and a comprehensive concept that includes preservation of materials is needed. In order to offer a comprehensive infrastructure to find and access data and materials that were generated in chemistry projects, we combined the infrastructure Chemotion repository with an archive for chemical compounds. Samples play a key role in this concept: we describe how FAIR metadata of a virtual sample representation can be used to refer to a physically available sample in a materials' archive and to link it with the FAIR research data gained using the said sample. We further describe the measures to make the physically available samples not only FAIR through their metadata but also findable, accessible and reusable.
Project description:This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Expert Group, as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of FORCE11.org and the NIH-funded BioCADDIE ( https://biocaddie.org ) project. The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories. We describe the early adoption of these recommendations 18 months after they have first been published, looking specifically at implementations of machine-readable metadata on dataset landing pages.
Project description:Prior to 1950, the consensus was that biological transformations occurred in two-electron steps, thereby avoiding the generation of free radicals. Dramatic advances in spectroscopy, biochemistry, and molecular biology have led to the realization that protein-based radicals participate in a vast array of vital biological mechanisms. Redox processes involving high-potential intermediates formed in reactions with O2 are particularly susceptible to radical formation. Clusters of tyrosine (Tyr) and tryptophan (Trp) residues have been found in many O2-reactive enzymes, raising the possibility that they play an antioxidant protective role. In blue copper proteins with plastocyanin-like domains, Tyr/Trp clusters are uncommon in the low-potential single-domain electron-transfer proteins and in the two-domain copper nitrite reductases. The two-domain muticopper oxidases, however, exhibit clusters of Tyr and Trp residues near the trinuclear copper active site where O2 is reduced. These clusters may play a protective role to ensure that reactive oxygen species are not liberated during O2 reduction.
Project description:The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Project description:Structural biology research generates large amounts of data, some deposited in public databases or repositories, but a substantial remainder never becomes available to the scientific community. In addition, some of the deposited data contain less or more serious errors that may bias the results of data mining. Thorough analysis and discussion of these problems is needed to ameliorate this situation. This perspective is an attempt to propose some solutions and encourage both further discussion and action on the part of the relevant organizations, in particular the PDB and various bodies of the International Union of Crystallography.
Project description:Structural biology, like many other areas of modern science, produces an enormous amount of primary, derived, and "meta" data with a high demand on data storage and manipulations. Primary data come from various steps of sample preparation, diffraction experiments, and functional studies. These data are not only used to obtain tangible results, like macromolecular structural models, but also to enrich and guide our analysis and interpretation of various biomedical problems. Herein we define several categories of data resources, (a) Archives, (b) Repositories, (c) Databases, and (d) Advanced Information Systems, that can accommodate primary, derived, or reference data. Data resources may be used either as web portals or internally by structural biology software. To be useful, each resource must be maintained, curated, as well as integrated with other resources. Ideally, the system of interconnected resources should evolve toward comprehensive "hubs", or Advanced Information Systems. Such systems, encompassing the PDB and UniProt, are indispensable not only for structural biology, but for many related fields of science. The categories of data resources described herein are applicable well beyond our usual scientific endeavors.
Project description:The disposal of nuclear waste represents a paramount concern for human safety, and the corrosion resistance of containers within the disposal environment stands as a critical factor in ensuring the integrity of such waste containment systems. In this report, the corrosion behavior of copper canisters was monitored in Olkiluoto-simulated/-procured groundwater (South Korea) with different temperatures. The exposure of copper in the procured groundwater at 70 °C revealed a 3.7-fold increase in corrosion vulnerability compared with room temperature conditions, with a current density of 12.7 μA/cm2. During a three-week immersion test in a controlled 70 °C chamber, the canister in the Korean groundwater maintained a constant weight. In contrast, its counterpart in the simulated groundwater revealed continuous weight loss, indicating heightened corrosion. An X-ray diffraction (XRD) analysis identified corrosion byproducts, specifically Cu2Cl3(OH) and calcite (CaCO3), in the simulated groundwater, confirming its corrosive nature. The initial impedance analysis revealed distinct differences: Korean groundwater exhibits high pore resistance and diffusion effects, while the simulated groundwater shows low pore resistance. Consequently, the corrosion of copper canisters in the Korean environment is deemed relatively stable because of significant differences in ion concentrations.
Project description:BackgroundClinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform first-hand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice.MethodsWe propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (G-mean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed.ResultsWe compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive cross-validated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (p-values < 0.01, Wilcoxon signed rank test) favorable averaged testing G-means and F1-scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance.ConclusionsPattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.
Project description:BackgroundWith the advent of inexpensive assay technologies, there has been an unprecedented growth in genomics data as well as the number of databases in which it is stored. In these databases, sample annotation using ontologies and controlled vocabularies is becoming more common. However, the annotation is rarely available as Linked Data, in a machine-readable format, or for standardized queries using SPARQL. This makes large-scale reuse, or integration with other knowledge bases very difficult.MethodsTo address this challenge, we have developed the second generation of our eXframe platform, a reusable framework for creating online repositories of genomics experiments. This second generation model now publishes Semantic Web data. To accomplish this, we created an experiment model that covers provenance, citations, external links, assays, biomaterials used in the experiment, and the data collected during the process. The elements of our model are mapped to classes and properties from various established biomedical ontologies. Resource Description Framework (RDF) data is automatically produced using these mappings and indexed in an RDF store with a built-in Sparql Protocol and RDF Query Language (SPARQL) endpoint.ConclusionsUsing the open-source eXframe software, institutions and laboratories can create Semantic Web repositories of their experiments, integrate it with heterogeneous resources and make it interoperable with the vast Semantic Web of biomedical knowledge.