Project description:BackgroundMetadata describe and provide context for other data, playing a pivotal role in enabling findability, accessibility, interoperability, and reusability (FAIR) data principles. By providing comprehensive and machine-readable descriptions of digital resources, metadata empower both machines and human users to seamlessly discover, access, integrate, and reuse data or content across diverse platforms and applications. However, the limited accessibility and machine-interpretability of existing metadata for population health data hinder effective data discovery and reuse.ObjectiveTo address these challenges, we propose a comprehensive framework using standardized formats, vocabularies, and protocols to render population health data machine-readable, significantly enhancing their FAIRness and enabling seamless discovery, access, and integration across diverse platforms and research applications.MethodsThe framework implements a 3-stage approach. The first stage is Data Documentation Initiative (DDI) integration, which involves leveraging the DDI Codebook metadata and documentation of detailed information for data and associated assets, while ensuring transparency and comprehensiveness. The second stage is Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) standardization. In this stage, the data are harmonized and standardized into the OMOP CDM, facilitating unified analysis across heterogeneous data sets. The third stage involves the integration of Schema.org and JavaScript Object Notation for Linked Data (JSON-LD), in which machine-readable metadata are generated using Schema.org entities and embedded within the data using JSON-LD, boosting discoverability and comprehension for both machines and human users. We demonstrated the implementation of these 3 stages using the Integrated Disease Surveillance and Response (IDSR) data from Malawi and Kenya.ResultsThe implementation of our framework significantly enhanced the FAIRness of population health data, resulting in improved discoverability through seamless integration with platforms such as Google Dataset Search. The adoption of standardized formats and protocols streamlined data accessibility and integration across various research environments, fostering collaboration and knowledge sharing. Additionally, the use of machine-interpretable metadata empowered researchers to efficiently reuse data for targeted analyses and insights, thereby maximizing the overall value of population health resources. The JSON-LD codes are accessible via a GitHub repository and the HTML code integrated with JSON-LD is available on the Implementation Network for Sharing Population Information from Research Entities website.ConclusionsThe adoption of machine-readable metadata standards is essential for ensuring the FAIRness of population health data. By embracing these standards, organizations can enhance diverse resource visibility, accessibility, and utility, leading to a broader impact, particularly in low- and middle-income countries. Machine-readable metadata can accelerate research, improve health care decision-making, and ultimately promote better health outcomes for populations worldwide.
Project description:MotivationUnderstanding the structure of sequenced fragments from genomics libraries is essential for accurate read preprocessing. Currently, different assays and sequencing technologies require custom scripts and programs that do not leverage the common structure of sequence elements present in genomics libraries.ResultsWe present seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.Availability and implementationThe specification and associated seqspec command line tool is available at https://www.doi.org/10.5281/zenodo.10213865.
Project description:Community-developed minimum information checklists are designed to drive the rich and consistent reporting of metadata, underpinning the reproducibility and reuse of the data. These reporting guidelines, however, are usually in the form of narratives intended for human consumption. Modular and reusable machine-readable versions are also needed. Firstly, to provide the necessary quantitative and verifiable measures of the degree to which the metadata descriptors meet these community requirements, a requirement of the FAIR Principles. Secondly, to encourage the creation of standards-driven templates for metadata authoring, especially when describing complex experiments that require multiple reporting guidelines to be used in combination or extended. We present new functionalities to support the creation and improvements of machine-readable models. We apply the approach to an exemplar set of reporting guidelines in Life Science and discuss the challenges. Our work, targeted to developers of standards and those familiar with standards, promotes the concept of compositional metadata elements and encourages the creation of community-standards which are modular and interoperable from the onset.
Project description:Audio descriptions (AD) make videos accessible to those who cannot see them. But many videos lack AD and remain inaccessible as traditional approaches involve expensive professional production. We aim to lower production costs by involving novices in this process. We present an AD authoring system that supports novices to write scene descriptions (SD)-textual descriptions of video scenes-and convert them into AD via text-to-speech. The system combines video scene recognition and natural language processing to review novice-written SD and feeds back what to mention automatically. To assess the effectiveness of this automatic feedback in supporting novices, we recruited 60 participants to author SD with no feedback, human feedback, and automatic feedback. Our study shows that automatic feedback improves SD's descriptiveness, objectiveness, and learning quality, without affecting qualities like sufficiency and clarity. Though human feedback remains more effective, automatic feedback can reduce production costs by 45%.
Project description:Publishing studies using standardized, machine-readable formats will enable machines to perform meta-analyses on demand. To build a semantically enhanced technology that embodies these functions, we developed the Cooperation Databank (CoDa)-a databank that contains 2,636 studies on human cooperation (1958-2017) conducted in 78 societies involving 356,283 participants. Experts annotated these studies along 312 variables, including the quantitative results (13,959 effects). We designed an ontology that defines and relates concepts in cooperation research and that can represent the relationships between results of correlational and experimental studies. We have created a research platform that, given the data set, enables users to retrieve studies that test the relation of variables with cooperation, visualize these study results, and perform (a) meta-analyses, (b) metaregressions, (c) estimates of publication bias, and (d) statistical power analyses for future studies. We leveraged the data set with visualization tools that allow users to explore the ontology of concepts in cooperation research and to plot a citation network of the history of studies. CoDa offers a vision of how publishing studies in a machine-readable format can establish institutions and tools that improve scientific practices and knowledge.
Project description:Cholesterol biosynthesis serves as a central metabolic hub for numerous biological processes in health and disease. A detailed, integrative single-view description of how the cholesterol pathway is structured and how it interacts with other pathway systems is lacking in the existing literature. Here we provide a systematic review of the existing literature and present a detailed pathway diagram that describes the cholesterol biosynthesis pathway (the mevalonate, the Kandutch-Russell and the Bloch pathway) and shunt pathway that leads to 24(S),25-epoxycholesterol synthesis. The diagram has been produced using the Systems Biology Graphical Notation (SBGN) and is available in the SBGN-ML format, a human readable and machine semantically parsable open community file format.
Project description:As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.
Project description:The design and use of a metadata-driven data repository for research data management is described. Metadata is collected automatically during the submission process whenever possible and is registered with DataCite in accordance with their current metadata schema, in exchange for a persistent digital object identifier. Two examples of data preview are illustrated, including the demonstration of a method for integration with commercial software that confers rich domain-specific data analytics without introducing customisation into the repository itself.
Project description:The ecosystem management actions taxonomy (EMAT) consists of actions taken by humans and wildlife that affect an ecosystem. Here, I present a protocol for discovering machine-readable entities of the EMAT. I describe steps for acquiring stories from online locations, collecting them into a story file, and processing them through a software package to extract those actions that match EMAT taxa. I then detail procedures for using the story file to learn new EMAT taxa.
Project description:The emergence of nanoinformatics as a key component of nanotechnology and nanosafety assessment for the prediction of engineered nanomaterials (NMs) properties, interactions, and hazards, and for grouping and read-across to reduce reliance on animal testing, has put the spotlight firmly on the need for access to high-quality, curated datasets. To date, the focus has been around what constitutes data quality and completeness, on the development of minimum reporting standards, and on the FAIR (findable, accessible, interoperable, and reusable) data principles. However, moving from the theoretical realm to practical implementation requires human intervention, which will be facilitated by the definition of clear roles and responsibilities across the complete data lifecycle and a deeper appreciation of what metadata is, and how to capture and index it. Here, we demonstrate, using specific worked case studies, how to organise the nano-community efforts to define metadata schemas, by organising the data management cycle as a joint effort of all players (data creators, analysts, curators, managers, and customers) supervised by the newly defined role of data shepherd. We propose that once researchers understand their tasks and responsibilities, they will naturally apply the available tools. Two case studies are presented (modelling of particle agglomeration for dose metrics, and consensus for NM dissolution), along with a survey of the currently implemented metadata schema in existing nanosafety databases. We conclude by offering recommendations on the steps forward and the needed workflows for metadata capture to ensure FAIR nanosafety data.