Biodiversity sequence datasets published in EMBL repositories in 2023 (NFDI4Biodiversity)
Ontology highlight
ABSTRACT: In 2023, a total of 11 genome assemblies, 2 variation datasets and 14 studies were published at the EMBL ENA and EMBL EVA repository by the IPK BIT group. Furthermore, 12,683 samples were registered in the EMBL BioSamples repository. The total released data volume in 2023 is 7,8 Terabytes.
All datasets are linked via study ID.
Project description:BackgroundA considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines.New informationHere, we present a novel, open source text mining tool, the Biodiversity Observations Miner (BOM). This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.
Project description:The increasing amount of publicly available research data provides the opportunity to link and integrate data in order to create and prove novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consuming task in daily research practice. In this study, we explore what hampers dataset retrieval in biodiversity research, a field that produces a large amount of heterogeneous data. In particular, we focus on scholarly search interests and metadata, the primary source of data in a dataset retrieval system. We show that existing metadata currently poorly reflect information needs and therefore are the biggest obstacle in retrieving relevant data. Our findings indicate that for data seekers in the biodiversity domain environments, materials and chemicals, species, biological and chemical processes, locations, data parameters and data types are important information categories. These interests are well covered in metadata elements of domain-specific standards. However, instead of utilizing these standards, large data repositories tend to use metadata standards with domain-independent metadata fields that cover search interests only to some extent. A second problem are arbitrary keywords utilized in descriptive fields such as title, description or subject. Keywords support scholars in a full text search only if the provided terms syntactically match or their semantic relationship to terms used in a user query is known.
Project description:Sharing of experimental clinical research data usually happens between individuals or research groups rather than via public repositories, in part due to the need to protect research participant privacy. This approach to data sharing makes it difficult to connect journal articles with their underlying datasets and is often insufficient for ensuring access to data in the long term. Voluntary data sharing services such as the Yale Open Data Access (YODA) and Clinical Study Data Request (CSDR) projects have increased accessibility to clinical datasets for secondary uses while protecting patient privacy and the legitimacy of secondary analyses but these resources are generally disconnected from journal articles-where researchers typically search for reliable information to inform future research. New scholarly journal and article types dedicated to increasing accessibility of research data have emerged in recent years and, in general, journals are developing stronger links with data repositories. There is a need for increased collaboration between journals, data repositories, researchers, funders, and voluntary data sharing services to increase the visibility and reliability of clinical research. Using the journal Scientific Data as a case study, we propose and show examples of changes to the format and peer-review process for journal articles to more robustly link them to data that are only available on request. We also propose additional features for data repositories to better accommodate non-public clinical datasets, including Data Use Agreements (DUAs).
Project description:Tracking the state of biodiversity over time is critical to successful conservation, but conventional monitoring schemes tend to be insufficient to adequately quantify how species' abundances and distributions are changing. One solution to this issue is to leverage data generated by citizen scientists, who collect vast quantities of data at temporal and spatial scales that cannot be matched by most traditional monitoring methods. However, the quality of citizen science data can vary greatly. In this paper, we develop three metrics (inventory completeness, range completeness, spatial bias) to assess the adequacy of spatial observation data. We explore the adequacy of citizen science data at the species level for Australia's terrestrial native birds and then model these metrics against a suite of seven species traits (threat status, taxonomic uniqueness, body mass, average count, range size, species density, and human population density) to identify predictors of data adequacy. We find that citizen science data adequacy for Australian birds is increasing across two of our metrics (inventory completeness and range completeness), but not spatial bias, which has worsened over time. Relationships between the three metrics and seven traits we modelled were variable, with only two traits having consistently significant relationships across the three metrics. Our results suggest that although citizen science data adequacy has generally increased over time, there are still gaps in the spatial adequacy of citizen science for monitoring many Australian birds. Despite these gaps, citizen science can play an important role in biodiversity monitoring by providing valuable baseline data that may be supplemented by information collected through other methods. We believe the metrics presented here constitute an easily applied approach to assessing the utility of citizen science datasets for biodiversity analyses, allowing researchers to identify and prioritise regions or species with lower data adequacy that will benefit most from targeted monitoring efforts.
Project description:The human microbiome project (HMP) has made it possible to test important ecological theories for arguably the most important ecosystem to human health-the human microbiome. Existing limited number of studies have reported conflicting evidence in the case of the neutral theory; the present study aims to comprehensively test the neutral theory with extensive HMP datasets covering all five major body sites inhabited by the human microbiome. Utilizing 7437 datasets of bacterial community samples, we discovered that only 49 communities (less than 1%) satisfied the neutral theory, and concluded that human microbial communities are not neutral in general. The 49 positive cases, although only a tiny minority, do demonstrate the existence of neutral processes. We realize that the traditional doctrine of microbial biogeography "Everything is everywhere, but the environment selects" first proposed by Baas-Becking resolves the apparent contradiction. The first part of Baas-Becking doctrine states that microbes are not dispersal-limited and therefore are neutral prone, and the second part reiterates that the freely dispersed microbes must endure selection by the environment. Therefore, in most cases, it is the host environment that ultimately shapes the community assembly and tip the human microbiome to niche regime.
Project description:With next-generation sequencing, the genomic data available for the characterization of integration sites (IS) has dramatically increased. At present, in a single experiment, several thousand viral integration genome targets can be investigated to define genomic hot spots. In a previous article, we renovated a formal CIS analysis based on a rigid fixed window demarcation into a more stretchy definition grounded on graphs. Here, we present a selection of supporting data related to the graph-based framework (GBF) from our previous article, in which a collection of common integration sites (CIS) was identified on six published datasets. In this work, we will focus on two datasets, ISRTCGD and ISHIV, which have been previously discussed. Moreover, we show in more detail the workflow design that originates the datasets.