Project description:Despite recent and growing interest in using Twitter to examine human behavior and attitudes, there is still significant room for growth regarding the ability to leverage Twitter data for social science research. In particular, gleaning demographic information about Twitter users-a key component of much social science research-remains a challenge. This article develops an accurate and reliable data processing approach for social science researchers interested in using Twitter data to examine behaviors and attitudes, as well as the demographic characteristics of the populations expressing or engaging in them. Using information gathered from Twitter users who state an intention to not vote in the 2012 presidential election, we describe and evaluate a method for processing data to retrieve demographic information reported by users that is not encoded as text (e.g., details of images) and evaluate the reliability of these techniques. We end by assessing the challenges of this data collection strategy and discussing how large-scale social media data may benefit demographic researchers.
Project description:An important hallmark of science is the transparency and reproducibility of scientific results. Over the last few years, internet-based technologies have emerged that allow for a representation of the scientific process that goes far beyond traditional methods and analysis descriptions. Using these often freely available tools requires a suite of skills that is not necessarily part of a curriculum in the life sciences. However, funders, journals, and policy makers increasingly require researchers to ensure complete reproducibility of their methods and analyses. To close this gap, we designed an introductory course that guides students towards a reproducible science workflow. Here, we outline the course content and possible extensions, report encountered challenges, and discuss how to integrate such a course in existing curricula.
Project description:BackgroundResearch projects often involve observation, registration, and data processing starting from information obtained in field experiments. In many cases, these tasks are carried out by several persons in different places, times, and ways, adding different levels of complexity and error in data collecting. Furthermore, data processing can be time consuming, and input errors may produce unwanted results.ResultsWe have developed a novel, open source software called Phenobook, an easy, flexible, and intuitive tool to organize, collect, and save experimental data for further analyses. Phenobook was conceived to collect phenotypic observations in a user-friendly, cost-effective way. It consists of a web-based software for experiment design, data input and visualization, and exportation, combined with a mobile application for remote data collecting. We provide in this article a detailed description of the developed tool.ConclusionPhenobook is a software tool that can be easily implemented in collaborative research and development projects involving data collecting and forward analyses. Adopting Phenobook is expected to improve the involved processes by minimizing input errors, resulting in higher quality and reliability of the research outcomes.
Project description:Despite the increased access to scientific publications and data as a result of open science initiatives, access to scientific tools remains limited. Uncrewed aerial vehicles (UAVs, or drones) can be a powerful tool for research in disciplines such as agriculture and environmental sciences, but their use in research is currently dominated by proprietary, closed source tools. The objective of this work was to collect, curate, organize and test a set of open source tools for aerial data capture for research purposes. The Open Science Drone Toolkit was built through a collaborative and iterative process by more than 100 people in five countries, and comprises an open-hardware autonomous drone and off-the-shelf hardware, open-source software, and guides and protocols that enable the user to perform all the necessary tasks to obtain aerial data. Data obtained with this toolkit over a wheat field was compared to data from satellite imagery and a commercial hand-held sensor, finding a high correlation for both instruments. Our results demonstrate the possibility of capturing research-grade aerial data using affordable, accessible, and customizable open source software and hardware, and using open workflows.
Project description:As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.
Project description:The science of science has attracted growing research interests, partly due to the increasing availability of large-scale datasets capturing the innerworkings of science. These datasets, and the numerous linkages among them, enable researchers to ask a range of fascinating questions about how science works and where innovation occurs. Yet as datasets grow, it becomes increasingly difficult to track available sources and linkages across datasets. Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses. We offer detailed documentation of pre-processing steps and analytical choices in constructing the data lake. We further supplement the data lake by computing frequently used measures in the literature, illustrating how researchers may contribute collectively to enriching the data lake. Overall, this data lake serves as an initial but useful resource for the field, by lowering the barrier to entry, reducing duplication of efforts in data processing and measurements, improving the robustness and replicability of empirical claims, and broadening the diversity and representation of ideas in the field.
Project description:Within this perspective article, we intend to summarise definitions and terms that are often used in the context of open science and data-driven R&D and we discuss upcoming European regulations concerning data, data sharing and handling. With this background in hand, we take a closer look at the potential connections and permeable interfaces of open science and digital economy, in which data and resulting immaterial goods can become vital pieces as tradeable items. We believe that both science and the digital economy can profit from a seamless transition and foresee that the scientific outcomes of publicly funded research can be better exploited. To close the gap between open science and the digital economy, and to serve for a balancing of the interests of data producers, data consumers, and an economy around services and the public, we introduce the concept of generic research data management plans (RDMs), which have in part been developed through a community effort and which have been evaluated by academic and industry members of the NFDI4Cat consortium. We are of the opinion that in data-driven research, RDMs do need to become a vital element in publicly funded projects.
Project description:The Health Informatics Centre at the University of Dundee provides a service to securely host clinical datasets and extract relevant data for anonymized cohorts to researchers to enable them to answer key research questions. As is common in research using routine healthcare data, the service was historically delivered using ad-hoc processes resulting in the slow provision of data whose provenance was often hidden to the researchers using it. This paper describes the development and evaluation of the Research Data Management Platform (RDMP): an open source tool to load, manage, clean, and curate longitudinal healthcare data for research and provide reproducible and updateable datasets for defined cohorts to researchers. Between 2013 and 2017, RDMP tool implementation tripled the productivity of data analysts producing data releases for researchers from 7.1 to 25.3 per month and reduced the error rate from 12.7% to 3.1%. The effort on data management reduced from a mean of 24.6 to 3.0 hours per data release. The waiting time for researchers to receive data after agreeing a specification reduced from approximately 6 months to less than 1 week. The software is scalable and currently manages 163 datasets. A total 1,321 data extracts for research have been produced, with the largest extract linking data from 70 different datasets. The tools and processes that encompass the RDMP not only fulfil the research data management requirements of researchers but also support the seamless collaboration of data cleaning, data transformation, data summarization and data quality assessment activities by different research groups.
Project description:ObjectiveThe purpose of this research is to identify how data science is applied in suicide prevention literature, describe the current landscape of this literature and highlight areas where data science may be useful for future injury prevention research.DesignWe conducted a literature review of injury prevention and data science in April 2020 and January 2021 in three databases.MethodsFor the included 99 articles, we extracted the following: (1) author(s) and year; (2) title; (3) study approach (4) reason for applying data science method; (5) data science method type; (6) study description; (7) data source and (8) focus on a disproportionately affected population.ResultsResults showed the literature on data science and suicide more than doubled from 2019 to 2020, with articles with individual-level approaches more prevalent than population-level approaches. Most population-level articles applied data science methods to describe (n=10) outcomes, while most individual-level articles identified risk factors (n=27). Machine learning was the most common data science method applied in the studies (n=48). A wide array of data sources was used for suicide research, with most articles (n=45) using social media and web-based behaviour data. Eleven studies demonstrated the value of applying data science to suicide prevention literature for disproportionately affected groups.ConclusionData science techniques proved to be effective tools in describing suicidal thoughts or behaviour, identifying individual risk factors and predicting outcomes. Future research should focus on identifying how data science can be applied in other injury-related topics.