ABSTRACT: This synthetic dataset contains genetics data for 1,008,000 individuals and 9 continuous phenotypic traits with various genetic architectures. The dataset includes 6 ancestry groups (AFR, AMR, CSA, EAS, EUR, MID) and over 6.8 million single nucleotide polymorphisms (SNPs) across 22 chromosomes. The data was generated using the HAPNEST software program (https://github.com/intervene-EU-H2020/synthetic_data) developed by members of the INTERVENE consortium (https://www.interveneproject.eu/). This software has been specifically designed to enable efficient, large-scale synthetic data generation for common genetic variants and complex phenotypic traits. We have open sourced this software so that anyone can easily generate their own synthetic datasets. Please see the linked GitHub repository for further details. The reference dataset used to generate this synthetic dataset is the combined 1000 Genomes Project and Human Genomic Diversity Project datasets downloaded from https://gnomad.broadinstitute.org/downloads. The data was preprocessed by retaining SNPs with non-zero MAF in all populations for which rsID numbers could be successfully aligned. This resulted in over 6.8 million variants across 22 chromosomes.
Project description:Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.
Project description:The amount of publicly available DNA sequence data is drastically increasing, making it a tedious task to create sequence databases necessary for the design of diagnostic assays. The selection of appropriate sequences is especially challenging in genes affected by frequent point mutations such as antibiotic resistance genes. To overcome this issue, we have designed the webtool resiDB, a rapid and user-friendly sequence database manager for bacteria, fungi, viruses, protozoa, invertebrates, plants, archaea, environmental and whole genome shotgun sequence data. It automatically identifies and curates sequence clusters to create custom sequence databases based on user-defined input sequences. A collection of helpful visualization tools gives the user the opportunity to easily access, evaluate, edit, and download the newly created database. Consequently, researchers do no longer have to manually manage sequence data retrieval, deal with hardware limitations, and run multiple independent software tools, each having its own requirements, input and output formats. Our tool was developed within the H2020 project FAPIC aiming to develop a single diagnostic assay targeting all sepsis-relevant pathogens and antibiotic resistance mechanisms. ResiDB is freely accessible to all users through https://residb.ait.ac.at/.
Project description:Molecular Cloning Designer Simulator (MCDS) is a powerful new all-in-one cloning and genetic engineering design, simulation and management software platform developed for complex synthetic biology and metabolic engineering projects. In addition to standard functions, it has a number of features that are either unique, or are not found in combination in any one software package: (1) it has a novel interactive flow-chart user interface for complex multi-step processes, allowing an integrated overview of the whole project; (2) it can perform a user-defined workflow of cloning steps in a single execution of the software; (3) it can handle multiple types of genetic recombineering, a technique that is rapidly replacing classical cloning for many applications; (4) it includes experimental information to conveniently guide wet lab work; and (5) it can store results and comments to allow the tracking and management of the whole project in one platform. MCDS is freely available from https://mcds.codeplex.com.
Project description:<h4>Motivation</h4>Correlated Nuclear Magnetic Resonance (NMR) chemical shift changes identified through the CHEmical Shift Projection Analysis (CHESPA) and CHEmical Shift Covariance Analysis (CHESCA) reveal pathways of allosteric transitions in biological macromolecules. To address the need for an automated platform that implements CHESPA and CHESCA and integrates them with other NMR analysis software packages, we introduce here integrated plugins for NMRFAM-SPARKY that implement the seamless detection and visualization of allosteric networks.<h4>Availability and implementation</h4>CHESCA-SPARKY and CHESPA-SPARKY are available in the latest version of NMRFAM-SPARKY from the National Magnetic Resonance Facility at Madison (http://pine.nmrfam.wisc.edu/download_packages.html), the NMRbox Project (https://nmrbox.org) and to subscribers to the SBGrid (https://sbgrid.org). The assigned spectra involved in this study and tutorial videos using this dataset are available at https://sites.google.com/view/chescachespa-sparky.<h4>Supplementary information</h4>Supplementary data are available at Bioinformatics Online.
Project description:In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
Project description:<h4>Background</h4>Inflammatory bowel disease (IBD) is a chronic complex disease of the gastrointestinal tract. Patients with IBD can experience a wide range of symptoms, but the pathophysiological mechanisms that cause these individual differences in clinical presentation remain largely unknown. In consequence, IBD is currently classified into subtypes using clinical characteristics. If we are to develop a more targeted treatment approach, molecular subtypes of IBD need to be discovered that can be used as new drug targets. To achieve this, we need multiple layers of molecular data generated from the same IBD patients.<h4>Construction and content</h4>We initiated the 1000IBD project ( https://1000ibd.org ) to prospectively follow more than 1000 IBD patients from the Northern provinces of the Netherlands. For these patients, we have collected a uniquely large number of phenotypes and generated multi-omics profiles. To date, 1215 participants have been enrolled in the project and enrolment is on-going. Phenotype data collected for these participants includes information on dietary and environmental factors, drug responses and adverse drug events. Genome information has been generated using genotyping (ImmunoChip, Global Screening Array and HumanExomeChip) and sequencing (whole exome sequencing and targeted resequencing of IBD susceptibility loci), transcriptome information generated using RNA-sequencing of intestinal biopsies and microbiome information generated using both sequencing of the 16S rRNA gene and whole genome shotgun metagenomic sequencing.<h4>Utility and discussion</h4>All molecular data generated within the 1000IBD project will be shared on the European Genome-Phenome Archive ( https://ega-archive.org , accession no: EGAS00001002702). The first data release, detailed in this announcement and released simultaneously with this publication, will contain basic phenotypes for 1215 participants, genotypes of 314 participants and gut microbiome data from stool samples (315 participants) and biopsies (107 participants) generated by tag sequencing the 16S gene. Future releases will comprise many more additional phenotypes and -omics data layers. 1000IBD data can be used by other researchers as a replication cohort, a dataset to test new software tools, or a dataset for applying new statistical models.<h4>Conclusions</h4>We report on the establishment and future development of the 1000IBD project: the first comprehensive multi-omics dataset aimed at discovering IBD biomarker profiles and treatment targets.
Project description:MOTIVATION:Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed using the Kubernetes container orchestrator. RESULTS:We developed a Virtual Research Environment (VRE) which facilitates rapid integration of new tools and developing scalable and interoperable workflows for performing metabolomics data analysis. The environment can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry, one nuclear magnetic resonance spectroscopy and one fluxomics study. We showed that the method scales dynamically with increasing availability of computational resources. We demonstrated that the method facilitates interoperability using integration of the major software suites resulting in a turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, statistics and identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science. AVAILABILITY AND IMPLEMENTATION:The PhenoMeNal consortium maintains a web portal (https://portal.phenomenal-h2020.eu) providing a GUI for launching the Virtual Research Environment. The GitHub repository https://github.com/phnmnl/ hosts the source code of all projects. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Mouse cortical slices were cut from two GFP labeled strains marking pyramidal cells and interneurons. Slices were then scanned with a Femtonics' SMART microscope equipped with Stimulated Raman Spectroscopy (SRS) measurement capabilities. Neurons located according to the two-photon fluorescence were considered identified. Work was supported by 712821-NEURAM H2020 project.
Project description:<h4>Background</h4>Double minute chromosomes are circular fragments of DNA whose presence is associated with the onset of certain cancers. Double minutes are lethal, as they are highly amplified and typically contain oncogenes. Locating double minutes can supplement the process of cancer diagnosis, and it can help to identify therapeutic targets. However, there is currently a dearth of computational methods available to identify double minutes. We propose a computational framework for the idenfication of double minute chromosomes using next-generation sequencing data. Our framework integrates predictions from algorithms that detect DNA copy number variants, and it also integrates predictions from algorithms that locate genomic structural variants. This information is used by a graph-based algorithm to predict the presence of double minute chromosomes.<h4>Results</h4>Using a previously published copy number variant algorithm and two structural variation prediction algorithms, we implemented our framework and tested it on a dataset consisting of simulated double minute chromosomes. Our approach uncovered double minutes with high accuracy, demonstrating its plausibility.<h4>Conclusions</h4>Although we only tested the framework with three programs (RDXplorer, BreakDancer, Delly), it can be extended to incorporate results from programs that 1) detect amplified copy number and from programs that 2) detect genomic structural variants like deletions, translocations, inversions, and tandem repeats. The software that implements the framework can be accessed here: https://github.com/mhayes20/DMFinder
Project description:Along with the adoption of 5G, the development of neutral host solutions provides a unique opportunity for mobile networks operators to accommodate the needs of emerging use-cases and in the consolidation of new business models. By exploiting the concept of network slicing, as one key enabler in the transition to 5G, infrastructure and service providers can logically split a shared physical network into multiple isolated and customized networks to flexibly address the specific demands of those tenant slices. Motivated by this reality, the H2020 5GCity project proposed a novel 5G-enabled neutral host framework for three European cities: Barcelona (ESP), Bristol (UK), and Lucca (IT). This article revises the main achievements and contributions of the 5GCity project, focusing on the deployment and validation of the proposed framework. The developed neutral host framework encompasses two main parts: the infrastructure and the software platform. A detailed description of the framework implementation, in terms of functional capabilities and practical implications of city-wide deployments, is provided in this article. This work also presents the performance evaluation of the proposed solution during the implementation of real vertical use cases. Obtained results validate the feasibility of the neutral host model and the proposed framework to be deployed in city-wide 5G infrastructures.