Project description:Several studies are starting to show the power of DNA microarrays to identify interactions between animal hosts and their pathogens, and have revealed interesting correlations between host responses to different infectious agents.
Project description:BACKGROUND:With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene/transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills. RESULTS:Utilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports the mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. The source code of RASflow is available on GitHub: https://github.com/zhxiaokang/RASflow. CONCLUSIONS:RASflow is a simple and reliable RNA-Seq analysis workflow covering many use cases.
Project description:For image-based infection biology, accurate unbiased quantification of host-pathogen interactions is essential, yet often performed manually or using limited enumeration employing simple image analysis algorithms based on image segmentation. Host protein recruitment to pathogens is often refractory to accurate automated assessment due to its heterogeneous nature. An intuitive intelligent image analysis program to assess host protein recruitment within general cellular pathogen defense is lacking. We present HRMAn (Host Response to Microbe Analysis), an open-source image analysis platform based on machine learning algorithms and deep learning. We show that HRMAn has the capacity to learn phenotypes from the data, without relying on researcher-based assumptions. Using Toxoplasma gondii and Salmonella enterica Typhimurium we demonstrate HRMAn's capacity to recognize, classify and quantify pathogen killing, replication and cellular defense responses. HRMAn thus presents the only intelligent solution operating at human capacity suitable for both single image and high content image analysis. This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed (see decision letter).
Project description:Background DNA adenine methyltransferase identification followed by sequencing (DamID-seq) is a powerful method used to map genome-wide chromatin-protein interactions. However, the bioinformatic analysis of DamID-seq data presents significant challenges due to the inherent complexities of the data and a notable lack of comprehensive software solutions for data-processing and downstream analysis. Results To address these challenges, we present a comprehensive bioinformatic workflow for DamID-seq data analysis, DamMapper, using the Snakemake workflow management system. Key features include straightforward processing of multiple biological replicates, visualisation of quality control figures, such as a correlation heatmaps and principal component analysis (PCA) plotting, and robust code quality maintained through continuous integration (CI). Reproducibility is ensured across diverse computational environments, including cloud computing and high-performance computing (HPC) clusters, through the implementation of software environments (Conda) and containerisation (Docker/Apptainer). We validated this workflow using a previously published DamID-seq dataset and Furthermore, we apply it to analyse novel datasets for proteins involved in the hypoxia response, specifically the transcription factor HIF-1α and the histone methyltransferase SET1B. This application reveals a strong concordance between our HIF-1α DamID-seq results and ChIP-seq data, and importantly, provides the first genome-wide DNA binding map for SET1B. Conclusions This work provides a validated, reproducible, and feature-rich workflow that overcomes common hurdles in DamID-seq data analysis. By streamlining the processing and ensuring robustness, DamMapper facilitates reliable analysis and enables new biological discoveries, as demonstrated by the characterization of SET1B binding sites. The workflow is available under the MIT license at: https://github.com/niekwit/damid-seq.
Project description:MotivationWhole-genome sequencing (WGS) is increasingly used to aid the understanding of Mycobacterium tuberculosis (MTB) transmission. The epidemiological analysis of tuberculosis based on the WGS technique requires a diverse collection of bioinformatics tools. Effectively using these analysis tools in a scalable and reproducible way can be challenging, especially for non-experts.ResultsHere, we present TransFlow (Transmission Workflow), a user-friendly, fast, efficient and comprehensive WGS-based transmission analysis pipeline. TransFlow combines some state-of-the-art tools to take transmission analysis from raw sequencing data, through quality control, sequence alignment and variant calling, into downstream transmission clustering, transmission network reconstruction and transmission risk factor inference, together with summary statistics and data visualization in a summary report. TransFlow relies on Snakemake and Conda to resolve dependencies among consecutive processing steps and can be easily adapted to any computation environment.Availability and implementationTransFlow is free available at https://github.com/cvn001/transflow.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:BackgroundSequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system.ResultsWe present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub ( https://github.com/MW55/Natrix ) or as a Docker container on DockerHub ( https://hub.docker.com/r/mw55/natrix ).ConclusionNatrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.
Project description:Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across core genomes in the data set. The objective of our research was to create an easy-to-use pipeline from existing bioinformatics tools that can be deployed on a personal computer. The pipeline was built on the Snakemake framework and utilizes existing tools for each processing step: fastp for quality trimming, snippy for variant calling, Centrifuge for taxonomic classification, Abricate for AMR gene detection, snippy-core for generating whole and core genome alignments, IQ-TREE for phylogenetic tree construction and vcfR for an interactive heatmap visualization which shows SNPs at specific locations across the genomes. BAGEP was successfully tested and validated with Mycobacterium tuberculosis (n = 20) and Salmonella enterica serovar Typhi (n = 20) genomes which are about 4.4 million and 4.8 million base pairs, respectively. Running these test data on a 8 GB RAM, 2.5 GHz quad core laptop took 122 and 61 minutes on respective data sets to complete the analysis. BAGEP is a fast, calls accurate SNPs and an easy to run pipeline that can be executed on a mid-range laptop; it is freely available on: https://github.com/idolawoye/BAGEP.
Project description:Chloroplasts are photosynthetic organelles in algal and plant cells that contain their own genome. Chloroplast genomes are commonly used in evolutionary studies and taxonomic identification and are increasingly becoming a target for crop improvement studies. As DNA sequencing becomes more affordable, researchers are collecting vast swathes of high-quality whole-genome sequence data from laboratory and field settings alike. Whole tissue read libraries sequenced with the primary goal of understanding the nuclear genome will inadvertently contain many reads derived from the chloroplast genome. These whole-genome, whole-tissue read libraries can additionally be used to assemble chloroplast genomes with little to no extra cost. While several tools exist that make use of short-read second generation and third-generation long-read sequencing data for chloroplast genome assembly, these tools may have complex installation steps, inadequate error reporting, poor expandability, and/or lack scalability. Here, we present CLAW (Chloroplast Long-read Assembly Workflow), an easy to install, customise, and use Snakemake tool to assemble chloroplast genomes from chloroplast long-reads found in whole-genome read libraries (https://github.com/aaronphillips7493/CLAW). Using 19 publicly available reference chloroplast genome assemblies and long-read libraries from algal, monocot and eudicot species, we show that CLAW can rapidly produce chloroplast genome assemblies with high similarity to the reference assemblies. CLAW was designed such that users have complete control over parameterisation, allowing individuals to optimise CLAW to their specific use cases. We expect that CLAW will provide researchers (with varying levels of bioinformatics expertise) with an additional resource useful for contributing to the growing number of publicly available chloroplast genome assemblies.
Project description:BackgroundGenome-wide association studies (GWAS) are a powerful method to detect associations between variants and phenotypes. A GWAS requires several complex computations with large data sets, and many steps may need to be repeated with varying parameters. Manual running of these analyses can be tedious, error-prone and hard to reproduce.ResultsThe H3AGWAS workflow from the Pan-African Bioinformatics Network for H3Africa is a powerful, scalable and portable workflow implementing pre-association analysis, implementation of various association testing methods and post-association analysis of results.ConclusionsThe workflow is scalable-laptop to cluster to cloud (e.g., SLURM, AWS Batch, Azure). All required software is containerised and can run under Docker or Singularity.