Project description:RNA sequencing (RNA-seq) has been a widely used high-throughput method to characterize transcriptomic dynamics spatiotemporally. However, typical RNA-seq data analysis pipelines depend on either a sequenced genome or reference transcripts. This constriction makes the use of RNA-seq for species lacking both of sequenced genomes and reference transcripts challenging. To solve this problem, we developed CRSP, an RNA-seq pipeline integrating multiple comparative species strategy but not depending on a specific sequenced genome or reference transcripts. Benchmarking suggests the CRSP tool can achieve high accuracy to quantify gene expression levels.
Project description:DNA methylation plays critical roles in gene regulation and cellular specification without altering DNA sequences. The wide application of reduced representation bisulfite sequencing (RRBS) and whole genome bisulfite sequencing (bis-seq) opens the door to study DNA methylation at single CpG site resolution. One challenging question is how best to test for significant methylation differences between groups of biological samples in order to minimize false positive findings. Current methods to analyze genome-wide bisulfite sequencing data use a smoothing approach or a simple statistical test based on the binomial distribution. Comparative DNA methylation profiling in AML blasts and normal CD34(+) control cells
Project description:Current pipelines used to map genetrap insertion sites are based on inverse- or splinkerette-PCR methods, which despite their efficacy are prone to artifacts and do not provide information on the impact of the genetrap on the expression of the targeted gene. We developed a new method, which we named TrapSeq, for the mapping of genetrap insertions based on paired-end RNA sequencing. By recognizing chimeric mRNAs containing genetrap sequences spliced to an endogenous exon, our method identifies insertions that lead to productive trapping.
Project description:We investigated the reported binding of telomere associated factor TERF1 and TERF2 to internal telomere sites using ChIP-Seq for these two factors in a lymphoblastoid cell line. We mapped over 40 million reads for each sample to a custom reference genome that incorporates our subtelomere assembly, and generated signal tracks using only uniquely mapping reads, and also using a multimapping pipeline we developed. We find that peaks are misshapen and made up of reads that cannot be distinguished from true telomere sequence. Removing telomere identified reads removes all internal signal. Examination of TRF1 and TRF2
Project description:Microbiome is an essential omics layer to elucidate disease pathophysiology. However, we face a challenge of low reproducibility in microbiome studies, partly due to a lack of standard analytical pipelines. Here, we developed OMARU (Omnibus metagenome-wide association study with robustness), a new end-to-end analysis workflow that covers a wide range of microbiome analysis from phylogenetic and functional profiling to case-control metagenome-wide association studies (MWAS). OMARU rigorously controls the statistical significance of the analysis results, including correction of hidden confounding factors and application of multiple testing comparisons. Furthermore, OMARU can evaluate pathway-level links between the metagenome and the germline genome-wide association study (i.e. MWAS-GWAS pathway interaction), as well as links between taxa and genes in the metagenome. OMARU is publicly available (https://github.com/toshi-kishikawa/OMARU), with a flexible workflow that can be customized by users. We applied OMARU to publicly available type 2 diabetes (T2D) and schizophrenia (SCZ) metagenomic data (n = 171 and 344, respectively), identifying disease biomarkers through comprehensive, multilateral, and unbiased case-control comparisons of metagenome (e.g. increased Streptococcus vestibularis in SCZ and disrupted diversity in T2D). OMARU improves accessibility and reproducibility in the microbiome research community. Robust and multifaceted results of OMARU reflect the dynamics of the microbiome authentically relevant to disease pathophysiology.
Project description:RNA-Sequencing is a transformative method that captures the quantitative dynamics of a transcriptome with exquisite sensitivity and single-base resolution. There are, however, few computational pipelines for RNA-Seq with statistical tests that evince sufficient robustness and power as demanded by the difficult combination of small sample sizes and high variability in sequence read counts. To this end, we developed GENE-counter, a complete software pipeline for analyzing RNA-Seq data for genome-wide expression differences between replicated treatment groups. One important component of GENE-counter is a statistical test based on the NBP parameterization of the negative binomial distribution for identifying differentially expressed genome features. We used GENE-counter to analyze RNA-Seq data derived from Arabidopsis thaliana infected with a strain of defense-eliciting bacteria. We identified 308 genes that were differentially induced. Using alternative methods, we provided support for the induced expression and biological relevance of a substantial proportion of the genes. These results suggest the NBP parameterization of the negative binomial distribution is well suited for explaining RNA-Seq data and the statistical test makes GENE-counter a powerful pipeline for studying genome-wide expression changes. GENE-counter is freely available at http://changlab.cgrb.oregonstate.edu/. Our RNA-seq data is uploaded on the NCBI short read archive (SRA) under the SRA025952.
Project description:Metagenomic data compression is very important as metagenomic projects are facing the challenges of larger data volumes per sample and more samples nowadays. Reference-based compression is a promising method to obtain a high compression ratio. However, existing microbial reference genome databases are not suitable to be directly used as references for compression due to their large size and redundancy, and different metagenomic cohorts often have various microbial compositions. We present a novel pipeline that generated simplified and tailored reference genomes for large metagenomic cohorts, enabling the reference-based compression of metagenomic data. We constructed customized reference genomes, ranging from 2.4 to 3.9 GB, for 29 real metagenomic datasets and evaluated their compression performance. Reference-based compression achieved an impressive compression ratio of over 20 for human whole-genome data and up to 33.8 for all samples, demonstrating a remarkable 4.5 times improvement than the standard Gzip compression. Our method provides new insights into reference-based metagenomic data compression and has a broad application potential for faster and cheaper data transfer, storage, and analysis.
Project description:DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalisation based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalisation and background minimisation. We describe here damidseq_pipeline, a software pipeline that performs automatic normalisation and background reduction on multiple DamID-seq FASTQ or BAM datasets. Single replicate profiling of pol II occupancy in 3rd instar larval neuroblasts of Drosophila
Project description:The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install-all dependencies are provided within containers-portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All codes are hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.