The bench scientist's guide to statistical analysis of RNA-Seq data.
ABSTRACT: BACKGROUND: RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression. FINDINGS: When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O3. However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts. CONCLUSIONS: Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.
Project description:Detection of differentially expressed genes is a common task in single-cell RNA-seq (scRNA-seq) studies. Various methods based on both bulk-cell and single-cell approaches are in current use. Due to the unique distributional characteristics of single-cell data, it is important to compare these methods with rigorous statistical assessments. In this study, we assess the reproducibility of 9 tools for differential expression analysis in scRNA-seq data. These tools include four methods originally designed for scRNA-seq data, three popular methods originally developed for bulk-cell RNA-seq data but have been applied in scRNA-seq analysis, and two general statistical tests. Instead of comparing the performance across all genes, we compare the methods in terms of the rediscovery rates (RDRs) of top-ranked genes, separately for highly and lowly expressed genes. Three real and one simulated scRNA-seq data sets are used for the comparisons. The results indicate that some widely used methods, such as edgeR and monocle, have worse RDR performances compared to the other methods, especially for the top-ranked genes. For highly expressed genes, many bulk-cell-based methods can perform similarly to the methods designed for scRNA-seq data. But for the lowly expressed genes performance varies substantially; edgeR and monocle are too liberal and have poor control of false positives, while DESeq2 is too conservative and consequently loses sensitivity compared to the other methods. BPSC, Limma, DEsingle, MAST, t-test and Wilcoxon have similar performances in the real data sets. Overall, the scRNA-seq based method BPSC performs well against the other methods, particularly when there is a sufficient number of cells.
Project description:Meta-analysis combining multiple transcriptomic studies increases statistical power and accuracy in detecting differentially expressed genes. As the next-generation sequencing experiments become mature and affordable, increasing number of RNA-seq datasets are available in the public domain. The count-data based technology provides better experimental accuracy, reproducibility and ability to detect low-expressed genes. A naive approach to combine multiple RNA-seq studies is to apply differential analysis tools such as edgeR and DESeq to each study and then combine the summary statistics of p-values or effect sizes by conventional meta-analysis methods. Such a two-stage approach loses statistical power, especially for genes with short length or low expression abundance. In this paper, we propose a full Bayesian hierarchical model (namely, BayesMetaSeq) for RNA-seq meta-analysis by modelling count data, integrating information across genes and across studies, and modelling potentially heterogeneous differential signals across studies via latent variables. A Dirichlet process mixture (DPM) prior is further applied on the latent variables to provide categorization of detected biomarkers according to their differential expression patterns across studies, facilitating improved interpretation and biological hypothesis generation. Simulations and a real application on multi-brain-region HIV-1 transgenic rats demonstrate improved sensitivity, accuracy and biological findings of the proposed method.
Project description:BACKGROUND:Long non-coding RNA (lncRNA) expression data have been increasingly used in finding diagnostic and prognostic biomarkers in cancer studies. Existing differential analysis tools for RNA sequencing do not effectively accommodate low abundant genes, as commonly observed in lncRNAs. RESULTS:We investigated the statistical distribution of normalized counts for low expression genes in lncRNAs and mRNAs, and proposed a new tool lncDIFF based on the underlying distribution pattern to detect differentially expressed (DE) lncRNAs. lncDIFF adopts the generalized linear model with zero-inflated Exponential quasi-likelihood to estimate group effect on normalized counts, and employs the likelihood ratio test to detect differential expressed genes. The proposed method and tool are applicable to data processed with standard RNA-Seq preprocessing and normalization pipelines. Simulation results showed that lncDIFF was able to detect DE genes with more power and lower false discovery rate regardless of the data pattern, compared to DESeq2, edgeR, limma, zinbwave, DEsingle, and ShrinkBayes. In the analysis of a head and neck squamous cell carcinomas data, lncDIFF also appeared to have higher sensitivity in identifying novel lncRNA genes with relatively large fold change and prognostic value. CONCLUSIONS:lncDIFF is a powerful differential analysis tool for low abundance non-coding RNA expression data. This method is compatible with various existing RNA-Seq quantification and normalization tools. lncDIFF is implemented in an R package available at https://github.com/qianli10000/lncDIFF .
Project description:BACKGROUND: Experimental designs that take advantage of high-throughput sequencing to generate datasets include RNA sequencing (RNA-seq), chromatin immunoprecipitation sequencing (ChIP-seq), sequencing of 16S rRNA gene fragments, metagenomic analysis and selective growth experiments. In each case the underlying data are similar and are composed of counts of sequencing reads mapped to a large number of features in each sample. Despite this underlying similarity, the data analysis methods used for these experimental designs are all different, and do not translate across experiments. Alternative methods have been developed in the physical and geological sciences that treat similar data as compositions. Compositional data analysis methods transform the data to relative abundances with the result that the analyses are more robust and reproducible. RESULTS: Data from an in vitro selective growth experiment, an RNA-seq experiment and the Human Microbiome Project 16S rRNA gene abundance dataset were examined by ALDEx2, a compositional data analysis tool that uses Bayesian methods to infer technical and statistical error. The ALDEx2 approach is shown to be suitable for all three types of data: it correctly identifies both the direction and differential abundance of features in the differential growth experiment, it identifies a substantially similar set of differentially expressed genes in the RNA-seq dataset as the leading tools and it identifies as differential the taxa that distinguish the tongue dorsum and buccal mucosa in the Human Microbiome Project dataset. The design of ALDEx2 reduces the number of false positive identifications that result from datasets composed of many features in few samples. CONCLUSION: Statistical analysis of high-throughput sequencing datasets composed of per feature counts showed that the ALDEx2 R package is a simple and robust tool, which can be applied to RNA-seq, 16S rRNA gene sequencing and differential growth datasets, and by extension to other techniques that use a similar approach.
Project description:High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.
Project description:Molecular changes occurring during mammalian oocyte maturation are partly regulated by cytoplasmic polyadenylation (CP) and affect oocyte quality, yet the extent of CP activity during oocyte maturation remains unknown. Single bovine oocyte RNA sequencing (RNA-Seq) was performed to examine changes in transcript abundance during in vitro oocyte maturation in cattle. Polyadenylated RNA from individual germinal-vesicle and metaphase-II oocytes was amplified and processed for Illumina sequencing, producing approximately 30 million reads per replicate for each sample type. A total of 10,494 genes were found to be expressed, of which 2,455 were differentially expressed (adjusted P < 0.05 and fold change >2) between stages, with 503 and 1,952 genes respectively increasing and decreasing in abundance. Differentially expressed genes with complete 3'-untranslated-region sequence (279 increasing and 918 decreasing in polyadenylated transcript abundance) were examined for the presence, position, and distribution of motifs mediating CP, revealing enrichment (85%) and lack thereof (18%) in up- and down-regulated genes, respectively. Examination of total and polyadenylated RNA abundance by quantitative PCR validated these RNA-Seq findings. The observed increases in polyadenylated transcript abundance within the RNA-Seq data are likely due to CP, providing novel insight into targeted transcripts and resultant differential gene expression profiles that contribute to oocyte maturation.
Project description:One limitation of the widely used RNA-seq method is that long transcripts are represented by more reads than shorter transcripts, resulting in a biased estimation of expression levels. The 3' RNA-seq method, which yields only one sequence per transcript, bypasses this limitation. Here, RNA was extracted from two samples, in which we expected to find differentially expressed genes. Each was processed by both traditional and 3' RNA-seq protocols. Both methods yielded similar differentially expressed genes and estimated expression levels in a comparable way, confirming they both represent valid tools for RNA-seq analysis. Notably, however, we identified more differentially expressed transcripts with the 3' RNA-seq method, suggesting a greater power to detect expression variation using this method. Hence, when little genomic information is available for the species studied, the standard RNA-seq presents a better cost-benefit compromise, whereas for model species, the 3' RNA-seq method might more accurately detect differential expression.
Project description:BACKGROUND:The use of RNA-sequencing (RNA-seq) in molecular biology research and clinical settings has increased significantly over the past decade. Despite its widespread adoption, there is a lack of simple and interactive tools to analyze and explore RNA-seq data. Many established tools require programming or Unix/Bash knowledge to analyze and visualize results. This requirement presents a significant barrier for many researchers to efficiently analyze and present RNA-seq data. RESULTS:Here we present BEAVR, a Browser-based tool for the Exploration And Visualization of RNA-seq data. BEAVR is an easy-to-use tool that facilitates interactive analysis and exploration of RNA-seq data. BEAVR is developed in R and uses DESeq2 as its engine for differential gene expression (DGE) analysis, but assumes users have no prior knowledge of R or DESeq2. BEAVR allows researchers to easily obtain a table of differentially-expressed genes with statistical testing and then visualize the results in a series of graphs, plots and heatmaps. Users are able to customize many parameters for statistical testing, dealing with variance, clustering methods and pathway analysis to generate high quality figures. CONCLUSION:BEAVR simplifies analysis for novice users but also streamlines the RNA-seq analysis process for experts by automating several steps. BEAVR and its documentation can be found on GitHub at https://github.com/developerpiru/BEAVR. BEAVR is available as a Docker container at https://hub.docker.com/r/pirunthan/beavr.
Project description:BACKGROUND:The past decade has seen an abundance of transcriptional profiling studies of preclinical models of persistent pain, predominantly employing microarray technology. In this study we directly compare exon microarrays to RNA-seq and investigate the ability of both platforms to detect differentially expressed genes following nerve injury using the L5 spinal nerve transection model of neuropathic pain. We also investigate the effects of increasing RNA-seq sequencing depth. Finally we take advantage of the "agnostic" approach of RNA-seq to discover areas of expression outside of annotated exons that show marked changes in expression following nerve injury. RESULTS:RNA-seq and microarrays largely agree in terms of the genes called as differentially expressed. However, RNA-seq is able to interrogate a much larger proportion of the genome. It can also detect a greater number of differentially expressed genes than microarrays, across a wider range of fold changes and is able to assign a larger range of expression values to the genes it measures. The number of differentially expressed genes detected increases with sequencing depth. RNA-seq also allows the discovery of a number of genes displaying unusual and interesting patterns of non-exonic expression following nerve injury, an effect that cannot be detected using microarrays. CONCLUSION:We recommend the use of RNA-seq for future high-throughput transcriptomic experiments in pain studies. RNA-seq allowed the identification of a larger number of putative candidate pain genes than microarrays and can also detect a wider range of expression values in a neuropathic pain model. In addition, RNA-seq can interrogate the whole genome regardless of prior annotations, being able to detect transcription from areas of the genome not currently annotated as exons. Some of these areas are differentially expressed following nerve injury, and may represent novel genes or isoforms. We also recommend the use of a high sequencing depth in order to detect differential expression for genes with low levels of expression.
Project description:BACKGROUND:Despite the availability of many ready-made testing software, reliable detection of differentially expressed genes in RNA-seq data is not a trivial task. Even though the data collection is considered high-throughput, data analysis has intricacies that require careful human attention. Researchers should use modern data analysis techniques that incorporate visual feedback to verify the appropriateness of their models. While some RNA-seq packages provide static visualization tools, their capabilities should be expanded and their meaningfulness should be explicitly demonstrated to users. RESULTS:In this paper, we 1) introduce new interactive RNA-seq visualization tools, 2) compile a collection of examples that demonstrate to biologists why visualization should be an integral component of differential expression analysis. We use public RNA-seq datasets to show that our new visualization tools can detect normalization issues, differential expression designation problems, and common analysis errors. We also show that our new visualization tools can identify genes of interest in ways undetectable with models. Our R package "bigPint" includes the plotting tools introduced in this paper, many of which are unique additions to what is currently available. The "bigPint" website is located at https://lindsayrutter.github.io/bigPint and contains short vignette articles that introduce new users to our package, all written in reproducible code. CONCLUSIONS:We emphasize that interactive graphics should be an indispensable component of modern RNA-seq analysis, which is currently not the case. This paper and its corresponding software aim to persuade 1) users to slightly modify their differential expression analyses by incorporating statistical graphics into their usual analysis pipelines, 2) developers to create additional complex and interactive plotting methods for RNA-seq data, possibly using lessons learned from our open-source codes. We hope our work will serve a small part in upgrading the RNA-seq analysis world into one that more wholistically extracts biological information using both models and visuals.