Statistical quantification of methylation levels by next-generation sequencing.
ABSTRACT: Recently, next-generation sequencing-based technologies have enabled DNA methylation profiling at high resolution and low cost. Methyl-Seq and Reduced Representation Bisulfite Sequencing (RRBS) are two such technologies that interrogate methylation levels at CpG sites throughout the entire human genome. With rapid reduction of sequencing costs, these technologies will enable epigenotyping of large cohorts for phenotypic association studies. Existing quantification methods for sequencing-based methylation profiling are simplistic and do not deal with the noise due to the random sampling nature of sequencing and various experimental artifacts. Therefore, there is a need to investigate the statistical issues related to the quantification of methylation levels for these emerging technologies, with the goal of developing an accurate quantification method.In this paper, we propose two methods for Methyl-Seq quantification. The first method, the Maximum Likelihood estimate, is both conceptually intuitive and computationally simple. However, this estimate is biased at extreme methylation levels and does not provide variance estimation. The second method, based on bayesian hierarchical model, allows variance estimation of methylation levels, and provides a flexible framework to adjust technical bias in the sequencing process.We compare the previously proposed binary method, the Maximum Likelihood (ML) method, and the bayesian method. In both simulation and real data analysis of Methyl-Seq data, the bayesian method offers the most accurate quantification. The ML method is slightly less accurate than the bayesian method. But both our proposed methods outperform the original binary method in Methyl-Seq. In addition, we applied these quantification methods to simulation data and show that, with sequencing depth above 40-300 (which varies with different tissue samples) per cleavage site, Methyl-Seq offers a comparable quantification consistency as microarrays.
Project description:Genome-wide enrichment of methylated DNA followed by sequencing (MeDIP-seq) offers a reasonable compromise between experimental costs and genomic coverage. However, the computational analysis of these experiments is complex, and quantification of the enrichment signals in terms of absolute levels of methylation requires specific transformation. In this work, we present QSEA, Quantitative Sequence Enrichment Analysis, a comprehensive workflow for the modelling and subsequent quantification of MeDIP-seq data. As the central part of the workflow we have developed a Bayesian statistical model that transforms the enrichment read counts to absolute levels of methylation and, thus, enhances interpretability and facilitates comparison with other methylation assays. We suggest several calibration strategies for the critical parameters of the model, either using additional data or fairly general assumptions. By comparing the results with bisulfite sequencing (BS) validation data, we show the improvement of QSEA over existing methods. Additionally, we generated a clinically relevant benchmark data set consisting of methylation enrichment experiments (MeDIP-seq), BS-based validation experiments (Methyl-seq) as well as gene expression experiments (RNA-seq) derived from non-small cell lung cancer patients, and show that the workflow retrieves well-known lung tumour methylation markers that are causative for gene expression changes, demonstrating the applicability of QSEA for clinical studies. QSEA is implemented in R and available from the Bioconductor repository 3.4 (www.bioconductor.org/packages/qsea).
Project description:RNA-sequencing (RNA-seq) technology has emerged as the preferred method for quantification of gene and isoform expression. Numerous RNA-seq quantification tools have been proposed and developed, bringing us closer to developing expression-based diagnostic tests based on this technology. However, because of the rapidly evolving technologies and algorithms, it is essential to establish a systematic method for evaluating the quality of RNA-seq quantification. We investigate how different RNA-seq experimental designs (i.e., variations in sequencing depth and read length) affect various quantification algorithms (i.e., HTSeq, Cufflinks, and MISO). Using simulated data, we evaluate the quantification tools based on four metrics, namely: (1) total number of usable fragments for quantification, (2) detection of genes and isoforms, (3) correlation, and (4) accuracy of expression quantification with respect to the ground truth. Results show that Cufflinks is able to use the largest number of fragments for quantification, leading to better detection of genes and isoforms. However, HTSeq produces more accurate expression estimates. Moreover, each quantification algorithm is affected differently by varying sequencing depth and read length, suggesting that the selection of quantification algorithms should be application-dependent.
Project description:Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.
Project description:Bisulfite treatment of DNA followed by high-throughput sequencing (Bisulfite-seq) is an important method for studying DNA methylation and epigenetic gene regulation, yet current software tools do not adequately address single nucleotide polymorphisms (SNPs). Identifying SNPs is important for accurate quantification of methylation levels and for identification of allele-specific epigenetic events such as imprinting. We have developed a model-based bisulfite SNP caller, Bis-SNP, that results in substantially better SNP calls than existing methods, thereby improving methylation estimates. At an average 30× genomic coverage, Bis-SNP correctly identified 96% of SNPs using the default high-stringency settings. The open-source package is available at http://epigenome.usc.edu/publicationdata/bissnp2011.
Project description:Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases.We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at http://www.cbil.ece.vt.edu/software.htm.
Project description:RNA-sequencing (RNA-Seq) has become a popular tool for transcriptome profiling in mammals. However, accurate estimation of allele-specific expression (ASE) based on alignments of reads to the reference genome is challenging, because it contains only one allele on a mosaic haploid genome. Even with the information of diploid genome sequences, precise alignment of reads to the correct allele is difficult because of the high-similarity between the corresponding allele sequences.We propose a Bayesian approach to estimate ASE from RNA-Seq data with diploid genome sequences. In the statistical framework, the haploid choice is modeled as a hidden variable and estimated simultaneously with isoform expression levels by variational Bayesian inference. Through the simulation data analysis, we demonstrate the effectiveness of the proposed approach in terms of identifying ASE compared to the existing approach. We also show that our approach enables better quantification of isoform expression levels compared to the existing methods, TIGAR2, RSEM and Cufflinks. In the real data analysis of the human reference lymphoblastoid cell line GM12878, some autosomal genes were identified as ASE genes, and skewed paternal X-chromosome inactivation in GM12878 was identified.The proposed method, called ASE-TIGAR, enables accurate estimation of gene expression from RNA-Seq data in an allele-specific manner. Our results show the effectiveness of utilizing personal genomic information for accurate estimation of ASE. An implementation of our method is available at http://nagasakilab.csml.org/ase-tigar .
Project description:BACKGROUND:Bread wheat has a large complex genome that makes whole genome resequencing costly. Therefore, genome complexity reduction techniques such as sequence capture make re-sequencing cost effective. With a high-quality draft wheat genome now available it is possible to design capture probe sets and to use them to accurately genotype and anchor SNPs to the genome. Furthermore, in addition to genetic variation, epigenetic variation provides a source of natural variation contributing to changes in gene expression and phenotype that can be profiled at the base pair level using sequence capture coupled with bisulphite treatment. Here, we present a new 12 Mbp wheat capture probe set, that allows both the profiling of genotype and methylation from the same DNA sample. Furthermore, we present a method, based on Agilent SureSelect Methyl-Seq, that will use a single capture assay as a starting point to allow both DNA sequencing and methyl-seq. RESULTS:Our method uses a single capture assay that is sequentially split and used for both DNA sequencing and methyl-seq. The resultant genotype and epi-type data is highly comparable in terms of coverage and SNP/methylation site identification to that generated from separate captures for DNA sequencing and methyl-seq. Furthermore, by defining SNP frequencies in a diverse landrace from the Watkins collection we highlight the importance of having genotype data to prevent false positive methylation calls. Finally, we present the design of a new 12 Mbp wheat capture and demonstrate its successful application to re-sequence wheat. CONCLUSIONS:We present a cost-effective method for performing both DNA sequencing and methyl-seq from a single capture reaction thus reducing reagent costs, sample preparation time and DNA requirements for these complementary analyses.
Project description:BACKGROUND: Transcriptome sequencing is a powerful tool for measuring gene expression, but as well as some other technologies, various artifacts and biases affect the quantification. In order to correct some of them, several normalization approaches have emerged, differing both in the statistical strategy employed and in the type of corrected biases. However, there is no clear standard normalization method. RESULTS: We present a novel methodology to normalize RNA-Seq data, taking into account transcript size, GC content, and sequencing depth, which are the major quantification-related biases. In this study, we found that transcripts shorter than 600 bp have an underestimated expression level, while longer transcripts are even more overestimated that they are long. Second, it was well known that the higher the GC content (>50%), the more the transcripts are underestimated. Third, we demonstrated that the sequencing depth impacts the size bias and proposed a correction allowing the comparison of expression levels among many samples. The efficiency of our approach was then tested by comparing the correlation between normalized RNA-Seq data and qRT-PCR expression measurements. All the steps are automated in a program written in Perl and available on request. CONCLUSIONS: The methodology presented in this article identifies and corrects different biases that influence RNA-Seq quantification, and provides more accurate estimations of gene expression levels. This method can be applied to compare expression quantifications from many samples, but preferentially from the same tissue. In order to compare samples from different tissue, a calibration using several reference genes will be required.
Project description:Alternatively spliced transcript isoforms are commonly observed in higher eukaryotes. The expression levels of these isoforms are key for understanding normal functions in healthy tissues and the progression of disease states. However, accurate quantification of expression at the transcript level is limited with current RNA-seq technologies because of, for example, limited read length and the cost of deep sequencing.A large number of tools have been developed to tackle this problem, and we performed a comprehensive evaluation of these tools using both experimental and simulated RNA-seq datasets. We found that recently developed alignment-free tools are both fast and accurate. The accuracy of all methods was mainly influenced by the complexity of gene structures and caution must be taken when interpreting quantification results for short transcripts. Using TP53 gene simulation, we discovered that both sequencing depth and the relative abundance of different isoforms affect quantification accuracy CONCLUSIONS: Our comprehensive evaluation helps data analysts to make informed choice when selecting computational tools for isoform quantification.
Project description:DNA methylation is catalysed by DNA methyltransferases (DNMTs) and is necessary for a correct embryonic development. On the other hand, the DNA demethylation is mediated by the Ten Eleven Translocation (Tet) proteins through oxidation of 5-methyl cytosine (5mC) to 5-hydroxyl (5hmC), 5-formyl (5fC) and 5-carboxyl (5caC) cytosine, and by the Thymine-DNA glycosylase (TDG) that excises the 5fC and 5caC. In embryonic stem cells (ESCs), gene promoters are maintained in an hypomethylated state, but the dynamics of this phenomenon still remains unknown. Here we present a genome-wide approach, named methylation-assisted bisulfite sequencing (MAB-Seq) that enables single-base resolution mapping of 5fC and 5caC and measuring of their relative abundance. Application of this method to mouse ESCs exposed the presence of 5fcaC residues on the hypomethylated promoters of the expressed genes, revealing an active DNA demethylation mechanism since the loss of TDG leads to an increase of 5fC/5caC. We also show that TDG is actually bound on these regions and that co-localizes and interacts with Tet1. We moreover demonstrate, by reduced representation of bisulfite sequencing (RRBS), that active promoters are actually demethylated by a Tet-dependent mechanism and that Dnmt1 and Dnmt3a are responsible of this DNA methylation. Our work shows the whole-genome map of 5fC and 5caC at single base resolution in ESCs, it demonstrates in detail the DNA methylation dynamics occurring on expressed gene promoters and identifies the key players of this mechanism. Furthermore, we provide a new tool (MAB-Seq) that can be broadly used in all biological contexts for epigenetics study involving identification and quantification of 5fC and 5caC at single base resolution. Methylation-assisted bisulfite sequencing (MAB-Seq) of E14 embryonic stem cells (ESCs), Biotag ChIP-Seq of Tdg and Reduced representation Bisulfite Sequencing (RRBS) in E14 ESCs.