Project description:Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities' measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low-especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities' signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset.
Project description:High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
Project description:MicroRNAs (miRNAs) regulate many genes critical for tumorigenesis. We profiled miRNAs from 11 normal breast tissues, 17 non-invasive, 151 invasive breast carcinomas, and 6 cell lines by in-house-developed barcoded Solexa sequencing. miRNAs were organized in genomic clusters representing promoter-controlled miRNA expression and sequence families representing seed-sequence-dependent miRNA-target regulation. Unsupervised clustering of samples by miRNA sequence families best reflected the clustering based on mRNA expression available for this sample set. Clustering and comparative analysis of miRNA read frequencies showed that normal breast samples were separated from most non-invasive ductal carcinoma in situ and invasive carcinomas by increased miR-21 (the most abundant miRNA in carcinomas) and multiple decreased miRNA families (including mir-98/let-7), with most miRNA changes apparent already in the non-invasive carcinomas. In addition, patients that went on to develop metastasis demonstrated increased expression of mir-423, and triple negative breast carcinomas were most distinct from other tumor subtypes due to up-regulation of the mir-17~92 cluster. However, absolute miRNA levels between normal breast and carcinomas did not reveal any significant differences. We also discovered two polymorphic nucleotide variations among the more abundant miRNAs miR-181a (T19G) and miR-185 (T16G), but we did not identify nucleotide variations expected for classical tumor suppressor function associated with miRNAs. The differentiation of tumor subtypes and prediction of metastasis based on miRNA levels is statistically possible, but is not driven by deregulation of abundant miRNAs, implicating far fewer miRNAs in tumorigenic processes than previously suggested.