Project description:Quantifying exposure-disease associations is a central issue in epidemiology. Researchers of a study often present an odds ratio (or a logarithm of odds ratio, logOR) estimate together with its confidence interval (CI), for each exposure they examined. Here the authors advocate using the empirical-Bayes-based 'prediction intervals' (PIs) to bound the uncertainty of logORs. The PI approach is applicable to a panel of factors believed to be exchangeable (no extra information, other than the data itself, is available to distinguish some logORs from the others). The authors demonstrate its use in a genetic epidemiological study on age-related macular degeneration (AMD). The proposed PIs can enjoy straightforward probabilistic interpretations--a 95% PI has a probability of 0.95 to encompass the true value, and the expected number of true values that are being encompassed is 0.95m for a total of m 95% PIs. The PI approach is theoretically more efficient (producing shorter intervals) than the traditional CI approach. In the AMD data, the average efficiency gain is 51.2%. The PI approach is advocated to present the uncertainties of many logORs in a study, for its straightforward probabilistic interpretations and higher efficiency while maintaining the nominal coverage probability.
Project description:A number of empirical Bayes models (each with different statistical distribution assumptions) have now been developed to analyze differential DNA methylation using high-density oligonucleotide tiling arrays. However, it remains unclear which model performs best. For example, for analysis of differentially methylated regions for conservative and functional sequence characteristics (e.g., enrichment of transcription factor-binding sites (TFBSs)), the sensitivity of such analyses, using various empirical Bayes models, remains unclear. In this paper, five empirical Bayes models were constructed, based on either a gamma distribution or a log-normal distribution, for the identification of differential methylated loci and their cell division-(1, 3, and 5) and drug-treatment-(cisplatin) dependent methylation patterns. While differential methylation patterns generated by log-normal models were enriched with numerous TFBSs, we observed almost no TFBS-enriched sequences using gamma assumption models. Statistical and biological results suggest log-normal, rather than gamma, empirical Bayes model distribution to be a highly accurate and precise method for differential methylation microarray analysis. In addition, we presented one of the log-normal models for differential methylation analysis and tested its reproducibility by simulation study. We believe this research to be the first extensive comparison of statistical modeling for the analysis of differential DNA methylation, an important biological phenomenon that precisely regulates gene transcription.
Project description:In meta-analysis, the heterogeneity of effect sizes across component studies is typically described by a variance parameter in a random-effects (Re) model. In the literature, methods for constructing confidence intervals (CIs) for the parameter often assume that study-level effect sizes are normally distributed. However, this assumption might be violated in practice, especially in meta-analysis of rare binary events. We propose to use jackknife empirical likelihood (JEL), a nonparametric approach that uses jackknife pseudo-values, to construct CIs for the heterogeneity parameter. To compute jackknife pseudo-values, we employ a moment-based estimator and consider two commonly used weighing schemes (i.e., equal and inverse variance weights). We prove that with each scheme, the resulting log empirical likelihood ratio follows a chi-square distribution asymptotically. We further examine the performance of the proposed JEL methods and compare them with existing CIs through simulation studies and data examples that focus on data of rare binary events. Our numerical results suggest that the JEL method with equal weights compares favorably to alternatives, especially when (observed) effect sizes are non-normal and the number of component studies is large. Thus, it is worth serious consideration in statistical inference.
Project description:BackgroundAn important goal of whole-genome studies concerned with single nucleotide polymorphisms (SNPs) is the identification of SNPs associated with a covariate of interest such as the case-control status or the type of cancer. Since these studies often comprise the genotypes of hundreds of thousands of SNPs, methods are required that can cope with the corresponding multiple testing problem. For the analysis of gene expression data, approaches such as the empirical Bayes analysis of microarrays have been developed particularly for the detection of genes associated with the response. However, the empirical Bayes analysis of microarrays has only been suggested for binary responses when considering expression values, i.e. continuous predictors.ResultsIn this paper, we propose a modification of this empirical Bayes analysis that can be used to analyze high-dimensional categorical SNP data. This approach along with a generalized version of the original empirical Bayes method are available in the R package siggenes version 1.10.0 and later that can be downloaded from http://www.bioconductor.org.ConclusionAs applications to two subsets of the HapMap data show, the empirical Bayes analysis of microarrays cannot only be used to analyze continuous gene expression data, but also be applied to categorical SNP data, where the response is not restricted to be binary. In association studies in which typically several ten to a few hundred SNPs are considered, our approach can furthermore be employed to test interactions of SNPs. Moreover, the posterior probabilities resulting from the empirical Bayes analysis of (prespecified) interactions/genotypes can also be used to quantify the importance of these interactions.
Project description:BackgroundAdvances in mass spectrometry-based proteomics have enabled the incorporation of proteomic data into systems approaches to biology. However, development of analytical methods has lagged behind. Here we describe an empirical Bayes framework for quantitative proteomics data analysis. The method provides a statistical description of each experiment, including the number of proteins that differ in abundance between 2 samples, the experiment's statistical power to detect them, and the false-positive probability of each protein.Methodology/principal findingsWe analyzed 2 types of mass spectrometric experiments. First, we showed that the method identified the protein targets of small-molecules in affinity purification experiments with high precision. Second, we re-analyzed a mass spectrometric data set designed to identify proteins regulated by microRNAs. Our results were supported by sequence analysis of the 3' UTR regions of predicted target genes, and we found that the previously reported conclusion that a large fraction of the proteome is regulated by microRNAs was not supported by our statistical analysis of the data.Conclusions/significanceOur results highlight the importance of rigorous statistical analysis of proteomic data, and the method described here provides a statistical framework to robustly and reliably interpret such data.
Project description:We provide a nonparametric estimate of τ-restricted mean survival using follow-up information beyond τwhen appropriate to improve precision. The variance accounts for correlation between follow-up windows. Both asymptotic calculations and simulation studies recommend follow-up intervals spaced approximately τ/2 apart.
Project description:Motivation: Computational inference methods that make use of graphical models to extract regulatory networks from gene expression data can have difficulty reconstructing dense regions of a network, a consequence of both computational complexity and unreliable parameter estimation when sample size is small. As a result, identification of hub genes is of special difficulty for these methods.Methods: We present a new algorithm, Empirical Light Mutual Min (ELMM), for large network reconstruction that has properties well suited for dense graph recovery. ELMM reconstructs the undirected graph of a regulatory network using empirical Bayes conditional independence testing with a heuristic relaxation of independence constraints in dense areas of the graph. This relaxation allows only one gene of a pair with a putative relation to be aware of the network connection, an approach that is aimed at easing multiple testing problems associated with recovering densely connected structures.Results: Using in silico data, we show that ELMM has better performance than commonly used network inference algorithms including PC Algorithm, GeneNet, and ARACNE. We also apply ELMM to reconstruct a network among 5,400 genes expressed in human lung airway epithelium of healthy nonsmokers, healthy smokers, and smokers with pulmonary diseases assayed using microarrays. The analysis identifies dense subnetworks that are consistent with known regulatory relationships in the lung airway and also suggests novel hub regulatory relationships among a number of genes that play roles in oxidative stress, wound response, and secretion.
Project description:Consider a Bayesian setup in which we observe Y , whose distribution depends on a parameter θ , that is, Y∣θ~πY∣θ . The parameter θ is unknown and treated as random, and a prior distribution chosen from some parametric family πθ(⋅;h),h∈ℋ , is to be placed on it. For the subjective Bayesian there is a single prior in the family which represents his or her beliefs about θ , but determination of this prior is very often extremely difficult. In the empirical Bayes approach, the latent distribution on θ is estimated from the data. This is usually done by choosing the value of the hyperparameter h that maximizes some criterion. Arguably the most common way of doing this is to let m(h) be the marginal likelihood of h , that is, m(h)=∫πY∣θvh(θ)dθ , and choose the value of h that maximizes m(⋅) . Unfortunately, except for a handful of textbook examples, analytic evaluation of argmaxhm(h) is not feasible. The purpose of this paper is two-fold. First, we review the literature on estimating it and find that the most commonly used procedures are either potentially highly inaccurate or don't scale well with the dimension of h , the dimension of θ , or both. Second, we present a method for estimating argmaxhm(h) , based on Markov chain Monte Carlo, that applies very generally and scales well with dimension. Let g be a real-valued function of θ , and let I(h) be the posterior expectation of g(θ) when the prior is vh . As a byproduct of our approach, we show how to obtain point estimates and globally-valid confidence bands for the family I(h) , h∈ℋ . To illustrate the scope of our methodology we provide three detailed examples, having different characters.
Project description:Studying small effects or subtle neuroanatomical variation requires large-scale sample size data. As a result, combining neuroimaging data from multiple datasets is necessary. Variation in acquisition protocols, magnetic field strength, scanner build, and many other non-biologically related factors can introduce undesirable bias into studies. Hence, harmonization is required to remove the bias-inducing factors from the data. ComBat is one of the most common methods applied to features from structural images. ComBat models the data using a hierarchical Bayesian model and uses the empirical Bayes approach to infer the distribution of the unknown factors. The empirical Bayes harmonization method is computationally efficient and provides valid point estimates. However, it tends to underestimate uncertainty. This paper investigates a new approach, fully Bayesian ComBat, where Monte Carlo sampling is used for statistical inference. When comparing fully Bayesian and empirical Bayesian ComBat, we found Empirical Bayesian ComBat more effectively removed scanner strength information and was much more computationally efficient. Conversely, fully Bayesian ComBat better preserved biological disease and age-related information while performing more accurate harmonization on traveling subjects. The fully Bayesian approach generates a rich posterior distribution, which is useful for generating simulated imaging features for improving classifier performance in a limited data setting. We show the generative capacity of our model for augmenting and improving the detection of patients with Alzheimer's disease. Posterior distributions for harmonized imaging measures can also be used for brain-wide uncertainty comparison and more principled downstream statistical analysis.Code for our new fully Bayesian ComBat extension is available at https://github.com/batmanlab/BayesComBat.
Project description:Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalizing to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop a Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by-product. The results and their inferential implications are showcased on synthetic and real data.