Variable Selection and Joint Estimation of Mean and Covariance Models with an Application to eQTL Data.
ABSTRACT: In genomic data analysis, it is commonplace that underlying regulatory relationship over multiple genes is hardly ascertained due to unknown genetic complexity and epigenetic regulations. In this paper, we consider a joint mean and constant covariance model (JMCCM) that elucidates conditional dependent structures of genes with controlling for potential genotype perturbations. To this end, the modified Cholesky decomposition is utilized to parametrize entries of a precision matrix. The JMCCM maximizes the likelihood function to estimate parameters involved in the model. We also develop a variable selection algorithm that selects explanatory variables and Cholesky factors by exploiting the combination of the GCV and BIC as benchmarks, together with Rao and Wald statistics. Importantly, we notice that sparse estimation of a precision matrix (or equivalently gene network) is effectively achieved via the proposed variable selection scheme and contributes to exploring significant hub genes shown to be concordant to a priori biological evidence. In simulation studies, we confirm that our model selection efficiently identifies the true underlying networks. With an application to miRNA and SNPs data from yeast (a.k.a. eQTL data), we demonstrate that constructed gene networks reproduce validated biological and clinical knowledge with regard to various pathways including the cell cycle pathway.
Project description:In longitudinal studies, serial dependence of repeated outcomes must be taken into account to make correct inferences on covariate effects. As such, care must be taken in modeling the covariance matrix. However, estimation of the covariance matrix is challenging because there are many parameters in the matrix and the estimated covariance matrix should be positive definite. To overcomes these limitations, two Cholesky decomposition approaches have been proposed: modified Cholesky decomposition for autoregressive (AR) structure and moving average Cholesky decomposition for moving average (MA) structure, respectively. However, the correlations of repeated outcomes are often not captured parsimoniously using either approach separately. In this paper, we propose a class of flexible, nonstationary, heteroscedastic models that exploits the structure allowed by combining the AR and MA modeling of the covariance matrix that we denote as ARMACD. We analyze a recent lung cancer study to illustrate the power of our proposed methods.
Project description:In this article, we propose a computationally efficient approach to estimate (large) p-dimensional covariance matrices of ordered (or longitudinal) data based on an independent sample of size n. To do this, we construct the estimator based on a k-band partial autocorrelation matrix with the number of bands chosen using an exact multiple hypothesis testing procedure. This approach is considerably faster than many existing methods and only requires inversion of (k + 1)-dimensional covariance matrices. The resulting estimator is positive definite as long as k < n (where p can be larger than n). We make connections between this approach and banding the Cholesky factor of the modified Cholesky decomposition of the inverse covariance matrix (Wu and Pourahmadi, 2003) and show that the maximum likelihood estimator of the k-band partial autocorrelation matrix is the same as the k-band inverse Cholesky factor. We evaluate our estimator via extensive simulations and illustrate the approach using high-dimensional sonar data.
Project description:The estimation of the covariance matrix is a key concern in the analysis of longitudinal data. When data consists of multiple groups, it is often assumed the covariance matrices are either equal across groups or are completely distinct. We seek methodology to allow borrowing of strength across potentially similar groups to improve estimation. To that end, we introduce a covariance partition prior which proposes a partition of the groups at each measurement time. Groups in the same set of the partition share dependence parameters for the distribution of the current measurement given the preceding ones, and the sequence of partitions is modeled as a Markov chain to encourage similar structure at nearby measurement times. This approach additionally encourages a lower-dimensional structure of the covariance matrices by shrinking the parameters of the Cholesky decomposition toward zero. We demonstrate the performance of our model through two simulation studies and the analysis of data from a depression study. This article includes Supplementary Material available online.
Project description:There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse-Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example.
Project description:The gain in quantification precision that can be expected in human brain (1) H MRS at very high field remains a matter of debate. Here, we investigate this issue using Monte-Carlo simulations.Simulated human brain-like (1) H spectra were fitted repeatedly with different noise realizations using LCModel at B0 ranging from 1.5T to 11.7T, assuming a linear increase in signal-to-noise ratio with B0 in the time domain, and assuming a linear increase in linewidth with B0 based on experimental measurements. Average quantification precision (Cramér-Rao lower bound) was then determined for each metabolite as a function of B0 .For singlets, Cramér-Rao lower bounds improved (decreased) by a factor of ? B0 as B0 increased, as predicted by theory. For most J-coupled metabolites, Cramér-Rao lower bounds decreased by a factor ranging from B0 to B0 as B0 increased, reflecting additional gains in quantification precision compared to singlets owing to simplification of spectral pattern and reduced overlap.Quantification precision of (1) H magnetic resonance spectroscopy in human brain continues to improve with B0 up to 11.7T although peak signal-to-noise ratio in the frequency domain levels off above 3T. In most cases, the gain in quantification precision is higher for J-coupled metabolites than for singlets.
Project description:In the modeling of longitudinal data from several groups, appropriate handling of the dependence structure is of central importance. Standard methods include specifying a single covariance matrix for all groups or independently estimating the covariance matrix for each group without regard to the others, but when these model assumptions are incorrect, these techniques can lead to biased mean effects or loss of efficiency, respectively. Thus, it is desirable to develop methods to simultaneously estimate the covariance matrix for each group that will borrow strength across groups in a way that is ultimately informed by the data. In addition, for several groups with covariance matrices of even medium dimension, it is difficult to manually select a single best parametric model among the huge number of possibilities given by incorporating structural zeros and/or commonality of individual parameters across groups. In this paper we develop a family of nonparametric priors using the matrix stick-breaking process of Dunson et al. (2008) that seeks to accomplish this task by parameterizing the covariance matrices in terms of the parameters of their modified Cholesky decomposition (Pourahmadi, 1999). We establish some theoretic properties of these priors, examine their effectiveness via a simulation study, and illustrate the priors using data from a longitudinal clinical trial.
Project description:Many statistical analysis procedures require a good estimator for a high-dimensional covariance matrix or its inverse, the precision matrix. When the precision matrix is banded, the Cholesky-based method often yields a good estimator of the precision matrix. One important aspect of this method is determination of the band size of the precision matrix. In practice, crossvalidation is commonly used; however, we show that crossvalidation not only is computationally intensive but can be very unstable. In this paper, we propose a new hypothesis testing procedure to determine the band size in high dimensions. Our proposed test statistic is shown to be asymptotically normal under the null hypothesis, and its theoretical power is studied. Numerical examples demonstrate the effectiveness of our testing procedure.
Project description:BACKGROUND: Poisson regression modeling has been widely used to estimate influenza-associated disease burden, as it has the advantage of adjusting for multiple seasonal confounders. However, few studies have discussed how to judge the adequacy of confounding adjustment. This study aims to compare the performance of commonly adopted model selection criteria in terms of providing a reliable and valid estimate for the health impact of influenza. METHODS: We assessed four model selection criteria: quasi Akaike information criterion (QAIC), quasi bayesian information criterion (QBIC), partial autocorrelation functions of residuals (PACF), and generalized cross-validation (GCV), by separately applying them to select the Poisson model best fitted to the mortality datasets that were simulated under the different assumptions of seasonal confounding. The performance of these criteria was evaluated by the bias and root-mean-square error (RMSE) of estimates from the pre-determined coefficients of influenza proxy variable. These four criteria were subsequently applied to an empirical hospitalization dataset to confirm the findings of simulation study. RESULTS: GCV consistently provided smaller biases and RMSEs for the influenza coefficient estimates than QAIC, QBIC and PACF, under the different simulation scenarios. Sensitivity analysis of different pre-determined influenza coefficients, study periods and lag weeks showed that GCV consistently outperformed the other criteria. Similar results were found in applying these selection criteria to estimate influenza-associated hospitalization. CONCLUSIONS: GCV criterion is recommended for selection of Poisson models to estimate influenza-associated mortality and morbidity burden with proper adjustment for confounding. These findings shall help standardize the Poisson modeling approach for influenza disease burden studies.
Project description:Single-molecule localization microscopy is a powerful tool for visualizing subcellular structures, interactions and protein functions in biological research. However, inhomogeneous refractive indices inside cells and tissues distort the fluorescent signal emitted from single-molecule probes, which rapidly degrades resolution with increasing depth. We propose a method that enables the construction of an in situ 3D response of single emitters directly from single-molecule blinking datasets, and therefore allows their locations to be pinpointed with precision that achieves the Cramér-Rao lower bound and uncompromised fidelity. We demonstrate this method, named in situ PSF retrieval (INSPR), across a range of cellular and tissue architectures, from mitochondrial networks and nuclear pores in mammalian cells to amyloid-? plaques and dendrites in brain tissues and elastic fibers in developing cartilage of mice. This advancement expands the routine applicability of super-resolution microscopy from selected cellular targets near coverslips to intra- and extracellular targets deep inside tissues.
Project description:Expression quantitative trait loci (eQTL) mapping is a widely used technique to uncover regulatory relationships between genes. A range of methodologies have been developed to map links between expression traits and genotypes. The DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative is a community project to objectively assess the relative performance of different computational approaches for solving specific systems biology problems. The goal of one of the DREAM5 challenges was to reverse-engineer genetic interaction networks from synthetic genetic variation and gene expression data, which simulates the problem of eQTL mapping. In this framework, we proposed an approach whose originality resides in the use of a combination of existing machine learning algorithms (committee). Although it was not the best performer, this method was by far the most precise on average. After the competition, we continued in this direction by evaluating other committees using the DREAM5 data and developed a method that relies on Random Forests and LASSO. It achieved a much higher average precision than the DREAM best performer at the cost of slightly lower average sensitivity.