Project description:Conjugate priors allow for fast inference in large dimensional vector autoregressive (VAR) models. But at the same time, they introduce the restriction that each equation features the same set of explanatory variables. This paper proposes a straightforward means of postprocessing posterior estimates of a conjugate Bayesian VAR to effectively perform equation-specific covariate selection. Compared with existing techniques using shrinkage alone, our approach combines shrinkage and sparsity in both the VAR coefficients and the error variance-covariance matrices, greatly reducing estimation uncertainty in large dimensions while maintaining computational tractability. We illustrate our approach by means of two applications. The first application uses synthetic data to investigate the properties of the model across different data-generating processes, and the second application analyzes the predictive gains from sparsification in a forecasting exercise for U.S. data.
Project description:We introduce a new shrinkage prior on function spaces, called the functional horseshoe prior (fHS), that encourages shrinkage towards parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. We study the efficacy of the proposed approach by showing an adaptive posterior concentration property on the function. We also demonstrate consistency of the model selection procedure that thresholds the shrinkage parameter of the functional horseshoe prior. We apply the fHS prior to nonparametric additive models and compare its performance with procedures based on the standard horseshoe prior and several penalized likelihood approaches. We find that the new procedure achieves smaller estimation error and more accurate model selection than other procedures in several simulated and real examples. The supplementary material for this article, which contains additional simulated and real data examples, MCMC diagnostics, and proofs of the theoretical results, is available online.
Project description:High-dimensional and highly correlated data leading to non- or weakly identified effects are commonplace. Maximum likelihood will typically fail in such situations and a variety of shrinkage methods have been proposed. Standard techniques, such as ridge regression or the lasso, shrink estimates toward zero, with some approaches allowing coefficients to be selected out of the model by achieving a value of zero. When substantive information is available, estimates can be shrunk to nonnull values; however, such information may not be available. We propose a Bayesian semiparametric approach that allows shrinkage to multiple locations. Coefficients are given a mixture of heavy-tailed double exponential priors, with location and scale parameters assigned Dirichlet process hyperpriors to allow groups of coefficients to be shrunk toward the same, possibly nonzero, mean. Our approach favors sparse, but flexible, structure by shrinking toward a small number of random locations. The methods are illustrated using a study of genetic polymorphisms and Parkinson's disease.
Project description:The dimension of the parameter space is typically unknown in a variety of models that rely on factorizations. For example, in factor analysis the number of latent factors is not known and has to be inferred from the data. Although classical shrinkage priors are useful in such contexts, increasing shrinkage priors can provide a more effective approach that progressively penalizes expansions with growing complexity. In this article we propose a novel increasing shrinkage prior, called the cumulative shrinkage process, for the parameters that control the dimension in overcomplete formulations. Our construction has broad applicability and is based on an interpretable sequence of spike-and-slab distributions which assign increasing mass to the spike as the model complexity grows. Using factor analysis as an illustrative example, we show that this formulation has theoretical and practical advantages relative to current competitors, including an improved ability to recover the model dimension. An adaptive Markov chain Monte Carlo algorithm is proposed, and the performance gains are outlined in simulations and in an application to personality data.
Project description:Linear-bilinear models, especially the additive main effects and multiplicative interaction (AMMI) model, are widely applicable to genotype-by-environment interaction (GEI) studies in plant breeding programs. These models allow a parsimonious modeling of GE interactions, retaining a small number of principal components in the analysis. However, one aspect of the AMMI model that is still debated is the selection criteria for determining the number of multiplicative terms required to describe the GE interaction pattern. Shrinkage estimators have been proposed as selection criteria for the GE interaction components. In this study, a Bayesian approach was combined with the AMMI model with shrinkage estimators for the principal components. A total of 55 maize genotypes were evaluated in nine different environments using a complete blocks design with three replicates. The results show that the traditional Bayesian AMMI model produces low shrinkage of singular values but avoids the usual pitfalls in determining the credible intervals in the biplot. On the other hand, Bayesian shrinkage AMMI models have difficulty with the credible interval for model parameters, but produce stronger shrinkage of the principal components, converging to GE matrices that have more shrinkage than those obtained using mixed models. This characteristic allowed more parsimonious models to be chosen, and resulted in models being selected that were similar to those obtained by the Cornelius F-test (α = 0.05) in traditional AMMI models and cross validation based on leave-one-out. This characteristic allowed more parsimonious models to be chosen and more GEI pattern retained on the first two components. The resulting model chosen by posterior distribution of singular value was also similar to those produced by the cross-validation approach in traditional AMMI models. Our method enables the estimation of credible interval for AMMI biplot plus the choice of AMMI model based on direct posterior distribution retaining more GEI pattern in the first components and discarding noise without Gaussian assumption as requested in F-based tests or deal with parametric problems as observed in traditional AMMI shrinkage method.
Project description:Biclustering techniques can identify local patterns of a data matrix by clustering feature space and sample space at the same time. Various biclustering methods have been proposed and successfully applied to analysis of gene expression data. While existing biclustering methods have many desirable features, most of them are developed for continuous data and few of them can efficiently handle -omics data of various types, for example, binomial data as in single nucleotide polymorphism data or negative binomial data as in RNA-seq data. In addition, none of existing methods can utilize biological information such as those from functional genomics or proteomics. Recent work has shown that incorporating biological information can improve variable selection and prediction performance in analyses such as linear regression and multivariate analysis. In this article, we propose a novel Bayesian biclustering method that can handle multiple data types including Gaussian, Binomial, and Negative Binomial. In addition, our method uses a Bayesian adaptive structured shrinkage prior that enables feature selection guided by existing biological information. Our simulation studies and application to multi-omics datasets demonstrate robust and superior performance of the proposed method, compared to other existing biclustering methods.
Project description:Polygenic risk scores (PRS) have shown promise in predicting human complex traits and diseases. Here, we present PRS-CS, a polygenic prediction method that infers posterior effect sizes of single nucleotide polymorphisms (SNPs) using genome-wide association summary statistics and an external linkage disequilibrium (LD) reference panel. PRS-CS utilizes a high-dimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRS-CS outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRS-CS to predict six common complex diseases and six quantitative traits in the Partners HealthCare Biobank, and further demonstrate the improvement of PRS-CS in prediction accuracy over alternative methods.
Project description:The successful implementation of Bayesian shrinkage analysis of high-dimensional regression models, as often encountered in quantitative trait locus (QTL) mapping, is contingent upon the choice of suitable sparsity-inducing priors. In practice, the shape (that is, the rate of tail decay) of such priors is typically preset, with no regard for the range of plausible alternatives and the fact that the most appropriate shape may depend on the data at hand. This study is presumably the first attempt to tackle this oversight through the shape-adaptive shrinkage prior (SASP) approach, with a focus on the mapping of QTLs in experimental crosses. Simulation results showed that the separation between genuine QTL effects and spurious ones can be made clearer using the SASP-based approach as compared with existing competitors. This feature makes our new method a promising approach to QTL mapping, where good separation is the ultimate goal. We also discuss a re-estimation procedure intended to improve the accuracy of the estimated genetic effects of detected QTLs with regard to shrinkage-induced bias, which may be particularly important in large-scale models with collinear predictors. The re-estimation procedure is relevant to any shrinkage method, and is potentially valuable for many scientific disciplines such as bioinformatics and quantitative genetics, where oversaturated models are booming.
Project description:Motivated by the increasing use of and rapid changes in array technologies, we consider the prediction problem of fitting a linear regression relating a continuous outcome Y to a large number of covariates X , eg measurements from current, state-of-the-art technology. For most of the samples, only the outcome Y and surrogate covariates, W , are available. These surrogates may be data from prior studies using older technologies. Owing to the dimension of the problem and the large fraction of missing information, a critical issue is appropriate shrinkage of model parameters for an optimal bias-variance tradeoff. We discuss a variety of fully Bayesian and Empirical Bayes algorithms which account for uncertainty in the missing data and adaptively shrink parameter estimates for superior prediction. These methods are evaluated via a comprehensive simulation study. In addition, we apply our methods to a lung cancer dataset, predicting survival time (Y) using qRT-PCR ( X ) and microarray ( W ) measurements.
Project description:IntroductionPopulation stratification (PS) is a major source of confounding in population-based genetic association studies of quantitative traits. Principal component regression (PCR) and linear mixed model (LMM) are two commonly used approaches to account for PS in association studies. Previous studies have shown that LMM can be interpreted as including all principal components (PCs) as random-effect covariates. However, including all PCs in LMM may dilute the influence of relevant PCs in some scenarios, while including only a few preselected PCs in PCR may fail to fully capture the genetic diversity.Materials and methodsTo address these shortcomings, we introduce Bayestrat-a method to detect associated variants with PS correction under the Bayesian LASSO framework. To adjust for PS, Bayestrat accommodates a large number of PCs and utilizes appropriate shrinkage priors to shrink the effects of nonassociated PCs.ResultsSimulation results show that Bayestrat consistently controls type I error rates and achieves higher power compared to its non-shrinkage counterparts, especially when the number of PCs included in the model is large. As a demonstration of the utility of Bayestrat, we apply it to the Multi-Ethnic Study of Atherosclerosis (MESA). Variants and genes associated with serum triglyceride or HDL cholesterol are identified in our analyses.DiscussionThe automatic and self-selection features of Bayestrat make it particularly suited in situations with complex underlying PS scenarios, where it is unknown a priori which PCs are potential confounders, yet the number that needs to be considered could be large in order to fully account for PS.