Bayesian lasso for semiparametric structural equation models.
ABSTRACT: There has been great interest in developing nonlinear structural equation models and associated statistical inference procedures, including estimation and model selection methods. In this paper a general semiparametric structural equation model (SSEM) is developed in which the structural equation is composed of nonparametric functions of exogenous latent variables and fixed covariates on a set of latent endogenous variables. A basis representation is used to approximate these nonparametric functions in the structural equation and the Bayesian Lasso method coupled with a Markov Chain Monte Carlo (MCMC) algorithm is used for simultaneous estimation and model selection. The proposed method is illustrated using a simulation study and data from the Affective Dynamics and Individual Differences (ADID) study. Results demonstrate that our method can accurately estimate the unknown parameters and correctly identify the true underlying model.
Project description:Structural equation modeling is commonly used to capture complex structures of relationships among multiple variables, both latent and observed. We propose a general class of structural equation models with a semiparametric component for potentially censored survival times. We consider nonparametric maximum likelihood estimation and devise a combined Expectation-Maximization and Newton-Raphson algorithm for its implementation. We establish conditions for model identifiability and prove the consistency, asymptotic normality, and semiparametric efficiency of the estimators. Finally, we demonstrate the satisfactory performance of the proposed methods through simulation studies and provide an application to a motivating cancer study that contains a variety of genomic variables. Supplementary materials for this article are available online.
Project description:BACKGROUND: With the abundant information produced by microarray technology, various approaches have been proposed to infer transcriptional regulatory networks. However, few approaches have studied subtle and indirect interaction such as genetic compensation, the existence of which is widely recognized although its mechanism has yet to be clarified. Furthermore, when inferring gene networks most models include only observed variables whereas latent factors, such as proteins and mRNA degradation that are not measured by microarrays, do participate in networks in reality. RESULTS: Motivated by inferring transcriptional compensation (TC) interactions in yeast, a stepwise structural equation modeling algorithm (SSEM) is developed. In addition to observed variables, SSEM also incorporates hidden variables to capture interactions (or regulations) from latent factors. Simulated gene networks are used to determine with which of six possible model selection criteria (MSC) SSEM works best. SSEM with Bayesian information criterion (BIC) results in the highest true positive rates, the largest percentage of correctly predicted interactions from all existing interactions, and the highest true negative (non-existing interactions) rates. Next, we apply SSEM using real microarray data to infer TC interactions among (1) small groups of genes that are synthetic sick or lethal (SSL) to SGS1, and (2) a group of SSL pairs of 51 yeast genes involved in DNA synthesis and repair that are of interest. For (1), SSEM with BIC is shown to outperform three Bayesian network algorithms and a multivariate autoregressive model, checked against the results of qRT-PCR experiments. The predictions for (2) are shown to coincide with several known pathways of Sgs1 and its partners that are involved in DNA replication, recombination and repair. In addition, experimentally testable interactions of Rad27 are predicted. CONCLUSION: SSEM is a useful tool for inferring genetic networks, and the results reinforce the possibility of predicting pathways of protein complexes via genetic interactions.
Project description:Motivated by the spatial modeling of aberrant crypt foci (ACF) in colon carcinogenesis, we consider binary data with probabilities modeled as the sum of a nonparametric mean plus a latent Gaussian spatial process that accounts for short-range dependencies. The mean is modeled in a general way using regression splines. The mean function can be viewed as a fixed effect and is estimated with a penalty for regularization. With the latent process viewed as another random effect, the model becomes a generalized linear mixed model. In our motivating data set and other applications, the sample size is too large to easily accommodate maximum likelihood or restricted maximum likelihood estimation (REML), so pairwise likelihood, a special case of composite likelihood, is used instead. We develop an asymptotic theory for models that are sufficiently general to be used in a wide variety of applications, including, but not limited to, the problem that motivated this work. The splines have penalty parameters that must converge to zero asymptotically: we derive theory for this along with a data-driven method for selecting the penalty parameter, a method that is shown in simulations to improve greatly upon standard devices, such as likelihood crossvalidation. Finally, we apply the methods to the data from our experiment ACF. We discover an unexpected location for peak formation of ACF.
Project description:In this article, we study the estimation of mean response and regression coefficient in semiparametric regression problems when response variable is subject to nonrandom missingness. When the missingness is independent of the response conditional on high-dimensional auxiliary information, the parametric approach may misspecify the relationship between covariates and response while the nonparametric approach is infeasible because of the curse of dimensionality. To overcome this, we study a model-based approach to condense the auxiliary information and estimate the parameters of interest nonparametrically on the condensed covariate space. Our estimators possess the double robustness property, i.e., they are consistent whenever the model for the response given auxiliary covariates or the model for the missingness given auxiliary covariate is correct. We conduct a number of simulations to compare the numerical performance between our estimators and other existing estimators in the current missing data literature, including the propensity score approach and the inverse probability weighted estimating equation. A set of real data is used to illustrate our approach.
Project description:In a family-based genetic study such as the Framingham Heart Study (FHS), longitudinal trait measurements are recorded on subjects collected from families. Observations on subjects from the same family are correlated due to shared genetic composition or environmental factors such as diet. The data have a 3-level structure with measurements nested in subjects and subjects nested in families. We propose a semiparametric variance components model to describe phenotype observed at a time point as the sum of a nonparametric population mean function, a nonparametric random quantitative trait locus (QTL) effect, a shared environmental effect, a residual random polygenic effect and measurement error. One feature of the model is that we do not assume a parametric functional form of the age-dependent QTL effect, and we use penalized spline-based method to fit the model. We obtain nonparametric estimation of the QTL heritability defined as the ratio of the QTL variance to the total phenotypic variance. We use simulation studies to investigate performance of the proposed methods and apply these methods to the FHS systolic blood pressure data to estimate age-specific QTL effect at 62cM on chromosome 17.
Project description:This article studies a semiparametric accelerated failure time mixture model for estimation of a biological treatment effect on a latent subgroup of interest with a time-to-event outcome in randomized clinical trials. Latency is induced because membership is observable in one arm of the trial and unidentified in the other. This method is useful in randomized clinical trials with all-or-none noncompliance when patients in the control arm have no access to active treatment and in, for example, oncology trials when a biopsy used to identify the latent subgroup is performed only on subjects randomized to active treatment. We derive a computational method to estimate model parameters by iterating between an expectation step and a weighted Buckley-James optimization step. The bootstrap method is used for variance estimation, and the performance of our method is corroborated in simulation. We illustrate our method through an analysis of a multicenter selective lymphadenectomy trial for melanoma.
Project description:In cancer research, interest frequently centers on factors influencing a latent event that must precede a terminal event. In practice it is often impossible to observe the latent event precisely, making inference about this process difficult. To address this problem, we propose a joint model for the unobserved time to the latent and terminal events, with the two events linked by the baseline hazard. Covariates enter the model parametrically as linear combinations that multiply, respectively, the hazard for the latent event and the hazard for the terminal event conditional on the latent one. We derive the partial likelihood estimators for this problem assuming the latent event is observed, and propose a profile likelihood-based method for estimation when the latent event is unobserved. The baseline hazard in this case is estimated nonparametrically using the EM algorithm, which allows for closed-form Breslow-type estimators at each iteration, bringing improved computational efficiency and stability compared with maximizing the marginal likelihood directly. We present simulation studies to illustrate the finite-sample properties of the method; its use in practice is demonstrated in the analysis of a prostate cancer data set.
Project description:The successional dynamics of microbial communities are influenced by the synergistic interactions of physical and biological factors. In our motivating data, ocean microbiome samples were collected from the Santa Cruz Municipal Wharf, Monterey Bay at multiple time points and then 16S ribosomal RNA (rRNA) sequenced. We develop a Bayesian semiparametric regression model to investigate how microbial abundance and succession change with covarying physical and biological factors including algal bloom and domoic acid concentration level using 16S rRNA sequencing data. A generalized linear regression model is built using the Laplace prior, a sparse inducing prior, to improve estimation of covariate effects on mean abundances of microbial species represented by operational taxonomic units (OTUs). A nonparametric prior model is used to facilitate borrowing strength across OTUs, across samples and across time points. It flexibly estimates baseline mean abundances of OTUs and provides the basis for improved quantification of covariate effects. The proposed method does not require prior normalization of OTU counts to adjust differences in sample total counts. Instead, the normalization and estimation of covariate effects on OTU abundance are simultaneously carried out for joint analysis of all OTUs. Using simulation studies and a real data analysis, we demonstrate improved inference compared to an existing method.
Project description:This paper investigates the semiparametric statistical methods for recurrent events. The mean number of the recurrent events are modeled with the generalized semiparametric varying-coefficient model that can flexibly model three types of covariate effects: time-constant effects, time-varying effects, and covariate-varying effects. We assume that the time-varying effects are unspecified functions of time and the covariate-varying effects are parametric functions of an exposure variable specified up to a finite number of unknown parameters. Different link functions can be selected to provide a rich family of models for recurrent events data. The profile estimation methods are developed for the parametric and nonparametric components. The asymptotic properties are established. We also develop some hypothesis testing procedures to test validity of the parametric forms of covariate-varying effects. The simulation study shows that both estimation and hypothesis testing procedures perform well. The proposed method is applied to analyze a data set from an acyclovir study and investigate whether acyclovir treatment reduces the mean relapse recurrences.
Project description:Motivated by medical studies in which patients could be cured of disease but the disease event time may be subject to interval censoring, we presents a semiparametric non-mixture cure model for the regression analysis of interval-censored time-to-event datxa. We develop semiparametric maximum likelihood estimation for the model using the expectation-maximization method for interval-censored data. The maximization step for the baseline function is nonparametric and numerically challenging. We develop an efficient and numerically stable algorithm via modern convex optimization techniques, yielding a self-consistency algorithm for the maximization step. We prove the strong consistency of the maximum likelihood estimators under the Hellinger distance, which is an appropriate metric for the asymptotic property of the estimators for interval-censored data. We assess the performance of the estimators in a simulation study with small to moderate sample sizes. To illustrate the method, we also analyze a real data set from a medical study for the biochemical recurrence of prostate cancer among patients who have undergone radical prostatectomy. Supplemental materials for the computational algorithm are available online.