A SIEVE M-THEOREM FOR BUNDLED PARAMETERS IN SEMIPARAMETRIC MODELS, WITH APPLICATION TO THE EFFICIENT ESTIMATION IN A LINEAR MODEL FOR CENSORED DATA.
ABSTRACT: In many semiparametric models that are parameterized by two types of parameters - a Euclidean parameter of interest and an infinite-dimensional nuisance parameter, the two parameters are bundled together, i.e., the nuisance parameter is an unknown function that contains the parameter of interest as part of its argument. For example, in a linear regression model for censored survival data, the unspecified error distribution function involves the regression coefficients. Motivated by developing an efficient estimating method for the regression parameters, we propose a general sieve M-theorem for bundled parameters and apply the theorem to deriving the asymptotic theory for the sieve maximum likelihood estimation in the linear regression model for censored survival data. The numerical implementation of the proposed estimating method can be achieved through the conventional gradient-based search algorithms such as the Newton-Raphson algorithm. We show that the proposed estimator is consistent and asymptotically normal and achieves the semiparametric efficiency bound. Simulation studies demonstrate that the proposed method performs well in practical settings and yields more efficient estimates than existing estimating equation based methods. Illustration with a real data example is also provided.
Project description:Epidemiologic studies and disease prevention trials often seek to relate an exposure variable to a failure time that suffers from interval-censoring. When the failure rate is low and the time intervals are wide, a large cohort is often required so as to yield reliable precision on the exposure-failure-time relationship. However, large cohort studies with simple random sampling could be prohibitive for investigators with a limited budget, especially when the exposure variables are expensive to obtain. Alternative cost-effective sampling designs and inference procedures are therefore desirable. We propose an outcome-dependent sampling (ODS) design with interval-censored failure time data, where we enrich the observed sample by selectively including certain more informative failure subjects. We develop a novel sieve semiparametric maximum empirical likelihood approach for fitting the proportional hazards model to data from the proposed interval-censoring ODS design. This approach employs the empirical likelihood and sieve methods to deal with the infinite-dimensional nuisance parameters, which greatly reduces the dimensionality of the estimation problem and eases the computation difficulty. The consistency and asymptotic normality of the resulting regression parameter estimator are established. The results from our extensive simulation study show that the proposed design and method works well for practical situations and is more efficient than the alternative designs and competing approaches. An example from the Atherosclerosis Risk in Communities (ARIC) study is provided for illustration.
Project description:We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method.
Project description:We develop asymptotic theory for weighted likelihood estimators (WLE) under two-phase stratified sampling without replacement. We also consider several variants of WLEs involving estimated weights and calibration. A set of empirical process tools are developed including a Glivenko-Cantelli theorem, a theorem for rates of convergence of M-estimators, and a Donsker theorem for the inverse probability weighted empirical processes under two-phase sampling and sampling without replacement at the second phase. Using these general results, we derive asymptotic distributions of the WLE of a finite-dimensional parameter in a general semiparametric model where an estimator of a nuisance parameter is estimable either at regular or nonregular rates. We illustrate these results and methods in the Cox model with right censoring and interval censoring. We compare the methods via their asymptotic variances under both sampling without replacement and the more usual (and easier to analyze) assumption of Bernoulli sampling at the second phase.
Project description:The case-cohort design has been widely used as a means of cost reduction in assembling or measuring expensive covariates in large cohort studies. The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring; it is known only to fall within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials. The consistency and asymptotic normality of the resulting regression parameter estimator are established and a weighted bootstrap procedure is considered for variance estimation. Simulations show that the proposed method works well for practical situations, and an application to real data is provided.
Project description:We propose a two-stage outcome-dependent sampling design and inference procedure for studies that concern interval-censored failure time outcomes. This design enhances the study efficiency by allowing the selection probabilities of the second-stage sample, for which the expensive exposure variable is ascertained, to depend on the first-stage observed interval-censored failure time outcomes. In particular, the second-stage sample is enriched by selectively including subjects who are known or observed to experience the failure at an early or late time. We develop a sieve semiparametric maximum pseudo likelihood procedure that makes use of all available data from the proposed two-stage design. The resulting regression parameter estimator is shown to be consistent and asymptotically normal, and a consistent estimator for its asymptotic variance is derived. Simulation results demonstrate that the proposed design and inference procedure performs well in practical situations and is more efficient than the existing designs and methods. An application to a phase 3 HIV vaccine trial is provided.
Project description:Partly interval-censored (PIC) data arise when some failure times are exactly observed while others are only known to lie within certain intervals. In this article, we consider efficient semiparametric estimation of the accelerated failure time (AFT) model with PIC data. We first generalize the Buckley-James estimator for right-censored data to PIC data. Then, we develop a one-step estimator by deriving and estimating the efficient score for the regression parameters. We show that under mild regularity conditions the generalized Buckley-James estimator is consistent and asymptotically normal and the one-step estimator is consistent and asymptotically normal with a covariance matrix that attains the semiparametric efficiency bound. We conduct extensive simulation studies to examine the performance of the proposed estimators in finite samples and apply our methods to data derived from an AIDS study.
Project description:We consider inference under a nonparametric or semiparametric model with likelihood that factorizes as the product of two or more variation-independent factors. We are interested in a finite-dimensional parameter that depends on only one of the likelihood factors and whose estimation requires the auxiliary estimation of one or several nuisance functions. We investigate general structures conducive to the construction of so-called multiply robust estimating functions, whose computation requires postulating several dimension-reducing models but which have mean zero at the true parameter value provided one of these models is correct.
Project description:Right-censored time-to-event data are sometimes observed from a (sub)cohort of patients whose survival times can be subject to outcome-dependent sampling schemes. In this paper, we propose a unified estimation method for semiparametric accelerated failure time models under general biased estimating schemes. The proposed estimator of the regression covariates is developed upon a bias-offsetting weighting scheme and is proved to be consistent and asymptotically normally distributed. Large sample properties for the estimator are also derived. Using rank-based monotone estimating functions for the regression parameters, we find that the estimating equations can be easily solved via convex optimization. The methods are confirmed through simulations and illustrated by application to real datasets on various sampling schemes including length-bias sampling, the case-cohort design and its variants.
Project description:This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low dimensional parameter of interest. The major challenge is how to construct a powerful test statistic in the presence of high-dimensional nuisance parameters and sophisticated within-subject correlation of longitudinal data. To deal with the challenge, we propose a new quadratic decorrelated inference function approach, which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. When the parameter of interest is of fixed dimension, we prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. We further extend this result and establish the limiting distribution of the estimator under the setting with the dimension of the parameter of interest growing with the sample size at a polynomial rate. Finally, we study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey (2002)'s procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. We also apply the proposed procedure to a real data example.
Project description:Consider a semiparametric model indexed by a Euclidean parameter of interest and an infinite-dimensional nuisance parameter. In many applications, pseudolikelihood provides a convenient way to infer the parameter of interest, where the nuisance parameter is replaced by a consistent estimator. The purpose of this paper is to establish the asymptotic behaviour of the pseudolikelihood ratio statistic under semiparametric models. In particular, we consider testing the hypothesis that the parameter of interest lies on the boundary of its parameter space. Under regularity conditions, we establish the equivalence between the asymptotic distributions of the pseudolikelihood ratio statistic and a likelihood ratio statistic for a normal mean problem with a misspecified covariance matrix. This result holds when the nuisance parameter is estimated at a rate slower than the usual rate in parametric models. We study three examples in which the asymptotic distributions are shown to be mixtures of chi-squared variables. We conduct simulation studies to examine the finite-sample performance of the pseudolikelihood ratio test.