Project description:BackgroundIn prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.MethodThe data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.ResultsThe discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.ConclusionPerformance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Project description:BackgroundThe use of alternative modeling techniques for predicting patient survival is complicated by the fact that some alternative techniques cannot readily deal with censoring, which is essential for analyzing survival data. In the current study, we aimed to demonstrate that pseudo values enable statistically appropriate analyses of survival outcomes when used in seven alternative modeling techniques.MethodsIn this case study, we analyzed survival of 1282 Dutch patients with newly diagnosed Head and Neck Squamous Cell Carcinoma (HNSCC) with conventional Kaplan-Meier and Cox regression analysis. We subsequently calculated pseudo values to reflect the individual survival patterns. We used these pseudo values to compare recursive partitioning (RPART), neural nets (NNET), logistic regression (LR) general linear models (GLM) and three variants of support vector machines (SVM) with respect to dichotomous 60-month survival, and continuous pseudo values at 60 months or estimated survival time. We used the area under the ROC curve (AUC) and the root of the mean squared error (RMSE) to compare the performance of these models using bootstrap validation.ResultsOf a total of 1282 patients, 986 patients died during a median follow-up of 66 months (60-month survival: 52% [95% CI: 50%-55%]). The LR model had the highest optimism corrected AUC (0.791) to predict 60-month survival, followed by the SVM model with a linear kernel (AUC 0.787). The GLM model had the smallest optimism corrected RMSE when continuous pseudo values were considered for 60-month survival or the estimated survival time followed by SVM models with a linear kernel. The estimated importance of predictors varied substantially by the specific aspect of survival studied and modeling technique used.ConclusionsThe use of pseudo values makes it readily possible to apply alternative modeling techniques to survival problems, to compare their performance and to search further for promising alternative modeling techniques to analyze survival time.
Project description:As a performance measure for a prediction model, the area under the receiver operating characteristic curve (AUC) is insensitive to the addition of strong markers. A number of measures sensitive to performance change have recently been proposed; however, these relative-performance measures may lead to self-contradictory conclusions. This paper examines alternative performance measures for prediction models: the Lorenz curve-based Gini and Pietra indices, and a standardized version of the Brier score, the scaled Brier. Computer simulations are performed in order to study the sensitivity of these measures to performance change when a new marker is added to a baseline model. When the discrimination power of the added marker is concentrated in the gray zone of the baseline model, the AUC and the Gini show minimal performance improvements. The Pietra and the scaled Brier show more significant improvements in the same situation, comparatively. The Pietra and the scaled Brier indices are therefore recommended for prediction model performance measurement, in light of their ease of interpretation, clinical relevance and sensitivity to gray-zone resolving markers.
Project description:An important goal of censored quantile regression is to provide reliable predictions of survival quantiles, which are often reported in practice to offer robust and comprehensive biomedical summaries. However, formal methods for evaluating and comparing working quantile regression models in terms of their performance in predicting survival quantiles have been lacking, especially when the working models are subject to model mis-specification. In this article, we proposes a sensible and rigorous framework to fill in this gap. We introduce and justify a predictive performance measure defined based on the check loss function. We derive estimators of the proposed predictive performance measure and study their distributional properties and the corresponding inference procedures. More importantly, we develop model comparison procedures that enable thorough evaluations of model predictive performance among nested or non-nested models. Our proposals properly accommodate random censoring to the survival outcome and the realistic complication of model mis-specification, and thus are generally applicable. Extensive simulations and a real data example demonstrate satisfactory performances of the proposed methods in real life settings.
Project description:The Support vector regression (SVR) was used to investigate quantitative structure-activity relationships (QSAR) of 75 phenolic compounds with Trolox-equivalent antioxidant capacity (TEAC). Geometric structures were optimized at the EF level of the MOPAC software program. Using Pearson correlation coefficient analysis, four molecular descriptors [n(OH), Cosmo Area (CA), Core-Core Repulsion (CCR) and Final Heat of Formation (FHF)] were selected as independent variables. The QSAR model was developed from the training set consisting of 57 compounds and then used the leave-one-out cross-validation (LOOCV) correlation coefficient to evaluate the prediction ability of the QSAR model. Used Artificial neural network (ANN) and multiple linear regression (MLR) for comparing. The RMSE (root mean square error) values of LOOCV in SVR, ANN and MLR models were 0.44, 0.46 and 0.54. The RMSE values of prediction of external 18 compounds were 0.41, 0.39 and 0.54 for SVR, ANN and MLR models, respectively. The obtained result indicated that the SVR models exhibited excellent predicting performance and competent for predicting the TEAC of phenolic compounds.
Project description:Ellenberg indicator values (EIVs) are a widely used metric in plant ecology comprising a semi-quantitative description of species' ecological requirements. Typically, point estimates of mean EIV scores are compared over space or time to infer differences in the environmental conditions structuring plant communities-particularly in resurvey studies where no historical environmental data are available. However, the use of point estimates as a basis for inference does not take into account variance among species EIVs within sampled plots and gives equal weighting to means calculated from plots with differing numbers of species. Traditional methods are also vulnerable to inaccurate estimates where only incomplete species lists are available.We present a set of multilevel (hierarchical) models-fitted with and without group-level predictors (e.g., habitat type)-to improve precision and accuracy of plot mean EIV scores and to provide more reliable inference on changing environmental conditions over spatial and temporal gradients in resurvey studies. We compare multilevel model performance to GLMMs fitted to point estimates of mean EIVs. We also test the reliability of this method to improve inferences with incomplete species lists in some or all sample plots. Hierarchical modeling led to more accurate and precise estimates of plot-level differences in mean EIV scores between time-periods, particularly for datasets with incomplete records of species occurrence. Furthermore, hierarchical models revealed directional environmental change within ecological habitat types, which less precise estimates from GLMMs of raw mean EIVs were inadequate to detect. The ability to compute separate residual variance and adjusted R 2 parameters for plot mean EIVs and temporal differences in plot mean EIVs in multilevel models also allowed us to uncover a prominent role of hydrological differences as a driver of community compositional change in our case study, which traditional use of EIVs would fail to reveal. Assessing environmental change underlying ecological communities is a vital issue in the face of accelerating anthropogenic change. We have demonstrated that multilevel modeling of EIVs allows for a nuanced estimation of such from plant assemblage data changes at local scales and beyond, leading to a better understanding of temporal dynamics of ecosystems. Further, the ability of these methods to perform well with missing data should increase the total set of historical data which can be used to this end.
Project description:Determining the target population for the screening of Barrett's esophagus (BE), a precancerous condition of esophageal adenocarcinoma, remains a challenge in Asia. The aim of our study was to develop risk prediction models for BE using logistic regression (LR) and artificial neural network (ANN) methods. Their predictive performances were compared. We retrospectively analyzed 9646 adults aged ≥20 years undergoing upper gastrointestinal endoscopy at a health examinations center in Taiwan. Evaluated by using 10-fold cross-validation, both models exhibited good discriminative power, with comparable area under curve (AUC) for the LR and ANN models (Both AUC were 0.702). Our risk prediction models for BE were developed from individuals with or without clinical indications of upper gastrointestinal endoscopy. The models have the potential to serve as a practical tool for identifying high-risk individuals of BE among the general population for endoscopic screening.
Project description:Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).
Project description:A new distribution defined on (0,1) interval is introduced. Its probability density and cumulative distribution functions have simple forms. Thanks to its simple forms, the moments, incomplete moments and quantile function of the proposed distribution are derived and obtained in explicit forms. Four parameter estimation methods are used to estimate the unknown parameter of the distribution. Besides, simulation study is implemented to compare the efficiencies of these parameter estimation methods. More importantly, owing to the proposed distribution, we provide an alternative regression model for the bounded response variable. The proposed regression model is compared with the beta and unit-Lindley regression models based on two real data sets.
Project description:Non-standard structured, multivariate data are emerging in many research areas, including genetics and genomics, ecology, and social science. Suitably defined pairwise distance measures are commonly used in distance-based analysis to study the association between the variables. In this work, we consider a linear quantile regression model for pairwise distances. We investigate the large sample properties of an estimator of the unknown coefficients and propose statistical inference procedures correspondingly. Extensive simulations provide evidence of satisfactory finite sample properties of the proposed method. Finally, we applied the method to a microbiome association study to illustrate its utility.