Dataset Information

A prediction-based alternative to P values in regression models.

ABSTRACT:

SUBMITTER: Lu M

PROVIDER: S-EPMC5915354 | biostudies-literature | 2018 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A prediction-based alternative to P values in regression models.

Lu Min M Ishwaran Hemant H

The Journal of thoracic and cardiovascular surgery 20170830 3

PMID: 29306487

Similar Datasets

Project description:BackgroundThe use of alternative modeling techniques for predicting patient survival is complicated by the fact that some alternative techniques cannot readily deal with censoring, which is essential for analyzing survival data. In the current study, we aimed to demonstrate that pseudo values enable statistically appropriate analyses of survival outcomes when used in seven alternative modeling techniques.MethodsIn this case study, we analyzed survival of 1282 Dutch patients with newly diagnosed Head and Neck Squamous Cell Carcinoma (HNSCC) with conventional Kaplan-Meier and Cox regression analysis. We subsequently calculated pseudo values to reflect the individual survival patterns. We used these pseudo values to compare recursive partitioning (RPART), neural nets (NNET), logistic regression (LR) general linear models (GLM) and three variants of support vector machines (SVM) with respect to dichotomous 60-month survival, and continuous pseudo values at 60 months or estimated survival time. We used the area under the ROC curve (AUC) and the root of the mean squared error (RMSE) to compare the performance of these models using bootstrap validation.ResultsOf a total of 1282 patients, 986 patients died during a median follow-up of 66 months (60-month survival: 52% [95% CI: 50%-55%]). The LR model had the highest optimism corrected AUC (0.791) to predict 60-month survival, followed by the SVM model with a linear kernel (AUC 0.787). The GLM model had the smallest optimism corrected RMSE when continuous pseudo values were considered for 60-month survival or the estimated survival time followed by SVM models with a linear kernel. The estimated importance of predictors varied substantially by the specific aspect of survival studied and modeling technique used.ConclusionsThe use of pseudo values makes it readily possible to apply alternative modeling techniques to survival problems, to compare their performance and to search further for promising alternative modeling techniques to analyze survival time.

Project description:Ellenberg indicator values (EIVs) are a widely used metric in plant ecology comprising a semi-quantitative description of species' ecological requirements. Typically, point estimates of mean EIV scores are compared over space or time to infer differences in the environmental conditions structuring plant communities-particularly in resurvey studies where no historical environmental data are available. However, the use of point estimates as a basis for inference does not take into account variance among species EIVs within sampled plots and gives equal weighting to means calculated from plots with differing numbers of species. Traditional methods are also vulnerable to inaccurate estimates where only incomplete species lists are available.We present a set of multilevel (hierarchical) models-fitted with and without group-level predictors (e.g., habitat type)-to improve precision and accuracy of plot mean EIV scores and to provide more reliable inference on changing environmental conditions over spatial and temporal gradients in resurvey studies. We compare multilevel model performance to GLMMs fitted to point estimates of mean EIVs. We also test the reliability of this method to improve inferences with incomplete species lists in some or all sample plots. Hierarchical modeling led to more accurate and precise estimates of plot-level differences in mean EIV scores between time-periods, particularly for datasets with incomplete records of species occurrence. Furthermore, hierarchical models revealed directional environmental change within ecological habitat types, which less precise estimates from GLMMs of raw mean EIVs were inadequate to detect. The ability to compute separate residual variance and adjusted R 2 parameters for plot mean EIVs and temporal differences in plot mean EIVs in multilevel models also allowed us to uncover a prominent role of hydrological differences as a driver of community compositional change in our case study, which traditional use of EIVs would fail to reveal. Assessing environmental change underlying ecological communities is a vital issue in the face of accelerating anthropogenic change. We have demonstrated that multilevel modeling of EIVs allows for a nuanced estimation of such from plant assemblage data changes at local scales and beyond, leading to a better understanding of temporal dynamics of ecosystems. Further, the ability of these methods to perform well with missing data should increase the total set of historical data which can be used to this end.

Project description:Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix (PWM) model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max, and Mad2) in their native genomic context. These high-throughput, quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF-DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar PWMs, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step towards better sequence-based models of individual TF-DNA binding specificity. Four protein binding microarray (PBM) experiments of human transcription factors were performed. Briefly, the PBMs involved binding GST-tagged transcription factors c-Myc, Max, and Mad2(Mxi1) to double-stranded 180K Agilent microarrays in order to determine their binding specificity for putative DNA binding sites in native genomic context. Briefly, we represent three categories of 36-bp sequences: 1) bound probes, 2) unbound probes (or negative controls), and 3) test probes. Bound probes corresponded to genomic regions bound in vivo by c-Myc, Max, or Mad2 (ChIP-seq P < 10^(-10) in HeLaS3 or K562 celld (ENCODE)) that contain at least two consecutive 8-mers with universal PBM E-score > 0.4 (Munteanu and Gordan, LNCS 2013). All putative binding sites occurr at the same position within the probes on the array. M-bM-^@M-^\UnboundM-bM-^@M-^] probes corresponded to genomic regions with ChIP-seq P < 10^(-10) and a maximum 8-mer E-score < 0.2. We also designed test probes that contain, within constant flanking regions, all nnCACGTGnn 10-mers and 18 nnnCACGTGnnn 12-mers (where n = A, C, G, or T). Each DNA sequence represented on the array is present in 6 replicate spots. We report the PBM signal intensity for each spot. The PBM protocol is described in Berger et al., Nature Biotechnology 2006 (PMID 16998473).

Dataset Information

A prediction-based alternative to P values in regression models.

Publications

A prediction-based alternative to P values in regression models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets