Project description:Model averaging is an effective way to enhance prediction accuracy. However, most previous works focus on low-dimensional settings with completely observed responses. To attain an accurate prediction for the risk effect of survival data with high-dimensional predictors, we propose a novel method: rank-based greedy (RG) model averaging. Specifically, adopting the transformation model with splitting predictors as working models, we doubly use the smooth concordance index function to derive the candidate predictions and optimal model weights. The final prediction is achieved by weighted averaging all the candidates. Our approach is flexible, computationally efficient, and robust against model misspecification, as it neither requires the correctness of a joint model nor involves the estimation of the transformation function. We further adopt the greedy algorithm for high dimensions. Theoretically, we derive an asymptotic error bound for the optimal weights under some mild conditions. In addition, the summation of weights assigned to the correct candidate submodels is proven to approach one in probability when there are correct models included among the candidate submodels. Extensive numerical studies are carried out using both simulated and real datasets to show the proposed approach's robust performance compared to the existing regularization approaches. Supplementary materials for this article are available online.
Project description:Applied researchers often confront two issues when using the fixed effect-two-stage least squares (FE-2SLS) estimator for panel data models. One is that it may lose its consistency due to too many instruments. The other is that the gain of using FE-2SLS may not exceed its loss when the endogeneity is weak. In this paper, an L2Boosting regularization procedure for panel data models is proposed to tackle the many instruments issue. We then construct a Stein-like model-averaging estimator to take advantage of FE and FE-2SLS-Boosting estimators. Finite sample properties are examined in Monte Carlo and an empirical application is presented.
Project description:BackgroundModel averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging.ResultsIn simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects.ConclusionsCompared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254-65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
Project description:BackgroundMicroarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes.ResultsWe applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139).ConclusionThe strength of the iterative BMA algorithm for survival analysis lies in its ability to account for model uncertainty. The results from this study demonstrate that our procedure selects a small number of genes while eclipsing other methods in predictive performance, making it a highly accurate and cost-effective prognostic tool in the clinical setting.
Project description:In this paper, we develop a model averaging method to estimate a high-dimensional covariance matrix, where the candidate models are constructed by different orders of polynomial functions. We propose a Mallows-type model averaging criterion and select the weights by minimizing this criterion, which is an unbiased estimator of the expected in-sample squared error plus a constant. Then, we prove the asymptotic optimality of the resulting model average covariance estimators. Finally, we conduct numerical simulations and a case study on Chinese airport network structure data to demonstrate the usefulness of the proposed approaches.
Project description:To clarify the effects of near-infrared radiation, we assessed DNA microarray after water-filtered near-infrared (1100-1800 nm together with a water-filter that excludes wavelengths 1400-1500 nm) irradiation.
Project description:MotivationDNA methylation plays an important role in many biological processes and cancer progression. Recent studies have found that there are also differences in methylation variations in different groups other than differences in methylation means. Several methods have been developed that consider both mean and variance signals in order to improve statistical power of detecting differentially methylated loci. Moreover, as methylation levels of neighboring CpG sites are known to be strongly correlated, methods that incorporate correlations have also been developed. We previously developed a network-based penalized logistic regression for correlated methylation data, but only focusing on mean signals. We have also developed a generalized exponential tilt model that captures both mean and variance signals but only examining one CpG site at a time.ResultsIn this article, we proposed a penalized Exponential Tilt Model (pETM) using network-based regularization that captures both mean and variance signals in DNA methylation data and takes into account the correlations among nearby CpG sites. By combining the strength of the two models we previously developed, we demonstrated the superior power and better performance of the pETM method through simulations and the applications to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. The developed pETM method identifies many cancer-related methylation loci that were missed by our previously developed method that considers correlations among nearby methylation loci but not variance signals.Availability and implementationThe R package 'pETM' is publicly available through CRAN: http://cran.r-project.org .Contactsw2206@columbia.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:An irradiated turbid medium scatters the light in accordance to its optical properties. Near-infrared (NIR) clinical methods, which are based on spectral-dependent absorption, suffer from an inherent error due to spectral-dependent scattering. We present here a unique spatial point, that is, iso-pathlength (IPL) point, on the surface of a tissue at which the intensity of re-emitted light remains constant. This scattering-indifferent point depends solely on the medium geometry. On the basis of this natural phenomenon, we suggest a novel optical method for self-calibrated clinical measurements. We found that the IPL point exists in both cylindrical and semi-infinite tissue geometries (Supporting Information, Video file). Finally, in vivo human finger and mice measurements are used to validate the crossing point between the intensity profiles of two wavelengths. Hence, measurements at the IPL point yield an accurate absorption assessment while eliminating the scattering dependence. This finding can be useful for oxygen saturation determination, NIR spectroscopy, photoplethysmography measurements, and a wide range of optical sensing methods for physiological aims.
Project description:Optimal management of free-ranging herbivores requires the accurate assessment of an animal's nutritional status. For this purpose 'near-infrared reflectance spectroscopy' (NIRS) is very useful, especially when nutritional assessment is done through faecal indicators such as faecal nitrogen (FN). In order to perform an NIRS calibration, the default protocol recommends starting by generating an initial equation based on at least 50-75 samples from the given species. Although this protocol optimises prediction accuracy, it limits the use of NIRS with rare or endangered species where sample sizes are often small. To overcome this limitation we tested a single NIRS equation (i.e., multispecies calibration) to predict FN in herbivores. Firstly, we used five herbivore species with highly contrasting digestive physiologies to build monospecies and multispecies calibrations, namely horse, sheep, Pyrenean chamois, red deer and European rabbit. Secondly, the equation accuracy was evaluated by two procedures using: (1) an external validation with samples from the same species, which were not used in the calibration process; and (2) samples from different ungulate species, specifically Alpine ibex, domestic goat, European mouflon, roe deer and cattle. The multispecies equation was highly accurate in terms of the coefficient of determination for calibration R2 = 0.98, standard error of validation SECV = 0.10, standard error of external validation SEP = 0.12, ratio of performance to deviation RPD = 5.3, and range error of prediction RER = 28.4. The accuracy of the multispecies equation to predict other herbivore species was also satisfactory (R2 > 0.86, SEP < 0.27, RPD > 2.6, and RER > 8.1). Lastly, the agreement between multi- and monospecies calibrations was also confirmed by the Bland-Altman method. In conclusion, our single multispecies equation can be used as a reliable, cost-effective, easy and powerful analytical method to assess FN in a wide range of herbivore species.