High Dimensional Variable Selection with Error Control.
ABSTRACT: Background. The iterative sure independence screening (ISIS) is a popular method in selecting important variables while maintaining most of the informative variables relevant to the outcome in high throughput data. However, it not only is computationally intensive but also may cause high false discovery rate (FDR). We propose to use the FDR as a screening method to reduce the high dimension to a lower dimension as well as controlling the FDR with three popular variable selection methods: LASSO, SCAD, and MCP. Method. The three methods with the proposed screenings were applied to prostate cancer data with presence of metastasis as the outcome. Results. Simulations showed that the three variable selection methods with the proposed screenings controlled the predefined FDR and produced high area under the receiver operating characteristic curve (AUROC) scores. In applying these methods to the prostate cancer example, LASSO and MCP selected 12 and 8 genes and produced AUROC scores of 0.746 and 0.764, respectively. Conclusions. We demonstrated that the variable selection methods with the sequential use of FDR and ISIS not only controlled the predefined FDR in the final models but also had relatively high AUROC scores.
Project description:Background:Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS) and principled SIS (PSIS) have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables. Methods:We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO) and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell's concordance statistic), calibration and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer. Results:When n=300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell's c-index (>0.80). On the other hand, with a small sample size (n=150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67. Conclusions:Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.
Project description:<h4>Motivation</h4>Association studies to discover links between genetic markers and phenotypes are central to bioinformatics. Methods of regularized regression, such as variants of the Lasso, are popular for this task. Despite the good predictive performance of these methods in the average case, they suffer from unstable selections of correlated variables and inconsistent selections of linearly dependent variables. Unfortunately, as we demonstrate empirically, such problematic situations of correlated and linearly dependent variables often exist in genomic datasets and lead to under-performance of classical methods of variable selection.<h4>Results</h4>To address these challenges, we propose the Precision Lasso. Precision Lasso is a Lasso variant that promotes sparse variable selection by regularization governed by the covariance and inverse covariance matrices of explanatory variables. We illustrate its capacity for stable and consistent variable selection in simulated data with highly correlated and linearly dependent variables. We then demonstrate the effectiveness of the Precision Lasso to select meaningful variables from transcriptomic profiles of breast cancer patients. Our results indicate that in settings with correlated and linearly dependent variables, the Precision Lasso outperforms popular methods of variable selection such as the Lasso, the Elastic Net and Minimax Concave Penalty (MCP) regression.<h4>Availability and implementation</h4>Software is available at https://github.com/HaohanWang/thePrecisionLasso.<h4>Supplementary information</h4>Supplementary data are available at Bioinformatics online.
Project description:Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in <i>Arabidopsis thaliana</i> datasets.
Project description:Colon leakage score (CLS) was introduced as a clinical tool to predict anastomotic leakage (AL) in patients who underwent left-sided colorectal surgery, but its clinical validity has not been widely studied. We evaluated the clinical utility of CLS and developed a modified CLS (m-CLS). In total, 566 patients who underwent left-sided colorectal surgery were enrolled and categorized into training (n = 396) and validation (n = 170) sets via random sampling. Using CLS variables, the least absolute shrinkage and selection operator (LASSO) regression model was applied for variable selection and predictive signature building in the training set. The model's performance was validated in the validation set. The predictive powers of m-CLS and CLS were compared by the area under the receiver operating characteristic (AUROC) curve in the overall group. Twenty-three AL events (4.1%) were noted. The AL group had a significantly higher mean CLS than the No Leakage group (12.5 vs. 9.6, p = 0.001). Five clinical variables were selected and used to generate m-CLS. The predictive performance of m-CLS was similar in training and validation sets (AUROC 0.838 vs. 0.803, p = 0.724). In the overall set, m-CLS was significantly predictive of AL and performed better than CLS (AUROC 0.831 vs. 0.701, p = 0.008). In conclusion, LASSO-model-generated m-CLS could predict AL more accurately than CLS.
Project description:<h4>Background</h4>There is increasing interest in examining the consequences of simultaneous exposures to chemical mixtures. However, a consensus or recommendations on how to appropriately select the statistical approach analyzing the health effects of mixture exposures which best aligns with study goals has not been well established. We recognize the limitations that existing methods have in effectively reducing data dimension and detecting interaction effects when analyzing chemical mixture exposures collected in high dimensional datasets with varying degrees of variable intercorrelations. In this research, we aim to examine the performance of a two-step statistical approach in addressing the analytical challenges of chemical mixture exposures using two simulated data sets, and an existing data set from the Navajo Birth Cohort Study as a representative case study.<h4>Methods</h4>We propose to use a two-step approach: a robust variable selection step using the random forest approach followed by adaptive lasso methods that incorporate both dimensionality reduction and quantification of the degree of association between the chemical exposures and the outcome of interest, including interaction terms. We compared the proposed method with other approaches including (1) single step adaptive lasso; and (2) two-step Classification and regression trees (CART) followed by adaptive lasso method.<h4>Results</h4>Utilizing simulated data sets and applying the method to a real-life dataset from the Navajo Birth Cohort Study, we have demonstrated good performance of the proposed two-step approach. Results from the simulation datasets indicated the effectiveness of variable dimension reduction and reliable identification of a parsimonious model compared to other methods: single-step adaptive lasso or two-step CART followed by adaptive lasso method.<h4>Conclusions</h4>Our proposed two-step approach provides a robust way of analyzing the effects of high-throughput chemical mixture exposures on health outcomes by combining the strengths of variable selection and adaptive shrinkage strategies.
Project description:BACKGROUND:Cancer is the second leading cause of death in the United States. Cancer screenings can detect precancerous cells and allow for earlier diagnosis and treatment. Our purpose was to better understand risk factors for cancer screenings and assess the effect of cancer screenings on changes of Cardiovascular health (CVH) measures before and after cancer screenings among patients. METHODS:We used The Guideline Advantage (TGA)-American Heart Association ambulatory quality clinical data registry of electronic health record data (n = 362,533 patients) to investigate associations between time-series CVH measures and receipt of breast, cervical, and colon cancer screenings. Long short-term memory (LSTM) neural networks was employed to predict receipt of cancer screenings. We also compared the distributions of CVH factors between patients who received cancer screenings and those who did not. Finally, we examined and quantified changes in CVH measures among the screened and non-screened groups. RESULTS:Model performance was evaluated by the area under the receiver operator curve (AUROC): the average AUROC of 10 curves was 0.63 for breast, 0.70 for cervical, and 0.61 for colon cancer screening. Distribution comparison found that screened patients had a higher prevalence of poor CVH categories. CVH submetrics were improved for patients after cancer screenings. CONCLUSION:Deep learning algorithm could be used to investigate the associations between time-series CVH measures and cancer screenings in an ambulatory population. Patients with more adverse CVH profiles tend to be screened for cancers, and cancer screening may also prompt favorable changes in CVH. Cancer screenings may increase patient CVH health, thus potentially decreasing burden of disease and costs for the health system (e.g., cardiovascular diseases and cancers).
Project description:OBJECTIVES:The FUTUREPAIN study develops a short general-purpose questionnaire, based on the biopsychosocial model, to predict the probability of developing or maintaining moderate-to-severe chronic pain 7-10 years into the future. METHODS:This is a retrospective cohort study. Two-thirds of participants in the National Survey of Midlife Development in the United States were randomly assigned to a training cohort used to train a predictive machine learning model based on the least absolute shrinkage and selection operator (LASSO) algorithm, which produces a model with minimal covariates. Out-of-sample predictions from this model were then estimated using the remaining one-third testing cohort to determine the area under the receiver operating characteristic curve (AUROC). An optimal cut-point that maximized sensitivity and specificity was determined. RESULTS:The LASSO model using 82 variables in the training cohort, yielded an 18-variable model with an out-of-sample AUROC of 0.85 (95% Confidence Interval (CI): 0.80, 0.91) in the testing cohort. The sum of sensitivity (0.88) and specificity (0.76) was maximized at a cut-point of 17 (95% CI: 15, 18) on a 0-100 scale where the AUROC was 0.82. DISCUSSION:We developed a short general-purpose questionnaire that predicts the probability of an adult having moderate-to-severe chronic pain in 7-to-10 years. It has diagnostic ability greater than 80% and can be used regardless of whether a patient is currently experiencing chronic pain. Knowing which patients are likely to have moderate-to-severe chronic pain in the future allows clinicians to target preventive treatment.
Project description:Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables from microbiome sequence data and their complex biological structure.We propose a tree-based scanning method, Selection of Models for the Analysis of Risk factor Trees (referred to as SMART-scan), for identifying taxonomic groups that are associated with a disease or trait. SMART-scan is a model selection technique that uses a predefined taxonomy to organize the large pool of possible predictors into optimized groups, and hierarchically searches and determines variable groups for association test. We investigate the statistical properties of SMART-scan through simulations, in comparison to a regular single-variable analysis and three commonly-used variable selection methods, stepwise regression, least absolute shrinkage and selection operator (LASSO) and classification and regression tree (CART). When there are taxonomic group effects in the data, SMART-scan can significantly increase power by using bacterial taxonomic information to split large numbers of variables into groups. Through an application to microbiome data from a vervet monkey diet experiment, we demonstrate that SMART-scan can identify important phenotype-associated taxonomic features missed by single-variable analysis, stepwise regression, LASSO and CART.
Project description:BACKGROUND:To identify a synovial fluid (SF) biomarker profile characteristic of individuals with an inflammatory osteoarthritis (OA) endotype. METHODS:A total of 48 knees (of 25 participants) were characterized for an extensive array of SF biomarkers quantified by Rules Based Medicine using the high-sensitivity multiplex immunoassay, Myriad Human InflammationMAP® 1.0, which included 47 different cytokines, chemokines, and growth factors related to inflammation. Multivariable regression with generalized estimating equations (GEE) and false discovery rate (FDR) correction was used to assess associations of SF RBM biomarkers with etarfolatide imaging scores reflecting synovial inflammation; radiographic knee OA severity (based on Kellgren-Lawrence (KL) grade, joint space narrowing, and osteophyte scores); knee joint symptoms; and SF biomarkers associated with activated macrophages and knee OA progression including CD14 and CD163 (shed by activated macrophages) and elastase (shed by activated neutrophils). RESULTS:Significant associations of SF biomarkers meeting FDR < 0.05 included soluble (s)VCAM-1 and MMP-3 with synovial inflammation (FDR-adjusted p = 0.025 and 1.06 × 10-7); sVCAM-1, sICAM-1, TIMP-1, and VEGF with radiographic OA severity (p = 1.85 × 10-5 to 3.97 × 10-4); and VEGF, MMP-3, TIMP-1, sICAM-1, sVCAM-1, and MCP-1 with OA symptoms (p = 2.72 × 10-5 to 0.050). All these SF biomarkers were highly correlated with macrophage markers CD163 and CD14 in SF (r = 0.43 to 0.90, FDR < 0.05); all but MCP-1 were also highly correlated with neutrophil elastase in SF (r = 0.62 to 0.89, FDR < 0.05). CONCLUSIONS:A subset of six SF biomarkers was related to synovial inflammation in OA, as well as radiographic and symptom severity. These six OA-related SF biomarkers were specifically linked to indicators of activated macrophages and neutrophils. These results attest to an inflammatory OA endotype that may serve as the basis for therapeutic targeting of a subset of individuals at high risk for knee OA progression. TRIAL REGISTRATION:Written informed consent was received from participants prior to inclusion in the study; the study was registered at ClinicalTrials.gov ( NCT01237405 ) on November 9, 2010, prior to enrollment of the first participant.
Project description:Reconstruction of grasp is a high priority for tetraplegic patients. Restoration of finger flexion by surgical activation of flexor digitorum profundus can result in roll-up finger flexion, interphalangeal (IP) joint before metacarpophalangeal (MCP) joint flexion, which can be improved by restoring intrinsic function. This study compares grasp kinematics between 2 intrinsic balancing procedures-Zancolli-lasso and House.The intrinsic muscles of 12 cadaver hands were reconstructed by either the Zancolli-lasso or the House procedure (n = 6 each) and tested by deforming the flexor digitorum profundus (FDP) with a motor to simulate hand closure. Results were compared with 5 control hands. All 17 hands were studied by video analysis. Kinematics were characterized by the order of MCP joint and IP joint flexion. Optimal grasp was defined as the maximal fingertip-to-palm distance during the arc of finger closure.Kinematics differed between the 2 procedures. The Zancolli-lasso reconstructed hands flexed first in the IP joints, and then in MCP joints, resembling an unreconstructed intrinsic-minus hand whereas the House reconstructed hands flexed first in the MCP joints and then in the IP joints, resembling an intrinsic-activated hand. Maximal fingertip-to-palm distance did not differ significantly between the 2 procedures, and both showed improvement over unreconstructed controls.Both intrinsic balancing techniques improved grasp. Only the House procedure restored hand kinematics approximating those of an intrinsic-activated hand. Improvement in fingertip-to-palm distance in Zancolli-lasso hands resulted primarily from the initial resting MCP joint flexion of 40°. We therefore advocate the more physiological House procedure for restoration of intrinsic function in tetraplegic patients.This study provides a rationale for advocacy of 1 reconstructive procedure over another.