Censoring Unbiased Regression Trees and Ensembles.
ABSTRACT: This paper proposes a novel paradigm for building regression trees and ensemble learning in survival analysis. Generalizations of the CART and Random Forests algorithms for general loss functions, and in the latter case more general bootstrap procedures, are both introduced. These results, in combination with an extension of the theory of censoring unbiased transformations applicable to loss functions, underpin the development of two new classes of algorithms for constructing survival trees and survival forests: Censoring Unbiased Regression Trees and Censoring Unbiased Regression Ensembles. For a certain "doubly robust" censoring unbiased transformation of squared error loss, we further show how these new algorithms can be implemented using existing software (e.g., CART, random forests). Comparisons of these methods to existing ensemble procedures for predicting survival probabilities are provided in both simulated settings and through applications to four datasets. It is shown that these new methods either improve upon, or remain competitive with, existing implementations of random survival forests, conditional inference forests, and recursively imputed survival trees.
Project description:Classification and Regression Trees (CART), and their successors-bagging and random forests, are statistical learning tools that are receiving increasing attention. However, due to characteristics of censored data collection, standard CART algorithms are not immediately transferable to the context of survival analysis. Questions about the occurrence and timing of events arise throughout psychological and behavioral sciences, especially in longitudinal studies. The prediction power and other key features of tree-based methods are promising in studies where an event occurrence is the outcome of interest. This article reviews existing tree algorithms designed specifically for censored responses as well as recently developed survival ensemble methods, and introduces available computer software. Through simulations and a practical example, merits and limitations of these methods are discussed. Suggestions are provided for practical use.
Project description:Deep learning is a class of machine learning algorithms that are popular for building risk prediction models. When observations are censored, the outcomes are only partially observed and standard deep learning algorithms cannot be directly applied. We develop a new class of deep learning algorithms for outcomes that are potentially censored. To account for censoring, the unobservable loss function used in the absence of censoring is replaced by a censoring unbiased transformation. The resulting class of algorithms can be used to estimate both survival probabilities and restricted mean survival. We show how the deep learning algorithms can be implemented by adapting software for uncensored data by using a form of response transformation. We provide comparisons of the proposed deep learning algorithms to existing risk prediction algorithms for predicting survival probabilities and restricted mean survival through both simulated datasets and analysis of data from breast cancer patients.
Project description:Propensity score weighting is sensitive to model misspecification and outlying weights that can unduly influence results. The authors investigated whether trimming large weights downward can improve the performance of propensity score weighting and whether the benefits of trimming differ by propensity score estimation method. In a simulation study, the authors examined the performance of weight trimming following logistic regression, classification and regression trees (CART), boosted CART, and random forests to estimate propensity score weights. Results indicate that although misspecified logistic regression propensity score models yield increased bias and standard errors, weight trimming following logistic regression can improve the accuracy and precision of final parameter estimates. In contrast, weight trimming did not improve the performance of boosted CART and random forests. The performance of boosted CART and random forests without weight trimming was similar to the best performance obtainable by weight trimmed logistic regression estimated propensity scores. While trimming may be used to optimize propensity score weights estimated using logistic regression, the optimal level of trimming is difficult to determine. These results indicate that although trimming can improve inferences in some settings, in order to consistently improve the performance of propensity score weighting, analysts should focus on the procedures leading to the generation of weights (i.e., proper specification of the propensity score model) rather than relying on ad-hoc methods such as weight trimming.
Project description:Hepatitis C virus (HCV) remains a significant public health challenge with approximately half of the infected population untreated and undiagnosed. In this retrospective study, predictive models were developed to identify undiagnosed HCV patients using longitudinal medical claims linked to prescription data from approximately ten million patients in the United States (US) between 2010 and 2016. Features capturing information on demographics, risk factors, symptoms, treatments and procedures relevant to HCV were extracted from patients' medical history. Predictive algorithms were developed based on logistic regression, random forests, gradient boosted trees and a stacked ensemble. Descriptive analysis indicated that patients exhibited known symptoms of HCV on average 2-3 years prior to their diagnosis. The precision was at least 95% for all algorithms at low levels of recall (10%). For recall levels >50%, the stacked ensemble performed best with a precision of 97% compared with 87% for the gradient boosted trees and just 31% for the logistic regression. For context, the Center for Disease Control recommends screening in an at-risk sub-population with an estimated HCV prevalence of 2.23%. The artificial intelligence (AI) algorithm presented here has a precision which is substantially higher than the screening rates associated with recommended clinical guidelines, suggesting that AI algorithms have the potential to provide a step change in the effectiveness of HCV screening.
Project description:BACKGROUND:Network inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network). RESULTS:We extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions. CONCLUSIONS:Bi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.
Project description:BACKGROUND:Machine learning (ML) is a powerful tool for identifying and structuring several informative variables for predictive tasks. Here, we investigated how ML algorithms may assist in echocardiographic pulmonary hypertension (PH) prediction, where current guidelines recommend integrating several echocardiographic parameters. METHODS:In our database of 90 patients with invasively determined pulmonary artery pressure (PAP) with corresponding echocardiographic estimations of PAP obtained within 24 hours, we trained and applied five ML algorithms (random forest of classification trees, random forest of regression trees, lasso penalized logistic regression, boosted classification trees, support vector machines) using a 10 times 3-fold cross-validation (CV) scheme. RESULTS:ML algorithms achieved high prediction accuracies: support vector machines (AUC 0.83; 95% CI 0.73-0.93), boosted classification trees (AUC 0.80; 95% CI 0.68-0.92), lasso penalized logistic regression (AUC 0.78; 95% CI 0.67-0.89), random forest of classification trees (AUC 0.85; 95% CI 0.75-0.95), random forest of regression trees (AUC 0.87; 95% CI 0.78-0.96). In contrast to the best of several conventional formulae (by Aduen et al.), this ML algorithm is based on several echocardiographic signs and feature selection, with estimated right atrial pressure (RAP) being of minor importance. CONCLUSIONS:Using ML, we were able to predict pulmonary hypertension based on a broader set of echocardiographic data with little reliance on estimated RAP compared to an existing formula with non-inferior performance. With the conceptual advantages of a broader and unbiased selection and weighting of data our ML approach is suited for high level assistance in PH prediction.
Project description:Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly. We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm. Code and supplementary material are available at https://github.com/m-ramchandran/tree-weighting.
Project description:BACKGROUND: HIV-1 genotypic susceptibility scores (GSSs) were proven to be significant prognostic factors of fixed time-point virologic outcomes after combination antiretroviral therapy (cART) switch/initiation. However, their relative-hazard for the time to virologic failure has not been thoroughly investigated, and an expert system that is able to predict how long a new cART regimen will remain effective has never been designed. METHODS: We analyzed patients of the Italian ARCA cohort starting a new cART from 1999 onwards either after virologic failure or as treatment-naïve. The time to virologic failure was the endpoint, from the 90th day after treatment start, defined as the first HIV-1 RNA > 400 copies/ml, censoring at last available HIV-1 RNA before treatment discontinuation. We assessed the relative hazard/importance of GSSs according to distinct interpretation systems (Rega, ANRS and HIVdb) and other covariates by means of Cox regression and random survival forests (RSF). Prediction models were validated via the bootstrap and c-index measure. RESULTS: The dataset included 2337 regimens from 2182 patients, of which 733 were previously treatment-naïve. We observed 1067 virologic failures over 2820 persons-years. Multivariable analysis revealed that low GSSs of cART were independently associated with the hazard of a virologic failure, along with several other covariates. Evaluation of predictive performance yielded a modest ability of the Cox regression to predict the virologic endpoint (c-index?0.70), while RSF showed a better performance (c-index?0.73, p < 0.0001 vs. Cox regression). Variable importance according to RSF was concordant with the Cox hazards. CONCLUSIONS: GSSs of cART and several other covariates were investigated using linear and non-linear survival analysis. RSF models are a promising approach for the development of a reliable system that predicts time to virologic failure better than Cox regression. Such models might represent a significant improvement over the current methods for monitoring and optimization of cART.
Project description:BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
Project description:One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions.