Project description:BackgroundCOPD is a leading cause of mortality.Research questionWe hypothesized that applying machine learning to clinical and quantitative CT imaging features would improve mortality prediction in COPD.Study design and methodsWe selected 30 clinical, spirometric, and imaging features as inputs for a random survival forest. We used top features in a Cox regression to create a machine learning mortality prediction (MLMP) in COPD model and also assessed the performance of other statistical and machine learning models. We trained the models in subjects with moderate to severe COPD from a subset of subjects in Genetic Epidemiology of COPD (COPDGene) and tested prediction performance in the remainder of individuals with moderate to severe COPD in COPDGene and Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE). We compared our model with the BMI, airflow obstruction, dyspnea, exercise capacity (BODE) index; BODE modifications; and the age, dyspnea, and airflow obstruction index.ResultsWe included 2,632 participants from COPDGene and 1,268 participants from ECLIPSE. The top predictors of mortality were 6-min walk distance, FEV1 % predicted, and age. The top imaging predictor was pulmonary artery-to-aorta ratio. The MLMP-COPD model resulted in a C index ≥ 0.7 in both COPDGene and ECLIPSE (6.4- and 7.2-year median follow-ups, respectively), significantly better than all tested mortality indexes (P < .05). The MLMP-COPD model had fewer predictors but similar performance to that of other models. The group with the highest BODE scores (7-10) had 64% mortality, whereas the highest mortality group defined by the MLMP-COPD model had 77% mortality (P = .012).InterpretationAn MLMP-COPD model outperformed four existing models for predicting all-cause mortality across two COPD cohorts. Performance of machine learning was similar to that of traditional statistical methods. The model is available online at: https://cdnm.shinyapps.io/cgmortalityapp/.
Project description:BackgroundUnlike linear models which are traditionally used to study all-cause mortality, complex machine learning models can capture non-linear interrelations and provide opportunities to identify unexplored risk factors. Explainable artificial intelligence can improve prediction accuracy over linear models and reveal great insights into outcomes like mortality. This paper comprehensively analyzes all-cause mortality by explaining complex machine learning models.MethodsWe propose the IMPACT framework that uses XAI technique to explain a state-of-the-art tree ensemble mortality prediction model. We apply IMPACT to understand all-cause mortality for 1-, 3-, 5-, and 10-year follow-up times within the NHANES dataset, which contains 47,261 samples and 151 features.ResultsWe show that IMPACT models achieve higher accuracy than linear models and neural networks. Using IMPACT, we identify several overlooked risk factors and interaction effects. Furthermore, we identify relationships between laboratory features and mortality that may suggest adjusting established reference intervals. Finally, we develop highly accurate, efficient and interpretable mortality risk scores that can be used by medical professionals and individuals without medical expertise. We ensure generalizability by performing temporal validation of the mortality risk scores and external validation of important findings with the UK Biobank dataset.ConclusionsIMPACT's unique strength is the explainable prediction, which provides insights into the complex, non-linear relationships between mortality and features, while maintaining high accuracy. Our explainable risk scores could help individuals improve self-awareness of their health status and help clinicians identify patients with high risk. IMPACT takes a consequential step towards bringing contemporary developments in XAI to epidemiology.
Project description:AimsPrediction of adverse events in mid-term follow-up after transcatheter aortic valve implantation (TAVI) is challenging. We sought to develop and validate a machine learning model for prediction of 1-year all-cause mortality in patients who underwent TAVI and were discharged following the index procedure.Methods and resultsThe model was developed on data of patients who underwent TAVI at a high-volume centre between January 2013 and March 2019. Machine learning by extreme gradient boosting was trained and tested with repeated 10-fold hold-out testing using 34 pre- and 25 peri-procedural clinical variables. External validation was performed on unseen data from two other independent high-volume TAVI centres. Six hundred four patients (43% men, 81 ± 5 years old, EuroSCORE II 4.8 [3.0-6.3]%) in the derivation and 823 patients (46% men, 82 ± 5 years old, EuroSCORE II 4.7 [2.9-6.0]%) in the validation cohort underwent TAVI and were discharged home following the index procedure. Over the 12 months of follow-up, 68 (11%) and 95 (12%) subjects died in the derivation and validation cohorts, respectively. In external validation, the machine learning model had an area under the receiver-operator curve of 0.82 (0.78-0.87) for prediction of 1-year all-cause mortality following hospital discharge after TAVI, which was superior to pre- and peri-procedural clinical variables including age 0.52 (0.46-0.59) and the EuroSCORE II 0.57 (0.51-0.64), P < 0.001 for a difference.ConclusionMachine learning based on readily available clinical data allows accurate prediction of 1-year all-cause mortality following a successful TAVI.
Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Wrong dose, a common prescription error, can cause serious patient harm, especially in the case of high-risk drugs like oral corticosteroids. This study aims to build a machine learning model to predict dose-related prescription modifications for oral prednisolone tablets (i.e., highly imbalanced data with very few positive cases). Prescription data were obtained from the electronic medical records at a single institute. Cluster analysis classified the clinical departments into six clusters with similar patterns of prednisolone prescription. Two patterns of training datasets were created with/without preprocessing by the SMOTE method. Five ML models (SVM, KNN, GB, RF, and BRF) and logistic regression (LR) models were constructed by Python. The model was internally validated by five-fold stratified cross-validation and was validated with a 30% holdout test dataset. Eighty-two thousand five hundred fifty-three prescribing data for prednisolone tablets containing 135 dose-corrected positive cases were obtained. In the original dataset (without SMOTE), only the BRF model showed a good performance (in test dataset, ROC-AUC:0.917, recall: 0.951). In the training dataset preprocessed by SMOTE, performance was improved on all models. The highest performance models with SMOTE were SVM (in test dataset, ROC-AUC: 0.820, recall: 0.659) and BRF (ROC-AUC: 0.814, recall: 0.634). Although the prescribing data for dose-related collection are highly imbalanced, various techniques such as the following have allowed us to build high-performance prediction models: data preprocessing by SMOTE, stratified cross-validation, and BRF classifier corresponding to imbalanced data. ML is useful in complicated dose audits such as oral prednisolone.Supplementary informationThe online version contains supplementary material available at 10.1007/s41666-023-00128-3.
Project description:IntroductionA critical value (or panic value) is a laboratory test result that significantly deviates from the normal value and represents a potentially life-threatening condition requiring immediate action. Although notification of critical values by critical value list (CVL) is a well-established method, their contribution to mortality prediction is unclear.MethodsA total of 335,430 clinical laboratory results from 92,673 patients from July 2018 to December 2019 were used. Data in the first 12 months were divided into two datasets at a ratio of 70:30, and a 7-day mortality prediction model by machine learning (eXtreme Gradient Boosting [XGB] decision tree) was created using stratified random undersampling data of the 70% dataset. Mortality predictions by the CVL and XGB model were validated using the remaining 30% of the data, as well as different 6-month datasets from July to December 2019.ResultsThe true results which were the sum of correct predictions by the XGB model and CVL using the remaining 30% data were 61,535 and 61,024 tests, and the false results which were the sum of incorrect predictions were 5,492 and 6,003, respectively. Furthermore, the true results with the different datasets were 105,956 and 102,061 tests, and the false results were 6,052 and 9,947, respectively. The XGB model was significantly better than CVL (p < 0.001) in both datasets.The receiver operating characteristic-area under the curve values for the 30% and validation data by XGB were 0.9807 and 0.9646, respectively, which were significantly higher than those by CVL (0.7549 and 0.7172, respectively).ConclusionsMortality prediction within 7 days by machine learning using numeric laboratory results was significantly better than that by conventional CVL. The results indicate that machine learning enables timely notification to healthcare providers and may be safer than prediction by conventional CVL.
Project description:Protein-protein interactions (PPI) control most of the biological processes in a living cell. In order to fully understand protein functions, a knowledge of protein-protein interactions is necessary. Prediction of PPI is challenging, especially when the three-dimensional structure of interacting partners is not known. Recently, a novel prediction method was proposed by exploiting physical interactions of constituent domains. We propose here a novel knowledge-based prediction method, namely PPI_SVM, which predicts interactions between two protein sequences by exploiting their domain information. We trained a two-class support vector machine on the benchmarking set of pairs of interacting proteins extracted from the Database of Interacting Proteins (DIP). The method considers all possible combinations of constituent domains between two protein sequences, unlike most of the existing approaches. Moreover, it deals with both single-domain proteins and multi domain proteins; therefore it can be applied to the whole proteome in high-throughput studies. Our machine learning classifier, following a brainstorming approach, achieves accuracy of 86%, with specificity of 95%, and sensitivity of 75%, which are better results than most previous methods that sacrifice recall values in order to boost the overall precision. Our method has on average better sensitivity combined with good selectivity on the benchmarking dataset. The PPI_SVM source code, train/test datasets and supplementary files are available freely in the public domain at: http://code.google.com/p/cmater-bioinfo/.
Project description:Stratifying prognosis following coronary bifurcation percutaneous coronary intervention (PCI) is an unmet clinical need that may be fulfilled through the adoption of machine learning (ML) algorithms to refine outcome predictions. We sought to develop an ML-based risk stratification model built on clinical, anatomical, and procedural features to predict all-cause mortality following contemporary bifurcation PCI. Multiple ML models to predict all-cause mortality were tested on a cohort of 2393 patients (training, n = 1795; internal validation, n = 598) undergoing bifurcation PCI with contemporary stents from the real-world RAIN registry. Twenty-five commonly available patient-/lesion-related features were selected to train ML models. The best model was validated in an external cohort of 1701 patients undergoing bifurcation PCI from the DUTCH PEERS and BIO-RESORT trial cohorts. At ROC curves, the AUC for the prediction of 2-year mortality was 0.79 (0.74-0.83) in the overall population, 0.74 (0.62-0.85) at internal validation and 0.71 (0.62-0.79) at external validation. Performance at risk ranking analysis, k-center cross-validation, and continual learning confirmed the generalizability of the models, also available as an online interface. The RAIN-ML prediction model represents the first tool combining clinical, anatomical, and procedural features to predict all-cause mortality among patients undergoing contemporary bifurcation PCI with reliable performance.
Project description:Different machine learning (ML) models are proposed in the present work to predict density functional theory-quality barrier heights (BHs) from semiempirical quantum mechanical (SQM) calculations. The ML models include a multitask deep neural network, gradient-boosted trees by means of the XGBoost interface, and Gaussian process regression. The obtained mean absolute errors are similar to those of previous models considering the same number of data points. The ML corrections proposed in this paper could be useful for rapid screening of the large reaction networks that appear in combustion chemistry or in astrochemistry. Finally, our results show that 70% of the features with the highest impact on model output are bespoke predictors. This custom-made set of predictors could be employed by future Δ-ML models to improve the quantitative prediction of other reaction properties.
Project description:Background: Pediatric myocarditis is a rare disease. The etiologies are multiple. Mortality associated with the disease is 5-8%. Prognostic factors were identified with the use of national hospitalization databases. Applying these identified risk factors for mortality prediction has not been reported. Methods: We used the Kids' Inpatient Database for this project. We manually curated fourteen variables as predictors of mortality based on the current knowledge of the disease, and compared performance of mortality prediction between linear regression models and a machine learning (ML) model. For ML, the random forest algorithm was chosen because of the categorical nature of the variables. Based on variable importance scores, a reduced model was also developed for comparison. Results: We identified 4,144 patients from the database for randomization into the primary (for model development) and testing (for external validation) datasets. We found that the conventional logistic regression model had low sensitivity (~50%) despite high specificity (>95%) or overall accuracy. On the other hand, the ML model struck a good balance between sensitivity (89.9%) and specificity (85.8%). The reduced ML model with top five variables (mechanical ventilation, cardiac arrest, ECMO, acute kidney injury, ventricular fibrillation) were sufficient to approximate the prediction performance of the full model. Conclusions: The ML algorithm performs superiorly when compared to the linear regression model for mortality prediction in pediatric myocarditis in this retrospective dataset. Prospective studies are warranted to further validate the applicability of our model in clinical settings.