Project description:ObjectiveHypertension has long been recognized as one of the most important predisposing factors for cardiovascular diseases and mortality. In recent years, machine learning methods have shown potential in diagnostic and predictive approaches in chronic diseases. Electronic health records (EHRs) have emerged as a reliable source of longitudinal data. The aim of this study is to predict the onset of hypertension using modern deep learning (DL) architectures, specifically long short-term memory (LSTM) networks, and longitudinal EHRs.Materials and methodsWe compare this approach to the best performing models reported from previous works, particularly XGboost, applied to aggregated features. Our work is based on data from 233 895 adult patients from a large health system in the United States. We divided our population into 2 distinct longitudinal datasets based on the diagnosis date. To ensure generalization to unseen data, we trained our models on the first dataset (dataset A "train and validation") using cross-validation, and then applied the models to a second dataset (dataset B "test") to assess their performance. We also experimented with 2 different time-windows before the onset of hypertension and evaluated the impact on model performance.ResultsWith the LSTM network, we were able to achieve an area under the receiver operating characteristic curve value of 0.98 in the "train and validation" dataset A and 0.94 in the "test" dataset B for a prediction time window of 1 year. Lipid disorders, type 2 diabetes, and renal disorders are found to be associated with incident hypertension.ConclusionThese findings show that DL models based on temporal EHR data can improve the identification of patients at high risk of hypertension and corresponding driving factors. In the long term, this work may support identifying individuals who are at high risk for developing hypertension and facilitate earlier intervention to prevent the future development of hypertension.
Project description:Precision medicine requires accurate identification of clinically relevant patient subgroups. Electronic health records provide major opportunities for leveraging machine learning approaches to uncover novel patient subgroups. However, many existing approaches fail to adequately capture complex interactions between diagnosis trajectories and disease-relevant risk events, leading to subgroups that can still display great heterogeneity in event risk and underlying molecular mechanisms. To address this challenge, we implemented VaDeSC-EHR, a transformer-based variational autoencoder for clustering longitudinal survival data as extracted from electronic health records. We show that VaDeSC-EHR outperforms baseline methods on both synthetic and real-world benchmark datasets with known ground-truth cluster labels. In an application to Crohn's disease, VaDeSC-EHR successfully identifies four distinct subgroups with divergent diagnosis trajectories and risk profiles, revealing clinically and genetically relevant factors in Crohn's disease. Our results show that VaDeSC-EHR can be a powerful tool for discovering novel patient subgroups in the development of precision medicine approaches.
Project description:Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient's record. We propose a representation of patients' entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two US academic medical centers with 216,221 adult patients hospitalized for at least 24 h. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (area under the receiver operator curve [AUROC] across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient's final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed traditional, clinically-used predictive models in all cases. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios. In a case study of a particular prediction, we demonstrate that neural networks can be used to identify relevant information from the patient's chart.
Project description:BackgroundArtificial intelligence approaches can integrate complex features and can be used to predict a patient's risk of developing lung cancer, thereby decreasing the need for unnecessary and expensive diagnostic interventions.ObjectiveThe aim of this study was to use electronic medical records to prescreen patients who are at risk of developing lung cancer.MethodsWe randomly selected 2 million participants from the Taiwan National Health Insurance Research Database who received care between 1999 and 2013. We built a predictive lung cancer screening model with neural networks that were trained and validated using pre-2012 data, and we tested the model prospectively on post-2012 data. An age- and gender-matched subgroup that was 10 times larger than the original lung cancer group was used to assess the predictive power of the electronic medical record. Discrimination (area under the receiver operating characteristic curve [AUC]) and calibration analyses were performed.ResultsThe analysis included 11,617 patients with lung cancer and 1,423,154 control patients. The model achieved AUCs of 0.90 for the overall population and 0.87 in patients ≥55 years of age. The AUC in the matched subgroup was 0.82. The positive predictive value was highest (14.3%) among people aged ≥55 years with a pre-existing history of lung disease.ConclusionsOur model achieved excellent performance in predicting lung cancer within 1 year and has potential to be deployed for digital patient screening. Convolution neural networks facilitate the effective use of EMRs to identify individuals at high risk for developing lung cancer.
Project description:Methicillin-resistant Staphylococcus aureus (MRSA) poses significant morbidity and mortality in hospitals. Rapid, accurate risk stratification of MRSA is crucial for optimizing antibiotic therapy. Our study introduced a deep learning model, PyTorch_EHR, which leverages electronic health record (EHR) time-series data, including wide-variety patient specific data, to predict MRSA culture positivity within two weeks. 8,164 MRSA and 22,393 non-MRSA patient events from Memorial Hermann Hospital System, Houston, Texas are used for model development. PyTorch_EHR outperforms logistic regression (LR) and light gradient boost machine (LGBM) models in accuracy (AUROCPyTorch_EHR = 0.911, AUROCLR = 0.857, AUROCLGBM = 0.892). External validation with 393,713 patient events from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset in Boston confirms its superior accuracy (AUROCPyTorch_EHR = 0.859, AUROCLR = 0.816, AUROCLGBM = 0.838). Our model effectively stratifies patients into high-, medium-, and low-risk categories, potentially optimizing antimicrobial therapy and reducing unnecessary MRSA-specific antimicrobials. This highlights the advantage of deep learning models in predicting MRSA positive cultures, surpassing traditional machine learning models and supporting clinicians' judgments.
Project description:Electronic health records naturally contain most of the medical information in the form of doctor's notes as unstructured or semi-structured texts. Current deep learning text analysis approaches allow researchers to reveal the inner semantics of text information and even identify hidden consequences that can offer extra decision support to doctors. In the presented article, we offer a new automated analysis of Polish summary texts of patient hospitalizations. The presented models were found to be able to predict the final diagnosis with almost 70% accuracy based just on the patient's medical history (only 132 words on average), with possible accuracy increases when adding further sentences from hospitalization results; even one sentence was found to improve the results by 4%, and the best accuracy of 78% was achieved with five extra sentences. In addition to detailed descriptions of the data and methodology, we present an evaluation of the analysis using more than 50,000 Polish cardiology patient texts and dive into a detailed error analysis of the approach. The results indicate that the deep analysis of just the medical history summary can suggest the direction of diagnosis with a high probability that can be further increased just by supplementing the records with further examination results.
Project description:ObjectivePostpartum hemorrhage (PPH) remains a leading cause of preventable maternal mortality in the United States. We sought to develop a novel risk assessment tool and compare its accuracy to tools used in current practice.Materials and methodsWe used a PPH digital phenotype that we developed and validated previously to identify 6639 PPH deliveries from our delivery cohort (N = 70 948). Using a vast array of known and potential risk factors extracted from electronic medical records available prior to delivery, we trained a gradient boosting model in a subset of our cohort. In a held-out test sample, we compared performance of our model with 3 clinical risk-assessment tools and 1 previously published model.ResultsOur 24-feature model achieved an area under the receiver-operating characteristic curve (AUROC) of 0.71 (95% confidence interval [CI], 0.69-0.72), higher than all other tools (research-based AUROC, 0.67 [95% CI, 0.66-0.69]; clinical AUROCs, 0.55 [95% CI, 0.54-0.56] to 0.61 [95% CI, 0.59-0.62]). Five features were novel, including red blood cell indices and infection markers measured upon admission. Additionally, we identified inflection points for vital signs and labs where risk rose substantially. Most notably, patients with median intrapartum systolic blood pressure above 132 mm Hg had an 11% (95% CI, 8%-13%) median increase in relative risk for PPH.ConclusionsWe developed a novel approach for predicting PPH and identified clinical feature thresholds that can guide intrapartum monitoring for PPH risk. These results suggest that our model is an excellent candidate for prospective evaluation and could ultimately reduce PPH morbidity and mortality through early detection and prevention.
Project description:BackgroundAn artificial-intelligence (AI) model for predicting the prognosis or mortality of coronavirus disease 2019 (COVID-19) patients will allow efficient allocation of limited medical resources. We developed an early mortality prediction ensemble model for COVID-19 using AI models with initial chest X-ray and electronic health record (EHR) data.ResultsWe used convolutional neural network (CNN) models (Inception-ResNet-V2 and EfficientNet) for chest X-ray analysis and multilayer perceptron (MLP), Extreme Gradient Boosting (XGBoost), and random forest (RF) models for EHR data analysis. The Gradient-weighted Class Activation Mapping and Shapley Additive Explanations (SHAP) methods were used to determine the effects of these features on COVID-19. We developed an ensemble model (Area under the receiver operating characteristic curve of 0.8698) using a soft voting method with weight differences for CNN, XGBoost, MLP, and RF models. To resolve the data imbalance, we conducted F1-score optimization by adjusting the cutoff values to optimize the model performance (F1 score of 0.77).ConclusionsOur study is meaningful in that we developed an early mortality prediction model using only the initial chest X-ray and EHR data of COVID-19 patients. Early prediction of the clinical courses of patients is helpful for not only treatment but also bed management. Our results confirmed the performance improvement of the ensemble model achieved by combining AI models. Through the SHAP method, laboratory tests that indicate the factors affecting COVID-19 mortality were discovered, highlighting the importance of these tests in managing COVID-19 patients.
Project description:BackgroundBleeding is associated with a significantly increased morbidity and mortality. Bleeding events are often described in the unstructured text of electronic health records, which makes them difficult to identify by manual inspection.ObjectivesTo develop a deep learning model that detects and visualizes bleeding events in electronic health records.Patients/methodsThree hundred electronic health records with International Classification of Diseases, Tenth Revision diagnosis codes for bleeding or leukemia were extracted. Each sentence in the electronic health record was annotated as positive or negative for bleeding. The annotated sentences were used to develop a deep learning model that detects bleeding at sentence and note level.ResultsOn a balanced test set of 1178 sentences, the best-performing deep learning model achieved a sensitivity of 0.90, specificity of 0.90, and negative predictive value of 0.90. On a test set consisting of 700 notes, of which 49 were positive for bleeding, the model achieved a note-level sensitivity of 1.00, specificity of 0.52, and negative predictive value of 1.00. By using a sentence-level model on a note level, the model can explain its predictions by visualizing the exact sentence in a note that contains information regarding bleeding. Moreover, we found that the model performed consistently well across different types of bleedings.ConclusionsA deep learning model can be used to detect and visualize bleeding events in the free text of electronic health records. The deep learning model can thus facilitate systematic assessment of bleeding risk, and thereby optimize patient care and safety.
Project description:BACKGROUND Reported per-patient costs of Clostridium difficile infection (CDI) vary by 2 orders of magnitude among different hospitals, implying that infection control officers need precise, local analyses to guide rational decision making between interventions. OBJECTIVE We sought to comprehensively estimate changes in length of stay (LOS) attributable to CDI at a single urban tertiary-care facility using only data automatically extractable from the electronic medical record (EMR). METHODS We performed a retrospective cohort study of 171,938 visits spanning a 7-year period. In total, 23,968 variables were extracted from EMR data recorded within 24 hours of admission to train elastic-net regularized logistic regression models for propensity score matching. To address time-dependent bias (reverse causation), we separately stratified comparisons by time of infection, and we fit multistate models. RESULTS The estimated difference in median LOS for propensity-matched cohorts varied from 3.1 days (95% CI, 2.2-3.9) to 10.1 days (95% CI, 7.3-12.2) depending on the case definition; however, dependency of the estimate on time to infection was observed. Stratification by time to first positive toxin assay, excluding probable community-acquired infections, showed a minimum excess LOS of 3.1 days (95% CI, 1.7-4.4). Under the same case definition, the multistate model averaged an excess LOS of 3.3 days (95% CI, 2.6-4.0). CONCLUSIONS In this study, 2 independent time-to-infection adjusted methods converged on similar excess LOS estimates. Changes in LOS can be extrapolated to marginal dollar costs by multiplying by average costs of an inpatient day. Infection control officers can leverage automatically extractable EMR data to estimate costs of CDI at their own institutions. Infect Control Hosp Epidemiol. 2017;38:1478-1486.