Automatic extraction and assessment of lifestyle exposures for Alzheimer's disease using natural language processing.
ABSTRACT: INTRODUCTION:Previous biomedical studies identified many lifestyle exposures that could possibly represent risk factors for dementia in general or dementia due to Alzheimer's disease (AD). These lifestyle exposures are mainly mentioned in free-text electronic health records (EHRs). However, automatic extraction and assessment of these exposures using EHRs remains understudied. METHODS:A natural language processing (NLP) approach was adopted to extract lifestyle exposures and intervention strategies from the clinical notes of 260 patients with clinical diagnoses of AD dementia and 260 age-matched cognitively unimpaired persons. Statistics of lifestyle exposures were compared between these two groups. The mapping results of the NLP extraction were evaluated by comparing the results with data captured independently by clinicians. RESULTS:Thirty out of fifty-five potentially relevant lifestyle exposures were mentioned in our clinical note dataset. Twenty-two dietary factors and three substance abuses that were potentially relevant were not found in clinical notes. Patients with AD dementia were significantly exposed to more of the potential risk factors compared to the cognitively unimpaired subjects (?2?=?120.31, p-value < 0.001). The average accuracy of the automated extraction was 74.0% in comparison with the manual review of randomly selected 50 sample documents. DISCUSSION AND CONCLUSION:We illustrated the feasibility of NLP techniques for the automated evaluation of a large number lifestyle habits using free-text EHR data. We found that AD dementia patients were exposed to more of the potential risk factors than the comparison group. Our results also demonstrated the feasibility and accuracy of investigating putative risk factors using NLP techniques.
Project description:The increasing availability of electronic health records (EHRs) creates opportunities for automated extraction of information from clinical text. We hypothesized that natural language processing (NLP) could substantially reduce the burden of manual abstraction in studies examining outcomes, like cancer recurrence, that are documented in unstructured clinical text, such as progress notes, radiology reports, and pathology reports. We developed an NLP-based system using open-source software to process electronic clinical notes from 1995 to 2012 for women with early-stage incident breast cancers to identify whether and when recurrences were diagnosed. We developed and evaluated the system using clinical notes from 1,472 patients receiving EHR-documented care in an integrated health care system in the Pacific Northwest. A separate study provided the patient-level reference standard for recurrence status and date. The NLP-based system correctly identified 92% of recurrences and estimated diagnosis dates within 30 days for 88% of these. Specificity was 96%. The NLP-based system overlooked 5 of 65 recurrences, 4 because electronic documents were unavailable. The NLP-based system identified 5 other recurrences incorrectly classified as nonrecurrent in the reference standard. If used in similar cohorts, NLP could reduce by 90% the number of EHR charts abstracted to identify confirmed breast cancer recurrence cases at a rate comparable to traditional abstraction.
Project description:BACKGROUND:The increasing adoption of electronic health records (EHRs) in clinical practice holds the promise of improving care and advancing research by serving as a rich source of data, but most EHRs allow clinicians to enter data in a text format without much structure. Natural language processing (NLP) may reduce reliance on manual abstraction of these text data by extracting clinical features directly from unstructured clinical digital text data and converting them into structured data. OBJECTIVE:This study aimed to assess the performance of a commercially available NLP tool for extracting clinical features from free-text consult notes. METHODS:We conducted a pilot, retrospective, cross-sectional study of the accuracy of NLP from dictated consult notes from our tuberculosis clinic with manual chart abstraction as the reference standard. Consult notes for 130 patients were extracted and processed using NLP. We extracted 15 clinical features from these consult notes and grouped them a priori into categories of simple, moderate, and complex for analysis. RESULTS:For the primary outcome of overall accuracy, NLP performed best for features classified as simple, achieving an overall accuracy of 96% (95% CI 94.3-97.6). Performance was slightly lower for features of moderate clinical and linguistic complexity at 93% (95% CI 91.1-94.4), and lowest for complex features at 91% (95% CI 87.3-93.1). CONCLUSIONS:The findings of this study support the use of NLP for extracting clinical features from dictated consult notes in the setting of a tuberculosis clinic. Further research is needed to fully establish the validity of NLP for this and other purposes.
Project description:BACKGROUND:Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset. OBJECTIVE:The goal of the research was to provide a comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases, including the investigation of challenges faced by NLP methodologies in understanding clinical narratives. METHODS:Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed and searches were conducted in 5 databases using "clinical notes," "natural language processing," and "chronic disease" and their variations as keywords to maximize coverage of the articles. RESULTS:Of the 2652 articles considered, 106 met the inclusion criteria. Review of the included papers resulted in identification of 43 chronic diseases, which were then further classified into 10 disease categories using the International Classification of Diseases, 10th Revision. The majority of studies focused on diseases of the circulatory system (n=38) while endocrine and metabolic diseases were fewest (n=14). This was due to the structure of clinical records related to metabolic diseases, which typically contain much more structured data, compared with medical records for diseases of the circulatory system, which focus more on unstructured data and consequently have seen a stronger focus of NLP. The review has shown that there is a significant increase in the use of machine learning methods compared to rule-based approaches; however, deep learning methods remain emergent (n=3). Consequently, the majority of works focus on classification of disease phenotype with only a handful of papers addressing extraction of comorbidities from the free text or integration of clinical notes with structured data. There is a notable use of relatively simple methods, such as shallow classifiers (or combination with rule-based methods), due to the interpretability of predictions, which still represents a significant issue for more complex methods. Finally, scarcity of publicly available data may also have contributed to insufficient development of more advanced methods, such as extraction of word embeddings from clinical notes. CONCLUSIONS:Efforts are still required to improve (1) progression of clinical NLP methods from extraction toward understanding; (2) recognition of relations among entities rather than entities in isolation; (3) temporal extraction to understand past, current, and future clinical events; (4) exploitation of alternative sources of clinical knowledge; and (5) availability of large-scale, de-identified clinical corpora.
Project description:OBJECTIVES:To demonstrate the utility of a natural language processing (NLP) algorithm for mining kidney stone composition in a large-scale electronic health records (EHR) repository. METHODS:We developed StoneX, a pattern-matching method for extracting kidney stone composition information from clinical notes. We trained the extraction algorithm on manually annotated text mentions of calcium oxalate monohydrate, calcium oxalate dihydrate, hydroxyapatite, brushite, uric acid, and struvite stones. We employed StoneX to identify patients with kidney stone composition data and mine >125 million notes from our institutional EHR. Analyses performed on the extracted patients included stone type conversions over time, survival analysis from a second stone surgery, and disease associations by stone composition to validate the phenotyping method against known associations. RESULTS:The NLP algorithm identified 45,235 text mentions corresponding to 11,585 patients. Overall, the system achieved positive predictive value >90% for calcium oxalate monohydrate, calcium oxalate dihydrate, hydroxyapatite, brushite, and struvite; except for uric acid (positive predictive value?=?87.5%). Survival analysis from a second stone surgery showed statistically significant differences among stone types (P?=?.03). Several phenotype associations were found: uric acid-type 2 diabetes (odds ratio, OR?=?2.69, 95% confidence intervals, CI?=?1.91-3.79), struvite-neurogenic bladder (OR?=?12.27, 95% CI?=?4.33-34.79), struvite-urinary tract infection (OR?=?7.36, 95% CI?=?3.01-17.99), hydroxyapatite-pulmonary collapse (OR?=?3.67, 95% CI?=?2.10-6.42), hydroxyapatite-neurogenic bladder (OR?=?5.23, 95% CI?=?2.05-13.36), brushite-calcium metabolism disorder (OR?=?4.59, 95% CI?=?2.14-9.81), and brushite-hypercalcemia (OR?=?4.09, 95% CI?=?1.90-8.80). CONCLUSION:NLP extraction of kidney stone composition from large-scale EHRs is feasible with high precision, enabling high-throughput epidemiological studies of kidney stone disease. These tools will enable high fidelity kidney stone research from the EHR.
Project description:BACKGROUND:Lung cancer is the second most common cancer for men and women; the wide adoption of electronic health records (EHRs) offers a potential to accelerate cohort-related epidemiological studies using informatics approaches. Since manual extraction from large volumes of text materials is time consuming and labor intensive, some efforts have emerged to automatically extract information from text for lung cancer patients using natural language processing (NLP), an artificial intelligence technique. METHODS:In this study, using an existing cohort of 2311 lung cancer patients with information about stage, histology, tumor grade, and therapies (chemotherapy, radiotherapy and surgery) manually ascertained, we developed and evaluated an NLP system to extract information on these variables automatically for the same patients from clinical narratives including clinical notes, pathology reports and surgery reports. RESULTS:Evaluation showed promising results with the recalls for stage, histology, tumor grade, and therapies achieving 89, 98, 78, and 100% respectively and the precisions were 70, 88, 90, and 100% respectively. CONCLUSION:This study demonstrated the feasibility and accuracy of automatically extracting pre-defined information from clinical narratives for lung cancer research.
Project description:Large volumes of data are continuously generated from clinical notes and diagnostic studies catalogued in electronic health records (EHRs). Echocardiography is one of the most commonly ordered diagnostic tests in cardiology. This study sought to explore the feasibility and reliability of using natural language processing (NLP) for large-scale and targeted extraction of multiple data elements from echocardiography reports. An NLP tool, EchoInfer, was developed to automatically extract data pertaining to cardiovascular structure and function from heterogeneously formatted echocardiographic data sources. EchoInfer was applied to echocardiography reports (2004 to 2013) available from 3 different on-going clinical research projects. EchoInfer analyzed 15,116 echocardiography reports from 1684 patients, and extracted 59 quantitative and 21 qualitative data elements per report. EchoInfer achieved a precision of 94.06%, a recall of 92.21%, and an F1-score of 93.12% across all 80 data elements in 50 reports. Physician review of 400 reports demonstrated that EchoInfer achieved a recall of 92-99.9% and a precision of >97% in four data elements, including three quantitative and one qualitative data element. Failure of EchoInfer to correctly identify or reject reported parameters was primarily related to non-standardized reporting of echocardiography data. EchoInfer provides a powerful and reliable NLP-based approach for the large-scale, targeted extraction of information from heterogeneous data sources. The use of EchoInfer may have implications for the clinical management and research analysis of patients undergoing echocardiographic evaluation.
Project description:Alzheimer's disease (AD) is the leading cause of dementia in the United States and afflicts >5.7 million Americans in 2018. Therapeutic options remain extremely limited to those that are symptom targeting, while no drugs have been approved for the modification or reversal of the disease itself. Risk factors for AD including aging, the female sex, as well as carrying an APOE4 genotype. These risk factors have been extensively examined in the literature, while less attention has been paid to modifiable risk factors, including lifestyle, and environmental risk factors such as exposures to air pollution and pesticides. This review highlights the most recent data on risk factors in AD and identifies gene by environment interactions that have been investigated. It also provides a suggested framework for a personalized therapeutic approach to AD, by combining genetic, environmental and lifestyle risk factors. Understanding modifiable risk factors and their interaction with non-modifiable factors (age, susceptibility alleles, and sex) is paramount for designing personalized therapeutic interventions.
Project description:BACKGROUND:An adverse drug event (ADE) is commonly defined as "an injury resulting from medical intervention related to a drug." Providing information related to ADEs and alerting caregivers at the point of care can reduce the risk of prescription and diagnostic errors and improve health outcomes. ADEs captured in structured data in electronic health records (EHRs) as either coded problems or allergies are often incomplete, leading to underreporting. Therefore, it is important to develop capabilities to process unstructured EHR data in the form of clinical notes, which contain a richer documentation of a patient's ADE. Several natural language processing (NLP) systems have been proposed to automatically extract information related to ADEs. However, the results from these systems showed that significant improvement is still required for the automatic extraction of ADEs from clinical notes. OBJECTIVE:This study aims to improve the automatic extraction of ADEs and related information such as drugs, their attributes, and reason for administration from the clinical notes of patients. METHODS:This research was conducted using discharge summaries from the Medical Information Mart for Intensive Care III (MIMIC-III) database obtained through the 2018 National NLP Clinical Challenges (n2c2) annotated with drugs, drug attributes (ie, strength, form, frequency, route, dosage, duration), ADEs, reasons, and relations between drugs and other entities. We developed a deep learning-based system for extracting these drug-centric concepts and relations simultaneously using a joint method enhanced with contextualized embeddings, a position-attention mechanism, and knowledge representations. The joint method generated different sentence representations for each drug, which were then used to extract related concepts and relations simultaneously. Contextualized representations trained on the MIMIC-III database were used to capture context-sensitive meanings of words. The position-attention mechanism amplified the benefits of the joint method by generating sentence representations that capture long-distance relations. Knowledge representations were obtained from graph embeddings created using the US Food and Drug Administration Adverse Event Reporting System database to improve relation extraction, especially when contextual clues were insufficient. RESULTS:Our system achieved new state-of-the-art results on the n2c2 data set, with significant improvements in recognizing crucial drug-reason (F1=0.650 versus F1=0.579) and drug-ADE (F1=0.490 versus F1=0.476) relations. CONCLUSIONS:This study presents a system for extracting drug-centric concepts and relations that outperformed current state-of-the-art results and shows that contextualized embeddings, position-attention mechanisms, and knowledge graph embeddings effectively improve deep learning-based concepts and relation extraction. This study demonstrates the potential for deep learning-based methods to help extract real-world evidence from unstructured patient data for drug safety surveillance.
Project description:Triggering receptor expressed on myeloid cells 2 (TREM2) is an innate immune receptor expressed by microglia. Its cleaved fragments, soluble TREM2 (sTREM2), can be measured in the cerebrospinal fluid (CSF). Previous studies indicate higher CSF sTREM2 in symptomatic AD; however most of these studies have included biomarker positive AD cases and biomarker negative controls. The aim of the study was to explore potential differences in the CSF level of sTREM2 and factors associated with an increased sTREM2 level in patients diagnosed with mild cognitive impairment (MCI) or dementia due to AD compared with cognitively unimpaired controls as judged by clinical symptoms and biomarker category (AT). We included 299 memory clinic patients, 62 (20.7%) with AD-MCI and 237 (79.3%) with AD dementia, and 113 cognitively unimpaired controls. CSF measures of the core biomarkers were applied to determine AT status. CSF sTREM2 was analyzed by ELISA. Patients presented with comparable CSF sTREM2 levels as the cognitively unimpaired (9.6 ng/ml [SD 4.7] versus 8.8 ng/ml [SD 3.6], p?=?0.27). We found that CSF sTREM2 associated with age-related neuroinflammation and tauopathy irrespectively of amyloid ?, APOE ?4 status or gender. The findings were similar in both symptomatic and non-symptomatic individuals.
Project description:BACKGROUND:White matter hyperintensities (WMH) of presumed vascular origin have been associated with an increased risk of Alzheimer's disease (AD). This study aims to describe the patterns of WMH associated with dementia risk estimates and individual risk factors in a cohort of middle-aged/late middle-aged individuals (mean 58 (interquartile range 51-64) years old). METHODS:Magnetic resonance imaging and AD risk factors were collected from 575 cognitively unimpaired participants. WMH load was automatically calculated in each brain lobe and in four equidistant layers from the ventricular surface to the cortical interface. Global volumes and regional patterns of WMH load were analyzed as a function of the Cardiovascular Risk Factors, Aging and Incidence of Dementia (CAIDE) dementia risk score, as well as family history of AD and Apolipoprotein E (APOE) genotype. Additional analyses were performed after correcting for the effect of age and hypertension. RESULTS:The studied cohort showed very low WMH burden (median 1.94 cm3) and 20-year dementia risk estimates (median 1.47 %). Even so, higher CAIDE scores were significantly associated with increased global WMH load. The main drivers of this association were age and hypertension, with hypercholesterolemia and body mass index also displaying a minor, albeit significant, influence. Regionally, CAIDE scores were positively associated with WMH in anterior areas, mostly in the frontal lobe. Age and hypertension showed significant association with WMH in almost all regions analyzed. The APOE-?2 allele showed a protective effect over global WMH with a pattern that comprised juxtacortical temporo-occipital and fronto-parietal deep white matter regions. Participants with maternal family history of AD had higher WMH load than those without, especially in temporal and occipital lobes. CONCLUSIONS:WMH load is associated with AD risk factors even in cognitively unimpaired subjects with very low WMH burden and dementia risk estimates. Our results suggest that tight control of modifiable risk factors in middle-age/late middle-age could have a significant impact on late-life dementia.