Chapter 13: Mining electronic health records in the genomics era.
ABSTRACT: The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.
Project description:Despite increased risk for negative outcomes, cognitive impairment (CI) is greatly under-detected during hospitalization. While automated EHR-based phenotypes have potential to improve recognition of CI, they are hindered by widespread under-diagnosis of underlying etiologies such as dementia-limiting the utility of more precise structured data elements. This study examined unstructured data on symptoms of CI in the acute-care EHRs of hip and stroke fracture patients with dementia from two hospitals. Clinician reviewers identified and classified unstructured EHR data using standardized criteria. Relevant narrative text was descriptively characterized and evaluated for key terminology. Most patient EHRs (90%) had narrative text reflecting cognitive and/or behavioral dysfunction common in CI that were reliably classified (? 0.82). The majority of statements reflected vague descriptions of cognitive/behavioral dysfunction as opposed to diagnostic terminology. Findings from this preliminary derivation study suggest that clinicians use specific terminology in unstructured EHR fields to describe common symptoms of CI. This terminology can inform the design of EHR-based phenotypes for CI and merits further investigation in more diverse, robustly characterized samples.
Project description:OBJECTIVE:Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites. MATERIALS AND METHODS:We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic. RESULTS:Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar. CONCLUSION:This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.
Project description:Motivation:Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results:We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ?90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41?889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12?462 rare variants (MAF???0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation:The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact:email@example.com.
Project description:We developed an insomnia classification algorithm by interrogating an electronic medical records (EMR) database of 314,292 patients. The patients received care at Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), or both, between 1992 and 2010. Our algorithm combined structured variables (such as International Classification of Diseases 9th Revision [ICD-9] codes, prescriptions, laboratory observations) and unstructured variables (such as text mentions of sleep and psychiatric disorders in clinical narrative notes). The highest classification performance of our algorithm was achieved when it included a combination of structured variables (billing codes for insomnia, common psychiatric conditions, and joint disorders) and unstructured variables (sleep disorders and psychiatric disorders). Our algorithm had superior performance in identifying insomnia patients compared to billing codes alone (area under the receiver operating characteristic curve [AUROC]?=?0.83 vs. 0.55 with 95% confidence intervals [CI] of 0.76-0.90 and 0.51-0.58, respectively). When applied to the 314,292-patient population, our algorithm classified 36,810 of the patients with insomnia, of which less than 17% had a billing code for insomnia. In conclusion, an insomnia classification algorithm that incorporates clinical notes is superior to one based solely on billing codes. Compared to traditional methods, our study demonstrates that a classification algorithm that incorporates physician notes can more accurately, comprehensively, and quickly identify large cohorts of insomnia patients.
Project description:A full-text search engine can be a useful tool for augmenting the reuse value of unstructured narrative data stored in electronic health records (EHR). A prominent barrier to the effective utilization of such tools originates from users' lack of search expertise and/or medical-domain knowledge. To mitigate the issue, the authors experimented with a 'collaborative search' feature through a homegrown EHR search engine that allows users to preserve their search knowledge and share it with others. This feature was inspired by the success of many social information-foraging techniques used on the web that leverage users' collective wisdom to improve the quality and efficiency of information retrieval.The authors conducted an empirical evaluation study over a 4-year period. The user sample consisted of 451 academic researchers, medical practitioners, and hospital administrators. The data were analyzed using a social-network analysis to delineate the structure of the user collaboration networks that mediated the diffusion of knowledge of search.The users embraced the concept with considerable enthusiasm. About half of the EHR searches processed by the system (0.44 million) were based on stored search knowledge; 0.16 million utilized shared knowledge made available by other users. The social-network analysis results also suggest that the user-collaboration networks engendered by the collaborative search feature played an instrumental role in enabling the transfer of search knowledge across people and domains.Applying collaborative search, a social information-foraging technique popularly used on the web, may provide the potential to improve the quality and efficiency of information retrieval in healthcare.
Project description:<h4>Background</h4>Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael's Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment.<h4>Methods</h4>We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael's Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors.<h4>Results</h4>SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N = 3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4-21.2%) were diagnosed with active TB and 46% (95% CI: 43.8-47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8-19.7%) and 40% (95% CI: 37.8-41.6%) respectively.<h4>Conclusion</h4>SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data by using NLP rulesets. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modeling studies.
Project description:Electronic health records (EHRs), originally designed to facilitate health care delivery, are becoming a valuable data source for health research. EHR systems have two components, both of which have various components, and points of data entry, management, and analysis. The "front end" refers to where the data are entered, primarily by healthcare workers (e.g. physicians and nurses). The second component of EHR systems is the electronic data warehouse, or "back-end," where the data are stored in a relational database. EHR data elements can be of many types, which can be categorized as structured, unstructured free-text, and imaging data. The Sunrise Clinical Manager (SCM) EHR is one example of an inpatient EHR system, which covers the city of Calgary (Alberta, Canada). This system, under the management of Alberta Health Services, is now being explored for research use. The purpose of the present paper is to describe the SCM EHR for research purposes, showing how this generalizes to EHRs in general. We further discuss advantages, challenges (e.g. potential bias and data quality issues), analytical capacities, and requirements associated with using EHRs in a health research context.
Project description:Testing potential drug treatments in animal disease models is a decisive step of all preclinical drug discovery programs. Yet, despite the importance of such experiments for translational medicine, there have been relatively few efforts to comprehensively and consistently analyze the data produced by in vivo bioassays. This is partly due to their complexity and lack of accepted reporting standards-publicly available animal screening data are only accessible in unstructured free-text format, which hinders computational analysis. In this study, we use text mining to extract information from the descriptions of over 100,000 drug screening-related assays in rats and mice. We retrieve our dataset from ChEMBL-an open-source literature-based database focused on preclinical drug discovery. We show that in vivo assay descriptions can be effectively mined for relevant information, including experimental factors that might influence the outcome and reproducibility of animal research: genetic strains, experimental treatments, and phenotypic readouts used in the experiments. We further systematize extracted information using unsupervised language model (Word2Vec), which learns semantic similarities between terms and phrases, allowing identification of related animal models and classification of entire assay descriptions. In addition, we show that random forest models trained on features generated by Word2Vec can predict the class of drugs tested in different in vivo assays with high accuracy. Finally, we combine information mined from text with curated annotations stored in ChEMBL to investigate the patterns of usage of different animal models across a range of experiments, drug classes, and disease areas.
Project description:OBJECTIVE:Electronic health record (EHR) systems contain structured data (such as diagnostic codes) and unstructured data (clinical documentation). Clinical insights can be derived from analyzing both. The use of natural language processing (NLP) algorithms to effectively analyze unstructured data has been well demonstrated. Here we examine the utility of NLP for the identification of patients with non-alcoholic fatty liver disease, assess patterns of disease progression, and identify gaps in care related to breakdown in communication among providers. MATERIALS AND METHODS:All clinical notes available on the 38,575 patients enrolled in the Mount Sinai BioMe cohort were loaded into the NLP system. We compared analysis of structured and unstructured EHR data using NLP, free-text search, and diagnostic codes with validation against expert adjudication. We then used the NLP findings to measure physician impression of progression from early-stage NAFLD to NASH or cirrhosis. Similarly, we used the same NLP findings to identify mentions of NAFLD in radiology reports that did not persist into clinical notes. RESULTS:Out of 38,575 patients, we identified 2,281 patients with NAFLD. From the remainder, 10,653 patients with similar data density were selected as a control group. NLP outperformed ICD and text search in both sensitivity (NLP: 0.93, ICD: 0.28, text search: 0.81) and F2 score (NLP: 0.92, ICD: 0.34, text search: 0.81). Of 2281 NAFLD patients, 673 (29.5%) were believed to have progressed to NASH or cirrhosis. Among 176 where NAFLD was noted prior to NASH, the average progression time was 410 days. 619 (27.1%) NAFLD patients had it documented only in radiology notes and not acknowledged in other forms of clinical documentation. Of these, 170 (28.4%) were later identified as having likely developed NASH or cirrhosis after a median 1057.3 days. DISCUSSION:NLP-based approaches were more accurate at identifying NAFLD within the EHR than ICD/text search-based approaches. Suspected NAFLD on imaging is often not acknowledged in subsequent clinical documentation. Many such patients are later found to have more advanced liver disease. Analysis of information flows demonstrated loss of key information that could have been used to help prevent the progression of early NAFLD (NAFL) to NASH or cirrhosis. CONCLUSION:For identification of NAFLD, NLP performed better than alternative selection modalities. It then facilitated analysis of knowledge flow between physician and enabled the identification of breakdowns where key information was lost that could have slowed or prevented later disease progression.
Project description:BACKGROUND:Most US health care providers have adopted electronic health records (EHRs) that facilitate the uniform collection of clinical information. However, standardized data formats to capture social and behavioral determinants of health (SBDH) in structured EHR fields are still evolving and not adopted widely. Consequently, at the point of care, SBDH data are often documented within unstructured EHR fields that require time-consuming and subjective methods to retrieve. Meanwhile, collecting SBDH data using traditional surveys on a large sample of patients is infeasible for health care providers attempting to rapidly incorporate SBDH data in their population health management efforts. A potential approach to facilitate targeted SBDH data collection is applying information extraction methods to EHR data to prescreen the population for identification of immediate social needs. OBJECTIVE:Our aim was to examine the availability and characteristics of SBDH data captured in the EHR of a multilevel academic health care system that provides both inpatient and outpatient care to patients with varying SBDH across Maryland. METHODS:We measured the availability of selected patient-level SBDH in both structured and unstructured EHR data. We assessed various SBDH including demographics, preferred language, alcohol use, smoking status, social connection and/or isolation, housing issues, financial resource strains, and availability of a home address. EHR's structured data were represented by information collected between January 2003 and June 2018 from 5,401,324 patients. EHR's unstructured data represented information captured for 1,188,202 patients between July 2016 and May 2018 (a shorter time frame because of limited availability of consistent unstructured data). We used text-mining techniques to extract a subset of SBDH factors from EHR's unstructured data. RESULTS:We identified a valid address or zip code for 5.2 million (95.00%) of approximately 5.4 million patients. Ethnicity was captured for 2.7 million (50.00%), whereas race was documented for 4.9 million (90.00%) and a preferred language for 2.7 million (49.00%) patients. Information regarding alcohol use and smoking status was coded for 490,348 (9.08%) and 1,728,749 (32.01%) patients, respectively. Using the International Classification of Diseases-10th Revision diagnoses codes, we identified 35,171 (0.65%) patients with information related to social connection/isolation, 10,433 (0.19%) patients with housing issues, and 3543 (0.07%) patients with income/financial resource strain. Of approximately 1.2 million unique patients with unstructured data, 30,893 (2.60%) had at least one clinical note containing phrases referring to social connection/isolation, 35,646 (3.00%) included housing issues, and 11,882 (1.00%) had mentions of financial resource strain. CONCLUSIONS:Apart from demographics, SBDH data are not regularly collected for patients. Health care providers should assess the availability and characteristics of SBDH data in EHRs. Evaluating the quality of SBDH data can potentially enable health care providers to modify underlying workflows to improve the documentation, collection, and extraction of SBDH data from EHRs.