Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning.
ABSTRACT: BACKGROUND:Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually. AIM:To develop an algorithm to identify relevant free texts automatically based on labelled examples. METHODS:We developed a novel machine learning algorithm, the 'Semi-supervised Set Covering Machine' (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our 'Freetext Matching Algorithm' natural language processor. RESULTS:Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%). CONCLUSIONS:Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.
Project description:Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records.We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine's MetaMap program and the gold standard of manual review.Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap.We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death.
Project description:This paper presents Lancet, a supervised machine-learning system that automatically extracts medication events consisting of medication names and information pertaining to their prescribed use (dosage, mode, frequency, duration and reason) from lists or narrative text in medical discharge summaries.Lancet incorporates three supervised machine-learning models: a conditional random fields model for tagging individual medication names and associated fields, an AdaBoost model with decision stump algorithm for determining which medication names and fields belong to a single medication event, and a support vector machines disambiguation model for identifying the context style (narrative or list).The authors, from the University of Wisconsin-Milwaukee, participated in the third i2b2 shared-task for challenges in natural language processing for clinical data: medication extraction challenge. With the performance metrics provided by the i2b2 challenge, the micro F1 (precision/recall) scores are reported for both the horizontal and vertical level.Among the top 10 teams, Lancet achieved the highest precision at 90.4% with an overall F1 score of 76.4% (horizontal system level with exact match), a gain of 11.2% and 12%, respectively, compared with the rule-based baseline system jMerki. By combining the two systems, the hybrid system further increased the F1 score by 3.4% from 76.4% to 79.0%.Supervised machine-learning systems with minimal external knowledge resources can achieve a high precision with a competitive overall F1 score.Lancet based on this learning framework does not rely on expensive manually curated rules. The system is available online at http://code.google.com/p/lancet/.
Project description:OBJECTIVE:1) To develop a machine learning approach for detecting stroke cases and subtypes from hospitalization data, 2) to assess algorithm performance and predictors on real-world data collected by a large-scale epidemiology study in the US; and 3) to identify directions for future development of high-precision stroke phenotypic signatures. MATERIALS AND METHODS:We utilized 8,131 hospitalization events (ICD-9 codes 430-438) collected from the Greater Cincinnati/Northern Kentucky Stroke Study in 2005 and 2010. Detailed information from patients' medical records was abstracted for each event by trained research nurses. By analyzing the broad list of demographic and clinical variables, the machine learning algorithms predicted whether an event was a stroke case and, if so, the stroke subtype. The performance was validated on gold-standard labels adjudicated by stroke physicians, and results were compared with stroke classifications based on ICD-9 discharge codes, as well as labels determined by study nurses. RESULTS:The best performing machine learning algorithm achieved a performance of 88.57%/93.81%/92.80%/93.30%/89.84%/98.01% (accuracy/precision/recall/F-measure/area under ROC curve/area under precision-recall curve) on stroke case detection. For detecting stroke subtypes, the algorithm yielded an overall accuracy of 87.39% and greater than 85% precision on individual subtypes. The machine learning algorithms significantly outperformed the ICD-9 method on all measures (P value<0.001). Their performance was comparable to that of study nurses, with better tradeoff between precision and recall. The feature selection uncovered a subset of predictive variables that could facilitate future development of effective stroke phenotyping algorithms. DISCUSSION AND CONCLUSIONS:By analyzing a broad array of patient data, the machine learning technologies held promise for improving detection of stroke diagnosis, thus unlocking high statistical power for subsequent genetic and genomic studies.
Project description:Identifying cancer-associated mutations (driver mutations) is critical for understanding the cellular function of cancer genome that leads to activation of oncogenes or inactivation of tumor suppressor genes. Many approaches are proposed which use supervised machine learning techniques for prediction with features obtained by some databases. However, often we do not know which features are important for driver mutations prediction. In this study, we propose a novel feature selection method (called DX) from 126 candidate features' set. In order to obtain the best performance, rotation forest algorithm was adopted to perform the experiment. On the train dataset which was collected from COSMIC and Swiss-Prot databases, we are able to obtain high prediction performance with 88.03% accuracy, 93.9% precision, and 81.35% recall when the 11 top-ranked features were used. Comparison with other various techniques in the TP53, EGFR, and Cosmic2plus datasets shows the generality of our method.
Project description:Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text.Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (~112,000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (~1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores.Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org.
Project description:BACKGROUND:MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. RESULTS:We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. CONCLUSIONS:Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.
Project description:OBJECTIVE:In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). MATERIALS AND METHODS:We employ a new "random replacement" paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current "search-and-secure" paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. RESULTS:We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. DISCUSSION:As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. CONCLUSION:Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers.
Project description:Drug-drug interaction (DDI) is a change in the effect of a drug when patient takes another drug. Characterizing DDIs is extremely important to avoid potential adverse drug reactions. We represent DDIs as a complex network in which nodes refer to drugs and links refer to their potential interactions. Recently, the problem of link prediction has attracted much consideration in scientific community. We represent the process of link prediction as a binary classification task on networks of potential DDIs. We use link prediction techniques for predicting unknown interactions between drugs in five arbitrary chosen large-scale DDI databases, namely DrugBank, KEGG, NDF-RT, SemMedDB, and Twosides. We estimated the performance of link prediction using a series of experiments on DDI networks. We performed link prediction using unsupervised and supervised approach including classification tree, k-nearest neighbors, support vector machine, random forest, and gradient boosting machine classifiers based on topological and semantic similarity features. Supervised approach clearly outperforms unsupervised approach. The Twosides network gained the best prediction performance regarding the area under the precision-recall curve (0.93 for both random forests and gradient boosting machine). The applied methodology can be used as a tool to help researchers to identify potential DDIs. The supervised link prediction approach proved to be promising for potential DDIs prediction and may facilitate the identification of potential DDIs in clinical research.
Project description:Clinical narratives are a valuable source of information for both patient care and biomedical research. Given the unstructured nature of medical reports, specific automatic techniques are required to extract relevant entities from such texts. In the natural language processing (NLP) community, this task is often addressed by using supervised methods. To develop such methods, both reliably-annotated corpora and elaborately designed features are needed. Despite the recent advances on corpora collection and annotation, research on multiple domains and languages is still limited. In addition, to compute the features required for supervised classification, suitable language- and domain-specific tools are needed. In this work, we propose a novel application of recurrent neural networks (RNNs) for event extraction from medical reports written in Italian. To train and evaluate the proposed approach, we annotated a corpus of 75 cardiology reports for a total of 4365 mentions of relevant events and their attributes (e.g., the polarity). For the annotation task, we developed specific annotation guidelines, which are provided together with this paper. The RNN-based classifier was trained on a training set including 3335 events (60 documents). The resulting model was integrated into an NLP pipeline that uses a dictionary lookup approach to search for relevant concepts inside the text. A test set of 1030 events (15 documents) was used to evaluate and compare different pipeline configurations. As a main result, using the RNN-based classifier instead of the dictionary lookup approach allowed increasing recall from 52.4% to 88.9%, and precision from 81.1% to 88.2%. Further, using the two methods in combination, we obtained final recall, precision, and F1 score of 91.7%, 88.6%, and 90.1%, respectively. These experiments indicate that integrating a well-performing RNN-based classifier with a standard knowledge-based approach can be a good strategy to extract information from clinical text in non-English languages.
Project description:OBJECTIVES:We investigated the usefulness of machine learning artificial intelligence (AI) in classifying the severity of ophthalmic emergency for timely hospital visits. STUDY DESIGN:This retrospective study analysed the patients who first visited the Armed Forces Daegu Hospital between May and December 2019. General patient information, events and symptoms were input variables. Events, symptoms, diagnoses and treatments were output variables. The output variables were classified into four classes (red, orange, yellow and green, indicating immediate to no emergency cases). About 200 cases of the class-balanced validation data set were randomly selected before all training procedures. An ensemble AI model using combinations of fully connected neural networks with the synthetic minority oversampling technique algorithm was adopted. PARTICIPANTS:A total of 1681 patients were included. MAJOR OUTCOMES:Model performance was evaluated using accuracy, precision, recall and F1 scores. RESULTS:The accuracy of the model was 99.05%. The precision of each class (red, orange, yellow and green) was 100%, 98.10%, 92.73% and 100%. The recalls of each class were 100%, 100%, 98.08% and 95.33%. The F1 scores of each class were 100%, 99.04%, 95.33% and 96.00%. CONCLUSIONS:We provided support for an AI method to classify ophthalmic emergency severity based on symptoms.