Mining the pharmacogenomics literature--a survey of the state of the art.
ABSTRACT: This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Project description:Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.
Project description:The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.
Project description:There have been repeated initiatives to produce standard nosologies and terminologies for cutaneous disease, some dedicated to the domain and some part of bigger terminologies such as ICD-10. Recently, formally structured terminologies, ontologies, have been widely developed in many areas of biomedical research. Primarily, these address the aim of providing comprehensive working terminologies for domains of knowledge, but because of the knowledge contained in the relationships between terms they can also be used computationally for many purposes.We have developed an ontology of cutaneous disease, constructed manually by domain experts. With more than 3000 terms, DermO represents the most comprehensive formal dermatological disease terminology available. The disease entities are categorized in 20 upper level terms, which use a variety of features such as anatomical location, heritability, affected cell or tissue type, or etiology, as the features for classification, in line with professional practice and nosology in dermatology. Available in OBO flatfile and OWL 2 formats, it is integrated semantically with other ontologies and terminologies describing diseases and phenotypes. We demonstrate the application of DermO to text mining the biomedical literature and in the creation of a network describing the phenotypic relationships between cutaneous diseases.DermO is an ontology with broad coverage of the domain of dermatologic disease and we demonstrate here its utility for text mining and investigation of phenotypic relationships between dermatologic disorders. We envision that in the future it may be applied to the creation and mining of electronic health records, clinical training and basic research, as it supports automated inference and reasoning, and for the broader integration of skin disease information with that from other domains.
Project description:BACKGROUND:The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence. RESULTS:LocText was created as a new method to extract protein locations from abstracts and full texts. LocText learned patterns from syntax parse trees and was trained and evaluated on a newly improved LocTextCorpus. Combined with an automatic named-entity recognizer, LocText achieved high precision (P = 86%±4). After completing development, we mined the latest research publications for three organisms: human (Homo sapiens), budding yeast (Saccharomyces cerevisiae), and thale cress (Arabidopsis thaliana). Examining 60 novel, text-mined annotations, we found that 65% (human), 85% (yeast), and 80% (cress) were correct. Of all validated annotations, 40% were completely novel, i.e. did neither appear in the annotations nor the text descriptions of Swiss-Prot. CONCLUSIONS:LocText provides a cost-effective, semi-automated workflow to assist database curators in identifying novel protein localization annotations. The annotations suggested through text-mining would be verified by experts to guarantee high-quality standards of manually-curated databases such as Swiss-Prot.
Project description:Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through "crowd-sourcing." Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for "next-generation," high-coverage lexical terminologies.
Project description:Through advances in neural language modeling, it has become possible to generate artificial texts in a variety of genres and styles. While the semantic coherence of such texts should not be over-estimated, the grammatical correctness and stylistic qualities of these artificial texts are at times remarkably convincing. In this paper, we report a study into crowd-sourced authenticity judgments for such artificially generated texts. As a case study, we have turned to rap lyrics, an established sub-genre of present-day popular music, known for its explicit content and unique rhythmical delivery of lyrics. The empirical basis of our study is an experiment carried out in the context of a large, mainstream contemporary music festival in the Netherlands. Apart from more generic factors, we model a diverse set of linguistic characteristics of the input that might have functioned as authenticity cues. It is shown that participants are only marginally capable of distinguishing between authentic and generated materials. By scrutinizing the linguistic features that influence the participants' authenticity judgments, it is shown that linguistic properties such as 'syntactic complexity', 'lexical diversity' and 'rhyme density' add to the user's perception of texts being authentic. This research contributes to the improvement of the quality and credibility of generated text. Additionally, it enhances our understanding of the perception of authentic and artificial art.
Project description:Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
Project description:BACKGROUND:Health experts including planners and policy-makers face complex decisions in diverse and constantly changing healthcare systems. Visual analytics may play a critical role in supporting analysis of complex healthcare data and decision-making. The purpose of this study was to examine the real-world experience that experts in mental healthcare planning have with visual analytics tools, investigate how well current visualisation techniques meet their needs, and suggest priorities for the future development of visual analytics tools of practical benefit to mental healthcare policy and decision-making. METHODS:Health expert experience was assessed by an online exploratory survey consisting of a mix of multiple choice and open-ended questions. Health experts were sampled from an international pool of policy-makers, health agency directors, and researchers with extensive and direct experience of using visual analytics tools for complex mental healthcare systems planning. We invited them to the survey, and the experts' responses were analysed using statistical and text mining approaches. RESULTS:The forty respondents who took part in the study recognised the complexity of healthcare systems data, but had most experience with and preference for relatively simple and familiar visualisations such as bar charts, scatter plots, and geographical maps. Sixty-five percent rated visual analytics as important to their field for evidence-informed decision-making processes. Fifty-five percent indicated that more advanced visual analytics tools were needed for their data analysis, and 67.5% stated their willingness to learn new tools. This was reflected in text mining and qualitative synthesis of open-ended responses. CONCLUSIONS:This exploratory research provides readers with the first self-report insight into expert experience with visual analytics in mental healthcare systems research and policy. In spite of the awareness of their importance for complex healthcare planning, the majority of experts use simple, readily available visualisation tools. We conclude that co-creation and co-development strategies will be required to support advanced visual analytics tools and skills, which will become essential in the future of healthcare.
Project description:To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous clinical research texts.We develop a "plug-n-play" framework that integrates replaceable unsupervised kernel algorithms with formatting, functional, and utility wrappers for FST mining. Temporal information identification and semantic equivalence detection were two example functional wrappers. We first compared this approach's recall and efficiency for mining FSTs from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then we assessed this approach's adaptability to two other types of clinical research texts: clinical data requests and clinical trial protocols, by comparing the prevalence trends of FSTs across three texts.Our approach increased the average recall and speed by 12.8% and 47.02% respectively upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap in relevant FSTs with the base- line ranging between 76.9% and 100% for varying FST frequency thresholds. The FSTs saturated when the data size reached 200 documents. Consistent trends in the prevalence of FST were observed across the three texts as the data size or frequency threshold changed.This paper contributes an adaptive tag-mining framework that is scalable and adaptable without sacrificing its recall. This component-based architectural design can be potentially generalizable to improve the adaptability of other clinical text mining methods.
Project description:BACKGROUND:Mental illness is quickly becoming one of the most prevalent public health problems worldwide. Social network platforms, where users can express their emotions, feelings, and thoughts, are a valuable source of data for researching mental health, and techniques based on machine learning are increasingly used for this purpose. OBJECTIVE:The objective of this review was to explore the scope and limits of cutting-edge techniques that researchers are using for predictive analytics in mental health and to review associated issues, such as ethical concerns, in this area of research. METHODS:We performed a systematic literature review in March 2017, using keywords to search articles on data mining of social network data in the context of common mental health disorders, published between 2010 and March 8, 2017 in medical and computer science journals. RESULTS:The initial search returned a total of 5386 articles. Following a careful analysis of the titles, abstracts, and main texts, we selected 48 articles for review. We coded the articles according to key characteristics, techniques used for data collection, data preprocessing, feature extraction, feature selection, model construction, and model verification. The most common analytical method was text analysis, with several studies using different flavors of image analysis and social interaction graph analysis. CONCLUSIONS:Despite an increasing number of studies investigating mental health issues using social network data, some common problems persist. Assembling large, high-quality datasets of social media users with mental disorder is problematic, not only due to biases associated with the collection methods, but also with regard to managing consent and selecting appropriate analytics techniques.