Using PhenX measures to identify opportunities for cross-study analysis.
ABSTRACT: The PhenX Toolkit provides researchers with recommended, well-established, low-burden measures suitable for human subject research. The database of Genotypes and Phenotypes (dbGaP) is the data repository for a variety of studies funded by the National Institutes of Health, including genome-wide association studies. The dbGaP requires that investigators provide a data dictionary of study variables as part of the data submission process. Thus, dbGaP is a unique resource that can help investigators identify studies that share the same or similar variables. As a proof of concept, variables from 16 studies deposited in dbGaP were mapped to PhenX measures. Soon, investigators will be able to search dbGaP using PhenX variable identifiers and find comparable and related variables in these 16 studies. To enhance effective data exchange, PhenX measures, protocols, and variables were modeled in Logical Observation Identifiers Names and Codes (LOINC® ). PhenX domains and measures are also represented in the Cancer Data Standards Registry and Repository (caDSR). Associating PhenX measures with existing standards (LOINC® and caDSR) and mapping to dbGaP study variables extends the utility of these measures by revealing new opportunities for cross-study analysis.
Project description:BACKGROUND:The purpose of this manuscript is to describe the PhenX RISING network and the site experiences in the implementation of PhenX measures into ongoing population-based genomic studies. METHODS:Eighty PhenX measures were implemented across the seven PhenX RISING groups, thirty-three of which were used at more than two sites, allowing for cross-site collaboration. Each site used between four and 37 individual measures and five of the sites are validating the PhenX measures through comparison with other study measures. Self-administered and computer-based administration modes are being evaluated at several sites which required changes to the original PhenX Toolkit protocols. A network-wide data use agreement was developed to facilitate data sharing and collaboration. RESULTS:PhenX Toolkit measures have been collected for more than 17,000 participants across the PhenX RISING network. The process of implementation provided information that was used to improve the PhenX Toolkit. The Toolkit was revised to allow researchers to select self- or interviewer administration when creating the data collection worksheets and ranges of specimens necessary to run biological assays has been added to the Toolkit. CONCLUSIONS:The PhenX RISING network has demonstrated that the PhenX Toolkit measures can be implemented successfully in ongoing genomic studies. The next step will be to conduct gene/environment studies.
Project description:The National Center for Biotechnology Information has created the dbGaP public repository for individual-level phenotype, exposure, genotype and sequence data and the associations between them. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data.
Project description:OBJECTIVE:A Working Group (WG) of tobacco regulatory science experts identified measures for the tobacco environment domain. METHODS:This article describes the methods by which measures were identified, selected, approved and placed in the PhenX Toolkit. FINDINGS:The WG identified 20 initial elements relevant to tobacco regulatory science and determined whether they were already in the PhenX Toolkit or whether novel or improved measures existed. In addition to the 10 complementary measures already in the Toolkit, the WG recommended 13 additional measures: aided and confirmed awareness of televised antitobacco advertising, interpersonal communication about tobacco advertising, media use, perceived effectiveness of antitobacco advertising, exposure to smoking on television and in the movies, social norms about tobacco (for adults and for youth), worksite policies, youth cigarette purchase behaviours and experiences, compliance with cigarette packaging and labelling policies, local and state tobacco control public policies, and neighbourhood-level racial/ethnic composition. Supplemental measures included youth social capital and compliance with smoke-free air laws and with point of sale and internet tobacco marketing restrictions. Gaps were identified in the areas of policy environment (public and private), communications environment, community environment and social environment (ie, the norms/acceptability of tobacco use). CONCLUSIONS:Consistent use of these tobacco environment measures will enhance rigor and reproducability of tobacco research.
Project description:BACKGROUND:The purpose of this paper is to describe the data collection efforts and validation of PhenX measures in the Personalized Medicine Research Project (PMRP) cohort. METHODS:Thirty-six measures were chosen from the PhenX Toolkit within the following domains: demographics; anthropometrics; alcohol, tobacco and other substances; cardiovascular; environmental exposures; cancer; psychiatric; neurology; and physical activity and physical fitness. Eligibility criteria for the current study included: living PMRP subjects with known addresses who consented to future contact and were not currently living in a nursing home, available GWAS data from eMERGE I for subjects where age-related cataract, HDL, dementia and resistant hypertension were the primary phenotypes, thus biasing the sample to the older PMRP participants. The questionnaires were mailed twice. Data from the PhenX measures were compared with information from PMRP questionnaires and data from Marshfield Clinic electronic medical records. RESULTS:Completed PhenX questionnaires were returned by 2271 subjects for a final response rate of 70%. The mean age reported on the PhenX questionnaire (73.1 years) was greater than the PMRP questionnaire (64.8 years) because the data were collected at different time points. The mean self-reported weight, and subsequently calculated BMI, were less on the PhenX survey than the measured values at the time of enrollment into PMRP (PhenX means 173.5 pounds and BMI 28.2 kg/m2 versus PMRP 182.9 pounds and BMI 29.6 kg/m2). There was 95.3% agreement between the two questionnaires about having ever smoked at least 100 cigarettes. 139 (6.2%) of subjects indicated on the PhenX questionnaire that they had been told they had a stroke. Of them, only 15 (10.8%) had no electronic indication of a prior stroke or TIA. All of the age-and gender-specific 95% confidence limits around point estimates for major depressive episodes overlap and show that 31% of women aged 50-64 reported symptoms associated with a major depressive episode. CONCLUSIONS:The approach employed resulted in a high response rate and valuable data for future gene/environment analyses. These results and high response rate highlight the utility of the PhenX Toolkit to collect valid phenotypic data that can be shared across groups to facilitate gene/environment studies.
Project description:Electronic reporting of genetic testing results is increasing, but they are often represented in diverse formats and naming conventions. Logical Observation Identifiers Names and Codes (LOINC) is a vocabulary standard that provides universal identifiers for laboratory tests and clinical observations. In genetics, LOINC provides codes to improve interoperability in the midst of reporting style transition, including codes for cytogenetic or mutation analysis tests, specific chromosomal alteration or mutation testing, and fully structured discrete genetic test reporting. LOINC terms follow the recommendations and nomenclature of other standards such as the Human Genome Organization Gene Nomenclature Committee's terminology for gene names. In addition to the narrative text they report now, we recommend that laboratories always report as discrete variables chromosome analysis results, genetic variation(s) found, and genetic variation(s) tested for. By adopting and implementing data standards like LOINC, information systems can help care providers and researchers unlock the potential of genetic information for delivering more personalized care.
Project description:OBJECTIVE: To address the problem of mapping local laboratory terminologies to Logical Observation Identifiers Names and Codes (LOINC). To study different ontology matching algorithms and investigate how the probability of term combinations in LOINC helps to increase match quality and reduce manual effort. MATERIALS AND METHODS: We proposed two matching strategies: full name and multi-part. The multi-part approach also considers the occurrence probability of combined concept parts. It can further recommend possible combinations of concept parts to allow more local terms to be mapped. Three real-world laboratory databases from Taiwanese hospitals were used to validate the proposed strategies with respect to different quality measures and execution run time. A comparison with the commonly used tool, Regenstrief LOINC Mapping Assistant (RELMA) Lab Auto Mapper (LAM), was also carried out. RESULTS: The new multi-part strategy yields the best match quality, with F-measure values between 89% and 96%. It can automatically match 70-85% of the laboratory terminologies to LOINC. The recommendation step can further propose mapping to (proposed) LOINC concepts for 9-20% of the local terminology concepts. On average, 91% of the local terminology concepts can be correctly mapped to existing or newly proposed LOINC concepts. CONCLUSIONS: The mapping quality of the multi-part strategy is significantly better than that of LAM. It enables domain experts to perform LOINC matching with little manual work. The probability of term combinations proved to be a valuable strategy for increasing the quality of match results, providing recommendations for proposed LOINC conepts, and decreasing the run time for match processing.
Project description:Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis.The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies.Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements.This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
Project description:The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype. Information in dbGaP is organized as a hierarchical structure and includes the accessioned objects, phenotypes (as variables and datasets), various molecular assay data (SNP and Expression Array data, Sequence and Epigenomic marks), analyses and documents. Publicly accessible metadata about submitted studies, summary level data, and documents related to studies can be accessed freely on the dbGaP website. Individual-level data are accessible via Controlled Access application to scientists across the globe.
Project description:Logical Observation Identifiers Names and Codes (LOINC) is the most widely used controlled vocabulary to identify laboratory tests. A given laboratory test can often be reported in more than 1 unit of measure (eg, grams or moles), and LOINC defines unique codes for each unit. Consequently, an identical laboratory test performed by 2 different clinical laboratories may have different LOINC codes. The absence of unit conversions between compatible LOINC codes impedes data aggregation and analysis of laboratory results. To develop such conversions, a computational process was developed to review the LOINC standard for potential conversions, and multiple expert reviewers oversaw and finalized the conversion list. In all, 285 bidirectional conversions were identified, including conversions for routine clinical tests such as sodium, magnesium, and human immunodeficiency virus (HIV). Unit conversions were applied to the aggregation of laboratory test results to demonstrate their usefulness. Diverse informatics projects may benefit from the ability to interconvert compatible results.