Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations.
ABSTRACT: BACKGROUND:Regulatory agencies, such as the European Medicines Agency and Health Canada, are requiring the public sharing of clinical trial reports that are used to make drug approval decisions. Both agencies have provided guidance for the quantitative anonymization of these clinical reports before they are shared. There is limited empirical information on the effectiveness of this approach in protecting patient privacy for clinical trial data. METHODS:In this paper we empirically test the hypothesis that when these guidelines are implemented in practice, they provide adequate privacy protection to patients. An anonymized clinical study report for a trial on a non-steroidal anti-inflammatory drug that is sold as a prescription eye drop was subjected to re-identification. The target was 500 patients in the USA. Only suspected matches to real identities were reported. RESULTS:Six suspected matches with low confidence scores were identified. Each suspected match took 24.2?h of effort. Social media and death records provided the most useful information for getting the suspected matches. CONCLUSIONS:These results suggest that the anonymization guidance from these agencies can provide adequate privacy protection for patients, and the modes of attack can inform further refinements of the methodologies they recommend in their guidance for manufacturers.
Project description:SUMMARY: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses. AVAILABILITY AND IMPLEMENTATION: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/.
Project description:OBJECTIVE:Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions. METHODS:We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance. RESULTS:Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ?100,000) retained better utility to those on smaller sizes (population ?6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets. CONCLUSIONS:This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.
Project description:The Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS) is a European registry for studying the epidemiology and clinical course of COVID-19. To support evidence-generation at the rapid pace required in a pandemic, LEOSS follows an Open Science approach, making data available to the public in real-time. To protect patient privacy, quantitative anonymization procedures are used to protect the continuously published data stream consisting of 16 variables on the course and therapy of COVID-19 from singling out, inference and linkage attacks. We investigated the bias introduced by this process and found that it has very little impact on the quality of output data. Current laws do not specify requirements for the application of formal anonymization methods, there is a lack of guidelines with clear recommendations and few real-world applications of quantitative anonymization procedures have been described in the literature. We therefore believe that our work can help others with developing urgently needed anonymization pipelines for their projects.
Project description:Researchers often face the problem of needing to protect the privacy of subjects while also needing to integrate data that contains personal information from diverse data sources. The advent of computational social science and the enormous amount of data about people that is being collected makes protecting the privacy of research subjects ever more important. However, strict privacy procedures can hinder the process of joining diverse sources of data that contain information about specific individual behaviors. In this paper we present a procedure to keep information about specific individuals from being "leaked" or shared in either direction between two sources of data without need of a trusted third party. To achieve this goal, we randomly assign individuals to anonymous groups before combining the anonymized information between the two sources of data. We refer to this method as the Yahtzee procedure, and show that it performs as predicted by theoretical analysis when we apply it to data from Facebook and public voter records.
Project description:In response to the growing interest in genome-wide association study (GWAS) data privacy, the Integrating Data for Analysis, Anonymization and SHaring (iDASH) center organized the iDASH Healthcare Privacy Protection Challenge, with the aim of investigating the effectiveness of applying privacy-preserving methodologies to human genetic data. This paper is based on a submission to the iDASH Healthcare Privacy Protection Challenge. We apply privacy-preserving methods that are adapted from Uhler et al. 2013 and Yu et al. 2014 to the challenge's data and analyze the data utility after the data are perturbed by the privacy-preserving methods. Major contributions of this paper include new interpretation of the ?2 statistic in a GWAS setting and new results about the Hamming distance score, a key component for one of the privacy-preserving methods.
Project description:Objective:Online data centers (ODCs) are becoming increasingly popular for making health-related data available for research. Such centers provide good privacy protection during analysis by trusted researchers, but privacy concerns may still remain if the system outputs are not sufficiently anonymized. In this article, we propose a method for anonymizing analysis outputs from ODCs for publication in academic literature. Methods:We use as a model system the Secure Unified Research Environment, an online computing system that allows researchers to access and analyze linked health-related data for approved studies in Australia. This model system suggests realistic assumptions for an ODC that, together with literature and practice reviews, inform our solution design. Results:We propose a two-step approach to anonymizing analysis outputs from an ODC. A data preparation stage requires data custodians to apply some basic treatments to the dataset before making it available. A subsequent output anonymization stage requires researchers to use a checklist at the point of downloading analysis output. The checklist assists researchers with highlighting potential privacy concerns, then applying appropriate anonymization treatments. Conclusion:The checklist can be used more broadly in health care research, not just in ODCs. Ease of online publication as well as encouragement from journals to submit supplementary material are likely to increase both the volume and detail of analysis results publicly available, which in turn will increase the need for approaches such as the one suggested in this paper.
Project description:BACKGROUND:The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients' privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects' privacy on one side, and the benefit of scientific advances on the other. OBJECTIVE:This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. METHODS:Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. RESULTS:After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. CONCLUSIONS:Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
Project description:Artificial intelligence enabled medical big data analysis has the potential to revolutionize medical practice from diagnosis and prediction of complex diseases to making recommendations and resource allocation decisions in an evidence-based manner. However, big data comes with big disclosure risks. To preserve privacy, excessive data anonymization is often necessary, leading to significant loss of data utility. In this paper, we develop a systematic data scrubbing procedure for large datasets when key variables are uncertain for re-identification risk assessment and assess the trade-off between anonymization of electronic health record data for sharing in support of open science and performance of machine learning models for early acute kidney injury risk prediction using the data. Results demonstrate that our proposed data scrubbing procedure can maintain good feature diversity and moderate data utility but raises concerns regarding its impact on knowledge discovery capability.
Project description:BACKGROUND:Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE:This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS:We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS:We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS:Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Project description:To identify data linkage errors in the form of possible false matches, where two patients appear to share the same unique identification number.Hospital Episode Statistics (HES) in England, United Kingdom.Data on births and re-admissions for infants (April 1, 2011 to March 31, 2012; age 0-1 year) and adolescents (April 1, 2004 to March 31, 2011; age 10-19 years).Hospital records pseudo-anonymized using an algorithm designed to link multiple records belonging to the same person. Six implausible clinical scenarios were considered possible false matches: multiple births sharing HESID, re-admission after death, two birth episodes sharing HESID, simultaneous admission at different hospitals, infant episodes coded as deliveries, and adolescent episodes coded as births.Among 507,778 infants, possible false matches were relatively rare (n = 433, 0.1 percent). The most common scenario (simultaneous admission at two hospitals, n = 324) was more likely for infants with missing data, those born preterm, and for Asian infants. Among adolescents, this scenario (n = 320) was more common for males, younger patients, the Mixed ethnic group, and those re-admitted more frequently.Researchers can identify clinically implausible scenarios and patients affected, at the data cleaning stage, to mitigate the impact of possible linkage errors.