Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge.
ABSTRACT: In response to the growing interest in genome-wide association study (GWAS) data privacy, the Integrating Data for Analysis, Anonymization and SHaring (iDASH) center organized the iDASH Healthcare Privacy Protection Challenge, with the aim of investigating the effectiveness of applying privacy-preserving methodologies to human genetic data. This paper is based on a submission to the iDASH Healthcare Privacy Protection Challenge. We apply privacy-preserving methods that are adapted from Uhler et al. 2013 and Yu et al. 2014 to the challenge's data and analyze the data utility after the data are perturbed by the privacy-preserving methods. Major contributions of this paper include new interpretation of the ?2 statistic in a GWAS setting and new results about the Hamming distance score, a key component for one of the privacy-preserving methods.
Project description:Genomic data is increasingly collected by a wide array of organizations. As such, there is a growing demand to make summary information about such collections available more widely. However, over the past decade, a series of investigations have shown that attacks, rooted in statistical inference methods, can be applied to discern the presence of a known individual's DNA sequence in the pool of subjects. Recently, it was shown that the Beacon Project of the Global Alliance for Genomics and Health, a web service for querying about the presence (or absence) of a specific allele, was vulnerable. The Integrating Data for Analysis, Anonymization, and Sharing (iDASH) Center modeled a track in their third Privacy Protection Challenge on how to mitigate the Beacon vulnerability. We developed the winning solution for this track.This paper describes our computational method to optimize the tradeoff between the utility and the privacy of the Beacon service. We generalize the genomic data sharing problem beyond that which was introduced in the iDASH Challenge to be more representative of real world scenarios to allow for a more comprehensive evaluation. We then conduct a sensitivity analysis of our method with respect to several state-of-the-art methods using a dataset of 400,000 positions in Chromosome 10 for 500 individuals from Phase 3 of the 1000 Genomes Project. All methods are evaluated for utility, privacy and efficiency.Our method achieves better performance than all state-of-the-art methods, irrespective of how key factors (e.g., the allele frequency in the population, the size of the pool and utility weights) change from the original parameters of the problem. We further illustrate that it is possible for our method to exhibit subpar performance under special cases of allele query sequences. However, we show our method can be extended to address this issue when the query sequence is fixed and known a priori to the data custodian, so that they may plan stage their responses accordingly.This research shows that it is possible to thwart the attack on Beacon services, without substantially altering the utility of the system, using computational methods. The method we initially developed is limited by the design of the scenario and evaluation protocol for the iDASH Challenge; however, it can be improved by allowing the data custodian to act in a staged manner.
Project description:BACKGROUND:Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE:This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS:We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS:We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient's name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS:Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Project description:The proliferation of large genomic databases offers the potential to perform increasingly larger-scale genome-wide association studies (GWASs). Due to privacy concerns, however, access to these data is limited, greatly reducing their usefulness for research. Here, we introduce a computational framework for performing GWASs that adapts principles of differential privacy-a cryptographic theory that facilitates secure analysis of sensitive data-to both protect private phenotype information (e.g., disease status) and correct for population stratification. This framework enables us to produce privacy-preserving GWAS results based on EIGENSTRAT and linear mixed model (LMM)-based statistics, both of which correct for population stratification. We test our differentially private statistics, PrivSTRAT and PrivLMM, on simulated and real GWAS datasets and find they are able to protect privacy while returning meaningful results. Our framework can be used to securely query private genomic datasets to discover which specific genomic alterations may be associated with a disease, thus increasing the availability of these valuable datasets.
Project description:Process mining has been successfully applied in the healthcare domain and has helped touncover various insights for improving healthcare processes. While the benefits of process miningare widely acknowledged, many people rightfully have concerns about irresponsible uses of personaldata. Healthcare information systems contain highly sensitive information and healthcare regulationsoften require protection of data privacy. The need to comply with strict privacy requirements mayresult in a decreased data utility for analysis. Until recently, data privacy issues did not get muchattention in the process mining community; however, several privacy-preserving data transformationtechniques have been proposed in the data mining community. Many similarities between datamining and process mining exist, but there are key differences that make privacy-preserving datamining techniques unsuitable to anonymise process data (without adaptations). In this article, weanalyse data privacy and utility requirements for healthcare process data and assess the suitabilityof privacy-preserving data transformation methods to anonymise healthcare data. We demonstratehow some of these anonymisation methods affect various process mining results using three publiclyavailable healthcare event logs. We describe a framework for privacy-preserving process mining thatcan support healthcare process mining analyses. We also advocate the recording of privacy metadatato capture information about privacy-preserving transformations performed on an event log.
Project description:BACKGROUND:Security concerns have been raised since big data became a prominent tool in data analysis. For instance, many machine learning algorithms aim to generate prediction models using training data which contain sensitive information about individuals. Cryptography community is considering secure computation as a solution for privacy protection. In particular, practical requirements have triggered research on the efficiency of cryptographic primitives. METHODS:This paper presents a method to train a logistic regression model without information leakage. We apply the homomorphic encryption scheme of Cheon et al. (ASIACRYPT 2017) for an efficient arithmetic over real numbers, and devise a new encoding method to reduce storage of encrypted database. In addition, we adapt Nesterov's accelerated gradient method to reduce the number of iterations as well as the computational cost while maintaining the quality of an output classifier. RESULTS:Our method shows a state-of-the-art performance of homomorphic encryption system in a real-world application. The submission based on this work was selected as the best solution of Track 3 at iDASH privacy and security competition 2017. For example, it took about six minutes to obtain a logistic regression model given the dataset consisting of 1579 samples, each of which has 18 features with a binary outcome variable. CONCLUSIONS:We present a practical solution for outsourcing analysis tools such as logistic regression analysis while preserving the data confidentiality.
Project description:One of the tasks in the iDASH Secure Genome Analysis Competition in 2016 was to demonstrate the feasibility of privacy-preserving queries on homomorphically encrypted genomic data. More precisely, given a list of up to 100,000 mutations, the task was to encrypt the data using homomorphic encryption in a way that allows it to be stored securely in the cloud, and enables the data owner to query the dataset for the presence of specific mutations, without revealing any information about the dataset or the queries to the cloud.We devise a novel string matching protocol to enable privacy-preserving queries on homomorphically encrypted data. Our protocol combines state-of-the-art techniques from homomorphic encryption and private set intersection protocols to minimize the computational and communication cost.We implemented our protocol using the homomorphic encryption library SEAL v2.1, and applied it to obtain an efficient solution to the iDASH competition task. For example, using 8 threads, our protocol achieves a running time of only 4 s, and a communication cost of 2 MB, when querying for the presence of 5 mutations from an encrypted dataset of 100,000 mutations.We demonstrate that homomorphic encryption can be used to enable an efficient privacy-preserving mechanism for querying the presence of particular mutations in realistic size datasets. Beyond its applications to genomics, our protocol can just as well be applied to any kind of data, and is therefore of independent interest to the homomorphic encryption community.
Project description:BACKGROUND:The deployment of Genome-wide association studies (GWASs) requires genomic information of a large population to produce reliable results. This raises significant privacy concerns, making people hesitate to contribute their genetic information to such studies. RESULTS:We propose two provably secure solutions to address this challenge: (1) a somewhat homomorphic encryption (HE) approach, and (2) a secure multiparty computation (MPC) approach. Unlike previous work, our approach does not rely on adding noise to the input data, nor does it reveal any information about the patients. Our protocols aim to prevent data breaches by calculating the ?2 statistic in a privacy-preserving manner, without revealing any information other than whether the statistic is significant or not. Specifically, our protocols compute the ?2 statistic, but only return a yes/no answer, indicating significance. By not revealing the statistic value itself but only the significance, our approach thwarts attacks exploiting statistic values. We significantly increased the efficiency of our HE protocols by introducing a new masking technique to perform the secure comparison that is necessary for determining significance. CONCLUSIONS:We show that full-scale privacy-preserving GWAS is practical, as long as the statistics can be computed by low degree polynomials. Our implementations demonstrated that both approaches are efficient. The secure multiparty computation technique completes its execution in approximately 2 ms for data contributed by one million subjects.
Project description:Starting from a forensic problem, Homer et al. showed that it was possible to detect if an individual contributes only 0.5% of the DNA in a pool. The finding was extended to prove the possibility of detecting whether a subject participated in a small homogeneous GWAS. We denote this as the detection of a subject belonging to a certain cohort (SBCC). Subsequently, Visscher and Hill showed that the power to detect SBCC signal for an ethnically homogeneous cohort depends roughly on the ratio of the number of independent markers and total sample size. However, it is not clear if the same holds for more ethnically diverse cohorts. Later, Masca et al. propose running as SBCC test a regression of departure from assumed population frequency of i) subject genotype on ii) cohort of interest frequency. They use simulations to show that the approach has better SBCC detection power than the original Homer method but is impeded by population stratification.To investigate the possibility of SBCC detection in multi-ethnic cohorts, we generalize the Masca et al. approach by theoretically deriving the correlation between a subject genotype and the cohort reference allele frequencies (RAFs) for stratified cohorts. Based on the derived formula, we theoretically show that, due to background stratification noise, SBCC detection is unlikely even for mildly stratified cohorts of size greater than around a thousand subjects. Thus, for the vast majority of contemporary cohorts, the fear of compromising privacy via SBCC detection is unfounded.
Project description:Frequent itemset mining is the important first step of association rule mining, which discovers interesting patterns from the massive data. There are increasing concerns about the privacy problem in the frequent itemset mining. Some works have been proposed to handle this kind of problem. In this paper, we introduce a personalized privacy problem, in which different attributes may need different privacy levels protection. To solve this problem, we give a personalized privacy-preserving method by using the randomized response technique. By providing different privacy levels for different attributes, this method can get a higher accuracy on frequent itemset mining than the traditional method providing the same privacy level. Finally, our experimental results show that our method can have better results on the frequent itemset mining while preserving personalized privacy.
Project description:Motivation:Inappropriate disclosure of human genomes may put the privacy of study subjects and of their family members at risk. Existing privacy-preserving mechanisms for Genome-Wide Association Studies (GWAS) mainly focus on protecting individual information in case-control studies. Protecting privacy in family-based studies is more difficult. The transmission disequilibrium test (TDT) is a powerful family-based association test employed in many rare disease studies. It gathers information about families (most frequently involving parents, affected children and their siblings). It is important to develop privacy-preserving approaches to disclose TDT statistics with a guarantee that the risk of family 're-identification' stays below a pre-specified risk threshold. 'Re-identification' in this context means that an attacker can infer that the presence of a family in a study. Methods:In the context of protecting family-level privacy, we developed and evaluated a suite of differentially private (DP) mechanisms for TDT. They include Laplace mechanisms based on the TDT test statistic, P-values, projected P-values and exponential mechanisms based on the TDT test statistic and the shortest Hamming distance (SHD) score. Results:Using simulation studies with a small cohort and a large one, we showed that that the exponential mechanism based on the SHD score preserves the highest utility and privacy among all proposed DP methods. We provide a guideline on applying our DP TDT in a real dataset in analyzing Kawasaki disease with 187 families and 906 SNPs. There are some limitations, including: (1) the performance of our implementation is slow for real-time results generation and (2) handling missing data is still challenging. Availability and implementation:The software dpTDT is available in https://github.com/mwgrassgreen/dpTDT. Contact:firstname.lastname@example.org. Supplementary information:Supplementary data are available at Bioinformatics online.