De-identifying a public use microdata file from the Canadian national discharge abstract database.
ABSTRACT: The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.
Project description:OBJECTIVE: There has been a consistent concern about the inadvertent disclosure of personal information through peer-to-peer file sharing applications, such as Limewire and Morpheus. Examples of personal health and financial information being exposed have been published. We wanted to estimate the extent to which personal health information (PHI) is being disclosed in this way, and compare that to the extent of disclosure of personal financial information (PFI). DESIGN: After careful review and approval of our protocol by our institutional research ethics board, files were downloaded from peer-to-peer file sharing networks and manually analyzed for the presence of PHI and PFI. The geographic region of the IP addresses was determined, and classified as either USA or Canada. MEASUREMENT: We estimated the proportion of files that contain personal health and financial information for each region. We also estimated the proportion of search terms that return files with personal health and financial information. We ascertained and discuss the ethical issues related to this study. RESULTS: Approximately 0.4% of Canadian IP addresses had PHI, as did 0.5% of US IP addresses. There was more disclosure of financial information, at 1.7% of Canadian IP addresses and 4.7% of US IP addresses. An analysis of search terms used in these file sharing networks showed that a small percentage of the terms would return PHI and PFI files (ie, there are people successfully searching for PFI and PHI on the peer-to-peer file sharing networks). CONCLUSION: There is a real risk of inadvertent disclosure of PHI through peer-to-peer file sharing networks, although the risk is not as large as for PFI. Anyone keeping PHI on their computers should avoid installing file sharing applications on their computers, or if they have to use such tools, actively manage the risks of inadvertent disclosure of their, their family's, their clients', or patients' PHI.
Project description:To investigate the use of the WHO EML as a tool with which to evaluate the evidence base for the medicines on the national insurance coverage list of the Croatian Institute of Health Insurance (CIHI).Medicines from 9 ATC categories with highest expenditures from 2012 CIHI Basic List (n?=?509) were compared with 2011 WHO EML for adults (n?=?359). For medicines with specific indication listed only in CIHI Basic List we assessed whether there was evidence in Cochrane Database of Systematic Reviews questioning their efficacy and safety.The two lists shared 188 medicines (52.4% of WHO EML and 32.0% of CIHI list). CIHI Basic List had 254 medicines and 33 combinations of these medicines which were not on the WHO EML, plus 14 medicines rejected and 20 deleted from WHO EML by its Evaluation Committee. For deleted medicines, we could obtain data that showed 2,965,378 prescriptions issued to 617,684 insured patients, and the cost of approximately € 41.2 million for 2012 and the first half of 2013, when the CIHI Basic List was in effect. For CIHI List-only medicines with a specific indication (n?=?164 or 57.1% of the analyzed set), fewer benefits or more serious side-effects than other medicines were found for 17 (10.4%) and not enough evidence for recommendations for specific indication for 21 (12.8%) medicines in Cochrane systematic reviews.National health care policy should use high-quality evidence in deciding on adding new medicines and reassessing those already present on national medicines lists, in order to rationalize expenditures and ensure wider and better access to medicines. The WHO EML and recommendations from its Evaluation Committee may be useful tools in this quality assurance process.
Project description:OBJECTIVE:Until recently, the options for summarizing Canadian patient complexity were limited to health risk predictive modeling tools developed outside of Canada. This study aims to validate a new model created by the Canadian Institute for Health Information (CIHI) for Canada's health care environment. RESEARCH DESIGN:This was a cohort study. SUBJECTS:The rolling population eligible for coverage under Ontario's Universal Provincial Health Insurance Program in the fiscal years (FYs) 2006/2007-2016/2017 (12-13 million annually) comprised the subjects. MEASURES:To evaluate model performance, we compared predicted cost risk at the individual level, on the basis of diagnosis history, with estimates of actual patient-level cost using "out-of-the-box" cost weights created by running the CIHI software "as is." We next considered whether performance could be improved by recalibrating the model weights, censoring outliers, or adding prior cost. RESULTS:We were able to closely match model performance reported by CIHI for their 2010-2012 development sample (concurrent R=48.0%; prospective R=8.9%) and show that performance improved over time (concurrent R=51.9%; prospective R=9.7% in 2014-2016). Recalibrating the model did not substantively affect prospective period performance, even with the addition of prior cost and censoring of cost outliers. However, censoring substantively improved concurrent period explanatory power (from R=53.6% to 66.7%). CONCLUSIONS:We validated the CIHI model for 2 periods, FYs 2010/2011-2012/2013 and FYs 2014/2015-2016/2017. Out-of-the-box model performance for Ontario was as good as that reported by CIHI for the development sample based on 3-province data (British Columbia, Alberta, and Ontario). We found that performance was robust to variations in model specification, data sources, and time.
Project description:RHYTHM is a web server that predicts buried versus exposed residues of helical membrane proteins. Starting from a given protein sequence, secondary and tertiary structure information is calculated by RHYTHM within only a few seconds. The prediction applies structural information from a growing data base of precalculated packing files and evolutionary information from sequence patterns conserved in a representative dataset of membrane proteins ('Pfam-domains'). The program uses two types of position specific matrices to account for the different geometries of packing in channels and transporters ('channels') or other membrane proteins ('membrane-coils'). The output provides information on the secondary structure and topology of the protein and specifically on the contact type of each residue and its conservation. This information can be downloaded as a graphical file for illustration, a text file for analysis and statistics and a PyMOL file for modeling purposes. The server can be freely accessed at: URL: http://proteinformatics.de/rhythm.
Project description:BACKGROUND:Outcomes for coronary artery bypass surgery are of broadening interest, but the impact of data type on quality reporting has not been fully examined. We compared the performance of administrative and clinical data-based risk adjustment models at a tertiary-quaternary care hospital. METHODS:We used a prospective study design to test two risk adjustment models, one from administrative (Canadian Institute for Health Information [CIHI] Cardiac Care Quality Indicator) and one from clinical data (Society of Thoracic Surgeons), on cardiac surgical procedures performed between 2013 and 2016 (n = 1635). Our primary outcome was in-hospital mortality within 30 days of surgery. Model performance was established by comparing predicted and observed mortality, model calibration and handling of critical covariates. RESULTS:Observed mortality was 1.96%, which was the same as that predicted by the Society of Thoracic Surgeons model (1.96%), but significantly higher than that predicted by the CIHI model (1.03%). Despite both models having similar C statistics (0.756 CIHI; 0.758 Society of Thoracic Surgeons), the CIHI model showed significant underestimation of mortality among patients at higher risk. There was significant miscalibration of risk associated with 7 covariates: New York Heart Association class IV, congestive heart failure, ejection fraction less than 20%, atrial fibrillation, acute coronary insufficiency, cardiac compromise (shock, myocardial infarction < 24 h, intra-aortic balloon pump, cardiac resuscitation or preprocedure circulatory support) and creatinine concentration of 100 mg/dL or more. Together, these factors accounted for 84% of the difference in predicted mortality between the administrative and clinical models. INTERPRETATION:Risk prediction using administrative data underestimated risk of death, potentially inflating observed-to-predicted mortality ratios at hospitals with patients who are more ill. Caution is warranted when hospital reports of cardiac surgery outcomes are based on administrative data alone.
Project description:This study highlights the need for analysis of online disclosure practices followed by non-governmental organizations; furthermore, it justifies the crucial role of potential correlates of online disclosure practices followed by non-governmental organizations. We propose a novel index for analyzing the extent of online disclosure of non-governmental organizations (NGO). Using the information stored in an auxiliary variable, we propose a new estimator for gauging the average value of the proposed index. Our approach relies on the use of two factors: imperfect ranked-set sampling procedure to link the auxiliary variable with the study variable, and an NGO disclosure index under simple random sampling that uses information only about the study variable. Relative efficiency of the proposed index is compared with the conventional estimator for the population average under the imperfect ranked-set sampling scheme. Mathematical conditions required for retaining the efficiency of the proposed index, in comparison to the imperfect ranked set sampling estimator, are derived. Numerical scrutiny of the relative efficiency, in response to the input variables, indicates; if the variance of the NGO disclosure index is less than the variance of the estimator under imperfect ranked set sampling, then the proposed index is universally efficient compared to the estimator under imperfect ranked set sampling. If the condition on variances is unmet, even then the proposed estimator remains efficient if majority of the NGO share online data on the auxiliary variable. This work can facilitate nonprofit regulation in the countries where most of the non-governmental organizations maintain their websites.
Project description:BACKGROUND: Privacy concerns by providers have been a barrier to disclosing patient information for public health purposes. This is the case even for mandated notifiable disease reporting. In the context of a pandemic it has been argued that the public good should supersede an individual's right to privacy. The precise nature of these provider privacy concerns, and whether they are diluted in the context of a pandemic are not known. Our objective was to understand the privacy barriers which could potentially influence family physicians' reporting of patient-level surveillance data to public health agencies during the Fall 2009 pandemic H1N1 influenza outbreak. METHODS: Thirty seven family doctors participated in a series of five focus groups between October 29-31 2009. They also completed a survey about the data they were willing to disclose to public health units. Descriptive statistics were used to summarize the amount of patient detail the participants were willing to disclose, factors that would facilitate data disclosure, and the consensus on those factors. The analysis of the qualitative data was based on grounded theory. RESULTS: The family doctors were reluctant to disclose patient data to public health units. This was due to concerns about the extent to which public health agencies are dependable to protect health information (trusting beliefs), and the possibility of loss due to disclosing health information (risk beliefs). We identified six specific actions that public health units can take which would affect these beliefs, and potentially increase the willingness to disclose patient information for public health purposes. CONCLUSIONS: The uncertainty surrounding a pandemic of a new strain of influenza has not changed the privacy concerns of physicians about disclosing patient data. It is important to address these concerns to ensure reliable reporting during future outbreaks.
Project description:This article describes a public data set containing the three-dimensional kinematics of the whole human body and the ground reaction forces (with a dual force platform setup) of subjects who were standing still for 60 s in different conditions, in which the subjects' vision and the standing surface were manipulated. Twenty-seven young subjects and 22 old subjects were evaluated. The data set comprises a file with metadata plus 1,813 files with the ground reaction force (GRF) and kinematics data for the 49 subjects (three files for each of the 12 trials plus one file for each subject). The file with metadata has information about each subject's sociocultural, demographic, and health characteristics. The files with the GRF have the data from each force platform and from the resultant GRF (including the center of pressure data). The files with the kinematics contain the three-dimensional positions of 42 markers that were placed on each subject's body and 73 calculated joint angles. In this text, we illustrate how to access, analyze, and visualize the data set. All the data is available at Figshare (DOI: 10.6084/m9.figshare.4525082), and a companion Jupyter Notebook presents programming code to access the data set, generate analyses and other examples. The availability of a public data set on the Internet that contains these measurements and information about how to access and process this data can potentially boost the research on human postural control, increase the reproducibility of studies, and be used for training and education, among other applications.
Project description:OBJECTIVE:Pressure ulcer development is a quality of care indicator, as pressure ulcers are potentially preventable. Yet pressure ulcer is a leading cause of morbidity, discomfort and additional healthcare costs for inpatients. Methods are lacking for accurate surveillance of pressure ulcer in hospitals to track occurrences and evaluate care improvement strategies. The main study aim was to validate hospital discharge abstract database (DAD) in recording pressure ulcers against nursing consult reports, and to calculate prevalence of pressure ulcers in Alberta, Canada in DAD. We hypothesised that a more inclusive case definition for pressure ulcers would enhance validity of cases identified in administrative data for research and quality improvement purposes. SETTING:A cohort of patients with pressure ulcers were identified from enterostomal (ET) nursing consult documents at a large university hospital in 2011. PARTICIPANTS:There were 1217 patients with pressure ulcers in ET nursing documentation that were linked to a corresponding record in DAD to validate DAD for correct and accurate identification of pressure ulcer occurrence, using two case definitions for pressure ulcer. RESULTS:Using pressure ulcer definition 1 (7 codes), prevalence was 1.4%, and using definition 2 (29 codes), prevalence was 4.2% after adjusting for misclassifications. The results were lower than expected. Definition 1 sensitivity was 27.7% and specificity was 98.8%, while definition 2 sensitivity was 32.8% and specificity was 95.9%. Pressure ulcer in both DAD and ET consultation increased with age, number of comorbidities and length of stay. CONCLUSION:DAD underestimate pressure ulcer prevalence. Since various codes are used to record pressure ulcers in DAD, the case definition with more codes captures more pressure ulcer cases, and may be useful for monitoring facility trends. However, low sensitivity suggests that this data source may not be accurate for determining overall prevalence, and should be cautiously compared with other prevalence studies.
Project description:Research transparency, reproducibility, and data sharing uphold core principles of science at a time when the integrity of scientific research is being questioned. This article discusses how research data in psychology can be made accessible for reproducibility and reanalysis by describing practical ways to overcome barriers to data sharing. We examine key issues surrounding the sharing of data such as who owns research data, how to protect the confidentiality of the research participant, how to give appropriate credit to the data creator, how to deal with metadata and codebooks, how to address provenance, and other specifics such as versioning and file formats. The protection of research subjects is a fundamental obligation, and we explain frameworks and procedures designed to protect against the harms that may result from disclosure of confidential information. We also advocate greater recognition for data creators and the authors of program code used in the management and analysis of data. We argue that research data and program code are important scientific contributions that should be cited in the same way as publications. (PsycINFO Database Record