Unsupervised Anomaly Detection of Healthcare Providers Using Generative Adversarial Networks
ABSTRACT: Healthcare fraud is considered a challenge for many societies. Health care funding that could be spent on medicine, care for the elderly or emergency room visits are instead lost to fraudulent activities by materialistic practitioners or patients. With rising healthcare costs, healthcare fraud is a major contributor to these increasing healthcare costs. This study evaluates previous anomaly detection machine learning models and proposes an unsupervised framework to identify anomalies using a Generative Adversarial Network (GANs) model. The GANs anomaly detection (GAN-AD) model was applied on two different healthcare provider data sets. The anomalous healthcare providers were further analysed through the application of classification models with the logistic regression and extreme gradient boosting models showing good performance. Results from the SHapley Additive exPlanation (SHAP) also signifies that the predictors used explain the anomalous healthcare providers.
Project description:Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called "normal" instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.
Project description:For complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.
Project description:Protein molecules are inherently dynamic and modulate their interactions with different molecular partners by accessing different tertiary structures under physiological conditions. Elucidating such structures remains challenging. Current momentum in deep learning and the powerful performance of generative adversarial networks (GANs) in complex domains, such as computer vision, inspires us to investigate GANs on their ability to generate physically-realistic protein tertiary structures. The analysis presented here shows that several GAN models fail to capture complex, distal structural patterns present in protein tertiary structures. The study additionally reveals that mechanisms touted as effective in stabilizing the training of a GAN model are not all effective, and that performance based on loss alone may be orthogonal to performance based on the quality of generated datasets. A novel contribution in this study is the demonstration that Wasserstein GAN strikes a good balance and manages to capture both local and distal patterns, thus presenting a first step towards more powerful deep generative models for exploring a possibly very diverse set of structures supporting diverse activities of a protein molecule in the cell.
Project description:BACKGROUND:Site-specific variations are challenges for pooling analyses in multi-center studies. This work aims to propose an inter-site harmonization method based on dual generative adversarial networks (GANs) for diffusion tensor imaging (DTI) derived metrics on neonatal brains. RESULTS:DTI-derived metrics (fractional anisotropy, FA; mean diffusivity, MD) are obtained on age-matched neonates without magnetic resonance imaging (MRI) abnormalities: 42 neonates from site 1 and 42 neonates from site 2. Significant inter-site differences of FA can be observed. The proposed harmonization approach and three conventional methods (the global-wise scaling, the voxel-wise scaling, and the ComBat) are performed on DTI-derived metrics from two sites. During the tract-based spatial statistics, inter-site differences can be removed by the proposed dual GANs method, the voxel-wise scaling, and the ComBat. Among these methods, the proposed method holds the lowest median values in absolute errors and root mean square errors. During the pooling analysis of two sites, Pearson correlation coefficients between FA and the postmenstrual age after harmonization are larger than those before harmonization. The effect sizes (Cohen's d between males and females) are also maintained by the harmonization procedure. CONCLUSIONS:The proposed dual GANs-based harmonization method is effective to harmonize neonatal DTI-derived metrics from different sites. Results in this study further suggest that the GANs-based harmonization is a feasible pre-processing method for pooling analyses in multi-center studies.
Project description:Single-cell RNA-sequencing (scRNA-seq) enables the characterization of transcriptomic profiles at the single-cell resolution with increasingly high throughput. However, it suffers from many sources of technical noises, including insufficient mRNA molecules that lead to excess false zero values, termed dropouts. Computational approaches have been proposed to recover the biologically meaningful expression by borrowing information from similar cells in the observed dataset. However, these methods suffer from oversmoothing and removal of natural cell-to-cell stochasticity in gene expression. Here, we propose the generative adversarial networks (GANs) for scRNA-seq imputation (scIGANs), which uses generated cells rather than observed cells to avoid these limitations and balances the performance between major and rare cell populations. Evaluations based on a variety of simulated and real scRNA-seq datasets show that scIGANs is effective for dropout imputation and enhances various downstream analysis. ScIGANs is robust to small datasets that have very few genes with low expression and/or cell-to-cell variance. ScIGANs works equally well on datasets from different scRNA-seq protocols and is scalable to datasets with over 100 000 cells. We demonstrated in many ways with compelling evidence that scIGANs is not only an application of GANs in omics data but also represents a competing imputation method for the scRNA-seq data.
Project description:Fluid flow characteristics are important to assess reservoir performance. Unfortunately, laboratory techniques are inadequate to know these characteristics, which is why numerical methods were developed. Such methods often use computed tomography (CT) scans as input but this technique is plagued by a resolution versus sample size trade-off. Therefore, a super-resolution method using generative adversarial neural networks (GANs) was used to artificially improve the resolution. Firstly, the influence of resolution on pore network properties and single-phase, unsaturated, and two-phase flow was analysed to verify that pores and pore throats become larger on average and surface area decreases with worsening resolution. These observations are reflected in increasingly overestimated single-phase permeability, less moisture uptake at lower capillary pressures, and high residual oil fraction after waterflooding. Therefore, the super-resolution GANs were developed which take low (12 µm) resolution input and increase the resolution to 4 µm, which is compared to the expected high-resolution output. These results better predicted pore network properties and fluid flow properties despite the overestimation of porosity. Relevant small pores and pore surfaces are better resolved thus providing better estimates of unsaturated and two-phase flow which can be heavily influenced by flow along pore boundaries and through smaller pores. This study presents the second case in which GANs were applied to a super-resolution problem on geological materials, but it is the first one to apply it directly on raw CT images and to determine the actual impact of a super-resolution method on fluid predictions.
Project description:Next-generation sequencing (NGS) technology has become a powerful tool for dissecting the molecular and pathological signatures of a variety of human diseases. However, the limited availability of biological samples from different disease stages is a major hurdle in studying disease progressions and identifying early pathological changes. Deep learning techniques have recently begun to be applied to analyze NGS data and thereby predict the progression of biological processes. In this study, we applied a deep learning technique called generative adversarial networks (GANs) to predict the molecular progress of Alzheimer's disease (AD). We successfully applied GANs to analyze RNA-seq data from a 5xFAD mouse model of AD, which recapitulates major AD features of massive amyloid-? (A?) accumulation in the brain. We examined how the generator is featured to have specific-sample generation and biological gene association. Based on the above observations, we suggested virtual disease progress by latent space interpolation to yield the transition curves of various genes with pathological changes from normal to AD state. By performing pathway analysis based on the transition curve patterns, we identified several pathological processes with progressive changes, such as inflammatory systems and synapse functions, which have previously been demonstrated to be involved in the pathogenesis of AD. Interestingly, our analysis indicates that alteration of cholesterol biosynthesis begins at a very early stage of AD, suggesting that it is the first effect to mediate the cholesterol metabolism of AD downstream of A? accumulation. Here, we suggest that GANs are a useful tool to study disease progression, leading to the identification of early pathological signatures.
Project description:MOTIVATION:Hi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data. RESULTS:We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts. AVAILABILITY AND IMPLEMENTATION:We release hicGAN as an open-sourced software at https://github.com/kimmo1019/hicGAN. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Antimicrobial peptides are a potential solution to the threat of multidrug-resistant bacterial pathogens. Recently, deep generative models including generative adversarial networks (GANs) have been shown to be capable of designing new antimicrobial peptides. Intuitively, a GAN controls the probability distribution of generated sequences to cover active peptides as much as possible. This paper presents a peptide-specialized model called PepGAN that takes the balance between covering active peptides and dodging nonactive peptides. As a result, PepGAN has superior statistical fidelity with respect to physicochemical descriptors including charge, hydrophobicity, and weight. Top six peptides were synthesized, and one of them was confirmed to be highly antimicrobial. The minimum inhibitory concentration was 3.1 ?g/mL, indicating that the peptide is twice as strong as ampicillin.
Project description:<h4>Background</h4>Cardiac surgery-associated acute kidney injury (CSA-AKI) is a major complication that results in increased morbidity and mortality after cardiac surgery. Most established prediction models are limited to the analysis of nonlinear relationships and fail to fully consider intraoperative variables, which represent the acute response to surgery. Therefore, this study utilized an artificial intelligence-based machine learning approach thorough perioperative data-driven learning to predict CSA-AKI.<h4>Methods</h4>A total of 671 patients undergoing cardiac surgery from August 2016 to August 2018 were enrolled. AKI following cardiac surgery was defined according to criteria from Kidney Disease: Improving Global Outcomes (KDIGO). The variables used for analysis included demographic characteristics, clinical condition, preoperative biochemistry data, preoperative medication, and intraoperative variables such as time-series hemodynamic changes. The machine learning methods used included logistic regression, support vector machine (SVM), random forest (RF), extreme gradient boosting (XGboost), and ensemble (RF + XGboost). The performance of these models was evaluated using the area under the receiver operating characteristic curve (AUC). We also utilized SHapley Additive exPlanation (SHAP) values to explain the prediction model.<h4>Results</h4>Development of CSA-AKI was noted in 163 patients (24.3%) during the first postoperative week. Regarding the efficacy of the single model that most accurately predicted the outcome, RF exhibited the greatest AUC (0.839, 95% confidence interval [CI] 0.772-0.898), whereas the AUC (0.843, 95% CI 0.778-0.899) of ensemble model (RF + XGboost) was even greater than that of the RF model alone. The top 3 most influential features in the RF importance matrix plot were intraoperative urine output, units of packed red blood cells (pRBCs) transfused during surgery, and preoperative hemoglobin level. The SHAP summary plot was used to illustrate the positive or negative effects of the top 20 features attributed to the RF. We also used the SHAP dependence plot to explain how a single feature affects the output of the RF prediction model.<h4>Conclusions</h4>In this study, machine learning methods were successfully established to predict CSA-AKI, which determines risks following cardiac surgery, enabling the optimization of postoperative treatment strategies to minimize the postoperative complications following cardiac surgeries.