Bayesian pathway analysis of cancer microarray data.
ABSTRACT: High Throughput Biological Data (HTBD) requires detailed analysis methods and from a life science perspective, these analysis results make most sense when interpreted within the context of biological pathways. Bayesian Networks (BNs) capture both linear and nonlinear interactions and handle stochastic events in a probabilistic framework accounting for noise making them viable candidates for HTBD analysis. We have recently proposed an approach, called Bayesian Pathway Analysis (BPA), for analyzing HTBD using BNs in which known biological pathways are modeled as BNs and pathways that best explain the given HTBD are found. BPA uses the fold change information to obtain an input matrix to score each pathway modeled as a BN. Scoring is achieved using the Bayesian-Dirichlet Equivalent method and significance is assessed by randomization via bootstrapping of the columns of the input matrix. In this study, we improve on the BPA system by optimizing the steps involved in "Data Preprocessing and Discretization", "Scoring", "Significance Assessment", and "Software and Web Application". We tested the improved system on synthetic data sets and achieved over 98% accuracy in identifying the active pathways. The overall approach was applied on real cancer microarray data sets in order to investigate the pathways that are commonly active in different cancer types. We compared our findings on the real data sets with a relevant approach called the Signaling Pathway Impact Analysis (SPIA).
Project description:Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model.We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called ? performed best. This score performed better than other BN scoring criteria and MDR at recall using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set.We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter ? appears more promising than a number of alternatives.
Project description:Gene expression class comparison studies may identify hundreds or thousands of genes as differentially expressed (DE) between sample groups. Gaining biological insight from the result of such experiments can be approached, for instance, by identifying the signaling pathways impacted by the observed changes. Most of the existing pathway analysis methods focus on either the number of DE genes observed in a given pathway (enrichment analysis methods), or on the correlation between the pathway genes and the class of the samples (functional class scoring methods). Both approaches treat the pathways as simple sets of genes, disregarding the complex gene interactions that these pathways are built to describe.We describe a novel signaling pathway impact analysis (SPIA) that combines the evidence obtained from the classical enrichment analysis with a novel type of evidence, which measures the actual perturbation on a given pathway under a given condition. A bootstrap procedure is used to assess the significance of the observed total pathway perturbation. Using simulations we show that the evidence derived from perturbations is independent of the pathway enrichment evidence. This allows us to calculate a global pathway significance P-value, which combines the enrichment and perturbation P-values. We illustrate the capabilities of the novel method on four real datasets. The results obtained on these data show that SPIA has better specificity and more sensitivity than several widely used pathway analysis methods.SPIA was implemented as an R package available at http://vortex.cs.wayne.edu/ontoexpress/
Project description:Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of Bayesian Networks (BNs) is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20-50 nodes). In this paper, we present a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than current general-purpose processor (GPP)-based implementations. The GPU-based implementation is just one of several implementations within the larger application, each optimized for a different input or machine configuration. We describe the methodology we use to build an extensible application, assembled from these variants, that can target a broad range of heterogeneous systems, e.g., GPUs, multicore GPPs. Specifically we show how we use the Merge programming model to efficiently integrate, test and intelligently select among the different potential implementations.
Project description:BACKGROUND:Phenotype prediction problems are usually considered ill-posed, as the amount of samples is very limited with respect to the scrutinized genetic probes. This fact complicates the sampling of the defective genetic pathways due to the high number of possible discriminatory genetic networks involved. In this research, we outline three novel sampling algorithms utilized to identify, classify and characterize the defective pathways in phenotype prediction problems, such as the Fisher's ratio sampler, the Holdout sampler and the Random sampler, and apply each one to the analysis of genetic pathways involved in tumor behavior and outcomes of triple negative breast cancers (TNBC). Altered biological pathways are identified using the most frequently sampled genes and are compared to those obtained via Bayesian Networks (BNs). RESULTS:Random, Fisher's ratio and Holdout samplers were more accurate and robust than BNs, while providing comparable insights about disease genomics. CONCLUSIONS:The three samplers tested are good alternatives to Bayesian Networks since they are less computationally demanding algorithms. Importantly, this analysis confirms the concept of "biological invariance" since the altered pathways should be independent of the sampling methodology and the classifier used for their inference. Nevertheless, still some modifications are needed in the Bayesian networks to be able to sample correctly the uncertainty space in phenotype prediction problems, since the probabilistic parameterization of the uncertainty space is not unique and the use of the optimum network might falsify the pathways analysis.
Project description:Common inflammatome gene signatures as well as disease-specific signatures were identified by analyzing 12 expression profiling data sets derived from 9 different tissues isolated from 11 rodent inflammatory disease models. The inflammatome signature significantly overlaps with known drug targets and co-expressed gene modules linked to metabolic disorders and cancer. A large proportion of genes in this signature are tightly connected in tissue-specific Bayesian networks (BNs) built from multiple independent mouse and human cohorts. Both the inflammatome signature and the corresponding consensus BNs are highly enriched for immune response-related genes supported as causal for adiposity, adipokine, diabetes, aortic lesion, bone, muscle, and cholesterol traits, suggesting the causal nature of the inflammatome for a variety of diseases. Integration of this inflammatome signature with the BNs uncovered 151 key drivers that appeared to be more biologically important than the non-drivers in terms of their impact on disease phenotypes. The identification of this inflammatome signature, its network architecture, and key drivers not only highlights the shared etiology but also pinpoints potential targets for intervention of various common diseases.
Project description:SUMMARY: Bayesian Networks (BNs) are versatile probabilistic models applicable to many different biological phenomena. In biological applications the structure of the network is usually unknown and needs to be inferred from experimental data. BNFinder is a fast software implementation of an exact algorithm for finding the optimal structure of the network given a number of experimental observations. Its second version, presented in this article, represents a major improvement over the previous version. The improvements include (i) a parallelized learning algorithm leading to an order of magnitude speed-ups in BN structure learning time; (ii) inclusion of an additional scoring function based on mutual information criteria; (iii) possibility of choosing the resulting network specificity based on statistical criteria and (iv) a new module for classification by BNs, including cross-validation scheme and classifier quality measurements with receiver operator characteristic scores. AVAILABILITY AND IMPLEMENTATION: BNFinder2 is implemented in python and freely available under the GNU general public license at the project Web site https://launchpad.net/bnfinder, together with a user's manual, introductory tutorial and supplementary methods.
Project description:Analysis of gene sets has been widely applied in various high-throughput biological studies. One weakness in the traditional methods is that they neglect the heterogeneity of genes expressions in samples which may lead to the omission of some specific and important gene sets. It is also difficult for them to reflect the severities of disease and provide expression profiles of gene sets for individuals. We developed an application software called IGSA that leverages a powerful analytical capacity in gene sets enrichment and samples clustering. IGSA calculates gene sets expression scores for each sample and takes an accumulating clustering strategy to let the samples gather into the set according to the progress of disease from mild to severe. We focus on gastric, pancreatic and ovarian cancer data sets for the performance of IGSA. We also compared the results of IGSA in KEGG pathways enrichment with David, GSEA, SPIA, ssGSEA and analyzed the results of IGSA clustering and different similarity measurement methods. Notably, IGSA is proved to be more sensitive and specific in finding significant pathways, and can indicate related changes in pathways with the severity of disease. In addition, IGSA provides with significant gene sets profile for each sample.
Project description:BACKGROUND:Patients who were diagnosed with hematologic malignancies (HM) had a higher risk of acute kidney injury (AKI). This study applies the Bayesian networks (BNs) to investigate the interrelationships between AKI and its risk factors among HM patients, and to evaluate the predictive and inferential ability of BNs model in different clinical settings. METHODS:During 2014 and 2015, a total of 2501 inpatients with HM were recruited in this retrospective study conducted in a tertiary hospital, Shanghai of China. Patients' demographics, medical history, clinical and laboratory records on admission were extracted from the electronic medical records. Candidate predictors of AKI were screened in the group-LASSO (gLASSO) regression, and then they were incorporated into BNs analysis for further interrelationship modeling and disease prediction. RESULTS:Of 2395 eligible patients with HM, 370 episodes were diagnosed with AKI (15.4%). Patients with multiple myeloma (24.1%) and leukemia (23.9%) had higher incidences of AKI, followed by lymphoma (13.4%). Screened by the gLASSO regression, variables as age, gender, diabetes, HM category, anti-tumor treatment, hemoglobin, serum creatinine (SCr), the estimated glomerular filtration rate (eGFR), serum uric acid, serum sodium and potassium level were found with significant associations with the occurrence of AKI. Through BNs analysis, age, hemoglobin, eGFR, serum sodium and potassium had directed connections with AKI. HM category and anti-tumor treatment were indirectly linked to AKI via hemoglobin and eGFR, and diabetes was connected with AKI by affecting eGFR level. BNs inferences concluded that when poor eGFR, anemia and hyponatremia occurred simultaneously, the patients' probability of AKI was up to 78.5%. The area under the receiver operating characteristic curve (AUC) of BNs model was 0.835, higher than that in the logistic score model (0.763). It also showed a robust performance in 10-fold cross-validation (AUC: 0.812). CONCLUSION:Bayesian networks can provide a novel perspective to reveal the intrinsic connections between AKI and its risk factors in HM patients. The BNs predictive model could help us to calculate the probability of AKI at the individual level, and follow the tide of e-alert and big-data realize the early detection of AKI.
Project description:Pathway analysis is a common approach to gain insight from biological experiments. Signaling-pathway impact analysis (SPIA) is one such method and combines both the classical enrichment analysis and the actual perturbation on a given pathway. Because this method focuses on a single pathway, its resolution generally is not very high because the differentially expressed genes may be enriched in a local region of the pathway. In the present work, to identify cancer-related pathways, we incorporated a recent subpathway analysis method into the SPIA method to form the "sub-SPIA method." The original subpathway analysis uses the k-clique structure to define a subpathway. However, it is not sufficiently flexible to capture subpathways with complex structure and usually results in many overlapping subpathways. We therefore propose using the minimal-spanning-tree structure to find a subpathway. We apply this approach to colorectal cancer and lung cancer datasets, and our results show that sub-SPIA can identify many significant pathways associated with each specific cancer that other methods miss. Based on the entire pathway network in the Kyoto Encyclopedia of Genes and Genomes, we find that the pathways identified by sub-SPIA not only have the largest average degree, but also are more closely connected than those identified by other methods. This result suggests that the abnormality signal propagating through them might be responsible for the specific cancer or disease.
Project description:Structure learning of Bayesian Networks (BNs) is an important topic in machine learning. Driven by modern applications in genetics and brain sciences, accurate and efficient learning of large-scale BN structures from high-dimensional data becomes a challenging problem. To tackle this challenge, we propose a Sparse Bayesian Network (SBN) structure learning algorithm that employs a novel formulation involving one L1-norm penalty term to impose sparsity and another penalty term to ensure that the learned BN is a Directed Acyclic Graph--a required property of BNs. Through both theoretical analysis and extensive experiments on 11 moderate and large benchmark networks with various sample sizes, we show that SBN leads to improved learning accuracy, scalability, and efficiency as compared with 10 existing popular BN learning algorithms. We apply SBN to a real-world application of brain connectivity modeling for Alzheimer's disease (AD) and reveal findings that could lead to advancements in AD research.