JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles.
ABSTRACT: BACKGROUND:Differential expression analysis on tumor expression profiles has always been a key issue for subsequent biological experimental validation. It is important how to select features which best discriminate between different groups of patients. Despite the emergence of multivariate analysis approaches, prevailing feature selection methods primarily focus on multiple hypothesis testing on individual variables, and then combine them for an explanatory result. Besides, these methods, which are commonly based on hypothesis testing, view classification as a posterior validation of the selected variables. RESULTS:Based on previously provided A5 feature selection strategy, we develop a joint covariate detection tool for differential expression analysis on tumor expression profiles. This software combines hypothesis testing with testing according to classification results. A model selection approach based on Gaussian mixture model is introduced in for automatic selection of features. Besides, a projection heatmap is proposed for the first time. CONCLUSIONS:Joint covariate detection strengthens the viewpoint for selecting variables which are not only individually but also jointly significant. Experiments on simulation and realistic data show the effectiveness of the developed software, which enhances the reliability of joint covariate detection for differential expression analysis on tumor expression profiles. The software is available at http://bio-nefu.com/resource/jcd-dea .
Project description:In linear regression models with high dimensional data, the classical z-test (or t-test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z-test to assess the significance of each covariate. Based on the p-value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively.
Project description:BACKGROUND: In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low p-value. However, the interpretation of each single p-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, game theory has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions. RESULTS: In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called Comparative Analysis of Shapley value (shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability. CONCLUSION: CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways.
Project description:There is mounting evidence that complex human phenotypes are highly polygenic, with many loci harboring multiple causal variants, yet most genetic association studies examine each SNP in isolation. While this has led to the discovery of thousands of disease associations, discovered variants account for only a small fraction of disease heritability. Alternative multi-SNP methods have been proposed, but issues such as multiple-testing correction, sensitivity to genotyping error, and optimization for the underlying genetic architectures remain. Here we describe a local joint-testing procedure, complete with multiple-testing correction, that leverages a genetic phenomenon we call linkage masking wherein linkage disequilibrium between SNPs hides their signal under standard association methods. We show that local joint testing on the original Wellcome Trust Case Control Consortium (WTCCC) data set leads to the discovery of 22 associated loci, 5 more than the marginal approach. These loci were later found in follow-up studies containing thousands of additional individuals. We find that these loci significantly increase the heritability explained by genome-wide significant associations in the WTCCC data set. Furthermore, we show that local joint testing in a cis-expression QTL (eQTL) study of the gEUVADIS data set increases the number of genes containing significant eQTL by 10.7% over marginal analyses. Our multiple-hypothesis correction and joint-testing framework are available in a python software package called Jester, available at github.com/brielin/Jester.
Project description:Despite the existence of many clinical and molecular factors reported that contribute to survival in glioblastoma, prevailing studies fell into partial or local feature selection for survival analysis. We proposed a feature selection strategy including not only joint covariate detection but also its evaluations, and performed it on miRNA expression profiles with glioblastoma. MiR-10b and miR-222 were selected as the most significant two-dimensional feature. Crucially, we integrated in vitro experiments on GBM cells and in vivo studies on a mouse model of human glioma to elucidate the synergistic effects between miR-10b and miR-222. Inhibition of miR-10b and miR-222 strongly suppress GBM cells growth, invasion, and induce apoptosis by co-targeting PTEN and leading to activation of p53 ultimately. We also demonstrated that miR-10b and miR-222 co-target BIM to induce apoptosis independent of p53 status. The results define mir-10b and mir-222 important roles in gliomagenesis and provided a reliable survival analysis strategy.
Project description:Meta-analyses that synthesize statistical evidence across studies have become important analytical tools for genetic studies. Inspired by the success of genome-wide association studies of the genetic main effect, researchers are searching for gene × environment interactions. Confounders are routinely included in the genome-wide gene × environment interaction analysis as covariates; however, this does not control for any confounding effects on the results if covariate × environment interactions are present. We carried out simulation studies to evaluate the robustness to the covariate × environment confounder for meta-regression and joint meta-analysis, which are two commonly used meta-analysis methods for testing the gene × environment interaction or the genetic main effect and interaction jointly. Here we show that meta-regression is robust to the covariate × environment confounder while joint meta-analysis is subject to the confounding effect with inflated type I error rates. Given vast sample sizes employed in genome-wide gene × environment interaction studies, non-significant covariate × environment interactions at the study level could substantially elevate the type I error rate at the consortium level. When covariate × environment confounders are present, type I errors can be controlled in joint meta-analysis by including the covariate × environment terms in the analysis at the study level. Alternatively, meta-regression can be applied, which is robust to potential covariate × environment confounders.
Project description:We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.
Project description:Pathway and gene set-based approaches for the analysis of gene expression profiling experiments have become increasingly popular for addressing problems associated with individual gene analysis. Since most genes are not differently expressed, existing gene set tests, which consider all the genes within a gene set, are subject to considerable noise and power loss, a concern exacerbated in studies in which the degree of differential expression is moderate for truly differentially expressed genes. For a significantly differentially expressed pathway, it is also of substantial interest to select important genes that drive the differential expression of the pathway.We develop a unified framework to jointly test the significance of a pathway and to select a subset of genes that drive the significant pathway effect. To achieve dimension reduction and gene selection, we decompose each gene pathway into a single score by using a regularized form of linear discriminant analysis, called sparse linear discriminant analysis (sLDA). Testing for the significance of the pathway effect proceeds via permutation of the sLDA score. The sLDA-based test is compared with competing approaches with simulations and two applications: a study on the effect of metal fume exposure on immune response and a study of gene expression profiles among Type II Diabetes patients.Our results show that sLDA-based testing provides a powerful approach to test for the significance of a differentially expressed pathway and gene selection.An implementation of the proposed sLDA-based pathway test in the R statistical computing environment is available at http://www.hsph.harvard.edu/~mwu/software/.Supplementary data are available at Bioinformatics online.
Project description:In gene selection for cancer classification using microarray data, we define an eigenvalue-ratio statistic to measure a gene's contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance.
Project description:The challenges of successfully applying causal inference methods include: (i) satisfying underlying assumptions, (ii) limitations in data/models accommodated by the software and (iii) low power of common multiple testing approaches.The causal inference test (CIT) is based on hypothesis testing rather than estimation, allowing the testable assumptions to be evaluated in the determination of statistical significance. A user-friendly software package provides P-values and optionally permutation-based FDR estimates (q-values) for potential mediators. It can handle single and multiple binary and continuous instrumental variables, binary or continuous outcome variables and adjustment covariates. Also, the permutation-based FDR option provides a non-parametric implementation.Simulation studies demonstrate the validity of the cit package and show a substantial advantage of permutation-based FDR over other common multiple testing strategies.The cit open-source R package is freely available from the CRAN website (https://cran.r-project.org/web/packages/cit/index.html) with embedded C?++?code that utilizes the GNU Scientific Library, also freely available (http://www.gnu.org/software/gsl/).firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Project description:Profile regression is a Bayesian statistical approach designed for investigating the joint effect of multiple risk factors. It reduces dimensionality by using as its main unit of inference the exposure profiles of the subjects that is, the sequence of covariate values that correspond to each subject.We applied profile regression to a case-control study of lung cancer in nonsmokers, nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, to estimate the combined effect of environmental carcinogens and to explore possible gene-environment interactions.We tailored and extended the profile regression approach to the analysis of case-control studies, allowing for the analysis of ordinal data and the computation of posterior odds ratios. We compared and contrasted our results with those obtained using standard logistic regression and classification tree methods, including multifactor dimensionality reduction.Profile regression strengthened previous observations in other study populations on the role of air pollutants, particularly particulate matter ≤ 10 μm in aerodynamic diameter (PM10), in lung cancer for nonsmokers. Covariates including living on a main road, exposure to PM10 and nitrogen dioxide, and carrying out manual work characterized high-risk subject profiles. Such combinations of risk factors were consistent with a priori expectations. In contrast, other methods gave less interpretable results.We conclude that profile regression is a powerful tool for identifying risk profiles that express the joint effect of etiologically relevant variables in multifactorial diseases.