Project description:MotivationGaussian graphical models (GGMs) are network representations of random variables (as nodes) and their partial correlations (as edges). GGMs overcome the challenges of high-dimensional data analysis by using shrinkage methodologies. Therefore, they have become useful to reconstruct gene regulatory networks from gene-expression profiles. However, it is often ignored that the partial correlations are 'shrunk' and that they cannot be compared/assessed directly. Therefore, accurate (differential) network analyses need to account for the number of variables, the sample size, and also the shrinkage value, otherwise, the analysis and its biological interpretation would turn biased. To date, there are no appropriate methods to account for these factors and address these issues.ResultsWe derive the statistical properties of the partial correlation obtained with the Ledoit-Wolf shrinkage. Our result provides a toolbox for (differential) network analyses as (i) confidence intervals, (ii) a test for zero partial correlation (null-effects) and (iii) a test to compare partial correlations. Our novel (parametric) methods account for the number of variables, the sample size and the shrinkage values. Additionally, they are computationally fast, simple to implement and require only basic statistical knowledge. Our simulations show that the novel tests perform better than DiffNetFDR-a recently published alternative-in terms of the trade-off between true and false positives. The methods are demonstrated on synthetic data and two gene-expression datasets from Escherichia coli and Mus musculus.Availability and implementationThe R package with the methods and the R script with the analysis are available in https://github.com/V-Bernal/GeneNetTools.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Mixed modelsModels allowing for continuous heterogeneity by assuming that value of one or more parameters follow a specified distribution have become increasingly popular. This is known as 'mixing' parameters, and it is standard practice by researchers--and the default option in many statistical programs--to base test statistics for mixed models on simulations using asymmetric draws (e.g. Halton draws). PROBLEM 1: INCONSISTENT LR TESTS DUE TO ASYMMETRIC DRAWS: This paper shows that when the estimated likelihood functions depend on standard deviations of mixed parameters this practice is very likely to cause misleading test results for the number of draws usually used today. The paper illustrates that increasing the number of draws is a very inefficient solution strategy requiring very large numbers of draws to ensure against misleading test statistics. The main conclusion of this paper is that the problem can be solved completely by using fully antithetic draws, and that using one dimensionally antithetic draws is not enough to solve the problem. PROBLEM 2: MAINTAINING THE CORRECT DIMENSIONS WHEN REDUCING THE MIXING DISTRIBUTION: A second point of the paper is that even when fully antithetic draws are used, models reducing the dimension of the mixing distribution must replicate the relevant dimensions of the quasi-random draws in the simulation of the restricted likelihood. Again this is not standard in research or statistical programs. The paper therefore recommends using fully antithetic draws replicating the relevant dimensions of the quasi-random draws in the simulation of the restricted likelihood and that this should become the default option in statistical programs. JEL classification: C15; C25.
Project description:MotivationOne of the main goals in systems biology is to learn molecular regulatory networks from quantitative profile data. In particular, Gaussian graphical models (GGMs) are widely used network models in bioinformatics where variables (e.g. transcripts, metabolites or proteins) are represented by nodes, and pairs of nodes are connected with an edge according to their partial correlation. Reconstructing a GGM from data is a challenging task when the sample size is smaller than the number of variables. The main problem consists in finding the inverse of the covariance estimator which is ill-conditioned in this case. Shrinkage-based covariance estimators are a popular approach, producing an invertible 'shrunk' covariance. However, a proper significance test for the 'shrunk' partial correlation (i.e. the GGM edges) is an open challenge as a probability density including the shrinkage is unknown. In this article, we present (i) a geometric reformulation of the shrinkage-based GGM, and (ii) a probability density that naturally includes the shrinkage parameter.ResultsOur results show that the inference using this new 'shrunk' probability density is as accurate as Monte Carlo estimation (an unbiased non-parametric method) for any shrinkage value, while being computationally more efficient. We show on synthetic data how the novel test for significance allows an accurate control of the Type I error and outperforms the network reconstruction obtained by the widely used R package GeneNet. This is further highlighted in two gene expression datasets from stress response in Eschericha coli, and the effect of influenza infection in Mus musculus.Availability and implementationhttps://github.com/V-Bernal/GGM-Shrinkage.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:We consider likelihood ratio tests (LRT) and their modifications for homogeneity in admixture models. The admixture model is a two-component mixture model, where one component is indexed by an unknown parameter while the parameter value for the other component is known. This model is widely used in genetic linkage analysis under heterogeneity in which the kernel distribution is binomial. For such models, it is long recognized that testing for homogeneity is nonstandard, and the LRT statistic does not converge to a conventional χ(2) distribution. In this article, we investigate the asymptotic behavior of the LRT for general admixture models and show that its limiting distribution is equivalent to the supremum of a squared Gaussian process. We also discuss the connection and comparison between LRT and alternative approaches such as modifications of LRT and score tests, including the modified LRT (Fu, Chen, and Kalbfleisch, 2006, Statistica Sinica 16, 805-823). The LRT is an omnibus test that is powerful to detect general alternative hypotheses. In contrast, alternative approaches may be slightly more powerful to detect certain type of alternatives, but much less powerful for others. Our results are illustrated by simulation studies and an application to a genetic linkage study of schizophrenia.
Project description:In the 1970s, Professor Robbins and his coauthors extended the Vile and Wald inequality in order to derive the fundamental theoretical results regarding likelihood ratio based sequential tests with power one. The law of the iterated logarithm confirms an optimal property of the power one tests. In parallel with Robbins's decision-making procedures, we propose and examine sequential empirical likelihood ratio (ELR) tests with power one. In this setting, we develop the nonparametric one- and two-sided ELR tests. It turns out that the proposed sequential ELR tests significantly outperform the classical nonparametric t-statistic-based counterparts in many scenarios based on different underlying data distributions.
Project description:Regression analyses are commonly performed with doubly limited continuous dependent variables; for instance, when modeling the behavior of rates, proportions and income concentration indices. Several models are available in the literature for use with such variables, one of them being the unit gamma regression model. In all such models, parameter estimation is typically performed using the maximum likelihood method and testing inferences on the model's parameters are usually based on the likelihood ratio test. Such a test can, however, deliver quite imprecise inferences when the sample size is small. In this paper, we propose two modified likelihood ratio test statistics for use with the unit gamma regressions that deliver much more accurate inferences when the number of data points in small. Numerical (i.e. simulation) evidence is presented for both fixed dispersion and varying dispersion models, and also for tests that involve nonnested models. We also present and discuss two empirical applications.
Project description:Beta regressions are commonly used with responses that assume values in the standard unit interval, such as rates, proportions and concentration indices. Hypothesis testing inferences on the model parameters are typically performed using the likelihood ratio test. It delivers accurate inferences when the sample size is large, but can otherwise lead to unreliable conclusions. It is thus important to develop alternative tests with superior finite sample behavior. We derive the Bartlett correction to the likelihood ratio test under the more general formulation of the beta regression model, i.e. under varying precision. The model contains two submodels, one for the mean response and a separate one for the precision parameter. Our interest lies in performing testing inferences on the parameters that index both submodels. We use three Bartlett-corrected likelihood ratio test statistics that are expected to yield superior performance when the sample size is small. We present Monte Carlo simulation evidence on the finite sample behavior of the Bartlett-corrected tests relative to the standard likelihood ratio test and to two improved tests that are based on an alternative approach. The numerical evidence shows that one of the Bartlett-corrected typically delivers accurate inferences even when the sample is quite small. An empirical application related to behavioral biometrics is presented and discussed.
Project description:Many statistical tests have been proposed for case-control data to detect disease association with multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium. The main reason for the existence of so many tests is that each test aims to detect one or two aspects of many possible distributional differences between cases and controls, largely due to the lack of a general and yet simple model for discrete genotype data. Here we propose a latent variable model to represent SNP data: the observed SNP data are assumed to be obtained by discretizing a latent multivariate Gaussian variate. Because the latent variate is multivariate Gaussian, its distribution is completely characterized by its mean vector and covariance matrix, in contrast to much more complex forms of a general distribution for discrete multivariate SNP data. We propose a composite likelihood approach for parameter estimation. A direct application of this latent variable model is to association testing with multiple SNPs in a candidate gene or region. In contrast to many existing tests that aim to detect only one or two aspects of many possible distributional differences of discrete SNP data, we can exclusively focus on testing the mean and covariance parameters of the latent Gaussian distributions for cases and controls. Our simulation results demonstrate potential power gains of the proposed approach over some existing methods.
Project description:The likelihood ratio test (LRT) is widely used for comparing the relative fit of nested latent variable models. Following Wilks' theorem, the LRT is conducted by comparing the LRT statistic with its asymptotic distribution under the restricted model, a [Formula: see text] distribution with degrees of freedom equal to the difference in the number of free parameters between the two nested models under comparison. For models with latent variables such as factor analysis, structural equation models and random effects models, however, it is often found that the [Formula: see text] approximation does not hold. In this note, we show how the regularity conditions of Wilks' theorem may be violated using three examples of models with latent variables. In addition, a more general theory for LRT is given that provides the correct asymptotic theory for these LRTs. This general theory was first established in Chernoff (J R Stat Soc Ser B (Methodol) 45:404-413, 1954) and discussed in both van der Vaart (Asymptotic statistics, Cambridge, Cambridge University Press, 2000) and Drton (Ann Stat 37:979-1012, 2009), but it does not seem to have received enough attention. We illustrate this general theory with the three examples.