When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, ? 2-consistency and Neuroscience Applications.
ABSTRACT: Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) - independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer's disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.
Project description:When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies.
Project description:Inferring patterns of synchronous brain activity from a heterogeneous sample of electroencephalograms (EEG) is scientifically and methodologically challenging. While it is intuitively and statistically appealing to rely on readings from more than one individual in order to highlight recurrent patterns of brain activation, pooling information across subjects presents non-trivial methodological problems. We discuss some of the scientific issues associated with the understanding of synchronized neuronal activity and propose a methodological framework for statistical inference from a sample of EEG readings. Our work builds on classical contributions in time-series, clustering and functional data analysis, in an effort to reframe a challenging inferential problem in the context of familiar analytical techniques. Some attention is paid to computational issues, with a proposal based on the combination of machine learning and Bayesian techniques.
Project description:BACKGROUND: Multiple data-analytic methods have been proposed for evaluating gene-expression levels in specific biological pathways, assessing differential expression associated with a binary phenotype. Following Goeman and Bühlmann's recent review, we compared statistical performance of three methods, namely Global Test, ANCOVA Global Test, and SAM-GS, that test "self-contained null hypotheses" Via. subject sampling. The three methods were compared based on a simulation experiment and analyses of three real-world microarray datasets. RESULTS: In the simulation experiment, we found that the use of the asymptotic distribution in the two Global Tests leads to a statistical test with an incorrect size. Specifically, p-values calculated by the scaled chi2 distribution of Global Test and the asymptotic distribution of ANCOVA Global Test are too liberal, while the asymptotic distribution with a quadratic form of the Global Test results in p-values that are too conservative. The two Global Tests with permutation-based inference, however, gave a correct size. While the three methods showed similar power using permutation inference after a proper standardization of gene expression data, SAM-GS showed slightly higher power than the Global Tests. In the analysis of a real-world microarray dataset, the two Global Tests gave markedly different results, compared to SAM-GS, in identifying pathways whose gene expressions are associated with p53 mutation in cancer cell lines. A proper standardization of gene expression variances is necessary for the two Global Tests in order to produce biologically sensible results. After the standardization, the three methods gave very similar biologically-sensible results, with slightly higher statistical significance given by SAM-GS. The three methods gave similar patterns of results in the analysis of the other two microarray datasets. CONCLUSION: An appropriate standardization makes the performance of all three methods similar, given the use of permutation-based inference. SAM-GS tends to have slightly higher power in the lower alpha-level region (i.e. gene sets that are of the greatest interest). Global Test and ANCOVA Global Test have the important advantage of being able to analyze continuous and survival phenotypes and to adjust for covariates. A free Microsoft Excel Add-In to perform SAM-GS is available from http://www.ualberta.ca/~yyasui/homepage.html.
Project description:Over 15% of the data sets catalogued in the Gene Expression Omnibus Database involve RNA samples that have been pooled before hybridization. Pooling affects data quality and inference, but the exact effects are not yet known because pooling has not been systematically studied in the context of microarray experiments. Here we report on the results of an experiment designed to evaluate the utility of pooling and the impact on identifying differentially expressed genes. We find that inference for most genes is not adversely affected by pooling, and we recommend that pooling be done when fewer than three arrays are used in each condition. For larger designs, pooling does not significantly improve inferences if few subjects are pooled. The realized benefits in this case do not outweigh the price paid for loss of individual specific information. Pooling is beneficial when many subjects are pooled, provided that independent samples contribute to multiple pools.
Project description:Modern scientific studies often require the identification of a subset of explanatory variables. Several statistical methods have been developed to automate this task, and the framework of knockoffs has been proposed as a general solution for variable selection under rigorous Type I error control, without relying on strong modelling assumptions. In this paper, we extend the methodology of knockoffs to problems where the distribution of the covariates can be described by a hidden Markov model. We develop an exact and efficient algorithm to sample knockoff variables in this setting and then argue that, combined with the existing selective framework, this provides a natural and powerful tool for inference in genome-wide association studies with guaranteed false discovery rate control. We apply our method to datasets on Crohn's disease and some continuous phenotypes.
Project description:Science is a social process with far-reaching impact on our modern society. In recent years, for the first time we are able to scientifically study the science itself. This is enabled by massive amounts of data on scientific publications that is increasingly becoming available. The data is contained in several databases such as Web of Science or PubMed, maintained by various public and private entities. Unfortunately, these databases are not always consistent, which considerably hinders this study. Relying on the powerful framework of complex networks, we conduct a systematic analysis of the consistency among six major scientific databases. We found that identifying a single "best" database is far from easy. Nevertheless, our results indicate appreciable differences in mutual consistency of different databases, which we interpret as recipes for future bibliometric studies.
Project description:Association tests that pool minor alleles into a measure of burden at a locus have been proposed for case-control studies using sequence data containing rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants. Early studies proposing pooling tests dismissed methods for locus-wide inference using nonnegative single-variant test statistics based on unrealistic comparisons. However, such methods are robust to the inclusion of neutral and protective variants and therefore may be more useful than previously appreciated. In fact, some recently proposed methods derived within different frameworks are equivalent to performing inference on weighted sums of squared single-variant score statistics. In this study, we compared two existing methods for locus-wide inference using nonnegative single-variant test statistics to two widely cited pooling tests under more realistic conditions. We established analytic results for a simple model with one rare risk and one rare neutral variant, which demonstrated that pooling tests were less powerful than even Bonferroni-corrected single-variant tests in most realistic situations. We also performed simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of missing genotypes. In all scenarios considered, existing methods using nonnegative single-variant test statistics had power comparable to or greater than two widely cited pooling tests. Moreover, in disease models with only rare risk variants, an existing method based on the maximum single-variant Cochran-Armitage trend chi-square statistic in the locus had power comparable to or greater than another existing method closely related to some recently proposed methods. We conclude that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for devising powerful association tests in sequence data with rare variants.
Project description:Now over 20 years old, functional MRI (fMRI) has a large and growing literature that is best synthesised with meta-analytic tools. As most authors do not share image data, only the peak activation coordinates (foci) reported in the article are available for Coordinate-Based Meta-Analysis (CBMA). Neuroimaging meta-analysis is used to (i) identify areas of consistent activation; and (ii) build a predictive model of task type or cognitive process for new studies (reverse inference). To simultaneously address these aims, we propose a Bayesian point process hierarchical model for CBMA. We model the foci from each study as a doubly stochastic Poisson process, where the study-specific log intensity function is characterized as a linear combination of a high-dimensional basis set. A sparse representation of the intensities is guaranteed through latent factor modeling of the basis coefficients. Within our framework, it is also possible to account for the effect of study-level covariates (meta-regression), significantly expanding the capabilities of the current neuroimaging meta-analysis methods available. We apply our methodology to synthetic data and neuroimaging meta-analysis datasets.