Dataset Information

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

ABSTRACT: Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries by such data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions on exogeneity of covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given certain number of predictors, namely, the distribution of the correlation of a response variable Y with the best s linear combinations of p covariates X, even when X and Y are independent. When the covariance matrix of X possesses the restricted eigenvalue property, we derive such distributions for both finite s and diverging s, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of X. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where residuals are from regularized fits. Our approach is then applied to construct the upper confidence limit for the maximum spurious correlation and testing exogeneity of covariates. The former provides a baseline for guarding against false discoveries due to data mining and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated by both numerical examples and real data analysis.

SUBMITTER: Fan J

PROVIDER: S-EPMC6014708 | biostudies-literature | 2018 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Fan Jianqing J Shao Qi-Man QM Zhou Wen-Xin WX

Annals of statistics 20180503 3

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries by such data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions on exogeneity of covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum ...[more]

PMID: 29942099

Similar Datasets

Project description:BackgroundAfrican trypanosomiasis is a tsetse-borne parasitic infection that affects humans, wildlife, and domesticated animals. Tsetse flies are endemic to much of Sub-Saharan Africa and a spatial and temporal understanding of tsetse habitat can aid surveillance and support disease risk management. Problematically, current fine spatial resolution remote sensing data are delivered with a temporal lag and are relatively coarse temporal resolution (e.g., 16 days), which results in disease control models often targeting incorrect places. The goal of this study was to devise a heuristic for identifying tsetse habitat (at a fine spatial resolution) into the future and in the temporal gaps where remote sensing and proximal data fail to supply information.MethodsThis paper introduces a generalizable and scalable open-access version of the tsetse ecological distribution (TED) model used to predict tsetse distributions across space and time, and contributes a geospatial Bayesian Maximum Entropy (BME) prediction model trained by TED output data to forecast where, herein the Morsitans group of tsetse, persist in Kenya, a method that mitigates the temporal lag problem. This model facilitates identification of tsetse habitat and provides critical information to control tsetse, mitigate the impact of trypanosomiasis on vulnerable human and animal populations, and guide disease minimization in places with ephemeral tsetse. Moreover, this BME analysis is one of the first to utilize cluster and parallel computing along with a Monte Carlo analysis to optimize BME computations. This allows for the analysis of an exceptionally large dataset (over 2 billion data points) at a finer resolution and larger spatiotemporal scale than what had previously been possible.ResultsUnder the most conservative assessment for Kenya, the BME kriging analysis showed an overall prediction accuracy of 74.8% (limited to the maximum suitability extent). In predicting tsetse distribution outcomes for the entire country the BME kriging analysis was 97% accurate in its forecasts.ConclusionsThis work offers a solution to the persistent temporal data gap in accurate and spatially precise rainfall predictions and the delayed processing of remotely sensed data collectively in the - 45 days past to + 180 days future temporal window. As is shown here, the BME model is a reliable alternative for forecasting future tsetse distributions to allow preplanning for tsetse control. Furthermore, this model provides guidance on disease control that would otherwise not be available. These 'big data' BME methods are particularly useful for large domain studies. Considering that past BME studies required reduction of the spatiotemporal grid to facilitate analysis. Both the GEE-TED and the BME libraries have been made open source to enable reproducibility and offer continual updates into the future as new remotely sensed data become available.

Dataset Information

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Publications

ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets