Multiple hypotheses testing procedures in clinical trials and genomic studies.
ABSTRACT: We review and compare multiple hypothesis testing procedures used in clinical trials and those in genomic studies. Clinical trials often employ global tests, which draw an overall conclusion for all the hypotheses, such as SUM test, Two-Step test, Approximate Likelihood Ratio test (ALRT), Intersection-Union Test (IUT), and MAX test. The SUM and Two-Step tests are most powerful under homogeneous treatment effects, while the ALRT and MAX test are robust in cases with non-homogeneous treatment effects. Furthermore, the ALRT is robust to unequal sample sizes in testing different hypotheses. In genomic studies, stepwise procedures are used to draw marker-specific conclusions and control family wise error rate (FWER) or false discovery rate (FDR). FDR refers to the percent of false positives among all significant results and is preferred over FWER in screening high-dimensional genomic markers due to its interpretability. In cases where correlations between test statistics cannot be ignored, Westfall-Young resampling method generates the joint distribution of P-values under the null and maintains their correlation structure. Finally, the GWAS data from a clinical trial searching for SNPs associated with nephropathy among Type 1 diabetic patients are used to illustrate various procedures.
Project description:Benjamini and Hochberg (1995) proposed the false discovery rate (FDR) as an alternative to the familywise error rate (FWER) in multiple testing problems. Since then, researchers have been increasingly interested in developing methodologies for controlling the FDR under different model assumptions. In a later paper, Benjamini and Yekutieli (2001) developed a conservative step-up procedure controlling the FDR without relying on the assumption that the test statistics are independent.In this paper, we develop a new step-down procedure aiming to control the FDR. It incorporates dependence information as in the FWER controlling step-down procedure given by Westfall and Young (1993). This new procedure has three versions: lFDR, eFDR and hFDR. Using simulations of independent and dependent data, we observe that the lFDR is too optimistic for controlling the FDR; the hFDR is very conservative; and the eFDR a) seems to control the FDR for the hypotheses of interest, and b) suggests the number of false null hypotheses. The most conservative procedure, hFDR, is proved to control the FDR for finite samples under the subset pivotality condition and under the assumption that joint distribution of statistics from true nulls is independent of the joint distribution of statistics from false nulls.
Project description:Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are more likely to be expression quantitative trait loci (eQTLs). Incorporating eQTL information hence has potential to increase power of genome-wide association studies (GWAS). In this paper, we propose using eQTL weights as prior information in SNP based association tests to improve test power while maintaining control of the family-wise error rate (FWER) or the false discovery rate (FDR). We apply the proposed methods to the analysis of a GWAS for childhood asthma consisting of 1296 unrelated individuals with German ancestry. The results confirm that eQTLs are enriched for previously reported asthma SNPs. We also find that some SNPs are insignificant using procedures without eQTL weighting, but become significant using eQTL-weighted Bonferroni or Benjamini-Hochberg procedures, while controlling the same FWER or FDR level. Some of these SNPs have been reported by independent studies in recent literature. The results suggest that the eQTL-weighted procedures provide a promising approach for improving power of GWAS. We also report the results of our methods applied to the large-scale European GABRIEL consortium data.
Project description:Although enormous costs have been dedicated to discovering relevant disease-related genetic variants, especially in genome-wide association studies (GWASs), only a small fraction of estimated heritability can be explained by these results. This is the so-called missing heritability problem. The conventional use of overly conservative multiple testing strategies based on controlling the familywise error rate (FWER), in particular with a genome-wide significance threshold of P?<5?×?10-8, is one of the most important issues from a statistical perspective. To help resolve this problem, we performed comprehensive re-assessments of currently available strategies using recently published, extremely large-scale GWAS data sets of rheumatoid arthritis and schizophrenia (>50,000 subjects). The estimates of statistical power averaged for all disease-related genetic variants of the standard FWER-based strategy were only 0.09% for the rheumatoid arthritis data and 0.04% for the schizophrenia data. To design more efficient strategies, we also conducted an extensive comparison of multiple testing strategies by applying false discovery rate (FDR)-controlling procedures to these data sets and simulations, and found that the FDR-based procedures achieved higher power than the FWER-based strategy, even at a strict FDR level (e.g., FDR?=?1%). We also discuss a useful alternative measure, namely "partial power," which is an averaged power for detecting the clinically and biologically meaningful genetic factors with the largest effects. Simulation results suggest that the FDR-based procedures can achieve sufficient partial power (>80%) for detecting these factors (odds ratios of >1.05) with 80,000 subjects, and thus this may be a useful measure for defining realistic objectives of future GWASs.
Project description:When analyzing high-throughput genomic data, the multiple comparison problem is most often addressed through estimation of the false discovery rate (FDR), using methods such as the Benjamini & Hochberg, Benjamini & Yekutieli, the q-value method, or in controlling the family-wise error rate (FWER) using Holm's step down method. To date, research studies that have compared various FDR/FWER methodologies have made use of limited simulation studies and/or have applied the methods to one or more microarray gene expression dataset(s). However, for microarray datasets the veracity of each null hypothesis tested is unknown so that an objective evaluation of performance cannot be rendered for application data. Due to the role of methylation in X-chromosome inactivation, we postulate that high-throughput methylation datasets may provide an appropriate forum for assessing the performance of commonly used FDR methodologies. These datasets preserve the complex correlation structure between probes, offering an advantage over simulated datasets. Using several methylation datasets, commonly used FDR methods including the q-value, Benjamini & Hochberg, and Benjamini & Yekutieli procedures as well as Holm's step down method were applied to identify CpG sites that are differentially methylated when comparing healthy males to healthy females. The methods were compared with respect to their ability to identify CpG sites located on sex chromosomes as significant, by reporting the sensitivity, specificity, and observed FDR. These datasets are useful for characterizing the performance of multiple comparison procedures, and may find further utility in other tasks such as comparing variable selection capabilities of classification methods and evaluating the performance of meta-analytic methods for microarray data.
Project description:Improved procedures, in terms of smaller missed discovery rates (MDR), for performing multiple hypotheses testing with weak and strong control of the family-wise error rate (FWER) or the false discovery rate (FDR) are developed and studied. The improvement over existing procedures such as the Šidák procedure for FWER control and the Benjamini-Hochberg (BH) procedure for FDR control is achieved by exploiting possible differences in the powers of the individual tests. Results signal the need to take into account the powers of the individual tests and to have multiple hypotheses decision functions which are not limited to simply using the individual p-values, as is the case, for example, with the Šidák, Bonferroni, or BH procedures. They also enhance understanding of the role of the powers of individual tests, or more precisely the receiver operating characteristic (ROC) functions of decision processes, in the search for better multiple hypotheses testing procedures. A decision-theoretic framework is utilized, and through auxiliary randomizers the procedures could be used with discrete or mixed-type data or with rank-based nonparametric tests. This is in contrast to existing p-value based procedures whose theoretical validity is contingent on each of these p-value statistics being stochastically equal to or greater than a standard uniform variable under the null hypothesis. Proposed procedures are relevant in the analysis of high-dimensional "large M, small n" data sets arising in the natural, physical, medical, economic and social sciences, whose generation and creation is accelerated by advances in high-throughput technology, notably, but not limited to, microarray technology.
Project description:In neuroimaging, hundreds to hundreds of thousands of tests are performed across a set of brain regions or all locations in an image. Recent studies have shown that the most common family-wise error (FWE) controlling procedures in imaging, which rely on classical mathematical inequalities or Gaussian random field theory, yield FWE rates (FWER) that are far from the nominal level. Depending on the approach used, the FWER can be exceedingly small or grossly inflated. Given the widespread use of neuroimaging as a tool for understanding neurological and psychiatric disorders, it is imperative that reliable multiple testing procedures are available. To our knowledge, only permutation joint testing procedures have been shown to reliably control the FWER at the nominal level. However, these procedures are computationally intensive due to the increasingly available large sample sizes and dimensionality of the images, and analyses can take days to complete. Here, we develop a parametric bootstrap joint testing procedure. The parametric bootstrap procedure works directly with the test statistics, which leads to much faster estimation of adjusted p-values than resampling-based procedures while reliably controlling the FWER in sample sizes available in many neuroimaging studies. We demonstrate that the procedure controls the FWER in finite samples using simulations, and present region- and voxel-wise analyses to test for sex differences in developmental trajectories of cerebral blood flow.
Project description:To detect localized glaucomatous structural changes using proper orthogonal decomposition (POD) framework with false-positive control that minimizes confirmatory follow-ups, and to compare the results to topographic change analysis (TCA).We included 167 participants (246 eyes) with ?4 Heidelberg Retina Tomograph (HRT)-II exams from the Diagnostic Innovations in Glaucoma Study; 36 eyes progressed by stereo-photographs or visual fields. All other patient eyes (n = 210) were non-progressing. Specificities were evaluated using 21 normal eyes. Significance of change at each HRT superpixel between each follow-up and its nearest baseline (obtained using POD) was estimated using mixed-effects ANOVA. Locations with significant reduction in retinal height (red pixels) were determined using Bonferroni, Lehmann-Romano k-family-wise error rate (k-FWER), and Benjamini-Hochberg false discovery rate (FDR) type I error control procedures. Observed positive rate (OPR) in each follow-up was calculated as a ratio of number of red pixels within disk to disk size. Progression by POD was defined as one or more follow-ups with OPR greater than the anticipated false-positive rate. TCA was evaluated using the recently proposed liberal, moderate, and conservative progression criteria.Sensitivity in progressors, specificity in normals, and specificity in non-progressors, respectively, were POD-Bonferroni = 100%, 0%, and 0%; POD k-FWER = 78%, 86%, and 43%; POD-FDR = 78%, 86%, and 43%; POD k-FWER with retinal height change ?50 ?m = 61%, 95%, and 60%; TCA-liberal = 86%, 62%, and 21%; TCA-moderate = 53%, 100%, and 70%; and TCA-conservative = 17%, 100%, and 84%.With a stronger control of type I errors, k-FWER in POD framework minimized confirmatory follow-ups while providing diagnostic accuracy comparable to TCA. Thus, POD with k-FWER shows promise to reduce the number of confirmatory follow-ups required for clinical care and studies evaluating new glaucoma treatments. (ClinicalTrials.gov number, NCT00221897.).
Project description:The microarray gene expression applications have greatly stimulated the statistical research on the massive multiple hypothesis tests problem. There is now a large body of literature in this area and basically five paradigms of massive multiple tests: control of the false discovery rate (FDR), estimation of FDR, significance threshold criteria, control of family-wise error rate (FWER) or generalized FWER (gFWER), and empirical Bayes approaches. This paper contains a technical survey of the developments of the FDR-related paradigms, emphasizing precise formulation of the problem, concepts of error measurements, and considerations in applications. The goal is not to do an exhaustive literature survey, but rather to review the current state of the field.
Project description:Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test [aLRT] and Shimodaira-Hasegawa [SH]-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentist frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira-Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based methods.
Project description:BACKGROUND:When many (up to millions) of statistical tests are conducted in discovery set analyses such as genome-wide association studies (GWAS), approaches controlling family-wise error rate (FWER) or false discovery rate (FDR) are required to reduce the number of false positive decisions. Some methods were specifically developed in the context of high-dimensional settings and partially rely on the estimation of the proportion of true null hypotheses. However, these approaches are also applied in low-dimensional settings such as replication set analyses that might be restricted to a small number of specific hypotheses. The aim of this study was to compare different approaches in low-dimensional settings using (a) real data from the CKDGen Consortium and (b) a simulation study. RESULTS:In both application and simulation FWER approaches were less powerful compared to FDR control methods, whether a larger number of hypotheses were tested or not. Most powerful was the q-value method. However, the specificity of this method to maintain true null hypotheses was especially decreased when the number of tested hypotheses was small. In this low-dimensional situation, estimation of the proportion of true null hypotheses was biased. CONCLUSIONS:The results highlight the importance of a sizeable data set for a reliable estimation of the proportion of true null hypotheses. Consequently, methods relying on this estimation should only be applied in high-dimensional settings. Furthermore, if the focus lies on testing of a small number of hypotheses such as in replication settings, FWER methods rather than FDR methods should be preferred to maintain high specificity.