Project description:The conventional approach of choosing sample size to provide 80% or greater power ignores the cost implications of different sample size choices. Costs, however, are often impossible for investigators and funders to ignore in actual practice. Here, we propose and justify a new approach for choosing sample size based on cost efficiency, the ratio of a study's projected scientific and/or practical value to its total cost. By showing that a study's projected value exhibits diminishing marginal returns as a function of increasing sample size for a wide variety of definitions of study value, we are able to develop two simple choices that can be defended as more cost efficient than any larger sample size. The first is to choose the sample size that minimizes the average cost per subject. The second is to choose sample size to minimize total cost divided by the square root of sample size. This latter method is theoretically more justifiable for innovative studies, but also performs reasonably well and has some justification in other cases. For example, if projected study value is assumed to be proportional to power at a specific alternative and total cost is a linear function of sample size, then this approach is guaranteed either to produce more than 90% power or to be more cost efficient than any sample size that does. These methods are easy to implement, based on reliable inputs, and well justified, so they should be regarded as acceptable alternatives to current conventional approaches.
Project description:Multilevel models have been developed for addressing data that come from a hierarchical structure. In particular, due to the increase of longitudinal studies, a three-level growth model is frequently used to measure the change of individuals who are nested in groups. In multilevel modeling, sufficient sample sizes are needed to obtain unbiased estimates and enough power to detect individual or group effects. However, there are few sample size guidelines for three-level growth models. Therefore, it is important that researchers recognize the possibility of unreliable results when sample sizes are small. The purpose of this study is to find adequate sample sizes for a three-level growth model under realistic conditions. A Monte Carlo simulation was performed under 12 conditions: (1) level-2 sample size (10, 30), (2) level-3 sample size (30, 50, 100) (3) intraclass correlation at level-3 (0.05, 0.15). The study examined the following outcomes: convergence rate, relative parameter bias, mean square error (MSE), 95% coverage rate and power. The results indicate that estimates of the regression coefficients are unbiased, but the variance component tends to be inaccurate with small sample sizes.
Project description:Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.
Project description:MotivationNetwork-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.ResultsWe develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.Availability and implementationThe NExUS source code is freely available for download at https://github.com/priyamdas2/NExUS.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Multivariate pattern analysis approaches can be applied to the topographic distribution of event-related potential (ERP) signals to 'decode' subtly different stimulus classes, such as different faces or different orientations. These approaches are extremely sensitive, and it seems possible that they could also be used to increase effect sizes and statistical power in traditional paradigms that ask whether an ERP component differs in amplitude across conditions. To assess this possibility, we leveraged the open-source ERP CORE dataset and compared the effect sizes resulting from conventional univariate analyses of mean amplitude with two multivariate pattern analysis approaches (support vector machine decoding and the cross-validated Mahalanobis distance, both of which are easy to compute using open-source software). We assessed these approaches across seven widely studied ERP components (N170, N400, N2pc, P3b, lateral readiness potential, error related negativity, and mismatch negativity). Across all components, we found that multivariate approaches yielded effect sizes that were as large or larger than the effect sizes produced by univariate approaches. These results indicate that researchers could obtain larger effect sizes, and therefore greater statistical power, by using multivariate analysis of topographic voltage patterns instead of traditional univariate analyses in many ERP studies.
Project description:Saturation is commonly used to determine sample sizes in qualitative research, yet there is little guidance on what influences saturation. We aimed to assess saturation and identify parameters to estimate sample sizes for focus group studies in advance of data collection. We used two approaches to assess saturation in data from 10 focus group discussions. Four focus groups were sufficient to identify a range of new issues (code saturation), but more groups were needed to fully understand these issues (meaning saturation). Group stratification influenced meaning saturation, whereby one focus group per stratum was needed to identify issues; two groups per stratum provided a more comprehensive understanding of issues, but more groups per stratum provided little additional benefit. We identify six parameters influencing saturation in focus group data: study purpose, type of codes, group stratification, number of groups per stratum, and type and degree of saturation.
Project description:Accurate quantification of forest carbon stocks is required for constraining the global carbon cycle and its impacts on climate. The accuracies of forest biomass maps are inherently dependent on the accuracy of the field biomass estimates used to calibrate models, which are generated with allometric equations. Here, we provide a quantitative assessment of the sensitivity of allometric parameters to sample size in temperate forests, focusing on the allometric relationship between tree height and crown radius. We use LiDAR remote sensing to isolate between 10,000 to more than 1,000,000 tree height and crown radius measurements per site in six U.S. forests. We find that fitted allometric parameters are highly sensitive to sample size, producing systematic overestimates of height. We extend our analysis to biomass through the application of empirical relationships from the literature, and show that given the small sample sizes used in common allometric equations for biomass, the average site-level biomass bias is ~+70% with a standard deviation of 71%, ranging from -4% to +193%. These findings underscore the importance of increasing the sample sizes used for allometric equation generation.
Project description:Background and objectives Reference ranges are widely used to locate the major range of the target probability distribution. When future measurements fall outside the reference range, they are classified as atypical and require further investigation. The fundamental principles and statistical properties of reference ranges are closely related to those of tolerance interval procedures. Existing investigations of reference ranges and tolerance intervals mainly devoted to the primitive cases of one- and paired-sample designs. Although reference ranges hold considerable promise for parallel group designs, the corresponding methodological and computational issues for determining reference limits and sample sizes have not been adequately addressed. Methods This paper describes a complete collection of one- and two-sided reference ranges for assessing measurement differences in parallel-group studies that assume variance homogeneity. Results The problem of sample size determination for precise reference ranges is also examined under the expected half-width and assurance probability considerations. Unlike the current methods, the suggested sample size criteria explicitly accommodate desired interval width in precise interval estimation. Conclusions Theoretical examinations and empirical assessments are presented to validate the usefulness of the proposed reference range and sample size procedures. To enhance the usages of the recommended techniques in practical applications, computer programs are developed for efficient calculation and exact analysis. A real data example regarding tablet absorption rate and extent is presented to illustrate the suggested assessments between two drug formulations.
Project description:A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.