Project description:In genomic-scale data sets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df') compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here, we measured pseudoreplication (quantified by the ratio df'/df) for a common metric of genetic differentiation (FST ) and a common measure of linkage disequilibrium between pairs of loci (r2 ). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df' and df'/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df' increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df' for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST , but df'/df ≤0.01 can occur in data sets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var (FST ), producing very conservative confidence intervals. Predicting df' based on our modelling results as a function of Ne , L, S, and genome size provides a robust way to quantify precision associated with genomic-scale data sets.
Project description:This article reviews how to analyze data from experiments designed to compare the cellular physiology of two or more groups of animals or people. This is commonly done by measuring data from several cells from each animal and using simple t tests or ANOVA to compare between groups. I use simulations to illustrate that this method can give erroneous positive results by assuming that the cells from each animal are independent of each other. This problem, which may be responsible for much of the lack of reproducibility in the literature, can be easily avoided by using a hierarchical, nested statistics approach.
Project description:Pseudoreplication occurs when the number of measured values or data points exceeds the number of genuine replicates, and when the statistical analysis treats all data points as independent and thus fully contributing to the result. By artificially inflating the sample size, pseudoreplication contributes to irreproducibility, and it is a pervasive problem in biological research. In some fields, more than half of published experiments have pseudoreplication - making it one of the biggest threats to inferential validity. Researchers may be reluctant to use appropriate statistical methods if their hypothesis is about the pseudoreplicates and not the genuine replicates; for example, when an intervention is applied to pregnant female rodents (genuine replicates) but the hypothesis is about the effect on the multiple offspring (pseudoreplicates). We propose using a Bayesian predictive approach, which enables researchers to make valid inferences about biological entities of interest, even if they are pseudoreplicates, and show the benefits of this approach using two in vivo data sets.
Project description:Despite expanding data sets and advances in phylogenomic methods, deep-level metazoan relationships remain highly controversial. Recent phylogenomic analyses depart from classical concepts in recovering ctenophores as the earliest branching metazoan taxon and propose a sister-group relationship between sponges and cnidarians (e.g., Dunn CW, Hejnol A, Matus DQ, et al. (18 co-authors). 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745-749). Here, we argue that these results are artifacts stemming from insufficient taxon sampling and long-branch attraction (LBA). By increasing taxon sampling from previously unsampled nonbilaterians and using an identical gene set to that reported by Dunn et al., we recover monophyletic Porifera as the sister group to all other Metazoa. This suggests that the basal position of the fast-evolving Ctenophora proposed by Dunn et al. was due to LBA and that broad taxon sampling is of fundamental importance to metazoan phylogenomic analyses. Additionally, saturation in the Dunn et al. character set is comparatively high, possibly contributing to the poor support for some nonbilaterian nodes.
Project description:Quantifying nitrous oxide (N2O) fluxes, a potent greenhouse gas, from soils is necessary to improve our knowledge of terrestrial N2O losses. Developing universal sampling frequencies for calculating annual N2O fluxes is difficult, as fluxes are renowned for their high temporal variability. We demonstrate daily sampling was largely required to achieve annual N2O fluxes within 10% of the 'best' estimate for 28 annual datasets collected from three continents--Australia, Europe and Asia. Decreasing the regularity of measurements either under- or overestimated annual N2O fluxes, with a maximum overestimation of 935%. Measurement frequency was lowered using a sampling strategy based on environmental factors known to affect temporal variability, but still required sampling more than once a week. Consequently, uncertainty in current global terrestrial N2O budgets associated with the upscaling of field-based datasets can be decreased significantly using adequate sampling frequencies.
Project description:Cells from the same individual share common genetic and environmental backgrounds and are not statistically independent; therefore, they are subsamples or pseudoreplicates. Thus, single-cell data have a hierarchical structure that many current single-cell methods do not address, leading to biased inference, highly inflated type 1 error rates, and reduced robustness and reproducibility. This includes methods that use a batch effect correction for individual as a means of accounting for within-sample correlation. Here, we document this dependence across a range of cell types and show that pseudo-bulk aggregation methods are conservative and underpowered relative to mixed models. To compute differential expression within a specific cell type across treatment groups, we propose applying generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among measures from cells within an individual. Finally, we provide power estimates across a range of experimental conditions to assist researchers in designing appropriately powered studies.
Project description:BackgroundPseudoreplication occurs when observations are not statistically independent, but treated as if they are. This can occur when there are multiple observations on the same subjects, when samples are nested or hierarchically organised, or when measurements are correlated in time or space. Analysis of such data without taking these dependencies into account can lead to meaningless results, and examples can easily be found in the neuroscience literature.ResultsA single issue of Nature Neuroscience provided a number of examples and is used as a case study to highlight how pseudoreplication arises in neuroscientific studies, why the analyses in these papers are incorrect, and appropriate analytical methods are provided. 12% of papers had pseudoreplication and a further 36% were suspected of having pseudoreplication, but it was not possible to determine for certain because insufficient information was provided.ConclusionsPseudoreplication can undermine the conclusions of a statistical analysis, and it would be easier to detect if the sample size, degrees of freedom, the test statistic, and precise p-values are reported. This information should be a requirement for all publications.
Project description:Understanding changes in biodiversity requires the implementation of monitoring programs encompassing different dimensions of biodiversity through varying sampling techniques. In this work, fish assemblages associated with the "outer" and "inner" sides of four marinas, two at the Canary Islands and two at southern Portugal, were investigated using three complementary sampling techniques: underwater visual censuses (UVCs), baited cameras (BCs), and fish traps (FTs). We firstly investigated the complementarity of these sampling methods to describe species composition. Then, we investigated differences in taxonomic (TD), phylogenetic (PD) and functional diversity (FD) between sides of the marinas according to each sampling method. Finally, we explored the applicability/reproducibility of each sampling technique to characterize fish assemblages according to these metrics of diversity. UVCs and BCs provided complementary information, in terms of the number and abundances of species, while FTs sampled a particular assemblage. Patterns of TD, PD, and FD between sides of the marinas varied depending on the sampling method. UVC was the most cost-efficient technique, in terms of personnel hours, and it is recommended for local studies. However, for large-scale studies, BCs are recommended, as it covers greater spatio-temporal scales by a lower cost. Our study highlights the need to implement complementary sampling techniques to monitor ecological change, at various dimensions of biodiversity. The results presented here will be useful for optimizing future monitoring programs.
Project description:The metabolic profiling of tissue biopsies using high-resolution-magic angle spinning (HR-MAS) 1H nuclear magnetic resonance (NMR) spectroscopy may be influenced by experimental factors such as the sampling method. Therefore, we compared the effects of two different sampling methods on the metabolome of brain tissue obtained from the brainstem and thalamus of healthy goats by 1H HR-MAS NMR spectroscopy-in vivo-harvested biopsy by a minimally invasive stereotactic approach compared with postmortem-harvested sample by dissection with a scalpel. Lactate and creatine were elevated, and choline-containing compounds were altered in the postmortem compared to the in vivo-harvested samples, demonstrating rapid changes most likely due to sample ischemia. In addition, in the brainstem samples acetate and inositols, and in the thalamus samples ƴ-aminobutyric acid, were relatively increased postmortem, demonstrating regional differences in tissue degradation. In conclusion, in vivo-harvested brain biopsies show different metabolic alterations compared to postmortem-harvested samples, reflecting less tissue degradation. Sampling method and brain region should be taken into account in the analysis of metabolic profiles. To be as close as possible to the actual situation in the living individual, it is desirable to use brain samples obtained by stereotactic biopsy whenever possible.
Project description:BACKGROUND: The allele frequency spectrum (AFS) consists of counts of the number of single nucleotide polymorphism (SNP) loci with derived variants present at each given frequency in a sample. Multiple approaches have recently been developed for parameter estimation and calculation of model likelihoods based on the joint AFS from two or more populations. We conducted a simulation study of one of these approaches, implemented in the Python module δaδi, to compare parameter estimation and model selection accuracy given different sample sizes under one- and two-population models. RESULTS: Our simulations included a variety of demographic models and two parameterizations that differed in the timing of events (divergence or size change). Using a number of SNPs reasonably obtained through next-generation sequencing approaches (10,000 - 50,000), accurate parameter estimates and model selection were possible for models with more ancient demographic events, even given relatively small numbers of sampled individuals. However, for recent events, larger numbers of individuals were required to achieve accuracy and precision in parameter estimates similar to that seen for models with older divergence or population size changes. We quantify i) the uncertainty in model selection, using tools from information theory, and ii) the accuracy and precision of parameter estimates, using the root mean squared error, as a function of the timing of demographic events, sample sizes used in the analysis, and complexity of the simulated models. CONCLUSIONS: Here, we illustrate the utility of the genome-wide AFS for estimating demographic history and provide recommendations to guide sampling in population genomics studies that seek to draw inference from the AFS. Our results indicate that larger samples of individuals (and thus larger AFS) provide greater power for model selection and parameter estimation for more recent demographic events.