ABSTRACT: A classical problem in statistics is estimating the expected coverage of a sample, which has had applications in gene expression, microbial ecology, optimization, and even numismatics. Here we consider a related extension of this problem to random samples of two discrete distributions. Specifically, we estimate what we call the dissimilarity probability of a sample, i.e., the probability of a draw from one distribution not being observed in [Formula: see text] draws from another distribution. We show our estimator of dissimilarity to be a [Formula: see text]-statistic and a uniformly minimum variance unbiased estimator of dissimilarity over the largest appropriate range of [Formula: see text]. Furthermore, despite the non-Markovian nature of our estimator when applied sequentially over [Formula: see text], we show it converges uniformly in probability to the dissimilarity parameter, and we present criteria when it is approximately normally distributed and admits a consistent jackknife estimator of its variance. As proof of concept, we analyze V35 16S rRNA data to discern between various microbial environments. Other potential applications concern any situation where dissimilarity of two discrete distributions may be of interest. For instance, in SELEX experiments, each urn could represent a random RNA pool and each draw a possible solution to a particular binding site problem over that pool. The dissimilarity of these pools is then related to the probability of finding binding site solutions in one pool that are absent in the other.
Project description:The effective population size ([Formula: see text]) is a major factor determining allele frequency changes in natural and experimental populations. Temporal methods provide a powerful and simple approach to estimate short-term [Formula: see text] They use allele frequency shifts between temporal samples to calculate the standardized variance, which is directly related to [Formula: see text] Here we focus on experimental evolution studies that often rely on repeated sequencing of samples in pools (Pool-seq). Pool-seq is cost-effective and often outperforms individual-based sequencing in estimating allele frequencies, but it is associated with atypical sampling properties: Additional to sampling individuals, sequencing DNA in pools leads to a second round of sampling, which increases the variance of allele frequency estimates. We propose a new estimator of [Formula: see text] which relies on allele frequency changes in temporal data and corrects for the variance in both sampling steps. In simulations, we obtain accurate [Formula: see text] estimates, as long as the drift variance is not too small compared to the sampling and sequencing variance. In addition to genome-wide [Formula: see text] estimates, we extend our method using a recursive partitioning approach to estimate [Formula: see text] locally along the chromosome. Since the type I error is controlled, our method permits the identification of genomic regions that differ significantly in their [Formula: see text] estimates. We present an application to Pool-seq data from experimental evolution with Drosophila and provide recommendations for whole-genome data. The estimator is computationally efficient and available as an R package at https://github.com/ThomasTaus/Nest.
Project description:Group testing, introduced by Dorfman (1943), has been used to reduce costs when estimating the prevalence of a binary characteristic based on a screening test of [Formula: see text] groups that include [Formula: see text] independent individuals in total. If the unknown prevalence is low and the screening test suffers from misclassification, it is also possible to obtain more precise prevalence estimates than those obtained from testing all [Formula: see text] samples separately (Tu et al., 1994). In some applications, the individual binary response corresponds to whether an underlying time-to-event variable [Formula: see text] is less than an observed screening time [Formula: see text], a data structure known as current status data. Given sufficient variation in the observed [Formula: see text] values, it is possible to estimate the distribution function [Formula: see text] of [Formula: see text] nonparametrically, at least at some points in its support, using the pool-adjacent-violators algorithm (Ayer et al., 1955). Here, we consider nonparametric estimation of [Formula: see text] based on group-tested current status data for groups of size [Formula: see text] where the group tests positive if and only if any individual's unobserved [Formula: see text] is less than the corresponding observed [Formula: see text]. We investigate the performance of the group-based estimator as compared to the individual test nonparametric maximum likelihood estimator, and show that the former can be more precise in the presence of misclassification for low values of [Formula: see text]. Potential applications include testing for the presence of various diseases in pooled samples where interest focuses on the age-at-incidence distribution rather than overall prevalence. We apply this estimator to the age-at-incidence curve for hepatitis C infection in a sample of U.S. women who gave birth to a child in 2014, where group assignment is done at random and based on maternal age. We discuss connections to other work in the literature, as well as potential extensions.
Project description:Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42-58], uses n samples to predict the number U of hitherto unseen species that would be observed if [Formula: see text] new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45-63] constructed an intriguing estimator that predicts U for all [Formula: see text] Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435-447] proposed a modification that empirically predicts U even for some [Formula: see text], but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to [Formula: see text] We also show that this range is the best possible and that the estimator's mean-square error is near optimal for any t Our approach yields a provable guarantee for the Efron-Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.
Project description:The advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of polymorphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains expensive for many nonmodel species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator of [Formula: see text] for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is unbiased and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecification, such as sequencing errors and uneven contributions of individual DNAs to the pools. Finally, by reanalyzing published Pool-seq data of different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased [Formula: see text] estimator may question the interpretation of population structure inferred from previous analyses.
Project description:We draw a random subset of [Formula: see text] rows from a frame with [Formula: see text] rows (vectors) and [Formula: see text] columns (dimensions), where [Formula: see text] and [Formula: see text] are proportional to [Formula: see text] For a variety of important deterministic equiangular tight frames (ETFs) and tight non-ETFs, we consider the distribution of singular values of the [Formula: see text]-subset matrix. We observe that, for large [Formula: see text], they can be precisely described by a known probability distribution-Wachter's MANOVA (multivariate ANOVA) spectral distribution, a phenomenon that was previously known only for two types of random frames. In terms of convergence to this limit, the [Formula: see text]-subset matrix from all of these frames is shown to be empirically indistinguishable from the classical MANOVA (Jacobi) random matrix ensemble. Thus, empirically, the MANOVA ensemble offers a universal description of the spectra of randomly selected [Formula: see text] subframes, even those taken from deterministic frames. The same universality phenomena is shown to hold for notable random frames as well. This description enables exact calculations of properties of solutions for systems of linear equations based on a random choice of [Formula: see text] frame vectors of [Formula: see text] possible vectors and has a variety of implications for erasure coding, compressed sensing, and sparse recovery. When the aspect ratio [Formula: see text] is small, the MANOVA spectrum tends to the well-known Mar?enko-Pastur distribution of the singular values of a Gaussian matrix, in agreement with previous work on highly redundant frames. Our results are empirical, but they are exhaustive, precise, and fully reproducible.
Project description:Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development.According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using [Formula: see text]. The [Formula: see text] was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed [Formula: see text] to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of [Formula: see text], five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which [Formula: see text] was applied to adjust the binning results. Our experiments showed that [Formula: see text] consistently achieves the best performance with tuple length k?=?6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), [Formula: see text] improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The [Formula: see text] is available at https://github.com/kunWangkun/d2SBin .Experiments showed that [Formula: see text] accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The [Formula: see text] can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.
Project description:The fixation index F(st) plays a central role in ecological and evolutionary genetic studies. The estimators of Wright ([Formula: see text]), Weir and Cockerham ([Formula: see text]), and Hudson et al. ([Formula: see text]) are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator [Formula: see text] using [Formula: see text] and [Formula: see text]. We tested [Formula: see text] in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that [Formula: see text] has smaller bias than [Formula: see text] for small sample sizes and smaller bias than [Formula: see text] for large sample sizes. Also, [Formula: see text] has smaller variance than [Formula: see text] for small Fst values and smaller variance than [Formula: see text] for large F(st) values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate F(st).
Project description:Gene diversity, or expected heterozygosity (H), is a common statistic for assessing genetic variation within populations. Estimation of this statistic decreases in accuracy and precision when individuals are related or inbred, due to increased dependence among allele copies in the sample. The original unbiased estimator of expected heterozygosity underestimates true population diversity in samples containing relatives, as it only accounts for sample size. More recently, a general unbiased estimator of expected heterozygosity was developed that explicitly accounts for related and inbred individuals in samples. Though unbiased, this estimator's variance is greater than that of the original estimator. To address this issue, we introduce a general unbiased estimator of gene diversity for samples containing related or inbred individuals, which employs the best linear unbiased estimator of allele frequencies, rather than the commonly used sample proportion. We examine the properties of this estimator, [Formula: see text] relative to alternative estimators using simulations and theoretical predictions, and show that it predominantly has the smallest mean squared error relative to others. Further, we empirically assess the performance of [Formula: see text] on a global human microsatellite dataset of 5795 individuals, from 267 populations, genotyped at 645 loci. Additionally, we show that the improved variance of [Formula: see text] leads to improved estimates of the population differentiation statistic, [Formula: see text] which employs measures of gene diversity within its calculation. Finally, we provide an R script, BestHet, to compute this estimator from genomic and pedigree data.
Project description:A motif in a network is a connected graph that occurs significantly more frequently as an induced subgraph than would be expected in a similar randomized network. By virtue of being atypical, it is thought that motifs might play a more important role than arbitrary subgraphs. Recently, a flurry of advances in the study of network motifs has created demand for faster computational means for identifying motifs in increasingly larger networks. Motif detection is typically performed by enumerating subgraphs in an input network and in an ensemble of comparison networks; this poses a significant computational problem. Classifying the subgraphs encountered, for instance, is typically performed using a graph canonical labeling package, such as Nauty, and will typically be called billions of times. In this article, we describe an implementation of a network motif detection package, which we call NetMODE. NetMODE can only perform motif detection for [Formula: see text]-node subgraphs when [Formula: see text], but does so without the use of Nauty. To avoid using Nauty, NetMODE has an initial pretreatment phase, where [Formula: see text]-node graph data is stored in memory ([Formula: see text]). For [Formula: see text] we take a novel approach, which relates to the Reconstruction Conjecture for directed graphs. We find that NetMODE can perform up to around [Formula: see text] times faster than its predecessors when [Formula: see text] and up to around [Formula: see text] times faster when [Formula: see text] (the exact improvement varies considerably). NetMODE also (a) includes a method for generating comparison graphs uniformly at random, (b) can interface with external packages (e.g. R), and (c) can utilize multi-core architectures. NetMODE is available from netmode.sf.net.
Project description:We propose a new method to assess the merit of any set of scientific papers in a given field based on the citations they receive. Given a field and a citation impact indicator, such as the mean citation or the [Formula: see text]-index, the merit of a given set of [Formula: see text] articles is identified with the probability that a randomly drawn set of [Formula: see text] articles from a given pool of articles in that field has a lower citation impact according to the indicator in question. The method allows for comparisons between sets of articles of different sizes and fields. Using a dataset acquired from Thomson Scientific that contains the articles published in the periodical literature in the period 1998-2007, we show that the novel approach yields rankings of research units different from those obtained by a direct application of the mean citation or the [Formula: see text]-index.