Project description:BackgroundStatistical inference based on small datasets, commonly found in precision oncology, is subject to low power and high uncertainty. In these settings, drawing strong conclusions about future research utility is difficult when using standard inferential measures. It is therefore important to better quantify the uncertainty associated with both significant and non-significant results based on small sample sizes.MethodsWe developed a new method, Bayesian Additional Evidence (BAE), that determines (1) how much additional supportive evidence is needed for a non-significant result to reach Bayesian posterior credibility, or (2) how much additional opposing evidence is needed to render a significant result non-credible. Although based in Bayesian analysis, a prior distribution is not needed; instead, the tipping point output is compared to reasonable effect ranges to draw conclusions. We demonstrate our approach in a comparative effectiveness analysis comparing two treatments in a real world biomarker-defined cohort, and provide guidelines for how to apply BAE in practice.ResultsOur initial comparative effectiveness analysis results in a hazard ratio of 0.31 with 95% confidence interval (0.09, 1.1). Applying BAE to this result yields a tipping point of 0.54; thus, an observed hazard ratio of 0.54 or smaller in a replication study would result in posterior credibility for the treatment association. Given that effect sizes in this range are not extreme, and that supportive evidence exists from a similar published study, we conclude that this problem is worthy of further research.ConclusionsOur proposed method provides a useful framework for interpreting analytic results from small datasets. This can assist researchers in deciding how to interpret and continue their investigations based on an initial analysis that has high uncertainty. Although we illustrated its use in estimating parameters based on time-to-event outcomes, BAE easily applies to any normally-distributed estimator, such as those used for analyzing binary or continuous outcomes.
Project description:Recent advances in molecular simulations allow the evaluation of previously unattainable observables, such as rate constants for protein folding. However, these calculations are usually computationally expensive, and even significant computing resources may result in a small number of independent estimates spread over many orders of magnitude. Such small-sample, high "log-variance" data are not readily amenable to analysis using the standard uncertainty (i.e., "standard error of the mean") because unphysical negative limits of confidence intervals result. Bootstrapping, a natural alternative guaranteed to yield a confidence interval within the minimum and maximum values, also exhibits a striking systematic bias of the lower confidence limit in log space. As we show, bootstrapping artifactually assigns high probability to improbably low mean values. A second alternative, the Bayesian bootstrap strategy, does not suffer from the same deficit and is more logically consistent with the type of confidence interval desired. The Bayesian bootstrap provides uncertainty intervals that are more reliable than those from the standard bootstrap method but must be used with caution nevertheless. Neither standard nor Bayesian bootstrapping can overcome the intrinsic challenge of underestimating the mean from small-size, high log-variance samples. Our conclusions are based on extensive analysis of model distributions and reanalysis of multiple independent atomistic simulations. Although we only analyze rate constants, similar considerations will apply to related calculations, potentially including highly nonlinear averages like the Jarzynski relation.
Project description:In many experiments and especially in translational and preclinical research, sample sizes are (very) small. In addition, data designs are often high dimensional, i.e. more dependent than independent replications of the trial are observed. The present paper discusses the applicability of max t-test-type statistics (multiple contrast tests) in high-dimensional designs (repeated measures or multivariate) with small sample sizes. A randomization-based approach is developed to approximate the distribution of the maximum statistic. Extensive simulation studies confirm that the new method is particularly suitable for analyzing data sets with small sample sizes. A real data set illustrates the application of the methods.
Project description:When item scores are ordered categorical, categorical omega can be computed based on the parameter estimates from a factor analysis model using frequentist estimators such as diagonally weighted least squares. When the sample size is relatively small and thresholds are different across items, using diagonally weighted least squares can yield a substantially biased estimate of categorical omega. In this study, we applied Bayesian estimation methods for computing categorical omega. The simulation study investigated the performance of categorical omega under a variety of conditions through manipulating the scale length, number of response categories, distributions of the categorical variable, heterogeneities of thresholds across items, and prior distributions for model parameters. The Bayes estimator appears to be a promising method for estimating categorical omega. Mplus and SAS codes for computing categorical omega were provided.
Project description:In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.
Project description:BackgroundTargeted therapies have greatly improved cancer patient prognosis. For instance, chronic myeloid leukemia is now well treated with imatinib, a tyrosine kinase inhibitor. Around 80% of the patients reach complete remission. However, despite its great efficiency, some patients are resistant to the drug. This heterogeneity in the response might be associated with pharmacokinetic parameters, varying between individuals because of genetic variants. To assess this issue, next-generation sequencing of large panels of genes can be performed from patient samples. However, the common problem in pharmacogenetic studies is the availability of samples, often limited. In the end, large sequencing data are obtained from small sample sizes; therefore, classical statistical analyses cannot be applied to identify interesting targets. To overcome this concern, here, we described original and underused statistical methods to analyze large sequencing data from a restricted number of samples.ResultsTo evaluate the relevance of our method, 48 genes involved in pharmacokinetics were sequenced by next-generation sequencing from 24 chronic myeloid leukemia patients, either sensitive or resistant to imatinib treatment. Using a graphical representation, from 708 identified polymorphisms, a reduced list of 115 candidates was obtained. Then, by analyzing each gene and the distribution of variant alleles, several candidates were highlighted such as UGT1A9, PTPN22, and ERCC5. These genes were already associated with the transport, the metabolism, and even the sensitivity to imatinib in previous studies.ConclusionsThese relevant tests are great alternatives to inferential statistics not applicable to next-generation sequencing experiments performed on small sample sizes. These approaches permit to reduce the number of targets and find good candidates for further treatment sensitivity studies.
Project description:MotivationNetwork-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited.ResultsWe develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data.Availability and implementationThe NExUS source code is freely available for download at https://github.com/priyamdas2/NExUS.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:LC-HRMS experiments detect thousands of compounds, with only a small fraction of them identified in most studies. Traditional data processing pipelines contain an alignment step to assemble the measurements of overlapping features across samples into a unified table. However, data sets acquired under nonidentical conditions are not amenable to this process, mostly due to significant alterations in chromatographic retention times. Alignment of features between disparately acquired LC-MS metabolomics data could aid collaborative compound identification efforts and enable meta-analyses of expanded data sets. Here, we describe metabCombiner, a new computational pipeline for matching known and unknown features in a pair of untargeted LC-MS data sets and concatenating their abundances into a combined table of intersecting feature measurements. metabCombiner groups features by mass-to-charge (m/z) values to generate a search space of possible feature pair alignments, fits a spline through a set of selected retention time ordered pairs, and ranks alignments by m/z, mapped retention time, and relative abundance similarity. We evaluated this workflow on a pair of plasma metabolomics data sets acquired with different gradient elution methods, achieving a mean absolute retention time prediction error of roughly 0.06 min and a weighted per-compound matching accuracy of approximately 90%. We further demonstrate the utility of this method by comprehensively mapping features in urine and muscle metabolomics data sets acquired from different laboratories. metabCombiner has the potential to bridge the gap between otherwise incompatible metabolomics data sets and is available as an R package at https://github.com/hhabra/metabCombiner and Bioconductor.
Project description:This study investigates the performance of robust ML estimators when fitting and evaluating small sample latent growth models (LGM) with non-normal missing data. Results showed that the robust ML methods could be used to account for non-normality even when the sample size is very small (e.g., N < 100). Among the robust ML estimators, "MLR" was the optimal choice, as it was found to be robust to both non-normality and missing data while also yielding more accurate standard error estimates and growth parameter coverage. However, the choice "MLMV" produced the most accurate p values for the Chi-square test statistic under conditions studied. Regarding the goodness of fit indices, as sample size decreased, all three fit indices studied (i.e., CFI, RMSEA, and SRMR) exhibited worse fit. When the sample size was very small (e.g., N < 60), the fit indices would imply that a proposed model fit poorly, when this might not be actually the case in the population.
Project description:Covering: 2014 to 2023 for metabolomics, 2002 to 2023 for information visualizationLC-MS/MS-based untargeted metabolomics is a rapidly developing research field spawning increasing numbers of computational metabolomics tools assisting researchers with their complex data processing, analysis, and interpretation tasks. In this article, we review the entire untargeted metabolomics workflow from the perspective of information visualization, visual analytics and visual data integration. Data visualization is a crucial step at every stage of the metabolomics workflow, where it provides core components of data inspection, evaluation, and sharing capabilities. However, due to the large number of available data analysis tools and corresponding visualization components, it is hard for both users and developers to get an overview of what is already available and which tools are suitable for their analysis. In addition, there is little cross-pollination between the fields of data visualization and metabolomics, leaving visual tools to be designed in a secondary and mostly ad hoc fashion. With this review, we aim to bridge the gap between the fields of untargeted metabolomics and data visualization. First, we introduce data visualization to the untargeted metabolomics field as a topic worthy of its own dedicated research, and provide a primer on cutting-edge visualization research into data visualization for both researchers as well as developers active in metabolomics. We extend this primer with a discussion of best practices for data visualization as they have emerged from data visualization studies. Second, we provide a practical roadmap to the visual tool landscape and its use within the untargeted metabolomics field. Here, for several computational analysis stages within the untargeted metabolomics workflow, we provide an overview of commonly used visual strategies with practical examples. In this context, we will also outline promising areas for further research and development. We end the review with a set of recommendations for developers and users on how to make the best use of visualizations for more effective and transparent communication of results.