Project description:Sparse feature tables, in which many features are present in very few samples, are common in big biological data (e.g. metagenomics). Ignoring issues of zero-laden datasets can result in biased statistical estimates and decreased power in downstream analyses. Zeros are also a particular issue for compositional data analysis using log-ratios since the log of zero is undefined. Researchers typically deal with this issue by removing low frequency features, but the thresholds for removal differ markedly between studies with little or no justification. Here, we present CurvCut, an unsupervised data-driven approach with human confirmation for rare-feature removal. CurvCut implements two distinct approaches for determining natural breaks in the feature distributions: a method based on curvature analysis borrowed from thermodynamics and the Fisher-Jenks statistical method. Our results show that CurvCut rapidly identifies data-specific breaks in these distributions that can be used as cutoff points for low-frequency feature removal that maximizes feature retention. We show that CurvCut works across different biological data types and rapidly generates clear visual results that allow researchers to confirm and apply feature removal cutoffs to individual datasets.
Project description:Data on the use of time in different exposures, behaviors, and work tasks are common in occupational research. Such data are most often expressed in hours, minutes, or percentage of work time. Thus, they are constrained or 'compositional', in that they add up to a finite sum (e.g. 8 h of work or 100% work time). Due to their properties, compositional data need to be processed and analyzed using specifically adapted methods. Compositional data analysis (CoDA) has become a particularly established framework to handle such data in various scientific fields such as nutritional epidemiology, geology, and chemistry, but has only recently gained attention in public and occupational health sciences. In this paper, we introduce the reader to CoDA by explaining why CoDA should be used when dealing with compositional time-use data, showing how to perform CoDA, including a worked example, and pointing at some remaining challenges in CoDA. The paper concludes by emphasizing that CoDA in occupational research is still in its infancy, and stresses the need for further development and experience in the use of CoDA for time-based occupational exposures. We hope that the paper will encourage researchers to adopt and apply CoDA in studies of work exposures and health.
Project description:This manuscript considers that the composition of Manzanilla and Hojiblanca fats are compositional data (CoDa). Thus, the work applies CoDa analysis (CoDA) to investigate the effect of processing and packaging on the fatty acid profiles of these cultivars. To this aim, the values of the fat components in percentages were successively subjected to exploratory CoDA tools and, later, transformed into ilr (isometric log-ratio) coordinates in the Euclidean space, where they were subjected to the standard multivariate techniques. The results from the first approach (bar plots of geometric means, tetrahedral plots, compositional biplots, and balance dendrograms) showed that the effect of processing was limited while most of the variability among the fatty acid (FA) profiles was due to cultivars. The application of the standard multivariate methods (i.e., Canonical variates, Linear Discriminant Analysis (LDA), ANOVA/MANOVA with bootstrapping and n = 1000, and nested General Linear Model (GLM)) to the ilr coordinates transformed data, following Ward's clustering or descending order of variances criteria, showed similar effects to the exploratory analysis but also showed that Hojiblanca was more sensitive to fat modifications than Manzanilla. On the contrary, associating GLM changes in ilr with fatty acids was not straightforward because of the complex deduction of some coordinates. Therefore, according to the CoDA, table olive fatty acid profiles are scarcely affected by Spanish-style processing compared with the differences between cultivars. This work has demonstrated that CoDA could be successfully applied to study the fatty acid profiles of olive fat and olive oils and may represent a model for the statistical analysis of other fats, with the advantage of applying appropriate statistical techniques and preventing misinterpretations.
Project description:The analysis of the combined mRNA and miRNA content of a biological sample can be of interest for answering several research questions, like biomarkers discovery, or mRNA-miRNA interactions. However, the process is costly and time-consuming, separate libraries need to be prepared and sequenced on different flowcells. Combo-Seq is a library prep kit that allows us to prepare combined mRNA-miRNA libraries starting from very low total RNA. To date, no dedicated bioinformatics method exists for the processing of Combo-Seq data. In this paper, we describe CODA (Combo-seq Data Analysis), a workflow specifically developed for the processing of Combo-Seq data that employs existing free-to-use tools. We compare CODA with exceRpt, the pipeline suggested by the kit manufacturer for this purpose. We also evaluate how Combo-Seq libraries analysed with CODA perform compared with conventional poly(A) and small RNA libraries prepared from the same samples. We show that using CODA more successfully trimmed reads are recovered compared with exceRpt, and the difference is more dramatic with short sequencing reads. We demonstrate how Combo-Seq identifies as many genes and fewer miRNAs compared to the standard libraries, and how miRNA validation favours conventional small RNA libraries over Combo-Seq. The CODA code is available at https://github.com/marta-nazzari/CODA.
Project description:BackgroundInformation is limited for the benefits of physical activity (PA) in preschoolers. Previous research using accelerometer-assessed PA may be affected for multicollinearity issues.ObjectivesThis study investigated the cross-sectional and prospective associations of sedentary behaviour (SB) and PA with body composition and physical fitness using compositional data analysis.MethodsBaseline PA and SB were collected in 4-year-old (n = 315) using wrist-worn GT3X+ during seven 24 h-periods. Body composition (air-displacement plethysmography) and physical fitness (PREFIT test battery) were assessed at baseline and at the 12-month follow-up.ResultsIncreasing vigorous PA at expenses of lower-intensity behaviours for 4-year-old was associated with body composition and physical fitness at cross-sectional and longitudinal levels. For example, reallocating 15 min/day from lower intensities to vigorous PA at baseline was associated with higher fat-free mass index (+0.45 kg/m2 , 95% confidence intervals [CI]: 0.18-0.72 kg/m2 ), higher upper-body strength (+0.6 kg, 95% CI: 0.1-1.19 kg), higher lower-body strength (+8 cm, 95% CI: 3-13 cm), and shorter time in completing the motor fitness test (-0.4 s, 95% CI: -0.82 to [-0.01] s) at the 12-month follow-up. Pairwise reallocations of time indicated that the behaviour replaced was not relevant, as long as vigorous PA was increased.ConclusionsMore time in vigorous PA may imply short- and long-term benefits on body composition and physical fitness in preschoolers. These findings using compositional data analysis corroborate our previously published results using isotemporal substitution models.
Project description:PurposePrivacy-protecting analytic and data-sharing methods that minimize the disclosure risk of sensitive information are increasingly important due to the growing interest in utilizing data across multiple sources. We conducted a simulation study to examine how avoiding sharing individual-level data in a distributed data network can affect analytic results.MethodsThe base scenario had four sites of varying sizes with 5% outcome incidence, 50% treatment prevalence, and seven confounders. We varied treatment prevalence, outcome incidence, treatment effect, site size, number of sites, and covariate distribution. Confounding adjustment was conducted using propensity score or disease risk score. We compared analyses of three types of aggregate-level data requested from sites: risk-set, summary-table, or effect-estimate data (meta-analysis) with benchmark results of analysis of pooled individual-level data. We assessed bias and precision of hazard ratio estimates as well as the accuracy of standard error estimates.ResultsAll the aggregate-level data-sharing approaches, regardless of confounding adjustment methods, successfully approximated pooled individual-level data analysis in most simulation scenarios. Meta-analysis showed minor bias when using inverse probability of treatment weights (IPTW) in infrequent exposure (5%), rare outcome (0.01%), and small site (5,000 patients) settings. SE estimates became less accurate for IPTW risk-set approach with less frequent exposure and for propensity score-matching meta-analysis approach with rare outcomes.ConclusionsOverall, we found that we can avoid sharing individual-level data and obtain valid results in many settings, although care must be taken with meta-analysis approach in infrequent exposure and rare outcome scenarios, particularly when confounding adjustment is performed with IPTW.
Project description:BackgroundThere is no gold standard in body composition measurement in pediatric patients with obesity. Therefore, the aim of this study was to investigate if there are any differences between two bioelectrical impedance analysis techniques performed in children and adolescents with obesity.MethodsData were collected at the Department of Pediatrics and Adolescent Medicine in Vienna from September 2015 to May 2017. Body composition measurement was performed with TANITA scale and BIA-BIACORPUS.ResultsIn total, 38 children and adolescents (age: 10-18 years, BMI: 25-54 kg/m2) were included. Boys had significantly increased fat free mass (TANITA p = 0.019, BIA p = 0.003), total body water (TANITA p = 0.020, BIA p = 0.005), and basal metabolic rate (TANITA p = 0.002, BIA p = 0.029). Girls had significantly increased body fat percentage with BIA (BIA p = 0.001). No significant gender differences of core abdominal area have been determined. TANITA overestimated body fat percentage (p < 0.001), fat mass (p = 0.002), and basal metabolic rate (p < 0.001) compared to BIA. TANITA underestimated fat free mass (p = 0.002) in comparison to BIA. The Bland Altman plot demonstrated a low agreement between the body composition methods.ConclusionsLow agreement between TANITA scale and BIA-BIACORPUS has been observed. Body composition measurement should always be performed by the same devices to obtain comparable results. At clinical routine due to its feasibility, safety, and efficiency, bioelectrical impedance analysis is appropriate for obese pediatric patients.Trial registrationClinicalTrials NCT02545764 . Registered 10 September 2015.
Project description:Multi-compartment body-composition models that divide the body into its multiple constituents are the criterion method for measuring body fat percentage, fat mass, and fat-free mass. However, 2- and 3-compartment body-composition devices such as air displacement plethysmography (ADP), DXA, and bioelectrical impedance devices [bioelectrical impedance analysis (BIA)] are more commonly used. Accurate measures depend on several assumptions, including constant hydration, body proportion, fat-free body density, and population characteristics. Investigations evaluating body composition in racial and ethnic minorities have observed differences in the aforementioned components between cohorts. Consequently, for racial/ethnic minority populations, estimates of body composition may not be valid. The purpose of this review was to comprehensively examine the validity of common body-composition devices in multi-ethnic samples (samples including >1 race/ethnicity) and in African-American, Hispanic, Asian, and Native American populations. Based on the literature, DXA produces valid results in multi-ethnic samples and ADP is valid for Hispanic and African American males when utilizing race-specific equations. However, for DXA and ADP, there is a need for validity investigations that include larger, more racially diverse samples, specifically including Hispanic/Latinx, Asian, Native American adults, and African-American females. Technology has advanced significantly since initial validity studies were conducted; therefore, conclusions are based on outdated models and software. For BIA, body-composition measures may be valid in a multi-ethnic sample, but the literature demonstrates disparate results between races/ethnicities. For BIA and ADP, the majority of studies have utilized DXA or hydrostatic weighing as the criterion to determine validity; additional studies utilizing a multi-compartment model criterion are essential to evaluate accuracy. Validity studies evaluating more recent technology in larger, more racially/ethnically diverse samples may improve our ability to select the appropriate method to accurately assess body composition in each racial/ethnic population.
Project description:Secondary analyses of survey data collected from large probability samples of persons or establishments further scientific progress in many fields. The complex design features of these samples improve data collection efficiency, but also require analysts to account for these features when conducting analysis. Unfortunately, many secondary analysts from fields outside of statistics, biostatistics, and survey methodology do not have adequate training in this area, and as a result may apply incorrect statistical methods when analyzing these survey data sets. This in turn could lead to the publication of incorrect inferences based on the survey data that effectively negate the resources dedicated to these surveys. In this article, we build on the results of a preliminary meta-analysis of 100 peer-reviewed journal articles presenting analyses of data from a variety of national health surveys, which suggested that analytic errors may be extremely prevalent in these types of investigations. We first perform a meta-analysis of a stratified random sample of 145 additional research products analyzing survey data from the Scientists and Engineers Statistical Data System (SESTAT), which describes features of the U.S. Science and Engineering workforce, and examine trends in the prevalence of analytic error across the decades used to stratify the sample. We once again find that analytic errors appear to be quite prevalent in these studies. Next, we present several example analyses of real SESTAT data, and demonstrate that a failure to perform these analyses correctly can result in substantially biased estimates with standard errors that do not adequately reflect complex sample design features. Collectively, the results of this investigation suggest that reviewers of this type of research need to pay much closer attention to the analytic methods employed by researchers attempting to publish or present secondary analyses of survey data.
Project description:Background & aimsA diagnosis of cirrhosis can be made on the basis of findings from imaging studies, but these are subjective. Analytic morphomics uses computational image processing algorithms to provide precise and detailed measurements of organs and body tissues. We investigated whether morphomic parameters can be used to identify patients with cirrhosis.MethodsIn a retrospective study, we performed analytic morphomics on data collected from 357 patients evaluated at the University of Michigan from 2004 to 2012 who had a liver biopsy within 6 months of a computed tomography scan for any reason. We used logistic regression with elastic net regularization and cross-validation to develop predictive models for cirrhosis, within 80% randomly selected internal training set. The other 20% data were used as internal test set to ensure that model overfitting did not occur. In validation studies, we tested the performance of our models on an external cohort of patients from a different health system.ResultsOur predictive models, which were based on analytic morphomics and demographics (morphomics model) or analytic morphomics, demographics, and laboratory studies (full model), identified patients with cirrhosis with area under the receiver operating characteristic curve (AUROC) values of 0.91 and 0.90, respectively, compared with 0.69, 0.77, and 0.76 for aspartate aminotransferase-to-platelet ratio, Lok Score, and FIB-4, respectively, by using the same data set. In the validation set, our morphomics model identified patients who developed cirrhosis with AUROC value of 0.97, and the full model identified them with AUROC value of 0.90.ConclusionsWe used analytic morphomics to demonstrate that cirrhosis can be objectively quantified by using medical imaging. In a retrospective analysis of multi-protocol scans, we found that it is possible to identify patients who have cirrhosis on the basis of analyses of preexisting scans, without significant additional risk or cost.