Biological observations in microbiota analysis are robust to the choice of 16S rRNA gene sequencing processing algorithm: case study on human milk microbiota.
ABSTRACT: BACKGROUND:In recent years, the microbiome field has undergone a shift from clustering-based methods of operational taxonomic unit (OTU) designation based on sequence similarity to denoising algorithms that identify exact amplicon sequence variants (ASVs), and methods to identify contaminating bacterial DNA sequences from low biomass samples have been developed. Although these methods improve accuracy when analyzing mock communities, their impact on real samples and downstream analysis of biological associations is less clear. RESULTS:Here, we re-processed our recently published milk microbiota data using Qiime1 to identify OTUs, and Qiime2 to identify ASVs, with or without contaminant removal using decontam. Qiime2 resolved the mock community more accurately, primarily because Qiime1 failed to detect Lactobacillus. Qiime2 also considerably reduced the average number of ASVs detected in human milk samples (364?±?145 OTUs vs. 170?±?73 ASVs, p?
Project description:BACKGROUND:To increase the accuracy of microbiome data analysis, solving the technical limitations of the existing sequencing machines is required. Quality trimming is suggested to reduce the effect of the progressive decrease in sequencing quality with the increased length of the sequenced library. In this study, we examined the effect of the trimming thresholds (0-20 for QIIME1 and 0-30 for QIIME2) on the number of reads that remained after the quality control and chimera removal (the good reads). We also examined the distance of the analysis results to the gold standard using simulated samples. RESULTS:Quality trimming increased the number of good reads and abundance measurement accuracy in Illumina paired-end reads of the V3-V4 hypervariable region. CONCLUSIONS:Our results suggest that the pre-analysis trimming step should be included before the application of QIIME1 or QIIME2.
Project description:One of the major methods to identify microbial community composition, to unravel microbial population dynamics, and to explore microbial diversity in environmental samples is high-throughput DNA- or RNA-based 16S rRNA (gene) amplicon sequencing in combination with bioinformatics analyses. However, focusing on environmental samples from contrasting habitats, it was not systematically evaluated (i) which analysis methods provide results that reflect reality most accurately, (ii) how the interpretations of microbial community studies are biased by different analysis methods and (iii) if the most optimal analysis workflow can be implemented in an easy-to-use pipeline. Here, we compared the performance of 16S rRNA (gene) amplicon sequencing analysis tools (i.e., Mothur, QIIME1, QIIME2, and MEGAN) using three mock datasets with known microbial community composition that differed in sequencing quality, species number and abundance distribution (i.e., even or uneven), and phylogenetic diversity (i.e., closely related or well-separated amplicon sequences). Our results showed that QIIME2 outcompeted all other investigated tools in sequence recovery (>10 times fewer false positives), taxonomic assignments (>22% better F-score) and diversity estimates (>5% better assessment), suggesting that this approach is able to reflect the in situ microbial community most accurately. Further analysis of 24 environmental datasets obtained from four contrasting terrestrial and freshwater sites revealed dramatic differences in the resulting microbial community composition for all pipelines at genus level. For instance, at the investigated river water sites Sphaerotilus was only reported when using QIIME1 (8% abundance) and Agitococcus with QIIME1 or QIIME2 (2 or 3% abundance, respectively), but both genera remained undetected when analyzed with Mothur or MEGAN. Since these abundant taxa probably have implications for important biogeochemical cycles (e.g., nitrate and sulfate reduction) at these sites, their detection and semi-quantitative enumeration is crucial for valid interpretations. A high-performance computing conformant workflow was constructed to allow FAIR (Findable, Accessible, Interoperable, and Re-usable) 16S rRNA (gene) amplicon sequence analysis starting from raw sequence files, using the most optimal methods identified in our study. Our presented workflow should be considered for future studies, thereby facilitating the analysis of high-throughput 16S rRNA (gene) sequencing data substantially, while maximizing reliability and confidence in microbial community data analysis.
Project description:DNA sequencing and analysis methods were compared for 16S rRNA V4 PCR amplicon and genomic DNA (gDNA) mock communities encompassing nine bacterial species commonly found in milk and dairy products. The two communities comprised strain-specific DNA that was pooled before (gDNA) or after (PCR amplicon) the PCR step. The communities were sequenced on the Illumina MiSeq and Ion Torrent PGM platforms and then analyzed using the QIIME 1 (UCLUST) and Divisive Amplicon Denoising Algorithm 2 (DADA2) analysis pipelines with taxonomic comparisons to the Greengenes and Ribosomal Database Project (RDP) databases. Examination of the PCR amplicon mock community with these methods resulted in operational taxonomic units (OTUs) and amplicon sequence variants (ASVs) that ranged from 13 to 118 and were dependent on the DNA sequencing method and read assembly steps. The additional 4 to 109 OTUs/ASVs (from 9 OTUs/ASVs) included assignments to spurious taxa and sequence variants of the 9 species included in the mock community. Comparisons between the gDNA and PCR amplicon mock communities showed that combining gDNAs from the different strains prior to PCR resulted in up to 8.9-fold greater numbers of spurious OTUs/ASVs. However, the DNA sequencing method and paired-end read assembly steps conferred the largest effects on predictions of bacterial diversity, with effect sizes of 0.88 (Bray-Curtis) and 0.32 (weighted Unifrac), independent of the mock community type. Overall, DNA sequencing performed with the Ion Torrent PGM and analyzed with DADA2 and the Greengenes database resulted in the most accurate predictions of the mock community phylogeny, taxonomy, and diversity.IMPORTANCE Validated methods are urgently needed to improve DNA sequence-based assessments of complex bacterial communities. In this study, we used 16S rRNA PCR amplicon and gDNA mock community standards, consisting of nine, dairy-associated bacterial species, to evaluate the most commonly applied 16S rRNA marker gene DNA sequencing and analysis platforms used in evaluating dairy and other bacterial habitats. Our results show that bacterial metataxonomic assessments are largely dependent on the DNA sequencing platform and read curation method used. DADA2 improved sequence annotation compared with QIIME 1, and when combined with the Ion Torrent PGM DNA sequencing platform and the Greengenes database for taxonomic assignment, the most accurate representation of the dairy mock community standards was reached. This approach will be useful for validating sample collection and DNA extraction methods and ultimately investigating bacterial population dynamics in milk- and dairy-associated environments.
Project description:High-throughput sequencing has the potential to describe biological communities with high efficiency yet comprehensive assessment of diversity with species-level resolution remains one of the most challenging aspects of metabarcoding studies. We investigated the utility of curated ribosomal and mitochondrial nematode reference sequence databases for determining phylum-specific species-level clustering thresholds. We compiled 438 ribosomal and 290 mitochondrial sequences which identified 99% and 94% as the species delineation clustering threshold, respectively. These thresholds were evaluated in HTS data from mock communities containing 39 nematode species as well as environmental samples from Vietnam. We compared the taxonomic description of the mocks generated by two read-merging and two clustering algorithms and the cluster-free Dada2 pipeline. Taxonomic assignment with the RDP classifier was assessed under different training sets. Our results showed that 36/39 mock nematode species were identified across the molecular markers (18S: 32, JB2: 19, JB3: 21) in UClust_ref OTUs at their respective clustering thresholds, outperforming UParse_denovo and the commonly used 97% similarity. Dada2 generated the most realistic number of ASVs (18S: 83, JB2: 75, JB3: 82), collectively identifying 30/39 mock species. The ribosomal marker outperformed the mitochondrial markers in terms of species and genus-level detections for both OTUs and ASVs. The number of taxonomic assignments of OTUs/ASVs was highest when the smallest reference database containing only nematode sequences was used and when sequences were truncated to the respective amplicon length. Overall, OTUs generated more species-level detections, which were, however, associated with higher error rates compared to ASVs. Genus-level assignments using ASVs exhibited higher accuracy and lower error rates compared to species-level assignments, suggesting that this is the most reliable pipeline for rapid assessment of alpha diversity from environmental samples.
Project description:BACKGROUND:It is now possible to comprehensively characterize the microbiota of the lungs using culture-independent, sequencing-based assays. Several sample types have been used to investigate the lung microbiota, each presenting specific challenges for preparation and analysis of microbial communities. Bronchoalveolar lavage fluid (BALF) enables the identification of microbiota specific to the lower lung but commonly has low bacterial density, increasing the risk of false-positive signal from contaminating DNA. The objectives of this study were to investigate the extent of contamination across a range of sample densities representative of BALF and identify features of contaminants that facilitate their removal from sequence data and aid in the interpretation of BALF sample 16S sequencing data. RESULTS:Using three mock communities across a range of densities ranging from 8E+ 02 to 8E+ 09 16S copies/ml, we assessed taxonomic accuracy and precision by 16S rRNA gene sequencing and the proportion of reads arising from contaminants. Sequencing accuracy, precision, and the relative abundance of mock community members decreased with sample input density, with a significant drop-off below 8E+ 05 16S copies/ml. Contaminant OTUs were commonly inversely correlated with sample input density or not reproduced between technical replicates. Removal of taxa with these features or physical concentration of samples prior to sequencing improved both sequencing accuracy and precision for samples between 8E+ 04 and 8E+ 06 16S copies/ml. For the lowest densities, below 8E+ 03 16S copies/ml BALF, accuracy and precision could not be significantly improved using these approaches. Using clinical BALF samples across a large density range, we observed that OTUs with features of contaminants identified in mock communities were also evident in low-density BALF samples. CONCLUSION:Relative abundance data and community composition generated by 16S sequencing of BALF samples across the range of density commonly observed in this sample type should be interpreted in the context of input sample density and may be improved by simple pre- and post-sequencing steps for densities above 8E+ 04 16S copies/ml.
Project description:High-depth sequencing of universal marker genes such as the 16S rRNA gene is a common strategy to profile microbial communities. Traditionally, sequence reads are clustered into operational taxonomic units (OTUs) at a defined identity threshold to avoid sequencing errors generating spurious taxonomic units. However, there have been numerous bioinformatic packages recently released that attempt to correct sequencing errors to determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs). As more researchers begin to use high resolution ASVs, there is a need for an in-depth and unbiased comparison of these novel "denoising" pipelines. In this study, we conduct a thorough comparison of three of the most widely-used denoising packages (DADA2, UNOISE3, and Deblur) as well as an open-reference 97% OTU clustering pipeline on mock, soil, and host-associated communities. We found from the mock community analyses that although they produced similar microbial compositions based on relative abundance, the approaches identified vastly different numbers of ASVs that significantly impact alpha diversity metrics. Our analysis on real datasets using recommended settings for each denoising pipeline also showed that the three packages were consistent in their per-sample compositions, resulting in only minor differences based on weighted UniFrac and Bray-Curtis dissimilarity. DADA2 tended to find more ASVs than the other two denoising pipelines when analyzing both the real soil data and two other host-associated datasets, suggesting that it could be better at finding rare organisms, but at the expense of possible false positives. The open-reference OTU clustering approach identified considerably more OTUs in comparison to the number of ASVs from the denoising pipelines in all datasets tested. The three denoising approaches were significantly different in their run times, with UNOISE3 running greater than 1,200 and 15 times faster than DADA2 and Deblur, respectively. Our findings indicate that, although all pipelines result in similar general community structure, the number of ASVs/OTUs and resulting alpha-diversity metrics varies considerably and should be considered when attempting to identify rare organisms from possible background noise.
Project description:Recent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently such that amplicon sequence variants (ASVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer resolution are immediately apparent, and arguments for ASV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits that derive from the status of ASVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how these features grant ASVs the combined advantages of closed-reference OTUs-including computational costs that scale linearly with study size, simple merging between independently processed data sets, and forward prediction-and of de novo OTUs-including accurate measurement of diversity and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.
Project description:BACKGROUND:Careful consideration of experimental artefacts is required in order to successfully apply high-throughput 16S ribosomal ribonucleic acid (rRNA) gene sequencing technology. Here we introduce experimental design, quality control and "denoising" approaches for sequencing low biomass specimens. RESULTS:We found that bacterial biomass is a key driver of 16S rRNA gene sequencing profiles generated from bacterial mock communities and that the use of different deoxyribonucleic acid (DNA) extraction methods [DSP Virus/Pathogen Mini Kit® (Kit-QS) and ZymoBIOMICS DNA Miniprep Kit (Kit-ZB)] and storage buffers [PrimeStore® Molecular Transport medium (Primestore) and Skim-milk, Tryptone, Glucose and Glycerol (STGG)] further influence these profiles. Kit-QS better represented hard-to-lyse bacteria from bacterial mock communities compared to Kit-ZB. Primestore storage buffer yielded lower levels of background operational taxonomic units (OTUs) from low biomass bacterial mock community controls compared to STGG. In addition to bacterial mock community controls, we used technical repeats (nasopharyngeal and induced sputum processed in duplicate, triplicate or quadruplicate) to further evaluate the effect of specimen biomass and participant age at specimen collection on resultant sequencing profiles. We observed a positive correlation (r?=?0.16) between specimen biomass and participant age at specimen collection: low biomass technical repeats (represented by <?500 16S rRNA gene copies/?l) were primarily collected at <?14?days of age. We found that low biomass technical repeats also produced higher alpha diversities (r?=?-?0.28); 16S rRNA gene profiles similar to no template controls (Primestore); and reduced sequencing reproducibility. Finally, we show that the use of statistical tools for in silico contaminant identification, as implemented through the decontam package in R, provides better representations of indigenous bacteria following decontamination. CONCLUSIONS:We provide insight into experimental design, quality control steps and "denoising" approaches for 16S rRNA gene high-throughput sequencing of low biomass specimens. We highlight the need for careful assessment of DNA extraction methods and storage buffers; sequence quality and reproducibility; and in silico identification of contaminant profiles in order to avoid spurious results.
Project description:Microbial amplicon sequencing studies are an important tool in biological and biomedical research. Widespread 16S rRNA gene microbial surveys have shed light on the structure of many ecosystems inhabited by bacteria, including the human body. However, specialized software and algorithms are needed to convert raw sequencing data into biologically meaningful information (i.e. tables of bacterial counts). While different bioinformatic pipelines are available in a rapidly changing and improving field, users are often unaware of limitations and biases associated with individual pipelines and there is a lack of agreement regarding best practices. Here, we compared six bioinformatic pipelines for the analysis of amplicon sequence data: three OTU-level flows (QIIME-uclust, MOTHUR, and USEARCH-UPARSE) and three ASV-level (DADA2, Qiime2-Deblur, and USEARCH-UNOISE3). We tested workflows with different quality control options, clustering algorithms, and cutoff parameters on a mock community as well as on a large (N = 2170) recently published fecal sample dataset from the multi-ethnic HELIUS study. We assessed the sensitivity, specificity, and degree of consensus of the different outputs. DADA2 offered the best sensitivity, at the expense of decreased specificity compared to USEARCH-UNOISE3 and Qiime2-Deblur. USEARCH-UNOISE3 showed the best balance between resolution and specificity. OTU-level USEARCH-UPARSE and MOTHUR performed well, but with lower specificity than ASV-level pipelines. QIIME-uclust produced large number of spurious OTUs as well as inflated alpha-diversity measures and should be avoided in future studies. This study provides guidance for researchers using amplicon sequencing to gain biological insights.
Project description:Biomonitoring approaches and investigations of many ecological questions require assessments of the biodiversity of a given habitat. Small organisms, ranging from protozoans to metazoans, are of great ecological importance and comprise a major share of the planet's biodiversity but they are extremely difficult to identify, due to their minute body sizes and indistinct structures. Thus, most biodiversity studies that include small organisms draw on several methods for species delimitation, ranging from traditional microscopy to molecular techniques. In this study, we compared the efficiency of these methods by analyzing a community of nematodes. Specifically, we evaluated the performances of traditional morphological identification, single-specimen barcoding (Sanger sequencing), and metabarcoding in the identification of 1500 nematodes from sediment samples. The molecular approaches were based on the analysis of the 28S ribosomal large and 18S small subunits (LSU and SSU). The morphological analysis resulted in the determination of 22 nematode species. Barcoding identified a comparable number of operational taxonomic units (OTUs) based on 28S rDNA (n = 20) and fewer OTUs based on 18S rDNA (n = 12). Metabarcoding identified a higher OTU number but fewer amplicon sequence variants (AVSs) (n = 48 OTUs, n = 17 ASVs for 28S rDNA, and n = 31 OTUs, n = 6 ASVs for 18S rDNA). Between the three approaches (morphology, barcoding, and metabarcoding), only three species (13.6%) were shared. This lack of taxonomic resolution hinders reliable community identifications to the species level. Further database curation will ensure the effective use of molecular species identification.