Dataset Information

PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.

ABSTRACT:

Motivation

The Sequence Read Archive (SRA) contains raw data from many different types of sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence (10 16 bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is complicated by both the amount of data and the detail within those annotations. Here, we introduce PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random) and amplicon (targeted) sequence data sets.

Results

PARTIE subsamples reads from the sequencing file and calculates four different statistics: k -mer frequency, 16S abundance, prokaryotic- and viral-read abundance. These metrics are used to create a RandomForest decision tree to classify the sequencing data, and PARTIE provides mechanisms for both supervised and unsupervised classification. We demonstrate the accuracy of PARTIE for classifying SRA data, discuss the probable error rates in the SRA annotations and introduce a resource assessing SRA data.

Availability and implementation

PARTIE and reclassified metagenome SRA entries are available from https://github.com/linsalrob/partie.

Contact

redwards@mail.sdsu.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Torres PJ

PROVIDER: S-EPMC5860118 | biostudies-literature | 2017 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.

Torres Pedro J PJ Edwards Robert A RA McNair Katelyn A KA

Bioinformatics (Oxford, England) 20170801 15

<h4>Motivation</h4>The Sequence Read Archive (SRA) contains raw data from many different types of sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence (10 16 bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is complicated by both the amount of data and the detail within those annotations. Here, we introduce PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random) and amplic ...[more]

PMID: 28369246

Similar Datasets

Project description:Amplicon sequencing variants (ASVs) have been proposed as an alternative to operational taxonomic units (OTUs) for analyzing microbial communities. ASVs have grown in popularity, in part because of a desire to reflect a more refined level of taxonomy since they do not cluster sequences based on a distance-based threshold. However, ASVs and the use of overly narrow thresholds to identify OTUs increase the risk of splitting a single genome into separate clusters. To assess this risk, I analyzed the intragenomic variation of 16S rRNA genes from the bacterial genomes represented in an rrn copy number database, which contained 20,427 genomes from 5,972 species. As the number of copies of the 16S rRNA gene increased in a genome, the number of ASVs also increased. There was an average of 0.58 ASVs per copy of the 16S rRNA gene for full-length 16S rRNA genes. It was necessary to use a distance threshold of 5.25% to cluster full-length ASVs from the same genome into a single OTU with 95% confidence for genomes with 7 copies of the 16S rRNA, such as Escherichia coli. This research highlights the risk of splitting a single bacterial genome into separate clusters when ASVs are used to analyze 16S rRNA gene sequence data. Although there is also a risk of clustering ASVs from different species into the same OTU when using broad distance thresholds, these risks are of less concern than artificially splitting a genome into separate ASVs and OTUs. IMPORTANCE 16S rRNA gene sequencing has engendered significant interest in studying microbial communities. There has been tension between trying to classify 16S rRNA gene sequences to increasingly lower taxonomic levels and the reality that those levels were defined using more sequence and physiological information than is available from a fragment of the 16S rRNA gene. Furthermore, the naming of bacterial taxa reflects the biases of those who name them. One motivation for the recent push to adopt ASVs in place of OTUs in microbial community analyses is to allow researchers to perform their analyses at the finest possible level that reflects species-level taxonomy. The current research is significant because it quantifies the risk of artificially splitting bacterial genomes into separate clusters. Far from providing a better representation of bacterial taxonomy and biology, the ASV approach can lead to conflicting inferences about the ecology of different ASVs from the same genome.

Dataset Information

PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets