<HashMap><database>biostudies-literature</database><scores/><additional><omics_type>Unknown</omics_type><volume>6</volume><submitter>Edgar RC</submitter><pubmed_abstract>Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ?50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ?100% at 100% identity but ?50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal.</pubmed_abstract><journal>PeerJ</journal><pagination>e4652</pagination><full_dataset_link>https://www.ebi.ac.uk/biostudies/studies/S-EPMC5910792</full_dataset_link><repository>biostudies-literature</repository><pubmed_title>Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences.</pubmed_title><pmcid>PMC5910792</pmcid><pubmed_authors>Edgar RC</pubmed_authors></additional><is_claimable>false</is_claimable><name>Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences.</name><description>Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ?50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ?100% at 100% identity but ?50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal.</description><dates><release>2018-01-01T00:00:00Z</release><publication>2018</publication><modification>2021-02-20T08:23:30Z</modification><creation>2019-03-26T23:29:53Z</creation></dates><accession>S-EPMC5910792</accession><cross_references><pubmed>29682424</pubmed><doi>10.7717/peerj.4652</doi></cross_references></HashMap>