Unknown

Dataset Information

0

MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences.


ABSTRACT: Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.

SUBMITTER: Bai X 

PROVIDER: S-EPMC10433695 | biostudies-literature | 2022 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences.

Bai Xin X   Ren Jie J   Sun Fengzhu F  

Journal of molecular biology 20220412 15


Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned  ...[more]

Similar Datasets

| S-EPMC5705233 | biostudies-literature
| S-EPMC10763884 | biostudies-literature
| S-EPMC3256161 | biostudies-literature
| S-EPMC6305257 | biostudies-literature
2021-06-25 | GSE164072 | GEO
| S-EPMC5447239 | biostudies-literature
| S-EPMC5144651 | biostudies-literature
| S-EPMC7090511 | biostudies-literature
2021-06-25 | GSE164045 | GEO
2021-06-25 | GSE164016 | GEO