Dataset Information

Decreasing the number of false positives in sequence classification.

ABSTRACT:

Background

A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation.

Results

For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results.

Conclusions

Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.

SUBMITTER: Machado-Lima A

PROVIDER: S-EPMC3045793 | biostudies-literature | 2010 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Decreasing the number of false positives in sequence classification.

Machado-Lima Ariane A Kashiwabara André Yoshiaki AY Durham Alan Mitchell AM

BMC genomics 20101222

<h4>Background</h4>A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such ...[more]

PMID: 21210966

Dataset Information

Decreasing the number of false positives in sequence classification.

Background

Results

Conclusions

Publications

Decreasing the number of false positives in sequence classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Correction of copy number induced false positives in CRISPR screens.
| S-EPMC6067744 | biostudies-literature

Machine learning classification can reduce false positives in structure-based virtual screening.
| S-EPMC7414157 | biostudies-literature

vScreenML v2.0: Improved Machine Learning Classification for Reducing False Positives in Structure-Based Virtual Screening.
| S-EPMC11595162 | biostudies-literature

E-GWAS: an ensemble-like GWAS strategy that provides effective control over false positive rates without decreasing true positives.
| S-EPMC10320972 | biostudies-literature

Hydrophobicity identifies false positives and false negatives in peptide-MHC binding.
| S-EPMC9677119 | biostudies-literature

microRNA target prediction programs predict many false positives.
| S-EPMC5287229 | biostudies-literature

Underlying causes for prevalent false positives and false negatives in STARR-seq data.
| S-EPMC10516709 | biostudies-literature

Determining false-positives requires considering the totality of evidence.
| S-EPMC4672784 | biostudies-literature

Balancing false positives and false negatives for the detection of differential expression in malignancies.
| S-EPMC2747693 | biostudies-literature

Metal impurities cause false positives in high-throughput screening campaigns.
| S-EPMC4027514 | biostudies-literature