Unknown

Dataset Information

0

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.


ABSTRACT:

Motivation

Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results.

Results

We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model's predictions.

Availability and implementation

The software implementation can be found at https://github.com/ideateknoloji/FPDetect.

SUBMITTER: Eren KK 

PROVIDER: S-EPMC10692869 | biostudies-literature | 2023 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics.

Eren Kazım Kıvanç KK   Çınar Esra E   Karakurt Hamza U HU   Özgür Arzucan A  

Bioinformatics (Oxford, England) 20231201 12


<h4>Motivation</h4>Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integr  ...[more]

Similar Datasets

| S-EPMC9082887 | biostudies-literature
| S-EPMC10107899 | biostudies-literature
| S-EPMC1906837 | biostudies-literature
| S-EPMC9845072 | biostudies-literature
| S-EPMC4112476 | biostudies-literature
| S-EPMC6475568 | biostudies-literature
| S-EPMC7470380 | biostudies-literature
| S-EPMC1127019 | biostudies-literature
2018-12-28 | PXD011364 | JPOST Repository
| S-EPMC5989480 | biostudies-literature