Unknown

Dataset Information

0

Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies.


ABSTRACT: Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nucleotide polymorphisms (SNPs) and investigated their influence on the predictive performance of these models. Our study suggests that an additive encoding of the SNP data should be the preferred encoding scheme, as it proved to yield the best predictive performances for all algorithms and data sets. Furthermore, our results showed that the differences between most state-of-the-art classification algorithms are not statistically significant. Consequently, we recommend to prefer algorithms with simple models like the linear support vector machine (SVM) as they allow for better subsequent interpretation without significant loss of accuracy.

SUBMITTER: Mittag F 

PROVIDER: S-EPMC4540285 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

altmetric image

Publications

Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies.

Mittag Florian F   Römer Michael M   Zell Andreas A  

PloS one 20150818 8


Various attempts have been made to predict the individual disease risk based on genotype data from genome-wide association studies (GWAS). However, most studies only investigated one or two classification algorithms and feature encoding schemes. In this study, we applied seven different classification algorithms on GWAS case-control data sets for seven different diseases to create models for disease risk prediction. Further, we used three different encoding schemes for the genotypes of single nu  ...[more]

Similar Datasets

| S-EPMC2964405 | biostudies-literature
| S-EPMC8294554 | biostudies-literature
| S-EPMC3208586 | biostudies-literature
| S-EPMC5287117 | biostudies-literature
| S-EPMC2688469 | biostudies-literature
| S-EPMC3044281 | biostudies-literature
| S-EPMC5436027 | biostudies-literature
| S-EPMC5007749 | biostudies-other
| S-EPMC7199379 | biostudies-literature
| S-EPMC7590725 | biostudies-literature