Dataset Information

Evaluation of a two-stage framework for prediction using big genomic data.

ABSTRACT: We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome-wide association studies' examine millions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status from these data sets, and use the knowledge learned to predict disease likelihood. Owing to the large number of features, it is difficult for many prediction methods to use all the features directly. The ReliefF algorithm ranks a set of features in terms of how well they predict a target. It can be used to identify good predictors, which can then be provided to a prediction method. We compared the performance of eight prediction methods when predicting binary outcomes using high-dimensional discrete data sets. We performed two-stage prediction, where ReliefF is used in the first stage to identify good predictors. Bayesian network (BN)-based methods performed best overall. Furthermore, ReliefF did not improve their performance. The BN-based methods use the Bayesian Dirichlet Equivalent Uniform score to evaluate candidate models, and use BN inference algorithms to perform prediction. This score and these algorithms were developed for discrete variables. This perhaps explains why they perform better in this domain. Many prediction methods are available, and researchers have little reason for choosing one over the other in the domain of binary prediction using high-dimensional data sets. Our results indicate that the best choices overall are BN-based methods.

SUBMITTER: Jiang X

PROVIDER: S-EPMC4652616 | biostudies-literature | 2015 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluation of a two-stage framework for prediction using big genomic data.

Jiang Xia X Neapolitan Richard E RE

Briefings in bioinformatics 20150318 6

We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome-wide association studies' examine millions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status from these data sets, and use the knowledge learned to predict disease likelihood. Owing to the ...[more]

PMID: 25788325

Dataset Information

Evaluation of a two-stage framework for prediction using big genomic data.

Publications

Evaluation of a two-stage framework for prediction using big genomic data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

HIBLUP: an integration of statistical models on the BLUP framework for efficient genetic evaluation using big genomic data.
| S-EPMC10164590 | biostudies-literature

Framework for parallelisation on big data.
| S-EPMC6532858 | biostudies-literature

Big Data Analytics for Genomic Medicine.
| S-EPMC5343946 | biostudies-literature

bigSCale: An Analytical Framework for Big-Scale Single Cell Data
2018-04-15 | GSE102934 | GEO

Prediction of lithium response using genomic data.
| S-EPMC7806976 | biostudies-literature

SparkText: Biomedical Text Mining on Big Data Framework.
| S-EPMC5042555 | biostudies-literature

Epileptic Seizure Prediction Using Big Data and Deep Learning: Toward a Mobile System.
| S-EPMC5828366 | biostudies-literature

Gene network inherent in genomic big data improves the accuracy of prognostic prediction for cancer patients.
| S-EPMC5652797 | biostudies-literature

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach.
| S-EPMC4682108 | biostudies-literature

An Innovative Big Data Predictive Analytics Framework over Hybrid Big Data Sources with an Application for Disease Analytics
| S-EPMC7123615 | biostudies-literature