Unknown

Dataset Information

0

Gene Selection with Sequential Classification and Regression Tree Algorithm.


ABSTRACT: BACKGROUND:In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors. RESULTS:We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis. CONCLUSIONS:The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the "greedy" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables. AVAILABILITY:The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.

SUBMITTER: Bastian CD 

PROVIDER: S-EPMC4214923 | biostudies-literature | 2011 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Gene Selection with Sequential Classification and Regression Tree Algorithm.

Bastian Caleb D CD   Rempala Grzegorz A GA  

Biostatistics, bioinformatics and biomathematics 20110801 4


<h4>Background</h4>In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional  ...[more]

Similar Datasets

| S-EPMC3110013 | biostudies-literature
| S-EPMC4908120 | biostudies-literature
| S-EPMC5416823 | biostudies-literature
| S-EPMC5879502 | biostudies-literature
| S-EPMC3718705 | biostudies-literature
| S-EPMC9270253 | biostudies-literature