Dataset Information

Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies.

ABSTRACT:

Background

The advent of personalized medicine requires robust, reproducible biomarkers that indicate which treatment will maximize therapeutic benefit while minimizing side effects and costs. Numerous molecular signatures have been developed over the past decade to fill this need, but their validation and up-take into clinical settings has been poor. Here, we investigate the technical reasons underlying reported failures in biomarker validation for non-small cell lung cancer (NSCLC).

Methods

We evaluated two published prognostic multi-gene biomarkers for NSCLC in an independent 442-patient dataset. We then systematically assessed how technical factors influenced validation success.

Results

Both biomarkers validated successfully (biomarker #1: hazard ratio (HR) 1.63, 95% confidence interval (CI) 1.21 to 2.19, P = 0.001; biomarker #2: HR 1.42, 95% CI 1.03 to 1.96, P = 0.030). Further, despite being underpowered for stage-specific analyses, both biomarkers successfully stratified stage II patients and biomarker #1 also stratified stage IB patients. We then systematically evaluated reasons for reported validation failures and find they can be directly attributed to technical challenges in data analysis. By examining 24 separate pre-processing techniques we show that minor alterations in pre-processing can change a successful prognostic biomarker (HR 1.85, 95% CI 1.37 to 2.50, P < 0.001) into one indistinguishable from random chance (HR 1.15, 95% CI 0.86 to 1.54, P = 0.348). Finally, we develop a new method, based on ensembles of analysis methodologies, to exploit this technical variability to improve biomarker robustness and to provide an independent confidence metric.

Conclusions

Biomarkers comprise a fundamental component of personalized medicine. We first validated two NSCLC prognostic biomarkers in an independent patient cohort. Power analyses demonstrate that even this large, 442-patient cohort is under-powered for stage-specific analyses. We then use these results to discover an unexpected sensitivity of validation to subtle data analysis decisions. Finally, we develop a novel algorithmic approach to exploit this sensitivity to improve biomarker robustness.

SUBMITTER: Starmans MH

PROVIDER: S-EPMC3580418 | biostudies-literature | 2012

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies.

Starmans Maud Hw MH Pintilie Melania M John Thomas T Der Sandy D SD Shepherd Frances A FA Jurisica Igor I Lambin Philippe P Tsao Ming-Sound MS Boutros Paul C PC

Genome medicine 20121112 11

<h4>Background</h4>The advent of personalized medicine requires robust, reproducible biomarkers that indicate which treatment will maximize therapeutic benefit while minimizing side effects and costs. Numerous molecular signatures have been developed over the past decade to fill this need, but their validation and up-take into clinical settings has been poor. Here, we investigate the technical reasons underlying reported failures in biomarker validation for non-small cell lung cancer (NSCLC).<h4 ...[more]

PMID: 23146350

Similar Datasets

Project description:BACKGROUND:Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. RESULTS:Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. CONCLUSIONS:Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS.

Project description:BACKGROUND:Exposure to certain pesticides has been associated with several chronic diseases. However, to determine the role of pesticides in the causation of such diseases, an assessment of historical exposures is required. Exposure measurement data are rarely available; therefore, assessment of historical exposures is frequently based on surrogate self-reported information, which has inherent limitations. Understanding the performance of the applied surrogate measures in the exposure assessment of pesticides is therefore important to allow proper evaluation of the risks. OBJECTIVE:The Improving Exposure Assessment Methodologies for Epidemiological Studies on Pesticides (IMPRESS) project aims to assess the reliability and external validity of the surrogate measures used to assign exposure within individuals or groups of individuals, which are frequently based on self-reported data on exposure determinants. IMPRESS will also evaluate the size of recall bias on the misclassification of exposure to pesticides; this in turn will affect epidemiological estimates of the effect of pesticides on human health. METHODS:The IMPRESS project will recruit existing cohort participants from previous and ongoing research studies primarily of epidemiological origin from Malaysia, Uganda, and the United Kingdom. Consenting participants of each cohort will be reinterviewed using an amended version of the original questionnaire addressing pesticide use characteristics administered to that cohort. The format and relevant questions will be retained but some extraneous questions from the original (eg, relating to health) will be excluded for ethical and practical reasons. The reliability of pesticide exposure recall over different time periods (<2 years, 6-12 years, and >15 years) will then be evaluated. Where the original cohort study is still ongoing, participants will also be asked if they wish to take part in a new exposure biomonitoring survey, which involves them providing urine samples for pesticide metabolite analysis and completing questionnaire information regarding their work activities at the time of sampling. The participant's level of exposure to pesticides will be determined by analyzing the collected urine samples for selected pesticide metabolites. The biomonitoring measurement results will be used to assess the performance of algorithm-based exposure assessment methods used in epidemiological studies to estimate individual exposures during application and re-entry work. RESULTS:The project was funded in September 2017. Enrollment and sample collection was completed for Malaysia in 2019 and is on-going for Uganda and the United Kingdom. Sample and data analysis will proceed in 2020 and the first results are expected to be submitted for publication in 2021. CONCLUSIONS:The study will evaluate the consistency of questionnaire data and accuracy of current algorithms in assessing pesticide exposures. It will indicate where amendments can be made to better capture exposure data for future epidemiology studies and thus improve the reliability of exposure-disease associations. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID):PRR1-10.2196/16448.

Dataset Information

Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies.

Background

Methods

Results

Conclusions

Publications

Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets