Dataset Information

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

ABSTRACT: A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

SUBMITTER: Panahiazar M

PROVIDER: S-EPMC5643580 | biostudies-literature | 2017 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

Panahiazar Maryam M Dumontier Michel M Gevaert Olivier O

Journal of biomedical informatics 20170616

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common struct ...[more]

PMID: 28625880

Dataset Information

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

Publications

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.
| S-EPMC6333964 | biostudies-literature

Mining microarray data at NCBI's Gene Expression Omnibus (GEO)*.
| S-EPMC1619899 | biostudies-literature

GEOMetaCuration: a web-based application for accurate manual curation of Gene Expression Omnibus metadata.
| S-EPMC5868185 | biostudies-literature

The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus.
| S-EPMC12265520 | biostudies-literature

Identification of a novel iron zinc finger protein 36 (ZFP36) for predicting the overall survival of osteosarcoma based on the Gene Expression Omnibus (GEO) database.
| S-EPMC8576698 | biostudies-literature

The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository.
| S-EPMC12421755 | biostudies-literature

An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes.
| S-EPMC4464538 | biostudies-literature

Prognostic values and prospective pathway signaling of MicroRNA-182 in ovarian cancer: a study based on gene expression omnibus (GEO) and bioinformatics analysis.
| S-EPMC6839211 | biostudies-literature

Maximizing the reusability of gene expression data by predicting missing metadata.
| S-EPMC7673503 | biostudies-literature

The Gene Expression Omnibus Database.
| S-EPMC4944384 | biostudies-literature