Dataset Information

Learning Meta-Learning (LML) dataset: Survey data of meta-learning parameters

ABSTRACT:

SUBMITTER:

PROVIDER: S-EPMC10694062 | biostudies-literature | 2023 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:What dataset features affect machine learning (ML) performance has primarily been unknown in the current literature. This study examines the impact of tabular datasets' different meta-level and statistical features on the performance of various ML algorithms. The three meta-level features this study considered are the dataset size, the number of attributes and the ratio between the positive (class 1) and negative (class 0) class instances. It considered four statistical features for each dataset: mean, standard deviation, skewness and kurtosis. After applying the required scaling, this study averaged (uniform and weighted) each dataset's different attributes to quantify its four statistical features. We analysed 200 open-access tabular datasets from the Kaggle (147) and UCI Machine Learning Repository (53) and developed ML classification models (through classification implementation and hyperparameter tuning) for each dataset. Then, this study developed multiple regression models to explore the impact of dataset features on ML performance. We found that kurtosis has a statistically significant negative effect on the accuracy of the three non-tree-based ML algorithms of the Support vector machine (SVM), Logistic regression (LR) and K-nearest neighbour (KNN) for their classical implementation with both uniform and weighted aggregations. This study observed similar findings in most cases for ML implementations through hyperparameter tuning, except for SVM with weighted aggregation. Meta-level and statistical features barely show any statistically significant impact on the accuracy of the two tree-based ML algorithms (Decision tree and Random forest), except for implementation through hyperparameter tuning for the weighted aggregation. When we excluded some datasets based on the imbalanced statistics and a significantly higher contribution of one attribute compared to others to the classification performance, we found a significant effect of the meta-level ratio feature and statistical mean and standard deviation features on SVM, LR and KNN accuracy in many cases. Our findings open a new research direction in understanding how dataset characteristics affect ML performance and will help researchers select appropriate ML algorithms for a possible optimal accuracy outcome.

Project description:The present dataset ("dataset 3") is a subset of a large metastudy on AML classfication. It contains normalized gene expression values of 1181 samples. In total, three datasets were generated, each containing data of a different platforms: dataset 1 (Affymetrix HG-U133 A microarrays), dataset 2 (Affymetrix HG-U133 2.0 microarrays) and dataset 3 (RNA-seq). Dataset 3 was generated using the following strategy: All data sets published in the National Center for Biotechnology Information Gene Expression Omnibus (GEO) on 20 September 2017 were reviewed for inclusion in the present study. Basic criteria for inclusion were the cell type under study (human peripheral blood mononuclear cells (PMBCs) and/or bone marrow samples) as well as the species (Homo sapiens). Furthermore, GEO SuperSeries were excluded to avoid duplicated samples. We filtered the datasets for data generated with high-throughput RNA sequencing (RNA-seq) and excluded studies with very small sample sizes (< 10 samples). We then applied a disease-specific search, in which we filtered for acute myeloid leukemia, other leukemia and healthy or non-leukemia-related samples. The results of this search strategy were then internally reviewed and data were excluded based on the following criteria: (i) exclusion of duplicated samples, (ii) exclusion of studies that sorted single cell types (e.g. T cells or B cells) prior to gene expression profiling, (iii) exclusion of studies with inaccessible data. Other than that, no studies were excluded from our analysis. In total, the datasets contained samples from the following GSE Series: GSE63085, GSE32874, GSE58335, GSE86884, GSE63703, GSE63646, GSE63816, GSE72790, GSE81259, GSE85712, GSE45735, GSE64655, GSE87186, GSE49642, GSE52656, GSE62190, GSE66917, GSE67039, GSE61162, GSE67184, GSE49601, GSE78785, GSE79970. All raw data files were downloaded from GEO. Transcript abundances were calculated using kallisto version 0.43.0 and all data was normalized with the R package DESeq2 (R version R-3.2.4, DESeq2 version 1.12.4) with standard parameters. Genome build hg38 was used for read alignment. No filtering of low-expressed genes was performed.

Dataset Information

Learning Meta-Learning (LML) dataset: Survey data of meta-learning parameters

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets