Project description:Zeolites, nanoporous aluminosilicates with well-defined porous structures, are versatile materials with applications in catalysis, gas separation, and ion exchange. Hydrothermal synthesis is widely used for zeolite production, offering control over composition, crystallinity, and pore size. However, the intricate interplay of synthesis parameters necessitates a comprehensive understanding of synthesis-structure relationships to optimize the synthesis process. Hitherto, public zeolite synthesis databases only contain a subset of parameters and are small in scale, comprising up to a few thousand synthesis routes. We present ZeoSyn, a dataset of 23,961 zeolite hydrothermal synthesis routes, encompassing 233 zeolite topologies and 921 organic structure-directing agents (OSDAs). Each synthesis route comprises comprehensive synthesis parameters: 1) gel composition, 2) reaction conditions, 3) OSDAs, and 4) zeolite products. Using ZeoSyn, we develop a machine learning classifier to predict the resultant zeolite given a synthesis route with >70% accuracy. We employ SHapley Additive exPlanations (SHAP) to uncover key synthesis parameters for >200 zeolite frameworks. We introduce an aggregation approach to extend SHAP to all building units. We demonstrate applications of this approach to phase-selective and intergrowth synthesis. This comprehensive analysis illuminates the synthesis parameters pivotal in driving zeolite crystallization, offering the potential to guide the synthesis of desired zeolites. The dataset is available at https://github.com/eltonpan/zeosyn_dataset.
Project description:The present dataset ("dataset 3") is a subset of a large metastudy on AML classfication. It contains normalized gene expression values of 1181 samples. In total, three datasets were generated, each containing data of a different platforms: dataset 1 (Affymetrix HG-U133 A microarrays), dataset 2 (Affymetrix HG-U133 2.0 microarrays) and dataset 3 (RNA-seq). Dataset 3 was generated using the following strategy: All data sets published in the National Center for Biotechnology Information Gene Expression Omnibus (GEO) on 20 September 2017 were reviewed for inclusion in the present study. Basic criteria for inclusion were the cell type under study (human peripheral blood mononuclear cells (PMBCs) and/or bone marrow samples) as well as the species (Homo sapiens). Furthermore, GEO SuperSeries were excluded to avoid duplicated samples. We filtered the datasets for data generated with high-throughput RNA sequencing (RNA-seq) and excluded studies with very small sample sizes (< 10 samples). We then applied a disease-specific search, in which we filtered for acute myeloid leukemia, other leukemia and healthy or non-leukemia-related samples. The results of this search strategy were then internally reviewed and data were excluded based on the following criteria: (i) exclusion of duplicated samples, (ii) exclusion of studies that sorted single cell types (e.g. T cells or B cells) prior to gene expression profiling, (iii) exclusion of studies with inaccessible data. Other than that, no studies were excluded from our analysis. In total, the datasets contained samples from the following GSE Series: GSE63085, GSE32874, GSE58335, GSE86884, GSE63703, GSE63646, GSE63816, GSE72790, GSE81259, GSE85712, GSE45735, GSE64655, GSE87186, GSE49642, GSE52656, GSE62190, GSE66917, GSE67039, GSE61162, GSE67184, GSE49601, GSE78785, GSE79970. All raw data files were downloaded from GEO. Transcript abundances were calculated using kallisto version 0.43.0 and all data was normalized with the R package DESeq2 (R version R-3.2.4, DESeq2 version 1.12.4) with standard parameters. Genome build hg38 was used for read alignment. No filtering of low-expressed genes was performed.
Project description:What dataset features affect machine learning (ML) performance has primarily been unknown in the current literature. This study examines the impact of tabular datasets' different meta-level and statistical features on the performance of various ML algorithms. The three meta-level features this study considered are the dataset size, the number of attributes and the ratio between the positive (class 1) and negative (class 0) class instances. It considered four statistical features for each dataset: mean, standard deviation, skewness and kurtosis. After applying the required scaling, this study averaged (uniform and weighted) each dataset's different attributes to quantify its four statistical features. We analysed 200 open-access tabular datasets from the Kaggle (147) and UCI Machine Learning Repository (53) and developed ML classification models (through classification implementation and hyperparameter tuning) for each dataset. Then, this study developed multiple regression models to explore the impact of dataset features on ML performance. We found that kurtosis has a statistically significant negative effect on the accuracy of the three non-tree-based ML algorithms of the Support vector machine (SVM), Logistic regression (LR) and K-nearest neighbour (KNN) for their classical implementation with both uniform and weighted aggregations. This study observed similar findings in most cases for ML implementations through hyperparameter tuning, except for SVM with weighted aggregation. Meta-level and statistical features barely show any statistically significant impact on the accuracy of the two tree-based ML algorithms (Decision tree and Random forest), except for implementation through hyperparameter tuning for the weighted aggregation. When we excluded some datasets based on the imbalanced statistics and a significantly higher contribution of one attribute compared to others to the classification performance, we found a significant effect of the meta-level ratio feature and statistical mean and standard deviation features on SVM, LR and KNN accuracy in many cases. Our findings open a new research direction in understanding how dataset characteristics affect ML performance and will help researchers select appropriate ML algorithms for a possible optimal accuracy outcome.
Project description:We present a meta-dataset comprising of a total of 178 samples including both primary tumors and tumor-free pancreatic tissues from four independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 663 samples including both primary tumors and tumor-free ovarian tissues from ten independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 347 samples including both primary tumors and tumor-free renal tissues from six independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 237 samples including both primary tumors and tumor-free prostate tissues from six independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 212 samples including both primary tumors and tumor-free bladder tissues from four independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 214 samples including both primary tumors and tumor-free melanoma tissues from four independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.
Project description:We present a meta-dataset comprising of a total of 737 samples including both primary tumors and tumor-free gastric tissues from seven independent GEO datasets. To minimise inter-platform variation, only datasets generated from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were processed to develop the meta-dataset. Using multiple open source R packages implemented in our previously developed bioinformatics pipeline, each dataset has been preprocessed with RMA normalisation, merged, and batch effect-corrected via Combat method. With increased sample size, the present meta-dataset serves an excellent 'discovery cohort' for discovering differentially expressed in diseased phenotype.