Unknown

Dataset Information

0

CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction.


ABSTRACT: Caco-2= permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates= during early-stage drug discovery. To enhance the accuracy and= efficiency of computational predictions, we systematically investigated the impact of eight molecular feature= representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. We evaluated model performance across various molecular representations using two datasets differing in scale and chemical diversity, namely the TDC benchmark and curated OCHEM data. Among the tested fingerprints and descriptors, PaDEL, Mordred, and RDKit emerged as particularly effective for predicting Caco-2 permeability. Notably, our model CaliciBoost, identified through training optimization, achieved the lowest MAE and secured the top position on the TDC Caco-2 Leaderboard. Furthermore, for both Padel and Mordred, using TDC data, incorporating 3D descriptors seem lead to improvements over using 2D features alone, as supported by feature importance analyses. These findings highlight the effectiveness of automated machine learning approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks. SCIENTIFIC CONTRIBUTION: This work provides a systematic benchmarking of eight molecular feature representation types in conjunction with AutoML for Caco-2 permeability prediction. It highlights the critical role of 3D descriptors in enhancing predictive accuracy and establishes a PaDEL-based AutoML model that achieves top-ranked performance on a public leaderboard. The study also emphasizes the value of interpretable feature selection (via SHAP and permutation importance), offering insights into feature contributions and generalizable modeling strategies for cheminformatics applications.

SUBMITTER: Le HV 

PROVIDER: S-EPMC12752011 | biostudies-literature | 2025 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction.

Le Huong Van HV   Ren Weibin W   Kim Junhong J   Yun Yukyung Y   Park Young Bin YB   Kim Young Jun YJ   Han Bok Kyung BK   Choi Inho I   Park Jong-Il JI   Yun Hwi-Yeol HY   Choi Jae-Mun JM  

Journal of cheminformatics 20251222 1


Caco-2= permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates= during early-stage drug discovery. To enhance the accuracy and= efficiency of computational predictions, we systematically investigated the impact of eight molecular feature= representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. We evaluated mo  ...[more]

Similar Datasets

| S-EPMC6727618 | biostudies-literature
| S-EPMC12339718 | biostudies-literature
| S-EPMC11651206 | biostudies-literature
| S-EPMC11966294 | biostudies-literature
| S-EPMC9569108 | biostudies-literature
| S-EPMC9058322 | biostudies-literature
| S-EPMC10829170 | biostudies-literature
| S-EPMC6368215 | biostudies-literature
| S-EPMC7495975 | biostudies-literature