Unknown

Dataset Information

0

Comparison of Pretraining Models and Strategies for Health-Related Social Media Text Classification


ABSTRACT: Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.

SUBMITTER: Guo Y 

PROVIDER: S-EPMC9408372 | biostudies-literature | 2022 Aug

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC11413434 | biostudies-literature
| S-EPMC7835447 | biostudies-literature
| S-EPMC8627225 | biostudies-literature
| S-EPMC8980624 | biostudies-literature
| S-EPMC7551727 | biostudies-literature
| S-EPMC6188524 | biostudies-literature
| S-EPMC10338115 | biostudies-literature
| S-EPMC11850375 | biostudies-literature
| S-EPMC10746297 | biostudies-literature
| S-EPMC10168403 | biostudies-literature