Dataset Information

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

ABSTRACT:

Background

Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets.

Objective

This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments.

Methods

We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance.

Results

LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks.

Conclusions

We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

SUBMITTER: Visweswaran S

PROVIDER: S-EPMC7450367 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

Visweswaran Shyam S Colditz Jason B JB O'Halloran Patrick P Han Na-Rae NR Taneja Sanya B SB Welling Joel J Chu Kar-Hai KH Sidani Jaime E JE Primack Brian A BA

Journal of medical Internet research 20200812 8

<h4>Background</h4>Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers of ...[more]

PMID: 32784184

Dataset Information

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

Background

Objective

Methods

Results

Conclusions

Publications

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Automated Detection of Vaping-Related Tweets on Twitter During the 2019 EVALI Outbreak Using Machine Learning Classification.
| S-EPMC8866955 | biostudies-literature

Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data [Transcriptomics]
2019-07-18 | GSE134056 | GEO

Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data [Methylomics]
2019-07-18 | GSE134052 | GEO

Applying GIS and Machine Learning Methods to Twitter Data for Multiscale Surveillance of Influenza.
| S-EPMC4959719 | biostudies-literature

Topics and Sentiment Surrounding Vaping on Twitter and Reddit During the 2019 e-Cigarette and Vaping Use-Associated Lung Injury Outbreak: Comparative Study.
| S-EPMC9795395 | biostudies-literature

A comparative analysis of machine learning classifiers for predicting protein-binding nucleotides in RNA sequences.
| S-EPMC9249596 | biostudies-literature

Machine learning classifiers and fMRI: a tutorial overview.
| S-EPMC2892746 | biostudies-literature

Regional level influenza study based on Twitter and machine learning method.
| S-EPMC6478375 | biostudies-literature

Machine Learning in Sensory Analysis of Mead—A Case Study: Ensembles of Classifiers
| S-EPMC12348089 | biostudies-literature

Generating automated kidney transplant biopsy reports combining molecular measurements with ensembles of machine learning classifiers
2019-11-07 | GSE124203 | GEO