A practical approach for content mining of Tweets.
ABSTRACT: Use of data generated through social media for health studies is gradually increasing. Twitter is a short-text message system developed 6 years ago, now with more than 100 million users generating over 300 million Tweets every day. Twitter may be used to gain real-world insights to promote healthy behaviors. The purposes of this paper are to describe a practical approach to analyzing Tweet contents and to illustrate an application of the approach to the topic of physical activity. The approach includes five steps: (1) selecting keywords to gather an initial set of Tweets to analyze; (2) importing data; (3) preparing data; (4) analyzing data (topic, sentiment, and ecologic context); and (5) interpreting data. The steps are implemented using tools that are publically available and free of charge and designed for use by researchers with limited programming skills. Content mining of Tweets can contribute to addressing challenges in health behavior research.
Project description:Two code files and one dataset related to Olympic Twitter activity are the foundation for this article. Through Twitter's Spritzer streaming API (Application Programming Interface), we collected over 430 million tweets from May 12th, 2016 to September 12th, 2016 windowing the Rio de Janeiro Olympics and Paralympics. We cleaned and filtered these tweets to contain Olympic-related content. We then analyzed the raw data of 21,218,652 tweets including location data, language, and tweet content to distill the sentiment and emotions of Twitter users pertaining to the Olympic Games Kassens-Noor E. et al., 2019. We generalized the original data set to comply with the Twitter's Terms of Service and Developer agreement, 2018. We present the modified dataset and accompanying code files in this article to suggest using both for further analysis on sentiment and emotions related to the Rio de Janeiro Olympics and for comparative research on imagery and perceptions of other Olympic Games.
Project description:BACKGROUND:With restrictions on movement and stay-at-home orders in place due to the COVID-19 pandemic, social media platforms such as Twitter have become an outlet for users to express their concerns, opinions, and feelings about the pandemic. Individuals, health agencies, and governments are using Twitter to communicate about COVID-19. OBJECTIVE:The aims of this study were to examine key themes and topics of English-language COVID-19-related tweets posted by individuals and to explore the trends and variations in how the COVID-19-related tweets, key topics, and associated sentiments changed over a period of time from before to after the disease was declared a pandemic. METHODS:Building on the emergent stream of studies examining COVID-19-related tweets in English, we performed a temporal assessment covering the time period from January 1 to May 9, 2020, and examined variations in tweet topics and sentiment scores to uncover key trends. Combining data from two publicly available COVID-19 tweet data sets with those obtained in our own search, we compiled a data set of 13.9 million English-language COVID-19-related tweets posted by individuals. We use guided latent Dirichlet allocation (LDA) to infer themes and topics underlying the tweets, and we used VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis to compute sentiment scores and examine weekly trends for 17 weeks. RESULTS:Topic modeling yielded 26 topics, which were grouped into 10 broader themes underlying the COVID-19-related tweets. Of the 13,937,906 examined tweets, 2,858,316 (20.51%) were about the impact of COVID-19 on the economy and markets, followed by spread and growth in cases (2,154,065, 15.45%), treatment and recovery (1,831,339, 13.14%), impact on the health care sector (1,588,499, 11.40%), and governments response (1,559,591, 11.19%). Average compound sentiment scores were found to be negative throughout the examined time period for the topics of spread and growth of cases, symptoms, racism, source of the outbreak, and political impact of COVID-19. In contrast, we saw a reversal of sentiments from negative to positive for prevention, impact on the economy and markets, government response, impact on the health care industry, and treatment and recovery. CONCLUSIONS:Identification of dominant themes, topics, sentiments, and changing trends in tweets about the COVID-19 pandemic can help governments, health care agencies, and policy makers frame appropriate responses to prevent and control the spread of the pandemic.
Project description:There is a growing recognition of social media data as being useful for understanding local area patterns. In this study, we sought to utilize geotagged tweets-specifically, the frequency and type of food mentions-to understand the neighborhood food environment and the social modeling of food behavior. Additionally, we examined associations between aggregated food-related tweet characteristics and prevalent chronic health outcomes at the census tract level. We used a Twitter streaming application programming interface (API) to continuously collect ~1% random sample of public tweets in the United States. A total of 4,785,104 geotagged food tweets from 71,844 census tracts were collected from April 2015 to May 2018. We obtained census tract chronic disease outcomes from the CDC 500 Cities Project. We investigated associations between Twitter-derived food variables and chronic outcomes (obesity, diabetes and high blood pressure) using the median regression. Census tracts with higher average calories per tweet, less frequent healthy food mentions, and a higher percentage of food tweets about fast food had higher obesity and hypertension prevalence. Twitter-derived food variables were not predictive of diabetes prevalence. Food-related tweets can be leveraged to help characterize the neighborhood social and food environment, which in turn are linked with community levels of obesity and hypertension.
Project description:Surveys are popular methods to measure public perceptions in emergencies but can be costly and time consuming. We suggest and evaluate a complementary "infoveillance" approach using Twitter during the 2009 H1N1 pandemic. Our study aimed to: 1) monitor the use of the terms "H1N1" versus "swine flu" over time; 2) conduct a content analysis of "tweets"; and 3) validate Twitter as a real-time content, sentiment, and public attention trend-tracking tool.Between May 1 and December 31, 2009, we archived over 2 million Twitter posts containing keywords "swine flu," "swineflu," and/or "H1N1." using Infovigil, an infoveillance system. Tweets using "H1N1" increased from 8.8% to 40.5% (R(2)?=?.788; p<.001), indicating a gradual adoption of World Health Organization-recommended terminology. 5,395 tweets were randomly selected from 9 days, 4 weeks apart and coded using a tri-axial coding scheme. To track tweet content and to test the feasibility of automated coding, we created database queries for keywords and correlated these results with manual coding. Content analysis indicated resource-related posts were most commonly shared (52.6%). 4.5% of cases were identified as misinformation. News websites were the most popular sources (23.2%), while government and health agencies were linked only 1.5% of the time. 7/10 automated queries correlated with manual coding. Several Twitter activity peaks coincided with major news stories. Our results correlated well with H1N1 incidence data.This study illustrates the potential of using social media to conduct "infodemiology" studies for public health. 2009 H1N1-related tweets were primarily used to disseminate information from credible sources, but were also a source of opinions and experiences. Tweets can be used for real-time content analysis and knowledge translation research, allowing health authorities to respond to public concerns.
Project description:BACKGROUND:Dementia is a prevalent disorder among adults and often subjects an individual and his or her family. Social media websites may serve as a platform to raise awareness for dementia and allow researchers to explore health-related data. OBJECTIVE:The objective of this study was to utilize Twitter, a social media website, to examine the content and location of tweets containing the keyword "dementia" to better understand the reasons why individuals discuss dementia. We adopted an approach that analyzed user location, user category, and tweet content subcategories to classify large publicly available datasets. METHODS:A total of 398 tweets were collected using the Twitter search application programming interface with the keyword "dementia," circulated between January and February 2018. Twitter users were categorized into 4 categories: general public, health care field, advocacy organization, and public broadcasting. Tweets posted by "general public" users were further subcategorized into 5 categories: mental health advocate, affected persons, stigmatization, marketing, and other. Placement into the categories was done through thematic analysis. RESULTS:A total of 398 tweets were written by 359 different screen names from 28 different countries. The largest number of Twitter users were from the United States and the United Kingdom. Within the United States, the largest number of users were from California and Texas. The majority (281/398, 70.6%) of Twitter users were categorized into the "general public" category. Content analysis of tweets from the "general public" category revealed stigmatization (113/281, 40.2%) and mental health advocacy (102/281, 36.3%) as the most common themes. Among tweets from California and Texas, California had more stigmatization tweets, while Texas had more mental health advocacy tweets. CONCLUSIONS:Themes from the content of tweets highlight the mixture of the political climate and the supportive network present on Twitter. The ability to use Twitter to combat stigma and raise awareness of mental health indicates the benefits that can potentially be facilitated via the platform, but negative stigmatizing tweets may interfere with the effectiveness of this social support.
Project description:BACKGROUND:Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE:The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS:To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS:We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: ??=?0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS:Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.
Project description:The pervasiveness of mobile devices, which is increasing daily, is generating a vast amount of geo-located data allowing us to gain further insights into human behaviors. In particular, this new technology enables users to communicate through mobile social media applications, such as Twitter, anytime and anywhere. Thus, geo-located tweets offer the possibility to carry out in-depth studies on human mobility. In this paper, we study the use of Twitter in transportation by identifying tweets posted from roads and rails in Europe between September 2012 and November 2013. We compute the percentage of highway and railway segments covered by tweets in 39 countries. The coverages are very different from country to country and their variability can be partially explained by differences in Twitter penetration rates. Still, some of these differences might be related to cultural factors regarding mobility habits and interacting socially online. Analyzing particular road sectors, our results show a positive correlation between the number of tweets on the road and the Average Annual Daily Traffic on highways in France and in the UK. Transport modality can be studied with these data as well, for which we discover very heterogeneous usage patterns across the continent.
Project description:INTRODUCTION:It is unclear whether warnings on electronic cigarette (e-cigarette) advertisements required by the US Food and Drug Administration (FDA) will apply to social media. Given the key role of social media in marketing e-cigarettes, we seek to inform FDA decision making by exploring how warnings on various tweet content influence perceived healthiness, nicotine harm, likelihood to try e-cigarettes, and warning recall. METHODS:In this 2 × 4 between-subjects experiment participants viewed a tweet from a fictitious e-cigarette brand. Four tweet content versions (e-cigarette product, e-cigarette use, e-cigarette in social context, unrelated content) were crossed with two warning versions (absent, present). Adult e-cigarette users (N = 994) were recruited via social media ads to complete a survey and randomized to view one of eight tweets. Multivariable regressions explored effects of tweet content and warning on perceived healthiness, perceived harm, and likelihood to try e-cigarettes, and tweet content on warning recall. Covariates were tobacco and social media use and demographics. RESULTS:Tweets with warnings elicited more negative health perceptions of the e-cigarette brand than tweets without warnings (p < .05). Tweets featuring e-cigarette products (p < .05) or use (p < .001) elicited higher warning recall than tweets featuring unrelated content. CONCLUSIONS:This is the first study to examine warning effects on perceptions of e-cigarette social media marketing. Warnings led to more negative e-cigarette health perceptions, but no effect on perceived nicotine harm or likelihood to try e-cigarettes. There were differences in warning recall by tweet content. Research should explore how varying warning content (text, size, placement) on tweets from e-cigarette brands influences health risk perceptions. IMPLICATIONS:FDA's 2016 ruling requires warnings on advertisements for nicotine-containing e-cigarettes, but does not specify whether this applies to social media. This study is the first to examine how e-cigarette warnings in tweets influence perceived healthiness and harm of e-cigarettes, which is important because e-cigarette brands are voluntarily including warnings on Twitter and Instagram. Warnings influenced perceived healthiness of the e-cigarette brand, but not perceived nicotine harm or likelihood to try e-cigarettes. We also saw higher recall of warning statements for tweets featuring e-cigarettes. Findings suggest that expanding warning requirements to e-cigarette social media marketing warrants further exploration and FDA consideration.
Project description:The misuse of prescription opioids (MUPO) is a leading public health concern. Social media are playing an expanded role in public health research, but there are few methods for estimating established epidemiological metrics from social media. The purpose of this study was to demonstrate that the geographic variation of social media posts mentioning prescription opioid misuse strongly correlates with government estimates of MUPO in the last month.We wrote software to acquire publicly available tweets from Twitter from 2012 to 2014 that contained at least one keyword related to prescription opioid use (n = 3,611,528). A medical toxicologist and emergency physician curated the list of keywords. We used the semantic distance (SemD) to automatically quantify the similarity of meaning between tweets and identify tweets that mentioned MUPO. We defined the SemD between two words as the shortest distance between the two corresponding word-centroids. Each word-centroid represented all recognized meanings of a word. We validated this automatic identification with manual curation. We used Twitter metadata to estimate the location of each tweet. We compared our estimated geographic distribution with the 2013-2015 National Surveys on Drug Usage and Health (NSDUH).Tweets that mentioned MUPO formed a distinct cluster far away from semantically unrelated tweets. The state-by-state correlation between Twitter and NSDUH was highly significant across all NSDUH survey years. The correlation was strongest between Twitter and NSDUH data from those aged 18-25 (r = 0.94, p < 0.01 for 2012; r = 0.94, p < 0.01 for 2013; r = 0.71, p = 0.02 for 2014). The correlation was driven by discussions of opioid use, even after controlling for geographic variation in Twitter usage.Mentions of MUPO on Twitter correlate strongly with state-by-state NSDUH estimates of MUPO. We have also demonstrated that a natural language processing can be used to analyze social media to provide insights for syndromic toxicosurveillance.
Project description:Importance:As society is increasingly becoming more networked, researchers are beginning to explore how social media can be used to study person-to-person communication about health and health care use. Twitter is an online messaging platform used by more than 300 million people who have generated several billion Tweets, yet little work has focused on the potential applications of these data for studying public attitudes and behaviors associated with cardiovascular health. Objective:To describe the volume and content of Tweets associated with cardiovascular disease as well as the characteristics of Twitter users. Design, Setting, and Participants:We used Twitter to access a random sample of approximately 10 billion English-language Tweets originating from US counties from July 23, 2009, to February 5, 2015, associated with cardiovascular disease. We characterized each Tweet relative to estimated user demographics. A random subset of 2500 Tweets was hand-coded for content and modifiers. Main Outcomes and Measures:The volume of Tweets about cardiovascular disease and the content of these Tweets. Results:Of 550?338 Tweets associated with cardiovascular disease, the terms diabetes (n?=?239?989) and myocardial infarction (n?=?269?907) were used more frequently than heart failure (n?=?9414). Users who Tweeted about cardiovascular disease were more likely to be older than the general population of Twitter users (mean age, 28.7 vs 25.4 years; P?<?.01) and less likely to be male (59?082 of 124?896 [47.3%] vs 8433 of 17?270 [48.8%]; P?<?.01). Most Tweets (2338 of 2500 [93.5%]) were associated with a health topic; common themes of Tweets included risk factors (1048 of 2500 [41.9%]), awareness (585 of 2500 [23.4%]), and management (541 of 2500 [21.6%]) of cardiovascular disease. Conclusions and Relevance:Twitter offers promise for studying public communication about cardiovascular disease.